Tuesday, November 13, 2012

Make cross-project defect prediction a successful story: a research roadmap

#Research questions:

1. For the first release of a new project, how can we learn quality prediction models on cross-project data?

Burak proposed to use NN-filter to help cross-project defect prediction and made some promising results (JASE 2010). Now Fayola is also working on this and do it much better (mostly on small size test sets). There are so much worth being further investigated.

1.1 can we  do better on cross-learning performance ? 
      -- some more effective ways to understand the training and test data, like TEAK?
1.2 can we generalize cross-learning strategies to other data?
      -- not only small test sets and static code measures.
1.3 can we make cross-learning algorithms more scalable?
      -- maybe low dimensionality can help?

2. For the following releases, we have to:

2.1 handle the data shift or concept drift problem caused by development environment change to produce better prediction results.
      -- now I'm going on this direction, see my post for last week's group meeting.
2.2 know whether cross-project data can further improve the prediction performance   even though we have local data. If cross-project data works, how should we make use of it?


#Potential solutions:

transductive transfer learning for research question 1, and inductive transfer learning for research question 2.
      -- transductive transfer learning: plenty of  labeled data in the source domain, no labeled data in target domain is available.
      -- inductive transfer learning:  plenty of labeled data in the source domain and a small number of labeled data in the target domain are available.


#My expectation:

1-2 publication on this topic.
     -- FSE conference, JESE, or somewhere better.

All comments are welcome.


Zhimin

Friday, November 9, 2012

Interesting tidbits from Discover Magazine

Nate Silver will tax your crap

Tarnished Silver

Effort Estimation Intrinsic Dimensions


Code to compute the correlation dimension as a function of "r". Note that the intrinsic dimensionality of a data set is, at most, the steepest slopes in a plot of log(C(r)) vs log(r).

@include "lib.awk"
@include "dist.awk"
@include "readcsv.awk"

function _cr(   o) {
  if(args("-d,data/nasa93.csv,-steps,20,-jumps,4",o))
    cr(o["-d"],o["-steps"],o["-jumps"])
}
function cr(f,steps,jumps,
   _Rows,r,k,c,x,logc,logr) {
  readcsv("cat "f,_Rows)
  distances(_Rows,x)
  for(r=1/steps; r<=1 ; r+= 1/steps) {
    c = correlationDimension(r,x,length(d))
    if (c==0) continue
    if (c==1) break
    k++
    print (logr[k] = log(r)) "\t" (logc[k] = log(c)) 
  }
  say("# " steepest(k,jumps,logr,logc)  " " f "\n")
}
function distances(_Rows,x,       i,j) {
  for(i in all)
    for(j in all)
      if (j > i) 
x[i][j] = dist(all[i],all[j],_Rows,1)
}
function correlationDimension(r,x,n,   i,j,c) {
  for(i in x)
    for(j in x[i])
      c += x[i][j] <= r
  return 2/(n*(n-1)) * c
}
function steepest(max,jumps,logr,logc, 
 i,rise,run,m,most) {
  for(i=1; i <= max-jumps; i += jumps) {

    rise = logc[i + jumps] - logc[i]
    run  = logr[i + jumps] - logr[i]
    m    = rise / run
    if (m > most) 
      most = m
  }
  return most
}

And the results are, sorted lowest to highest...


  • low
    •  0.91  data/china.csv
    • 1.97 data/kemerer.csv
    • 2.77 data/finnish.csv
    • 2.92  data/miyazaki94.csv
    • 3.00  data/albrecht.csv
    • 3.35 data/nasa93c1.csv







  • medium
    • 3.70 data/coc81o.csv
    • 3.96  data/telecom.csv
    • 4.00 data/coc81sd.csv
    • 4.07  data/coc81.csv
    • 4.10 data/desharnais.csv















  • high
    • 4.51 data/nasa93c2.csv
    • 4.54 data/nasa93c5.csv
    • 4.78  data/coc81e.csv
    • 5.74  data/nasa93.csv
    • 8.19  data/sdr.csv




Thursday, November 8, 2012

The Peters Filter


The cross-company problem: how to find what train is relevant to you:


Why do cross-company learning?
  • Cause when you don't have enough local data, you do very badly
  • In the following, we are training and test on very small data sets (lo, median, hi) = 6, 20, 65 instances


So, lets reach out across data sets and compare.
Two cross-company selection filters
  • Burak: N things nearest the test data (shown in gray)
  • Peters: Cluster the train data, find the clusters with the test data




Note that the Peters  filter uses the structure of the train data to guide the initial selection of the data.
Why?
  • Intuition behind Peters' filter: 
  • there is more experience in the repo than with you. So use it to guide you


In the following

  • Train on selected members of the 46 data sets in the repo (lo, med, hi) = (109, 293,885) instances 
  • g = 2*(1 - pf)*pd / (1 - pf + pd) 
  • The last column is the delta between peters and burak Filter
  • Delta is usually positive and large









The

Does transfer learning work for cross project defect prediction?

The data shift issue  troubles software defect prediction mainly at two aspects:
1) at the time dimensionality: for multi-version projects, historical data of early releases may not applicable for new release.
2) at the space dimensionality: because of the data lack problem, we need to do cross-project defect prediction in some cases.

We see transfer learning as a potential solution to data shift problem of defect prediction for its capability to learn and predict under different training and test distribution.  Our on going experiments support this argument.

First, we observe serious performance reduction on cross-release and cross-project defect prediction (i.e., at the time dimensionality and the space dimensionality ).


Second, after employ a transfer learning framework BDSR (Bias Reduction via Structure Discovery), we do better on cross defect prediction.




--Zhimin 

Wednesday, November 7, 2012

Novel Subspace Clustering


Novel Subspace Clustering algorithm that uses a top-down FP Growth approach for attribute filtering versus the typically agglomerative bottom-up Apriori paradigm.

It produces disjoint clusters of various dimensionality and does not suffer from the exhaustive subspace search problem of bottom-up approaches.  It finds the densest, correlated attributes and then searches the points for patterns (clusters).

As with most subspace and projected clustering algorithms, the clustering is done in a cyclical manner.  In this method, FP Growth is used to find an candidate attribute subset, then EM clustering is performed over the attributes. EM produces multiple clusters, which are tested by classification learners. Good clusters are labeled and removed from the data set, creating disjoint instance clusters.  The null and bad clusters remain in the data set for further cycles of clustering.  All attributes are available for the FP Growth step and may be repeated in later clusters.

This method requires several parameters, for FP Growth, EM clustering, minimum test and stopping criteria. I believe that it will be significantly less sensitive to the parameter values than current methods. I also believe that it will be more computationally efficient than existing techniques since it uses FP Growth to find candidate subspaces escaping a combinatorial search, and removes clustered instances reducing the overall data set size.

Literature surveys compare the performance of methods over varying data set  sizes.  Moise09 varies attribute size and instance size to create several data sets, but does not test against data sets with roughly the same attribute and instance size.  My method was designed for this case, named the n x m problem. Other subspace and projected clustering algorithms suffer as the attributes are increased; whereas my method is suited to scalability.

- Erin

Tuesday, November 6, 2012

Intrinsic dimensionality

Here's an article playing my song:

http://www.stat.berkeley.edu/~bickel/mldim.pdf (cited by 300+ people)

Maximum Likelihood Estimation of Intrinsic Dimension
Elizaveta Levina
Peter J. Bickel

In Advances in NIPS, volume 17. MIT Press, 2005
There is a consensus in the high-dimensional data analysis community that the only
reason any methods work in very high dimensions is that, in fact, the data are not
truly high-dimensional. Rather, they are embedded in a high-dimensional space,
but can be e±ciently summarized in a space of a much lower dimension, such as a
nonlinear manifold. Then one can reduce dimension without losing much informa-
tion for many types of real-life high-dimensional data, such as images, and avoid
many of the \curses of dimensionality".

Most standard method: the correlation dimension plot log(C(r)) vs log(r):


They go on to offer another method, which I don't understand, but the resulting numbers are very close to (1).

For a sample of these log(C(r)) vs log(r) graphs, see fig1 of http://oldweb.ct.infn.it/~rapis/corso-fsc2/grassberger-procaccia-prl1983.pdf