ai @ wvu: June 2010

Sunday, June 13, 2010

Active Learning

Labeling every data point is time consuming and unlabeled data is abundant. This motivates the field of active learning, in which learner is able to ask for labels of specific points, but each question is charged.

The points to be queried are usually chosen from a pool of unlabeled data points. What we need is to ask as few as possible queries and pick up points that would help the learner the most (highest information content).

Possible ideas to find the points to be queried:

1) Build a Voronoi structure and ask the points which are a) center of largest circumcirle or b) subset of Voronoi vertices whose nearest neighbors belong to different classes. It is difficult for high dimensions.

Another ideas: Use two learners and ask points where they disagree, use SVM and only ask point closest to hyperplane at each round. The question to me is how I can adapt it to effort estimation (a regression problem). We formed a reading group for this problem, the bib file etc. are here: http://unbox.org/wisp/var/ekrem/activeLearning/Literature/

Tuesday, June 8, 2010

beamer, and zoom

Adam nelson is writing his slides using the beamer class. the slides look fabulous.

Beamer is a latex style file which, unlike other approaches, can be processed with pdflatex

Also, last time i checked, its a standard item on most linux /osX /cygwin package managers

For notes on beamer, see http://www.math.umbc.edu/~rouben/beamer/

For an advanced guide, see http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf

For a reason to use beamer instead of anything else, look at the zoom feature on slide 32 of http://www.math.umbc.edu/~rouben/beamer/beamer_guide.pdf: zoomable figures. Now you can include results to any level of detail.

t

Monday, June 7, 2010

Abstracts - first draft

http://unbox.org/wisp/var/fayola/forensics/

http://unbox.org/wisp/var/fayola/code/docs/

Wednesday, June 2, 2010

Data: Feature Weighting and Instance Selection Optimization

http://unbox.org/wisp/var/bill/table.txt

Results of optimizing feature weighting and instance selection for analogy based estimation of software effort w/different methods.

Random Pre-Processor + Algorithm Results for Normal and Reduced Datasets

The results for the experiments:http://unbox.org/wisp/var/ekrem/resultsVariance/Results/FinalResults.xls

Behavior of different pre-processor&algorithm combinations become more similar as the instances size gets smaller (e.g. cocomo81s and desharnaisL3 figures)
Some algorithms have considerable less loss values, but there is no best algorithm for all datasets.
The reduced variance datasets are reduced by the GAC tree (only instances in the nodes that have less than or equal to ten percent of the max variance).
When reduction is applied 3 datasets reduce to only 2 instances: kemerer, nasa-center1, telecom1.
Since reduction makes the dataset smaller, their results become more similar (both the plots look similar and the cases of all algorithms getting zero losses increase).
The graphs for these experiments can be found at:http://unbox.org/wisp/var/ekrem/resultsVariance/Results/NORMAL-DATA RESULTS.zip andhttp://unbox.org/wisp/var/ekrem/resultsVariance/Results/REDUCED-DATA RESULTS.zip Related plots are at http://unbox.org/wisp/var/ekrem/resultsVariance/Results/resultsPlotterTexFiles/plots.pdf

Some more results regarding GAC-simulated datasets:

Populated datasets attain very high MdMRE and Pred(25) values.
There is more of a pattern regarding the best algorithm.
A better check of GAC-simulation would be simulation&prediction with leave-one-out.

Tuesday, June 1, 2010

Makefile tricks

I write to record Bryan's "embed postscript fonts" trick. Without this trick, some conference/journal submission systems won't let you submit, complaining that "fonts not embedded".

The trick is to use the "embed" rule, as called by "done" in the following Makefile. This code is available at /wisp/var/adam2/cbr/doc/Makefile

Src=model-vs-cbr-v3
all : dirs tex bib tex tex done
one : dirs tex done

done : embed
@printf "\n\n\n======================================\n"
@printf "see output in $(HOME)/tmp/$(Src).pdf\n"
@printf "=======================================\n\n\n"
@printf "\n\nWarnings (may be none):\n\n"
grep arning $(HOME)/tmp/${Src}.log
dirs :
- [ ! -d $(HOME)/tmp ] && mkdir $(HOME)/tmp
tex :
- pdflatex -output-directory=$(HOME)/tmp $(Src)
embed :
@ cd $(HOME)/tmp; \
   gs -q -dNOPAUSE -dBATCH -dPDFSETTINGS=/prepress \
   -sDEVICE=pdfwrite \
   -sOutputFile=$(Src)1.pdf $(Src).pdf
@ mv $(HOME)/tmp/$(Src)1.pdf $(HOME)/tmp/$(Src).pdf
bib :
- bibtex $(HOME)/tmp/$(Src)

lab meeting, wednesday

note meeting time: 11am

10am: skype call with adam2. adam2-
11am: meeting
1pm: break
1:45pm: bryan's defense. (edited by Bryan L. to correct time. Its 1:45, not 2:00)

newbies (make them welcome)

Kel Cecil
Charles Corb
Tomi Prifti

adam1:

promise paper
travel arrangements to bejing

timm:

the vision thing Structured Machine Learning: Ten Problems for the Next Ten Years (Section 5 in Structured Machine Learning: The Next Ten Years). Machine Learning, 73, 3-23, 2008.
journal paper

1) Stable rankings for different effort models ;
2) Defect prediction from static code features: current results, limitations, new approaches
3) How to Understand Complex Models
dod sttrt2
clustering as compression
Towards Parameter-Free Data Mining
Clustering by Compression
Compression and Machine Learning: A New Perspective on Feature Space Vectors
Parameterless Outlier Detection in Data Stream
temporal data mining and anomaly detection
the joy of sax: Visually Mining and Monitoring Massive Time Series

tomi:

what news?

fayola:

did not understand your last explanation of your distance measure (i.e. is 20% more or less movement). help me, please, to obtain clarity
does your result hold as you increase number of clusters?
starting 3 papers

ekrem, andrew:

start a sub-group: active learning.
begin lit reviewing
that experiment with people in front of interfaces..

andrew-

what news on ddp?
papers for comparing compass against other clusterers:
Evalaution of Hierarchical Clustering
A Comparison of Document Clustering Techniques

ekrem:

what news on using teac as an instance selector for other data miners

william:

effort estimation data table
-instance collection environment

Sunday, June 13, 2010

Tuesday, June 8, 2010

Monday, June 7, 2010

Wednesday, June 2, 2010

Tuesday, June 1, 2010

1) Stable rankings for different effort models ;2) Defect prediction from static code features: current results, limitations, new approaches

1) Stable rankings for different effort models ;
2) Defect prediction from static code features: current results, limitations, new approaches