Sunday, June 13, 2010

Active Learning

Labeling every data point is time consuming and unlabeled data is abundant. This motivates the field of active learning, in which learner is able to ask for labels of specific points, but each question is charged.

The points to be queried are usually chosen from a pool of unlabeled data points. What we need is to ask as few as possible queries and pick up points that would help the learner the most (highest information content).

Possible ideas to find the points to be queried:
1) Build a Voronoi structure and ask the points which are a) center of largest circumcirle or b) subset of Voronoi vertices whose nearest neighbors belong to different classes. It is difficult for high dimensions.
Another ideas: Use two learners and ask points where they disagree, use SVM and only ask point closest to hyperplane at each round. The question to me is how I can adapt it to effort estimation (a regression problem). We formed a reading group for this problem, the bib file etc. are here:

Tuesday, June 8, 2010

beamer, and zoom

Adam nelson is writing his slides using the beamer class. the slides look fabulous.

Beamer is a latex style file which, unlike other approaches, can be processed with pdflatex

Also, last time i checked, its a standard item on most linux /osX /cygwin package managers

For notes on beamer, see

For an advanced guide, see

For a reason to use beamer instead of anything else, look at the zoom feature on slide 32 of zoomable figures. Now you can include results to any level of detail.


Wednesday, June 2, 2010

Data: Feature Weighting and Instance Selection Optimization

Results of optimizing feature weighting and instance selection for analogy based estimation of software effort w/different methods.

Random Pre-Processor + Algorithm Results for Normal and Reduced Datasets

The results for the experiments:

Some more results regarding GAC-simulated datasets:
  • Populated datasets attain very high MdMRE and Pred(25) values.
  • There is more of a pattern regarding the best algorithm.
  • A better check of GAC-simulation would be simulation&prediction with leave-one-out.

Tuesday, June 1, 2010

Makefile tricks

I write to record Bryan's "embed postscript fonts" trick. Without this trick, some conference/journal submission systems won't let you submit, complaining that "fonts not embedded".

The trick is to use the "embed" rule, as called by "done" in the following Makefile. This code is available at /wisp/var/adam2/cbr/doc/Makefile

all : dirs tex bib tex tex done
one : dirs tex done 

done : embed
@printf "\n\n\n======================================\n"
@printf       "see output in $(HOME)/tmp/$(Src).pdf\n"
@printf "=======================================\n\n\n"
@printf "\n\nWarnings (may be none):\n\n"
grep arning $(HOME)/tmp/${Src}.log
dirs : 
- [ ! -d $(HOME)/tmp ] && mkdir $(HOME)/tmp
tex :
- pdflatex -output-directory=$(HOME)/tmp $(Src)
embed :
@ cd $(HOME)/tmp; \
    gs -q -dNOPAUSE -dBATCH -dPDFSETTINGS=/prepress \
       -sDEVICE=pdfwrite \
        -sOutputFile=$(Src)1.pdf $(Src).pdf
@ mv  $(HOME)/tmp/$(Src)1.pdf $(HOME)/tmp/$(Src).pdf
bib : 
- bibtex $(HOME)/tmp/$(Src)

lab meeting, wednesday

note meeting time: 11am

10am: skype call with adam2. adam2-
11am: meeting
1pm: break
1:45pm: bryan's defense. (edited by Bryan L. to correct time. Its 1:45, not 2:00)

newbies (make them welcome)
  • Kel Cecil
  • Charles Corb
  • Tomi Prifti
  • promise paper
  • travel arrangements to bejing

  • what news?


  • did not understand your last explanation of your distance measure (i.e. is 20% more or less movement). help me, please, to obtain clarity
  • does your result hold as you increase number of clusters?
  • starting 3 papers

ekrem, andrew:

  • start a sub-group: active learning.
  • begin lit reviewing
  • that experiment with people in front of interfaces..


  • what news on using teac as an instance selector for other data miners


  • effort estimation data table
  • -instance collection environment