Wednesday, January 23, 2013

Free Big Data Symposium, Feb 1

Nate Silver: The Numbers Don't Lie

A 56-minute talk about the signal and the noise by Nate Silver

Results for TSE paper

Results for 20 feature models

Wednesday, January 16, 2013

RRSL Report

1) We have a system for very easily instancing new problems, and have already implemented most baselines such as Fonseca, Srinivas, ConstrEx and more.  Capabilities can handle constrained or non-constrained problems.

Modifiable Genetic Algorithm Approach
2) A basis for a very plain and dull GA is as follows:
 - Read a 100-row dataset for the problem (so that we begin on a consistent standpoint).
 - Evaluate all of these as the initial population and record Median & IQR metrics for each objective.
 - Begin an iterative process, stopping at a given criteria (# of runs, or %change between iterations):
 - - - a: Select offspring for CM
 - - - b: Crossover for the offspring
 - - - c: Mutation for the offspring
 - - - d: Combine population with the new offspring and select a new population from these

3) Different Algorithms follow from adjusting (a,b,c,d) above:
 - - - a: Selection of offspring via rrsl non-dominating leaves. Maintain two copies of these offspring.
 - - - b: Crossover of the offspring via moving averages idea
 - - - c: Mutation for the offspring via Differential Evolution
 - - - d: Tournament Selection on the (few) offspring + adjustedoffspring

4) Benchmark Algorithms also follow from adjusting (a,b,c,d) above:
 - - - a: Selection of offspring using k-round tournament selection
 - - - b: Standard Crossover to swap attributes for a given pair of individuals
 - - - c: Standard Mutation to re-roll the dice on some individual's attribute chosen decision
 - - - d: Selection via NSGA2 or SPEA2 sub-routines.

5) The Goal of the Research:
 - We want RRSL GA to perform as well as NSGA2/SPEA2.
 - There's an argument on what it means to be better.  Solutions lie across the optimal pareto frontier, but different algorithms may discover solutions at different areas along the frontier.
 - The absolute emphasis of the research: Improving upon the number of objective function evaluations.  The theory is that RRSL GA will require log N the number of evaluations needed by NSGA2/SPEA, and still do just as well.

6) Results

--- full results pending ---

Wednesday, January 9, 2013

Erin's update

Her is the current pseudocode and code for my alg.  BicBin is the alg that I will run comparisons against.


loop over (confidence levels, or coverage)

run FP Growth
extract unique 4-mers from results
compile the list of selected attribute numbers
derive unselected attribute numbers

EM cluster db using selected attributes
label rows by the clusters
parse results

divide db into 10 bins
run nb, JRip, oner, prism, ZeroR, j48 using cluster labels as the class

keep clusters that are greater than the db average (effect size),
AND pd, pf, calc you gave me today in classification

record the cluster patterns, and rows and columns

change 1s in the clusters to 0s (non overlapping biclusters)

end loop

? do I recalculate the db average or use inital avg
? classification performance criteria

CODE (thus far)


Weka="nice -n 20 java -Xmx2048M -cp $Jar "

cp $Input /tiny/data/trialInput.arff

cat $Input | ./dbAvg.awk
#DbAvg=`cat $Input | ./dbAvg.awk`
#echo "DbAvg" $DbAvg

#LOOP OVER CONFIDENCE LEVELS or MinRulesPerc(aka coverage)


#run FPGrowth
$Weka weka.associations.FPGrowth -P 1 -I -1 -N 20 -T 0 -C $CONFIDENCE -D $STEP -U $UPPER -M $MinRulesPerc -t $Input > fpResults$MinRulesPerc$CONFIDENCE.txt

#extract unique 4-mers from FP

cat fpResults$MinRulesPerc$CONFIDENCE.txt |
sed 's/_binarized/ /g' |
sed -e 's/=1/ /g' -e 's/=0/ /g' |
./fpFieldParser.awk > attA$MinRulesPerc$CONFIDENCE.txt

#compile list of attribute numbers

cat fieldsNum.txt separator.txt attA$MinRulesPerc$CONFIDENCE.txt  |
./makeFieldNumList.awk > fieldListA$MinRulesPerc$CONFIDENCE.txt
#remove last comma

sed 's/.\{1\}$//' attIndicies.txt > fldListA$MinRulesPerc$CONFIDENCE.txt
sed 's/.\{1\}$//' inverseAttIndicies.txt > fldListIA$MinRulesPerc$CONFIDENCE.txt

#put list into variable

InvAttSet=`cat fldListIA$MinRulesPerc$CONFIDENCE.txt`
echo $InvAttSet

#cluster all instances on the reduced attribute list

$Weka weka.filters.unsupervised.attribute.AddCluster -W "weka.clusterers.EM  -t $Input -I 100 -N -1 -M 1.0E-6 -S 100" -I $InvAttSet

#keep clusters that are 30% more than the db average
#AND Test well 70+


used 'time' and got these numbers

real 0m33.106s
user 0m38.313s
sys 0m0.890s

Vasil - Progress Report

1. From the previous meeting, I showed that we can approximate the NEMS score results by training on 10% of the population.
However, by clustering (cluster+LR and cluster+Centroid) we did not get better results than by applying linear regression on the entire instances space.

I tried a new method of grid clustering. Instead of using Recursive FastMap, the projected instance space was split into a specific number of equally-sized quadrants, then the quadrants were clustered using the variance filter.
I used different number of quadrants to split the space, but still clustering could not outperform the linear regression on the entire space.

2. Next I have shifted by attention on applying the clustering methods on defect prediction data. (I am currently using the ant data set from promise repository).

- The instance space is projected using FastMap.
- Then the instance space is recursively split into quadrants on the median of x and y dimensions.
- The clusters are created using Entropy to combine quadrants (i.e. a quadrant is not added if the entropy is increased).

- Apply Naive Bayes on the entire space and on clusters. (The implementation will be finished today and I can report the results).
- Use information gain to define the attributes that contribute on clustering.