Thursday, July 26, 2012

Erin Thurs 7-26

Worked example of Ordonez distance algorithm

Code snippet - shows code and marks the place where I have verified the algorithm

pseudoCode up to verified line

awk code where I'm hung up

must do: decide on a class the meets requirements, hard as there are very few open

Tuesday, July 24, 2012

Erin's To Dos

Ellington NN clustering, preliminary

Continue working on Binary Sparse NN implementation

  Think centroids need to be binary with stdDev instead of averages of the points.

  Nearly every instance is clustered, and the lack of StdDev cutoff is not identifying unknown instances

Double check methods to make sure they are working as I think they should

Start GRE studying - exam July 31

Schedule

Utility Results

Utility for 21 defect test data-set
and 8 defect train data-sets.

Thursday, July 19, 2012

Raw and Summary Results

Raw Results

Summary Results

POM

POM Learner Results:

Software Project Performance Metrics: http://i.imgur.com/ERu9M.png

Learner Metrics: http://i.imgur.com/AJoq1.png

Cluster SA results

Results/Comparisons:
https://docs.google.com/document/d/1E0lwTCm-GDijJUjmfu_aA0EwuLCkm2B2rikJ2hoh-MM/edit

How to measure success/compare methods?
In the NSGA-II paper[1], there were two performance measures they used, neither were AUC.  Since these models were widely used, they used 500 known, evenly distributed, points on the Pareto to measure the average distance from each resultant point to the Pareto, which was the first metric.  The second metric was a measure of spread across the Pareto of the obtained solutions, calculated with a given algorithm.


REFERENCES:
[1] Kalyanmoy Deb, Associate Member, IEEE, Amrit Pratap, Sameer Agarwal, and T. Meyarivan,  A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 6, NO. 2, APRIL 2002

Tuesday, July 17, 2012

FpFSS slides
Leverages Association Learning (FP Growth) and Clustering (EM) to create a predictive data model in an unclassified database where the number of rows and columns are similar. Model is then applied to a related time series database where cluster concentrations can be predicted for future time values.

Erin's To Dos

Local Cluster SA

Only SA for Constr (really bad):















[1] Baseline from NSGA-II for Constr:





SA on cluster for Constr(top: all points, bottom: only dominating):






Other runs for SA on cluster for Constr, only dominating:





Conclusion: Running SA with clusters was much MUCH better, but could use some improvement.  By limiting SA to within cluster we get points that aren't going to be on the pareto.

Up Next: DE



REFERENCES



[1] Kalyanmoy Deb, Associate Member, IEEE, Amrit Pratap, Sameer Agarwal, and T. Meyarivan,  A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 6, NO. 2, APRIL 2002

POM

Todo List: https://docs.google.com/spreadsheet/ccc?key=0AolajDUgsGZ7dGF0Y2ppYk9XS0g5aTQ3bFRVbjRXSmc




POM: Portman Owens Menzies

What is it: A software project emulator model.  See how full projects which take 200 days or more complete in mere seconds in a model.  Can gather variety of statistics such as days to complete, money spent, and many more.  I've coded a version of it for use with a learner.

How it works: My POM model runs on Actory, which is a Finite State Machine of sorts, where each Team/Person in the development project is a different machine.  We also add a project manager, and an "assigner", who's job is to decide which task is best for the team/person.

Coded in: Python

Reason for Building POM: The transitions between machines in Actory have priorities.  The main goal of POM was to use a Learner (bore = best or the rest) to learn the best transition priorities in Actory.

Methodology for Learning: We run POM 1000 times to generate average statistics and then package them with the currently used (random) transition priorities.  This package gets sent to the learner, which spits out some data analysis on what the best transition priorities should be.  After learning the best transition priorities, we run POM again, 1000 times, and regenerate the statistics and compare them to see if any improvements were found.

Data Results: The five statistics used are as follows:
 - - - days = Days to Complete Project
 - - - s1 = Money per Day Spent
 - - - s2 = Money per LOC
 - - - s3 = Days per LOC
 - - - s4 = Average time spent IDLE for a team/person

Before learning:
 - - - days = 269
 - - - s1 = 1240
 - - - s2 = 10.25
 - - - s3 = 0.0083
 - - - s4 = 0.5004


After learning:

 - - - days = 268
 - - - s1 = 1242
 - - - s2 = 10.22
 - - - s3 = 0.0082
 - - - s4 = 0.4001

Brookes Law: Adding members to the project at a late phase in the game will only make it later.  We test this in POM by allowing team/persons to gain experience and become better coders the more they work on the project.  We test the effects and prove brookes law by running POM 35 times and gathering the number of days it takes, when team/persons can be added at different phases during the completion of the project.  The following chart depicts the results, and indicates a steady increase in the days when members can be added earlier in the development.

http://i.imgur.com/frqkH.png
Y Axis: Days
X Axis: (0 to 100%) Percent of the Project Completed (Teams/Persons can only be added to the project when it is this much complete)

Wednesday, July 11, 2012

To-Do List:

  • Summer Report 3: researching MOEA performance metrics, testing NIS active breeding pool updating, new performance metrics?
  • Summer Report 4: researching nicheing techniques, testing new idea for nicheing technique against those in literature, new MOEA?
  • Thesis: compiling summer research results into a thesis
  • Paper? I'm pretty sure that my Non-dominated Insertion Sort will make MOEAs converge faster.  I think that the algorithm running times are worth publication on their own.  If it also accelerates convergence, I think the resulting algorithm could be called NSGA-III (or NISGA).
  • Prepare for job hunt: After this, I'd like to prepare for getting a job.  I'd like to do some research that can relate to landing a data mining or game coding job.

Jared Update

**UPDATED**
Notes:
Dominance eastwest heuristic worked well
splitting while y decreasing was very prone to wacky results (some clusters of 10, some of 2)

next up:
reorganize and rethink code (very hard to SA on both models and real data as is)
get some/any pareto graph out for a model
DE & GA

Erin's update

To Dos
With paper details

Slope as Estimator of Cumulative Rule Percentages

These charts show the cumulative rule percentages for Atrazine and Bromacil.  There is a graph displaying all of the rules for both chemicals. The other graphs are an example of using the slope of the cumulative rule percentage to estimate the rule percentage in a later round of experimentation.  The 'A' rule set is used for both examples.

The example shows that the slope of the line in early rounds, 6 for Atrazine and 3 for Bromacil, provides a good estimate that could allow experimenters to jump forward in the aptamer discovery process by several rounds