Thursday, January 23, 2014

Unsupervised syns with FastMap

Click here for unsupervised syns result with fastmap.

Tuesday, January 21, 2014

Cocomo/Coqualmo/Sced-Risk Model w/4 Objectives

The Cocomo model class in things/var/darren/coco/ has been updated now with four different methods to calculate the desired objective functions of:
  • Effort
  • Months
  • Defects
  • Risks
These are simple in terms of the arguments required (ultimately, none should be needed, although several may be optionally provided so as to preload or precompute values used repeatedly in the calculations).

I have begun filling in with the code to support this model with these functions, but I have not yet begun testing and debugging.  Below is an example output of the cocomo class's new method, xys():

aa 3.68689510262 None
sced 4.33572792066 0.855571095869
cplx 3.54209989659 1.06860375637
site 5.5631077419 0.692430144197
resl 2.04446498018 3.63467420444
acap 3.05879106806 0.990494946701
etat 3.4521618083 None
rely 2.54968439506 0.9382368144
data 3.45688917009 1.02910631533
prec 2.13581832559 2.75739144447
pmat 4.92326222727 0.781801416853
aexp 4.33937647426 0.8933853857
flex 4.38257432283 1.15507418054
pcon 2.2586185885 1.06496698325
tool 1.58649480483 1.13975044162
time 3.00824141435 1.00119768587
stor 4.40652474984 1.10905633167
docu 1.46520028131 0.885081237188
b 5.28636472236 0.801315863421
plex 2.14868867232 1.10609100864
pcap 1.88562765343 1.1577450923
kloc 701.370627289
ltex 2.49317333366 1.07800326396
pr 3.21767023357 None
ruse 3.16205901643 1.01522858174
team 1.35876155009 3.30419154403
pvol 3.73906811314 1.05062941603
(':a', 5.286364722358447, ':b', 0.801315863420666, ':kloc', 701.370627288504, ':exp', 0.917647191324108, ':sum', 11.633132790344202, ':prod', 18.07557850083159, 'effort', 39068.355126207636, 'months', 123.50925817291382, 'defects', 12763.194794326424, 'risks', 1.876675603217158)
({'aa': 3.686895102618872, 'sced': 4.335727920663164, 'cplx': 3.542099896591699, 'site': 5.5631077418981025, 'resl': 2.044464980178878, 'acap': 3.058791068058065, 'etat': 3.4521618082979, 'rely': 2.549684395063621, 'data': 3.4568891700927185, 'prec': 2.135818325586969, 'pmat': 4.923262227274789, 'aexp': 4.339376474260203, 'flex': 4.382574322828805, 'pcon': 2.258618588503664, 'tool': 1.586494804833082, 'time': 3.0082414143526988, 'stor': 4.40652474984347, 'docu': 1.4652002813053215, 'b': 5.286364722358447, 'plex': 2.1486886723249508, 'pcap': 1.8856276534266851, 'kloc': 701.370627288504, 'ltex': 2.493173333655456, 'pr': 3.2176702335725667, 'ruse': 3.1620590164309146, 'team': 1.3587615500910672, 'pvol': 3.7390681131390493}, 39068.355126207636, 123.50925817291382, 12763.194794326424, 1.876675603217158)


One unresolved question I have, which may be evident from the data above, is that while COCOMO calibration values were calculated from a PRNG (calling a y() method on the decision range's subclass), the COQUALMO calibrations here are read in from the hardcoded values provided in the original coco.awk script.  If this is not correct, I am unclear on how exactly these values should be computed, and so far my searches of literature on the subject have not been successful other than to see that it is recommended to use the calculator on the COQUALMO website, which is loaded with the "most current" calibration values.

Monday, January 20, 2014


Scrum'med POM3? Or Generic POM based model with more agile support?


*1. Input and output remains same as POM3. More outputs--velocity/burndown charts?Or Need any changes?
*2. Instead of pom3_requirements--Add user stories. Each user story has..
   -Number of teams.
   -Estimate size.(s)
   =Generate them at random and (min<Sum(s)<max)--min,max predefined
   ?Heap or not?
   ?Implementing changes on the go?
3. Teams in pom3 remains the same. If any, Make Changes in pom3_teams according to (2)
*4. Maintain product backlog at all times.
    -Divide user stories into sprints and assign sprint to each team.
    -If team completes one sprint, assign one more. loop it.
    -Manage release backlog? Set of sprints at random as "release backlog".
    -End of each sprint update release backlog and calculate velocity.
5. Have to figure a way to score? Use pareto frontier as base as in POM3? Any other idea? 

Tuesday, January 14, 2014

Introducing multiple objective functions to to support monte carlo simulations, it is found that certain values may need to be saved in the model. Here is a sample output computing both estimated effort and estimated months to completion. Also of note is that the estimations are generally at the upper end of previous data seen.

sced 1.10374478929 1.24379570838
flex 3.18327262758 2.61918685665
cplx 1.16993654826 0.810233522997
site 2.78086160785 1.0208471033
acap 4.08550479924 0.856455446536
rely 2.79509206577 0.970624810294
arch 1.52977754313 3.64286212636
prec 2.60300768966 2.69520959565
pmat 1.93849259272 3.69000784304
aexp 1.21214826296 1.14197561475
pexp 1.95664357311 1.13821406249
pcon 4.64991956732 0.806931122196
tool 3.672965413 0.893688345633
time 5.33559574511 1.16750907363
stor 4.00732679549 1.12328497753
docu 1.1245419464 0.853590713082
b 9.2944521547 0.643490350238
data 3.51126371228 1.05518529753
kloc 654.740065278
ltex 3.62581140746 0.908983366991
ruse 3.91603088042 1.1123939103
pcap 4.85036424753 0.827654929489
team 1.37304562403 3.23708175068
pvol 3.59547684544 1.06280165763
:a 9.2944521547 :b 0.657084517624 :kloc 85.9734460178 :exp 0.810643288666 :sum 15.3558771042 :prod 18.0554715192 effort 6207.43379569 months 45.0217290274
({'sced': 1.1037447892852588, 'flex': 3.18327262757599, 'cplx': 1.1699365482601933, 'site': 2.7808616078467976, 'acap': 4.085504799243289, 'rely': 2.79509206577425, 'arch': 1.529777543133112, 'prec': 2.603007689655303, 'pmat': 1.938492592716496, 'aexp': 1.2121482629597513, 'pexp': 1.9566435731061547, 'pcon': 4.649919567322444, 'tool': 3.6729654130036895, 'time': 5.335595745107138, 'stor': 4.007326795486877, 'docu': 1.1245419463964668, 'b': 9.294452154704873, 'data': 3.5112637122816897, 'kloc': 654.7400652777252, 'ltex': 3.6258114074591927, 'ruse': 3.9160308804151143, 'pcap': 4.8503642475262385, 'team': 1.3730456240284985, 'pvol': 3.59547684544323}, 6207.433795687203, 45.02172902735297)

Finally, the numbers to the right in the 3-column sets do not seem to be the same used in the effort/etc calculations, y() is called each time to generate a random.uniform() based value (?)  These can either be kept or replaced with the hardcoded array versions from coco.awk

Problem with seeding

Problem with my seeding approach: All the good solutions are clustered in local optima.
The feature-rich seed has 5704 selected features. Subsequent good solutions are only slightly different.

What to do:
·         Make a cluster of good solutions, with a variety of “richness”…
·         OR, better yet: let the user introduce reference points. Instead of seeking to minimize the objective, minimize the difference (desired point – actual point)
o   Desired number of features
o   Desired number of defects
o   Desired number of used before
o   Desired cost

Relational Knowledge Transfer

With relational transfer, it is the relationship among data from a source domain to a target domain that is transferred [1]. In our experiments so far, we are looking at synonym learning (source and target with different features) based in relational transfer.



The source data is a combination of the following OO data: poi-3.0 ant-1.7 camel-1.6 ivy-2.0 jEdit-4.1 and the target data is jm1 (Halstead metrics).


  • x% of the target is labelled and all others are unlabeled.
  • Only 50% of the target data are used as test instances (these are from the unlabeled bunch).
  • BORE is applied separately to the labelled x% from the target and the source data.
  • Each instance now has a score that is the product of the ranks from the power ranges (the scores are normalized).
  • Each target instance gets a BORE score by using the ranks from the x%.
  • These are then matched to their nearest [instances scores] from the source data and the majority defect label is assigned to the target instance.
  • For the within experiment, the x% of labelled target data is used as the train set and the 50% test instances are the test.
  • The above is also benchmarked with a 10 x 10 cross-validation experiment on jm1 with Naive Bayes.

Initial Results

Click here

or syns.pdf

So far there are four things offered

  1. Synonyms - (if technology, or data collect methods, or metrics change, can we still use previous projects).
  2. Cross Prediction method for synonyms based on relational transfer of different data-sets.
  3. The percentage of labelled data used - second opinion paper is at 6% for the lowest and mixed paper experiments with 10%
  4. Methods closely resembles the second opinion paper, BORE is linear. 


[1] Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." Knowledge and Data Engineering, IEEE Transactions on 22.10 (2010): 1345-1359.