Wednesday, June 11, 2014

PLAN C: Prune trees

1a) delete data from any leaf containing things from > 1 cluster.
Did this one . Check http://unbox.org/things/var/nave/lpj/out/10_June_2014/dats/pruned_tree.dat

for pruned tree and actual tree. 
To check if the code is working or not see: http://unbox.org/things/var/nave/lpj/out/10_June_2014/dats/short_example.dat

1b) descend the trees generated from CART looking for sub-trees that
whose items with cluster ID have HIGHER entropy than the parents, then
delete all items in those sub-trees-of-confusion

1c) for all sub-trees built by CART, compute the entropy of the leaf
items in that tree. sort those entropies to find "too much confusion"
e.g. half way down that list. delete all sub-trees with MORE than "too
much confusion"

then after step1, rebuild the trees using the reduced data sets.


Apparently scikit-learn dtrees do not maintain samples in their trees. meaning. at the leafs or at the nodes there is no sample data. It runs through the sample data and gathers values(results) but not store samples themselves. So at a leaf, I can only find stats of samples, like
[  7.   8.   3.   0.   0.   0.   0.   0.   3.   0.   5.   2.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]

indicating there are 7 samples of first cluster, 8 of second and so on.

So I cannot actually go back to dataset remove those rows and rebuild the tree.
Instead I can calculate entropies using above array at each leaf/node and prune ones using 1b and 1c.

PS: One more problem, those arrays of values are not maintained at nodes! there are only available at leaves, so I have to traverse down entire branch beneath a node and add those arrays together to get that above array at that node.

No comments:

Post a Comment