Hi Stephan, hi Aljoscha,
thank you for your responses!
I am afraid that I explained the idea of the algorithm not good
enaugh. The workset is being constant until we find leafs in the tree.
This is because we will not split leafs therefore we can remove the
training data tuples that are assigned to them. The tuples that are
assigned to non-leaf nodes are emitted again but with another nodeId
property, i.e. they are being replaced. No append to the set is
performed (see the map function below)
Aljoscha's question has a good point. We did not have an implementation
of the work set modification function until yesterday evening. Stephan
is also right, we will join the work set (training data) with the
solution set (nodes in the tree). Within the function that receives the
joined tuples we will perform the reassignment of the training data to
the new child nodes (k*2 or k*2+1 with k as the currently assigned
node). Training data that is assigned to leafs is removed from the work
set. The join looks like this:
val newWorkSet = data.join(nodes.filter{ node => !node.isLeaf() })
.where{ data => data.nodeID }.isEqualTo{ node => node.nodeID }
.map{ (data, node) => new Data(
if (data.point.features.elementAt(node.split.dim) <= node.split.value)
data.nodeID * 2
else
data.nodeID * 2 + 1
,data.point) }
Unfortunately the exception is still raised.
Ragarding the solution set growing: We are generating new nodes with ID
2*k and 2*k+1 for each non-leaf node with ID k. Therefore new keys are
generated. We have removed the union now.
But I am afraid that we did not understood the API to emit the new
workset. We do not have keys on the training data. Is it possible to
re-emit a completely new work set in each iteration or can you only
replace existing tuples in the current work set?
Can you also please have a look on our step function invokation?
Currently it is looking like this:
initializedTree.iterateWithDelta(data, {_.nodeID}, growTree, maxDepth)
with the following variables:
- initializedTree: Node with ID 1 and split point. (Initial solution
set)
- data: Training data set. (Initial work set)
- {_.nodeId} Attribute of the training data. But we are not entirely
sure what this parameter is used for.
- growTree: Step function returning (solutionSetDelta, newWorkSet)
- maxDepth: Maximum number of iterations.
Thank you very much for your support!
Best regards
Christoph
Free forum by Nabble | Disable Popup Ads | Edit this page |