Hi,
I have the following use case: I want to to regression for a timeseries dataset like: id, x1, x2, ..., xn, y id = point in time x = features y = target value In the Flink frame work I would map this to a LabeledVector (y, DenseVector(x)). (I don't want to use the id as a feature) When I apply finally the predict() method I get a LabeledVector (y_predicted, DenseVector(x)). Now my problem is that I would like to plot the predicted target value according to its time. What I have to do now is: a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) This is really a cumbersome process for such an simple thing. Is there any approach which makes this more simple. If not, can we extend the ML API. to allow ids? Best regards, Felix |
I see your problem. One way to solve the problem is to implement a special
PredictOperation which takes a tuple (id, vector) and returns a tuple (id, labeledVector). You can take a look at the implementation for the vector prediction operation. But we can also discuss about adding an ID field to the Vector type. Cheers, Till On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> wrote: > Hi, > > I have the following use case: I want to to regression for a timeseries > dataset like: > > id, x1, x2, ..., xn, y > > id = point in time > x = features > y = target value > > In the Flink frame work I would map this to a LabeledVector (y, > DenseVector(x)). (I don't want to use the id as a feature) > > When I apply finally the predict() method I get a LabeledVector > (y_predicted, DenseVector(x)). > > Now my problem is that I would like to plot the predicted target value > according to its time. > > What I have to do now is: > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > This is really a cumbersome process for such an simple thing. Is there any > approach which makes this more simple. If not, can we extend the ML API. to > allow ids? > > Best regards, > Felix > |
That would be great. I like the special predict operation better because it
is only in some cases necessary to return the id. The special predict Operation would save this overhead. Best regards, Felix Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <[hidden email]>: > I see your problem. One way to solve the problem is to implement a special > PredictOperation which takes a tuple (id, vector) and returns a tuple (id, > labeledVector). You can take a look at the implementation for the vector > prediction operation. > > But we can also discuss about adding an ID field to the Vector type. > > Cheers, > Till > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> wrote: > > > Hi, > > > > I have the following use case: I want to to regression for a timeseries > > dataset like: > > > > id, x1, x2, ..., xn, y > > > > id = point in time > > x = features > > y = target value > > > > In the Flink frame work I would map this to a LabeledVector (y, > > DenseVector(x)). (I don't want to use the id as a feature) > > > > When I apply finally the predict() method I get a LabeledVector > > (y_predicted, DenseVector(x)). > > > > Now my problem is that I would like to plot the predicted target value > > according to its time. > > > > What I have to do now is: > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > > This is really a cumbersome process for such an simple thing. Is there > any > > approach which makes this more simple. If not, can we extend the ML API. > to > > allow ids? > > > > Best regards, > > Felix > > > |
Then you only have to provide an implicit PredictOperation[SVM, (T, Int),
(LabeledVector, Int)] value with T <: Vector in the scope where you call the predict operation. On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> wrote: > That would be great. I like the special predict operation better because it > is only in some cases necessary to return the id. The special predict > Operation would save this overhead. > > Best regards, > Felix > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <[hidden email] > >: > > > I see your problem. One way to solve the problem is to implement a > special > > PredictOperation which takes a tuple (id, vector) and returns a tuple > (id, > > labeledVector). You can take a look at the implementation for the vector > > prediction operation. > > > > But we can also discuss about adding an ID field to the Vector type. > > > > Cheers, > > Till > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> wrote: > > > > > Hi, > > > > > > I have the following use case: I want to to regression for a timeseries > > > dataset like: > > > > > > id, x1, x2, ..., xn, y > > > > > > id = point in time > > > x = features > > > y = target value > > > > > > In the Flink frame work I would map this to a LabeledVector (y, > > > DenseVector(x)). (I don't want to use the id as a feature) > > > > > > When I apply finally the predict() method I get a LabeledVector > > > (y_predicted, DenseVector(x)). > > > > > > Now my problem is that I would like to plot the predicted target value > > > according to its time. > > > > > > What I have to do now is: > > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) > > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > > > > This is really a cumbersome process for such an simple thing. Is there > > any > > > approach which makes this more simple. If not, can we extend the ML > API. > > to > > > allow ids? > > > > > > Best regards, > > > Felix > > > > > > |
Probably we also need it for the other classes of the pipeline as well, in
order to be able to pass the ID through the whole pipeline. Best regards, Felix Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email]>: > Then you only have to provide an implicit PredictOperation[SVM, (T, Int), > (LabeledVector, Int)] value with T <: Vector in the scope where you call > the predict operation. > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> wrote: > > > That would be great. I like the special predict operation better because > it > > is only in some cases necessary to return the id. The special predict > > Operation would save this overhead. > > > > Best regards, > > Felix > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > [hidden email] > > >: > > > > > I see your problem. One way to solve the problem is to implement a > > special > > > PredictOperation which takes a tuple (id, vector) and returns a tuple > > (id, > > > labeledVector). You can take a look at the implementation for the > vector > > > prediction operation. > > > > > > But we can also discuss about adding an ID field to the Vector type. > > > > > > Cheers, > > > Till > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> > wrote: > > > > > > > Hi, > > > > > > > > I have the following use case: I want to to regression for a > timeseries > > > > dataset like: > > > > > > > > id, x1, x2, ..., xn, y > > > > > > > > id = point in time > > > > x = features > > > > y = target value > > > > > > > > In the Flink frame work I would map this to a LabeledVector (y, > > > > DenseVector(x)). (I don't want to use the id as a feature) > > > > > > > > When I apply finally the predict() method I get a LabeledVector > > > > (y_predicted, DenseVector(x)). > > > > > > > > Now my problem is that I would like to plot the predicted target > value > > > > according to its time. > > > > > > > > What I have to do now is: > > > > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) > > > > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > > > > > > This is really a cumbersome process for such an simple thing. Is > there > > > any > > > > approach which makes this more simple. If not, can we extend the ML > > API. > > > to > > > > allow ids? > > > > > > > > Best regards, > > > > Felix > > > > > > > > > > |
A more general approach would be to take as input which indices of the
vector to consider as features. After that, the vector can be returned as such and user can do what they wish with the non-feature values. This wouldn't need extending the predict operation, instead this can be specified in the model itself using a set parameter function. Or perhaps a better approach is to just take this input in the predict operation. Cheers! Sachin On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> wrote: > Probably we also need it for the other classes of the pipeline as well, in > order to be able to pass the ID through the whole pipeline. > > Best regards, > Felix > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email]>: > > > Then you only have to provide an implicit PredictOperation[SVM, (T, Int), > > (LabeledVector, Int)] value with T <: Vector in the scope where you call > > the predict operation. > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> wrote: > > > > > That would be great. I like the special predict operation better > because > > it > > > is only in some cases necessary to return the id. The special predict > > > Operation would save this overhead. > > > > > > Best regards, > > > Felix > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > [hidden email] > > > >: > > > > > > > I see your problem. One way to solve the problem is to implement a > > > special > > > > PredictOperation which takes a tuple (id, vector) and returns a tuple > > > (id, > > > > labeledVector). You can take a look at the implementation for the > > vector > > > > prediction operation. > > > > > > > > But we can also discuss about adding an ID field to the Vector type. > > > > > > > > Cheers, > > > > Till > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I have the following use case: I want to to regression for a > > timeseries > > > > > dataset like: > > > > > > > > > > id, x1, x2, ..., xn, y > > > > > > > > > > id = point in time > > > > > x = features > > > > > y = target value > > > > > > > > > > In the Flink frame work I would map this to a LabeledVector (y, > > > > > DenseVector(x)). (I don't want to use the id as a feature) > > > > > > > > > > When I apply finally the predict() method I get a LabeledVector > > > > > (y_predicted, DenseVector(x)). > > > > > > > > > > Now my problem is that I would like to plot the predicted target > > value > > > > > according to its time. > > > > > > > > > > What I have to do now is: > > > > > > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) > > > > > > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > > > > > > > > This is really a cumbersome process for such an simple thing. Is > > there > > > > any > > > > > approach which makes this more simple. If not, can we extend the ML > > > API. > > > > to > > > > > allow ids? > > > > > > > > > > Best regards, > > > > > Felix > > > > > > > > > > > > > > > |
You're right Felix. You need to provide the `FitOperation` and
`PredictOperation` for the `Predictor` you want to use and the `FitOperation` and `TransformOperation` for all `Transformer`s you want to chain in front of the `Predictor`. Specifying which features to take could be a solution. However, then you're always carrying data along which is not needed. Especially for large scale data, this might be prohibitive expensive. I guess the more efficient solution would be to assign an ID and later join with the removed feature elements. Cheers, Till On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> wrote: > A more general approach would be to take as input which indices of the > vector to consider as features. After that, the vector can be returned as > such and user can do what they wish with the non-feature values. This > wouldn't need extending the predict operation, instead this can be > specified in the model itself using a set parameter function. Or perhaps a > better approach is to just take this input in the predict operation. > > Cheers! > Sachin > On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> wrote: > > > Probably we also need it for the other classes of the pipeline as well, > in > > order to be able to pass the ID through the whole pipeline. > > > > Best regards, > > Felix > > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email] > >: > > > > > Then you only have to provide an implicit PredictOperation[SVM, (T, > Int), > > > (LabeledVector, Int)] value with T <: Vector in the scope where you > call > > > the predict operation. > > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> > wrote: > > > > > > > That would be great. I like the special predict operation better > > because > > > it > > > > is only in some cases necessary to return the id. The special predict > > > > Operation would save this overhead. > > > > > > > > Best regards, > > > > Felix > > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > > [hidden email] > > > > >: > > > > > > > > > I see your problem. One way to solve the problem is to implement a > > > > special > > > > > PredictOperation which takes a tuple (id, vector) and returns a > tuple > > > > (id, > > > > > labeledVector). You can take a look at the implementation for the > > > vector > > > > > prediction operation. > > > > > > > > > > But we can also discuss about adding an ID field to the Vector > type. > > > > > > > > > > Cheers, > > > > > Till > > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I have the following use case: I want to to regression for a > > > timeseries > > > > > > dataset like: > > > > > > > > > > > > id, x1, x2, ..., xn, y > > > > > > > > > > > > id = point in time > > > > > > x = features > > > > > > y = target value > > > > > > > > > > > > In the Flink frame work I would map this to a LabeledVector (y, > > > > > > DenseVector(x)). (I don't want to use the id as a feature) > > > > > > > > > > > > When I apply finally the predict() method I get a LabeledVector > > > > > > (y_predicted, DenseVector(x)). > > > > > > > > > > > > Now my problem is that I would like to plot the predicted target > > > value > > > > > > according to its time. > > > > > > > > > > > > What I have to do now is: > > > > > > > > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) > > > > > > > > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > > > > > > > > > > This is really a cumbersome process for such an simple thing. Is > > > there > > > > > any > > > > > > approach which makes this more simple. If not, can we extend the > ML > > > > API. > > > > > to > > > > > > allow ids? > > > > > > > > > > > > Best regards, > > > > > > Felix > > > > > > > > > > > > > > > > > > > > > |
Hi all,
I think there are number of issues here: - whether or not we generally need ids for our examples. For time-series, this is a must, but I think it would also help us with many other things (like partitioning the data, or picking a consistent subset), so I would think adding (numeric) ids in general to LabeledVector would be ok. - Some machinery to select features. My biggest concern here for putting that as a parameter to the learning algorithm is that this something independent of the learning algorith, so every algorithm would need to duplicate the code for that. I think it's better if the learning algorithm can assume that the LabelVector already contains all the relevant features, and then there should be other operations to project or extract a subset of examples. -M On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]> wrote: > You're right Felix. You need to provide the `FitOperation` and > `PredictOperation` for the `Predictor` you want to use and the > `FitOperation` and `TransformOperation` for all `Transformer`s you want to > chain in front of the `Predictor`. > > Specifying which features to take could be a solution. However, then you're > always carrying data along which is not needed. Especially for large scale > data, this might be prohibitive expensive. I guess the more efficient > solution would be to assign an ID and later join with the removed feature > elements. > > Cheers, > Till > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> wrote: > >> A more general approach would be to take as input which indices of the >> vector to consider as features. After that, the vector can be returned as >> such and user can do what they wish with the non-feature values. This >> wouldn't need extending the predict operation, instead this can be >> specified in the model itself using a set parameter function. Or perhaps a >> better approach is to just take this input in the predict operation. >> >> Cheers! >> Sachin >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> wrote: >> >> > Probably we also need it for the other classes of the pipeline as well, >> in >> > order to be able to pass the ID through the whole pipeline. >> > >> > Best regards, >> > Felix >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email] >> >: >> > >> > > Then you only have to provide an implicit PredictOperation[SVM, (T, >> Int), >> > > (LabeledVector, Int)] value with T <: Vector in the scope where you >> call >> > > the predict operation. >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> >> wrote: >> > > >> > > > That would be great. I like the special predict operation better >> > because >> > > it >> > > > is only in some cases necessary to return the id. The special predict >> > > > Operation would save this overhead. >> > > > >> > > > Best regards, >> > > > Felix >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < >> > > [hidden email] >> > > > >: >> > > > >> > > > > I see your problem. One way to solve the problem is to implement a >> > > > special >> > > > > PredictOperation which takes a tuple (id, vector) and returns a >> tuple >> > > > (id, >> > > > > labeledVector). You can take a look at the implementation for the >> > > vector >> > > > > prediction operation. >> > > > > >> > > > > But we can also discuss about adding an ID field to the Vector >> type. >> > > > > >> > > > > Cheers, >> > > > > Till >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> >> > > wrote: >> > > > > >> > > > > > Hi, >> > > > > > >> > > > > > I have the following use case: I want to to regression for a >> > > timeseries >> > > > > > dataset like: >> > > > > > >> > > > > > id, x1, x2, ..., xn, y >> > > > > > >> > > > > > id = point in time >> > > > > > x = features >> > > > > > y = target value >> > > > > > >> > > > > > In the Flink frame work I would map this to a LabeledVector (y, >> > > > > > DenseVector(x)). (I don't want to use the id as a feature) >> > > > > > >> > > > > > When I apply finally the predict() method I get a LabeledVector >> > > > > > (y_predicted, DenseVector(x)). >> > > > > > >> > > > > > Now my problem is that I would like to plot the predicted target >> > > value >> > > > > > according to its time. >> > > > > > >> > > > > > What I have to do now is: >> > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id)) >> > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) >> > > > > > >> > > > > > This is really a cumbersome process for such an simple thing. Is >> > > there >> > > > > any >> > > > > > approach which makes this more simple. If not, can we extend the >> ML >> > > > API. >> > > > > to >> > > > > > allow ids? >> > > > > > >> > > > > > Best regards, >> > > > > > Felix >> > > > > > >> > > > > >> > > > >> > > >> > >> -- Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun |
I agree with Mikio; ids would be useful overall, and feature selection
should not be a part of learning algorithms, all features in a LabeledVector should be assumed to be relevant by the learners. On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun <[hidden email]> wrote: > Hi all, > > I think there are number of issues here: > > - whether or not we generally need ids for our examples. For > time-series, this is a must, but I think it would also help us with > many other things (like partitioning the data, or picking a consistent > subset), so I would think adding (numeric) ids in general to > LabeledVector would be ok. > - Some machinery to select features. My biggest concern here for > putting that as a parameter to the learning algorithm is that this > something independent of the learning algorith, so every algorithm > would need to duplicate the code for that. I think it's better if the > learning algorithm can assume that the LabelVector already contains > all the relevant features, and then there should be other operations > to project or extract a subset of examples. > > -M > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]> > wrote: > > You're right Felix. You need to provide the `FitOperation` and > > `PredictOperation` for the `Predictor` you want to use and the > > `FitOperation` and `TransformOperation` for all `Transformer`s you want > to > > chain in front of the `Predictor`. > > > > Specifying which features to take could be a solution. However, then > you're > > always carrying data along which is not needed. Especially for large > scale > > data, this might be prohibitive expensive. I guess the more efficient > > solution would be to assign an ID and later join with the removed feature > > elements. > > > > Cheers, > > Till > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> > wrote: > > > >> A more general approach would be to take as input which indices of the > >> vector to consider as features. After that, the vector can be returned > as > >> such and user can do what they wish with the non-feature values. This > >> wouldn't need extending the predict operation, instead this can be > >> specified in the model itself using a set parameter function. Or > perhaps a > >> better approach is to just take this input in the predict operation. > >> > >> Cheers! > >> Sachin > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> > wrote: > >> > >> > Probably we also need it for the other classes of the pipeline as > well, > >> in > >> > order to be able to pass the ID through the whole pipeline. > >> > > >> > Best regards, > >> > Felix > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > [hidden email] > >> >: > >> > > >> > > Then you only have to provide an implicit PredictOperation[SVM, (T, > >> Int), > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where you > >> call > >> > > the predict operation. > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> > >> wrote: > >> > > > >> > > > That would be great. I like the special predict operation better > >> > because > >> > > it > >> > > > is only in some cases necessary to return the id. The special > predict > >> > > > Operation would save this overhead. > >> > > > > >> > > > Best regards, > >> > > > Felix > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > >> > > [hidden email] > >> > > > >: > >> > > > > >> > > > > I see your problem. One way to solve the problem is to > implement a > >> > > > special > >> > > > > PredictOperation which takes a tuple (id, vector) and returns a > >> tuple > >> > > > (id, > >> > > > > labeledVector). You can take a look at the implementation for > the > >> > > vector > >> > > > > prediction operation. > >> > > > > > >> > > > > But we can also discuss about adding an ID field to the Vector > >> type. > >> > > > > > >> > > > > Cheers, > >> > > > > Till > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email] > > > >> > > wrote: > >> > > > > > >> > > > > > Hi, > >> > > > > > > >> > > > > > I have the following use case: I want to to regression for a > >> > > timeseries > >> > > > > > dataset like: > >> > > > > > > >> > > > > > id, x1, x2, ..., xn, y > >> > > > > > > >> > > > > > id = point in time > >> > > > > > x = features > >> > > > > > y = target value > >> > > > > > > >> > > > > > In the Flink frame work I would map this to a LabeledVector > (y, > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature) > >> > > > > > > >> > > > > > When I apply finally the predict() method I get a > LabeledVector > >> > > > > > (y_predicted, DenseVector(x)). > >> > > > > > > >> > > > > > Now my problem is that I would like to plot the predicted > target > >> > > value > >> > > > > > according to its time. > >> > > > > > > >> > > > > > What I have to do now is: > >> > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > Tuple2(x,id)) > >> > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > >> > > > > > > >> > > > > > This is really a cumbersome process for such an simple thing. > Is > >> > > there > >> > > > > any > >> > > > > > approach which makes this more simple. If not, can we extend > the > >> ML > >> > > > API. > >> > > > > to > >> > > > > > allow ids? > >> > > > > > > >> > > > > > Best regards, > >> > > > > > Felix > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > -- > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > |
Yes. I agree too. It makes no sense for the learning algorithm to have
extra payload. Only relevant data makes sense. Further, adding ID to the predict operation type definition seems a legitimate choice. +1 from my side. Regards Sachin Goel On Mon, Jun 8, 2015 at 4:06 PM, Theodore Vasiloudis < [hidden email]> wrote: > I agree with Mikio; ids would be useful overall, and feature selection > should not be a part of learning algorithms, > all features in a LabeledVector should be assumed to be relevant by the > learners. > > On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun <[hidden email]> > wrote: > > > Hi all, > > > > I think there are number of issues here: > > > > - whether or not we generally need ids for our examples. For > > time-series, this is a must, but I think it would also help us with > > many other things (like partitioning the data, or picking a consistent > > subset), so I would think adding (numeric) ids in general to > > LabeledVector would be ok. > > - Some machinery to select features. My biggest concern here for > > putting that as a parameter to the learning algorithm is that this > > something independent of the learning algorith, so every algorithm > > would need to duplicate the code for that. I think it's better if the > > learning algorithm can assume that the LabelVector already contains > > all the relevant features, and then there should be other operations > > to project or extract a subset of examples. > > > > -M > > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]> > > wrote: > > > You're right Felix. You need to provide the `FitOperation` and > > > `PredictOperation` for the `Predictor` you want to use and the > > > `FitOperation` and `TransformOperation` for all `Transformer`s you want > > to > > > chain in front of the `Predictor`. > > > > > > Specifying which features to take could be a solution. However, then > > you're > > > always carrying data along which is not needed. Especially for large > > scale > > > data, this might be prohibitive expensive. I guess the more efficient > > > solution would be to assign an ID and later join with the removed > feature > > > elements. > > > > > > Cheers, > > > Till > > > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> > > wrote: > > > > > >> A more general approach would be to take as input which indices of the > > >> vector to consider as features. After that, the vector can be returned > > as > > >> such and user can do what they wish with the non-feature values. This > > >> wouldn't need extending the predict operation, instead this can be > > >> specified in the model itself using a set parameter function. Or > > perhaps a > > >> better approach is to just take this input in the predict operation. > > >> > > >> Cheers! > > >> Sachin > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> > > wrote: > > >> > > >> > Probably we also need it for the other classes of the pipeline as > > well, > > >> in > > >> > order to be able to pass the ID through the whole pipeline. > > >> > > > >> > Best regards, > > >> > Felix > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > > [hidden email] > > >> >: > > >> > > > >> > > Then you only have to provide an implicit PredictOperation[SVM, > (T, > > >> Int), > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where > you > > >> call > > >> > > the predict operation. > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> > > >> wrote: > > >> > > > > >> > > > That would be great. I like the special predict operation better > > >> > because > > >> > > it > > >> > > > is only in some cases necessary to return the id. The special > > predict > > >> > > > Operation would save this overhead. > > >> > > > > > >> > > > Best regards, > > >> > > > Felix > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > >> > > [hidden email] > > >> > > > >: > > >> > > > > > >> > > > > I see your problem. One way to solve the problem is to > > implement a > > >> > > > special > > >> > > > > PredictOperation which takes a tuple (id, vector) and returns > a > > >> tuple > > >> > > > (id, > > >> > > > > labeledVector). You can take a look at the implementation for > > the > > >> > > vector > > >> > > > > prediction operation. > > >> > > > > > > >> > > > > But we can also discuss about adding an ID field to the Vector > > >> type. > > >> > > > > > > >> > > > > Cheers, > > >> > > > > Till > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" < > [hidden email] > > > > > >> > > wrote: > > >> > > > > > > >> > > > > > Hi, > > >> > > > > > > > >> > > > > > I have the following use case: I want to to regression for a > > >> > > timeseries > > >> > > > > > dataset like: > > >> > > > > > > > >> > > > > > id, x1, x2, ..., xn, y > > >> > > > > > > > >> > > > > > id = point in time > > >> > > > > > x = features > > >> > > > > > y = target value > > >> > > > > > > > >> > > > > > In the Flink frame work I would map this to a LabeledVector > > (y, > > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature) > > >> > > > > > > > >> > > > > > When I apply finally the predict() method I get a > > LabeledVector > > >> > > > > > (y_predicted, DenseVector(x)). > > >> > > > > > > > >> > > > > > Now my problem is that I would like to plot the predicted > > target > > >> > > value > > >> > > > > > according to its time. > > >> > > > > > > > >> > > > > > What I have to do now is: > > >> > > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > > Tuple2(x,id)) > > >> > > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > >> > > > > > > > >> > > > > > This is really a cumbersome process for such an simple > thing. > > Is > > >> > > there > > >> > > > > any > > >> > > > > > approach which makes this more simple. If not, can we extend > > the > > >> ML > > >> > > > API. > > >> > > > > to > > >> > > > > > allow ids? > > >> > > > > > > > >> > > > > > Best regards, > > >> > > > > > Felix > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > -- > > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > > > |
In reply to this post by Mikio Braun
My gut feeling is also that a `Transformer` would be a good place to
implement feature selection. Then you can simply reuse it across multiple algorithms by simply chaining them together. However, I don't know yet what's the best way to realize the IDs. One way would be to add an ID field to `Vector` and `LabeledVector`. Another way would be to provide operations for `(ID, Vector)` and `(ID, LabeledVector)` tuple types which reuse the implementations for `Vector` and `LabeledVector`. This means that the developer doesn't have to implement special operations for the tuple variants. The latter approach has the advantage that you only use memory for IDs if you really need them. Another question is how to assign the IDs. Does the user have to provide them? Are they randomly chosen. Or do we assign each element an increasing index based on the total number of elements? On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email]> wrote: > Hi all, > > I think there are number of issues here: > > - whether or not we generally need ids for our examples. For > time-series, this is a must, but I think it would also help us with > many other things (like partitioning the data, or picking a consistent > subset), so I would think adding (numeric) ids in general to > LabeledVector would be ok. > - Some machinery to select features. My biggest concern here for > putting that as a parameter to the learning algorithm is that this > something independent of the learning algorith, so every algorithm > would need to duplicate the code for that. I think it's better if the > learning algorithm can assume that the LabelVector already contains > all the relevant features, and then there should be other operations > to project or extract a subset of examples. > > -M > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]> > wrote: > > You're right Felix. You need to provide the `FitOperation` and > > `PredictOperation` for the `Predictor` you want to use and the > > `FitOperation` and `TransformOperation` for all `Transformer`s you want > to > > chain in front of the `Predictor`. > > > > Specifying which features to take could be a solution. However, then > you're > > always carrying data along which is not needed. Especially for large > scale > > data, this might be prohibitive expensive. I guess the more efficient > > solution would be to assign an ID and later join with the removed feature > > elements. > > > > Cheers, > > Till > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> > wrote: > > > >> A more general approach would be to take as input which indices of the > >> vector to consider as features. After that, the vector can be returned > as > >> such and user can do what they wish with the non-feature values. This > >> wouldn't need extending the predict operation, instead this can be > >> specified in the model itself using a set parameter function. Or > perhaps a > >> better approach is to just take this input in the predict operation. > >> > >> Cheers! > >> Sachin > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> > wrote: > >> > >> > Probably we also need it for the other classes of the pipeline as > well, > >> in > >> > order to be able to pass the ID through the whole pipeline. > >> > > >> > Best regards, > >> > Felix > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > [hidden email] > >> >: > >> > > >> > > Then you only have to provide an implicit PredictOperation[SVM, (T, > >> Int), > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where you > >> call > >> > > the predict operation. > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> > >> wrote: > >> > > > >> > > > That would be great. I like the special predict operation better > >> > because > >> > > it > >> > > > is only in some cases necessary to return the id. The special > predict > >> > > > Operation would save this overhead. > >> > > > > >> > > > Best regards, > >> > > > Felix > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > >> > > [hidden email] > >> > > > >: > >> > > > > >> > > > > I see your problem. One way to solve the problem is to > implement a > >> > > > special > >> > > > > PredictOperation which takes a tuple (id, vector) and returns a > >> tuple > >> > > > (id, > >> > > > > labeledVector). You can take a look at the implementation for > the > >> > > vector > >> > > > > prediction operation. > >> > > > > > >> > > > > But we can also discuss about adding an ID field to the Vector > >> type. > >> > > > > > >> > > > > Cheers, > >> > > > > Till > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email] > > > >> > > wrote: > >> > > > > > >> > > > > > Hi, > >> > > > > > > >> > > > > > I have the following use case: I want to to regression for a > >> > > timeseries > >> > > > > > dataset like: > >> > > > > > > >> > > > > > id, x1, x2, ..., xn, y > >> > > > > > > >> > > > > > id = point in time > >> > > > > > x = features > >> > > > > > y = target value > >> > > > > > > >> > > > > > In the Flink frame work I would map this to a LabeledVector > (y, > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature) > >> > > > > > > >> > > > > > When I apply finally the predict() method I get a > LabeledVector > >> > > > > > (y_predicted, DenseVector(x)). > >> > > > > > > >> > > > > > Now my problem is that I would like to plot the predicted > target > >> > > value > >> > > > > > according to its time. > >> > > > > > > >> > > > > > What I have to do now is: > >> > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > Tuple2(x,id)) > >> > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > >> > > > > > > >> > > > > > This is really a cumbersome process for such an simple thing. > Is > >> > > there > >> > > > > any > >> > > > > > approach which makes this more simple. If not, can we extend > the > >> ML > >> > > > API. > >> > > > > to > >> > > > > > allow ids? > >> > > > > > > >> > > > > > Best regards, > >> > > > > > Felix > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > -- > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > |
I think if the user doesn't provide IDs, we can safely assume that they
don't need it. We can just simply assign an ID of one as a temporary measure and return the result, with no IDs [just to make the interface cleaner]. If the IDs are provided, in that case, we simply use those IDs. A possible template for this would be: implicit def predictValues[T <: Vector] = { new PredictOperation[SVM, T, LabeledVector]{ override def predict( instance: SVM, predictParameters: ParameterMap, input: DataSet[T]) : DataSet[LabeledVector] = { predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2) } } } implicit def predictValues[T <: (ID,Vector)] = { new PredictOperation[SVM, T, (ID,LabeledVector)]{ override def predict( instance: SVM, predictParameters: ParameterMap, input: DataSet[T]) : DataSet[LabeledVector] = { predict(ParameterMap,input) } } } Regards Sachin Goel On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <[hidden email]> wrote: > My gut feeling is also that a `Transformer` would be a good place to > implement feature selection. Then you can simply reuse it across multiple > algorithms by simply chaining them together. > > However, I don't know yet what's the best way to realize the IDs. One way > would be to add an ID field to `Vector` and `LabeledVector`. Another way > would be to provide operations for `(ID, Vector)` and `(ID, LabeledVector)` > tuple types which reuse the implementations for `Vector` and > `LabeledVector`. This means that the developer doesn't have to implement > special operations for the tuple variants. The latter approach has the > advantage that you only use memory for IDs if you really need them. > > Another question is how to assign the IDs. Does the user have to provide > them? Are they randomly chosen. Or do we assign each element an increasing > index based on the total number of elements? > > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email]> > wrote: > > > Hi all, > > > > I think there are number of issues here: > > > > - whether or not we generally need ids for our examples. For > > time-series, this is a must, but I think it would also help us with > > many other things (like partitioning the data, or picking a consistent > > subset), so I would think adding (numeric) ids in general to > > LabeledVector would be ok. > > - Some machinery to select features. My biggest concern here for > > putting that as a parameter to the learning algorithm is that this > > something independent of the learning algorith, so every algorithm > > would need to duplicate the code for that. I think it's better if the > > learning algorithm can assume that the LabelVector already contains > > all the relevant features, and then there should be other operations > > to project or extract a subset of examples. > > > > -M > > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]> > > wrote: > > > You're right Felix. You need to provide the `FitOperation` and > > > `PredictOperation` for the `Predictor` you want to use and the > > > `FitOperation` and `TransformOperation` for all `Transformer`s you want > > to > > > chain in front of the `Predictor`. > > > > > > Specifying which features to take could be a solution. However, then > > you're > > > always carrying data along which is not needed. Especially for large > > scale > > > data, this might be prohibitive expensive. I guess the more efficient > > > solution would be to assign an ID and later join with the removed > feature > > > elements. > > > > > > Cheers, > > > Till > > > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> > > wrote: > > > > > >> A more general approach would be to take as input which indices of the > > >> vector to consider as features. After that, the vector can be returned > > as > > >> such and user can do what they wish with the non-feature values. This > > >> wouldn't need extending the predict operation, instead this can be > > >> specified in the model itself using a set parameter function. Or > > perhaps a > > >> better approach is to just take this input in the predict operation. > > >> > > >> Cheers! > > >> Sachin > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> > > wrote: > > >> > > >> > Probably we also need it for the other classes of the pipeline as > > well, > > >> in > > >> > order to be able to pass the ID through the whole pipeline. > > >> > > > >> > Best regards, > > >> > Felix > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > > [hidden email] > > >> >: > > >> > > > >> > > Then you only have to provide an implicit PredictOperation[SVM, > (T, > > >> Int), > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where > you > > >> call > > >> > > the predict operation. > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> > > >> wrote: > > >> > > > > >> > > > That would be great. I like the special predict operation better > > >> > because > > >> > > it > > >> > > > is only in some cases necessary to return the id. The special > > predict > > >> > > > Operation would save this overhead. > > >> > > > > > >> > > > Best regards, > > >> > > > Felix > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > >> > > [hidden email] > > >> > > > >: > > >> > > > > > >> > > > > I see your problem. One way to solve the problem is to > > implement a > > >> > > > special > > >> > > > > PredictOperation which takes a tuple (id, vector) and returns > a > > >> tuple > > >> > > > (id, > > >> > > > > labeledVector). You can take a look at the implementation for > > the > > >> > > vector > > >> > > > > prediction operation. > > >> > > > > > > >> > > > > But we can also discuss about adding an ID field to the Vector > > >> type. > > >> > > > > > > >> > > > > Cheers, > > >> > > > > Till > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" < > [hidden email] > > > > > >> > > wrote: > > >> > > > > > > >> > > > > > Hi, > > >> > > > > > > > >> > > > > > I have the following use case: I want to to regression for a > > >> > > timeseries > > >> > > > > > dataset like: > > >> > > > > > > > >> > > > > > id, x1, x2, ..., xn, y > > >> > > > > > > > >> > > > > > id = point in time > > >> > > > > > x = features > > >> > > > > > y = target value > > >> > > > > > > > >> > > > > > In the Flink frame work I would map this to a LabeledVector > > (y, > > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature) > > >> > > > > > > > >> > > > > > When I apply finally the predict() method I get a > > LabeledVector > > >> > > > > > (y_predicted, DenseVector(x)). > > >> > > > > > > > >> > > > > > Now my problem is that I would like to plot the predicted > > target > > >> > > value > > >> > > > > > according to its time. > > >> > > > > > > > >> > > > > > What I have to do now is: > > >> > > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > > Tuple2(x,id)) > > >> > > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > >> > > > > > > > >> > > > > > This is really a cumbersome process for such an simple > thing. > > Is > > >> > > there > > >> > > > > any > > >> > > > > > approach which makes this more simple. If not, can we extend > > the > > >> ML > > >> > > > API. > > >> > > > > to > > >> > > > > > allow ids? > > >> > > > > > > > >> > > > > > Best regards, > > >> > > > > > Felix > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > -- > > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > > > |
I am in favor of efficiency. Therefore I would be prefer to introduce new
methods, in order to save memory and network traffic. This would also solve the problem of "how to come up with ids?" Best regards, Felix Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" <[hidden email]>: > I think if the user doesn't provide IDs, we can safely assume that they > don't need it. We can just simply assign an ID of one as a temporary > measure and return the result, with no IDs [just to make the interface > cleaner]. > If the IDs are provided, in that case, we simply use those IDs. > A possible template for this would be: > > implicit def predictValues[T <: Vector] = { > new PredictOperation[SVM, T, LabeledVector]{ > override def predict( > instance: SVM, > predictParameters: ParameterMap, > input: DataSet[T]) > : DataSet[LabeledVector] = { > predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2) > } > } > } > > implicit def predictValues[T <: (ID,Vector)] = { > new PredictOperation[SVM, T, (ID,LabeledVector)]{ > override def predict( > instance: SVM, > predictParameters: ParameterMap, > input: DataSet[T]) > : DataSet[LabeledVector] = { > predict(ParameterMap,input) > } > } > } > > Regards > Sachin Goel > > On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <[hidden email]> > wrote: > > > My gut feeling is also that a `Transformer` would be a good place to > > implement feature selection. Then you can simply reuse it across multiple > > algorithms by simply chaining them together. > > > > However, I don't know yet what's the best way to realize the IDs. One way > > would be to add an ID field to `Vector` and `LabeledVector`. Another way > > would be to provide operations for `(ID, Vector)` and `(ID, > LabeledVector)` > > tuple types which reuse the implementations for `Vector` and > > `LabeledVector`. This means that the developer doesn't have to implement > > special operations for the tuple variants. The latter approach has the > > advantage that you only use memory for IDs if you really need them. > > > > Another question is how to assign the IDs. Does the user have to provide > > them? Are they randomly chosen. Or do we assign each element an > increasing > > index based on the total number of elements? > > > > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email]> > > wrote: > > > > > Hi all, > > > > > > I think there are number of issues here: > > > > > > - whether or not we generally need ids for our examples. For > > > time-series, this is a must, but I think it would also help us with > > > many other things (like partitioning the data, or picking a consistent > > > subset), so I would think adding (numeric) ids in general to > > > LabeledVector would be ok. > > > - Some machinery to select features. My biggest concern here for > > > putting that as a parameter to the learning algorithm is that this > > > something independent of the learning algorith, so every algorithm > > > would need to duplicate the code for that. I think it's better if the > > > learning algorithm can assume that the LabelVector already contains > > > all the relevant features, and then there should be other operations > > > to project or extract a subset of examples. > > > > > > -M > > > > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann < > [hidden email]> > > > wrote: > > > > You're right Felix. You need to provide the `FitOperation` and > > > > `PredictOperation` for the `Predictor` you want to use and the > > > > `FitOperation` and `TransformOperation` for all `Transformer`s you > want > > > to > > > > chain in front of the `Predictor`. > > > > > > > > Specifying which features to take could be a solution. However, then > > > you're > > > > always carrying data along which is not needed. Especially for large > > > scale > > > > data, this might be prohibitive expensive. I guess the more efficient > > > > solution would be to assign an ID and later join with the removed > > feature > > > > elements. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email] > > > > > wrote: > > > > > > > >> A more general approach would be to take as input which indices of > the > > > >> vector to consider as features. After that, the vector can be > returned > > > as > > > >> such and user can do what they wish with the non-feature values. > This > > > >> wouldn't need extending the predict operation, instead this can be > > > >> specified in the model itself using a set parameter function. Or > > > perhaps a > > > >> better approach is to just take this input in the predict operation. > > > >> > > > >> Cheers! > > > >> Sachin > > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> > > > wrote: > > > >> > > > >> > Probably we also need it for the other classes of the pipeline as > > > well, > > > >> in > > > >> > order to be able to pass the ID through the whole pipeline. > > > >> > > > > >> > Best regards, > > > >> > Felix > > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > > > [hidden email] > > > >> >: > > > >> > > > > >> > > Then you only have to provide an implicit PredictOperation[SVM, > > (T, > > > >> Int), > > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where > > you > > > >> call > > > >> > > the predict operation. > > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email] > > > > > >> wrote: > > > >> > > > > > >> > > > That would be great. I like the special predict operation > better > > > >> > because > > > >> > > it > > > >> > > > is only in some cases necessary to return the id. The special > > > predict > > > >> > > > Operation would save this overhead. > > > >> > > > > > > >> > > > Best regards, > > > >> > > > Felix > > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > > >> > > [hidden email] > > > >> > > > >: > > > >> > > > > > > >> > > > > I see your problem. One way to solve the problem is to > > > implement a > > > >> > > > special > > > >> > > > > PredictOperation which takes a tuple (id, vector) and > returns > > a > > > >> tuple > > > >> > > > (id, > > > >> > > > > labeledVector). You can take a look at the implementation > for > > > the > > > >> > > vector > > > >> > > > > prediction operation. > > > >> > > > > > > > >> > > > > But we can also discuss about adding an ID field to the > Vector > > > >> type. > > > >> > > > > > > > >> > > > > Cheers, > > > >> > > > > Till > > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" < > > [hidden email] > > > > > > > >> > > wrote: > > > >> > > > > > > > >> > > > > > Hi, > > > >> > > > > > > > > >> > > > > > I have the following use case: I want to to regression > for a > > > >> > > timeseries > > > >> > > > > > dataset like: > > > >> > > > > > > > > >> > > > > > id, x1, x2, ..., xn, y > > > >> > > > > > > > > >> > > > > > id = point in time > > > >> > > > > > x = features > > > >> > > > > > y = target value > > > >> > > > > > > > > >> > > > > > In the Flink frame work I would map this to a > LabeledVector > > > (y, > > > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature) > > > >> > > > > > > > > >> > > > > > When I apply finally the predict() method I get a > > > LabeledVector > > > >> > > > > > (y_predicted, DenseVector(x)). > > > >> > > > > > > > > >> > > > > > Now my problem is that I would like to plot the predicted > > > target > > > >> > > value > > > >> > > > > > according to its time. > > > >> > > > > > > > > >> > > > > > What I have to do now is: > > > >> > > > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p)) > > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > > > Tuple2(x,id)) > > > >> > > > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > >> > > > > > > > > >> > > > > > This is really a cumbersome process for such an simple > > thing. > > > Is > > > >> > > there > > > >> > > > > any > > > >> > > > > > approach which makes this more simple. If not, can we > extend > > > the > > > >> ML > > > >> > > > API. > > > >> > > > > to > > > >> > > > > > allow ids? > > > >> > > > > > > > > >> > > > > > Best regards, > > > >> > > > > > Felix > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > -- > > > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun > > > > > > |
That would be better of course. My opinion had to do with
not-implementing-exactly-the-same-thing-twice. Perhaps Till could weigh in here. We really do need to come up with a general mechanism for this. Testing labeled vectors has exactly the same problem. I'll look into how Spark and sci-kit approach this. Regards Sachin Goel On Mon, Jun 8, 2015 at 4:26 PM, Felix Neutatz <[hidden email]> wrote: > I am in favor of efficiency. Therefore I would be prefer to introduce new > methods, in order to save memory and network traffic. This would also solve > the problem of "how to come up with ids?" > > Best regards, > Felix > Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" <[hidden email] > >: > > > I think if the user doesn't provide IDs, we can safely assume that they > > don't need it. We can just simply assign an ID of one as a temporary > > measure and return the result, with no IDs [just to make the interface > > cleaner]. > > If the IDs are provided, in that case, we simply use those IDs. > > A possible template for this would be: > > > > implicit def predictValues[T <: Vector] = { > > new PredictOperation[SVM, T, LabeledVector]{ > > override def predict( > > instance: SVM, > > predictParameters: ParameterMap, > > input: DataSet[T]) > > : DataSet[LabeledVector] = { > > predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2) > > } > > } > > } > > > > implicit def predictValues[T <: (ID,Vector)] = { > > new PredictOperation[SVM, T, (ID,LabeledVector)]{ > > override def predict( > > instance: SVM, > > predictParameters: ParameterMap, > > input: DataSet[T]) > > : DataSet[LabeledVector] = { > > predict(ParameterMap,input) > > } > > } > > } > > > > Regards > > Sachin Goel > > > > On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <[hidden email]> > > wrote: > > > > > My gut feeling is also that a `Transformer` would be a good place to > > > implement feature selection. Then you can simply reuse it across > multiple > > > algorithms by simply chaining them together. > > > > > > However, I don't know yet what's the best way to realize the IDs. One > way > > > would be to add an ID field to `Vector` and `LabeledVector`. Another > way > > > would be to provide operations for `(ID, Vector)` and `(ID, > > LabeledVector)` > > > tuple types which reuse the implementations for `Vector` and > > > `LabeledVector`. This means that the developer doesn't have to > implement > > > special operations for the tuple variants. The latter approach has the > > > advantage that you only use memory for IDs if you really need them. > > > > > > Another question is how to assign the IDs. Does the user have to > provide > > > them? Are they randomly chosen. Or do we assign each element an > > increasing > > > index based on the total number of elements? > > > > > > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email] > > > > > wrote: > > > > > > > Hi all, > > > > > > > > I think there are number of issues here: > > > > > > > > - whether or not we generally need ids for our examples. For > > > > time-series, this is a must, but I think it would also help us with > > > > many other things (like partitioning the data, or picking a > consistent > > > > subset), so I would think adding (numeric) ids in general to > > > > LabeledVector would be ok. > > > > - Some machinery to select features. My biggest concern here for > > > > putting that as a parameter to the learning algorithm is that this > > > > something independent of the learning algorith, so every algorithm > > > > would need to duplicate the code for that. I think it's better if the > > > > learning algorithm can assume that the LabelVector already contains > > > > all the relevant features, and then there should be other operations > > > > to project or extract a subset of examples. > > > > > > > > -M > > > > > > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann < > > [hidden email]> > > > > wrote: > > > > > You're right Felix. You need to provide the `FitOperation` and > > > > > `PredictOperation` for the `Predictor` you want to use and the > > > > > `FitOperation` and `TransformOperation` for all `Transformer`s you > > want > > > > to > > > > > chain in front of the `Predictor`. > > > > > > > > > > Specifying which features to take could be a solution. However, > then > > > > you're > > > > > always carrying data along which is not needed. Especially for > large > > > > scale > > > > > data, this might be prohibitive expensive. I guess the more > efficient > > > > > solution would be to assign an ID and later join with the removed > > > feature > > > > > elements. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel < > [hidden email] > > > > > > > wrote: > > > > > > > > > >> A more general approach would be to take as input which indices of > > the > > > > >> vector to consider as features. After that, the vector can be > > returned > > > > as > > > > >> such and user can do what they wish with the non-feature values. > > This > > > > >> wouldn't need extending the predict operation, instead this can be > > > > >> specified in the model itself using a set parameter function. Or > > > > perhaps a > > > > >> better approach is to just take this input in the predict > operation. > > > > >> > > > > >> Cheers! > > > > >> Sachin > > > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> > > > > wrote: > > > > >> > > > > >> > Probably we also need it for the other classes of the pipeline > as > > > > well, > > > > >> in > > > > >> > order to be able to pass the ID through the whole pipeline. > > > > >> > > > > > >> > Best regards, > > > > >> > Felix > > > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" < > > > > [hidden email] > > > > >> >: > > > > >> > > > > > >> > > Then you only have to provide an implicit > PredictOperation[SVM, > > > (T, > > > > >> Int), > > > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope > where > > > you > > > > >> call > > > > >> > > the predict operation. > > > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" < > [hidden email] > > > > > > > >> wrote: > > > > >> > > > > > > >> > > > That would be great. I like the special predict operation > > better > > > > >> > because > > > > >> > > it > > > > >> > > > is only in some cases necessary to return the id. The > special > > > > predict > > > > >> > > > Operation would save this overhead. > > > > >> > > > > > > > >> > > > Best regards, > > > > >> > > > Felix > > > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" < > > > > >> > > [hidden email] > > > > >> > > > >: > > > > >> > > > > > > > >> > > > > I see your problem. One way to solve the problem is to > > > > implement a > > > > >> > > > special > > > > >> > > > > PredictOperation which takes a tuple (id, vector) and > > returns > > > a > > > > >> tuple > > > > >> > > > (id, > > > > >> > > > > labeledVector). You can take a look at the implementation > > for > > > > the > > > > >> > > vector > > > > >> > > > > prediction operation. > > > > >> > > > > > > > > >> > > > > But we can also discuss about adding an ID field to the > > Vector > > > > >> type. > > > > >> > > > > > > > > >> > > > > Cheers, > > > > >> > > > > Till > > > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" < > > > [hidden email] > > > > > > > > > >> > > wrote: > > > > >> > > > > > > > > >> > > > > > Hi, > > > > >> > > > > > > > > > >> > > > > > I have the following use case: I want to to regression > > for a > > > > >> > > timeseries > > > > >> > > > > > dataset like: > > > > >> > > > > > > > > > >> > > > > > id, x1, x2, ..., xn, y > > > > >> > > > > > > > > > >> > > > > > id = point in time > > > > >> > > > > > x = features > > > > >> > > > > > y = target value > > > > >> > > > > > > > > > >> > > > > > In the Flink frame work I would map this to a > > LabeledVector > > > > (y, > > > > >> > > > > > DenseVector(x)). (I don't want to use the id as a > feature) > > > > >> > > > > > > > > > >> > > > > > When I apply finally the predict() method I get a > > > > LabeledVector > > > > >> > > > > > (y_predicted, DenseVector(x)). > > > > >> > > > > > > > > > >> > > > > > Now my problem is that I would like to plot the > predicted > > > > target > > > > >> > > value > > > > >> > > > > > according to its time. > > > > >> > > > > > > > > > >> > > > > > What I have to do now is: > > > > >> > > > > > > > > > >> > > > > > a = predictedDataSet.map ( LabeledVector => > Tuple2(x,y_p)) > > > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => > > > > Tuple2(x,id)) > > > > >> > > > > > > > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p) > > > > >> > > > > > > > > > >> > > > > > This is really a cumbersome process for such an simple > > > thing. > > > > Is > > > > >> > > there > > > > >> > > > > any > > > > >> > > > > > approach which makes this more simple. If not, can we > > extend > > > > the > > > > >> ML > > > > >> > > > API. > > > > >> > > > > to > > > > >> > > > > > allow ids? > > > > >> > > > > > > > > > >> > > > > > Best regards, > > > > >> > > > > > Felix > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > -- > > > > Mikio Braun - http://blog.mikiobraun.de, > http://twitter.com/mikiobraun > > > > > > > > > > |
Free forum by Nabble | Edit this page |