(DEPRECATED) Apache Flink Mailing List archive.

Problem with ML pipeline

Classic

List

Threaded

14 messages Options

Felix Neutatz

Problem with ML pipeline

Hi,

I have the following use case: I want to to regression for a timeseries
dataset like:

id, x1, x2, ..., xn, y

id = point in time
x = features
y = target value

In the Flink frame work I would map this to a LabeledVector (y,
DenseVector(x)). (I don't want to use the id as a feature)

When I apply finally the predict() method I get a LabeledVector
(y_predicted, DenseVector(x)).

Now my problem is that I would like to plot the predicted target value
according to its time.

What I have to do now is:

a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))

a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)

This is really a cumbersome process for such an simple thing. Is there any
approach which makes this more simple. If not, can we extend the ML API. to
allow ids?

Best regards,
Felix

till.rohrmann

Re: Problem with ML pipeline

I see your problem. One way to solve the problem is to implement a special
PredictOperation which takes a tuple (id, vector) and returns a tuple (id,
labeledVector). You can take a look at the implementation for the vector
prediction operation.

But we can also discuss about adding an ID field to the Vector type.

Cheers,
Till
On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> wrote:

> Hi,
>
> I have the following use case: I want to to regression for a timeseries
> dataset like:
>
> id, x1, x2, ..., xn, y
>
> id = point in time
> x = features
> y = target value
>
> In the Flink frame work I would map this to a LabeledVector (y,
> DenseVector(x)). (I don't want to use the id as a feature)
>
> When I apply finally the predict() method I get a LabeledVector
> (y_predicted, DenseVector(x)).
>
> Now my problem is that I would like to plot the predicted target value
> according to its time.
>
> What I have to do now is:
>
> a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
>
> a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
>
> This is really a cumbersome process for such an simple thing. Is there any
> approach which makes this more simple. If not, can we extend the ML API. to
> allow ids?
>
> Best regards,
> Felix
>

Felix Neutatz

Re: Problem with ML pipeline

That would be great. I like the special predict operation better because it
is only in some cases necessary to return the id. The special predict
Operation would save this overhead.

Best regards,
Felix
Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <[hidden email]>:

> I see your problem. One way to solve the problem is to implement a special
> PredictOperation which takes a tuple (id, vector) and returns a tuple (id,
> labeledVector). You can take a look at the implementation for the vector
> prediction operation.
>
> But we can also discuss about adding an ID field to the Vector type.
>
> Cheers,
> Till
> On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> wrote:
>
> > Hi,
> >
> > I have the following use case: I want to to regression for a timeseries
> > dataset like:
> >
> > id, x1, x2, ..., xn, y
> >
> > id = point in time
> > x = features
> > y = target value
> >
> > In the Flink frame work I would map this to a LabeledVector (y,
> > DenseVector(x)). (I don't want to use the id as a feature)
> >
> > When I apply finally the predict() method I get a LabeledVector
> > (y_predicted, DenseVector(x)).
> >
> > Now my problem is that I would like to plot the predicted target value
> > according to its time.
> >
> > What I have to do now is:
> >
> > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
> >
> > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> >
> > This is really a cumbersome process for such an simple thing. Is there
> any
> > approach which makes this more simple. If not, can we extend the ML API.
> to
> > allow ids?
> >
> > Best regards,
> > Felix
> >
>

Till Rohrmann

Re: Problem with ML pipeline

Then you only have to provide an implicit PredictOperation[SVM, (T, Int),
(LabeledVector, Int)] value with T <: Vector in the scope where you call
the predict operation.
On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> wrote:

> That would be great. I like the special predict operation better because it
> is only in some cases necessary to return the id. The special predict
> Operation would save this overhead.
>
> Best regards,
> Felix
> Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <[hidden email]
> >:
>
> > I see your problem. One way to solve the problem is to implement a
> special
> > PredictOperation which takes a tuple (id, vector) and returns a tuple
> (id,
> > labeledVector). You can take a look at the implementation for the vector
> > prediction operation.
> >
> > But we can also discuss about adding an ID field to the Vector type.
> >
> > Cheers,
> > Till
> > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > > I have the following use case: I want to to regression for a timeseries
> > > dataset like:
> > >
> > > id, x1, x2, ..., xn, y
> > >
> > > id = point in time
> > > x = features
> > > y = target value
> > >
> > > In the Flink frame work I would map this to a LabeledVector (y,
> > > DenseVector(x)). (I don't want to use the id as a feature)
> > >
> > > When I apply finally the predict() method I get a LabeledVector
> > > (y_predicted, DenseVector(x)).
> > >
> > > Now my problem is that I would like to plot the predicted target value
> > > according to its time.
> > >
> > > What I have to do now is:
> > >
> > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
> > >
> > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > >
> > > This is really a cumbersome process for such an simple thing. Is there
> > any
> > > approach which makes this more simple. If not, can we extend the ML
> API.
> > to
> > > allow ids?
> > >
> > > Best regards,
> > > Felix
> > >
> >
>

Felix Neutatz

Re: Problem with ML pipeline

Probably we also need it for the other classes of the pipeline as well, in
order to be able to pass the ID through the whole pipeline.

Best regards,
Felix
Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email]>:

> Then you only have to provide an implicit PredictOperation[SVM, (T, Int),
> (LabeledVector, Int)] value with T <: Vector in the scope where you call
> the predict operation.
> On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> wrote:
>
> > That would be great. I like the special predict operation better because
> it
> > is only in some cases necessary to return the id. The special predict
> > Operation would save this overhead.
> >
> > Best regards,
> > Felix
> > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> [hidden email]
> > >:
> >
> > > I see your problem. One way to solve the problem is to implement a
> > special
> > > PredictOperation which takes a tuple (id, vector) and returns a tuple
> > (id,
> > > labeledVector). You can take a look at the implementation for the
> vector
> > > prediction operation.
> > >
> > > But we can also discuss about adding an ID field to the Vector type.
> > >
> > > Cheers,
> > > Till
> > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have the following use case: I want to to regression for a
> timeseries
> > > > dataset like:
> > > >
> > > > id, x1, x2, ..., xn, y
> > > >
> > > > id = point in time
> > > > x = features
> > > > y = target value
> > > >
> > > > In the Flink frame work I would map this to a LabeledVector (y,
> > > > DenseVector(x)). (I don't want to use the id as a feature)
> > > >
> > > > When I apply finally the predict() method I get a LabeledVector
> > > > (y_predicted, DenseVector(x)).
> > > >
> > > > Now my problem is that I would like to plot the predicted target
> value
> > > > according to its time.
> > > >
> > > > What I have to do now is:
> > > >
> > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
> > > >
> > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > >
> > > > This is really a cumbersome process for such an simple thing. Is
> there
> > > any
> > > > approach which makes this more simple. If not, can we extend the ML
> > API.
> > > to
> > > > allow ids?
> > > >
> > > > Best regards,
> > > > Felix
> > > >
> > >
> >
>

Sachin Goel

Re: Problem with ML pipeline

A more general approach would be to take as input which indices of the
vector to consider as features. After that, the vector can be returned as
such and user can do what they wish with the non-feature values. This
wouldn't need extending the predict operation, instead this can be
specified in the model itself using a set parameter function. Or perhaps a
better approach is to just take this input in the predict operation.

Cheers!
Sachin
On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> wrote:

> Probably we also need it for the other classes of the pipeline as well, in
> order to be able to pass the ID through the whole pipeline.
>
> Best regards,
> Felix
> Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email]>:
>
> > Then you only have to provide an implicit PredictOperation[SVM, (T, Int),
> > (LabeledVector, Int)] value with T <: Vector in the scope where you call
> > the predict operation.
> > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]> wrote:
> >
> > > That would be great. I like the special predict operation better
> because
> > it
> > > is only in some cases necessary to return the id. The special predict
> > > Operation would save this overhead.
> > >
> > > Best regards,
> > > Felix
> > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > [hidden email]
> > > >:
> > >
> > > > I see your problem. One way to solve the problem is to implement a
> > > special
> > > > PredictOperation which takes a tuple (id, vector) and returns a tuple
> > > (id,
> > > > labeledVector). You can take a look at the implementation for the
> > vector
> > > > prediction operation.
> > > >
> > > > But we can also discuss about adding an ID field to the Vector type.
> > > >
> > > > Cheers,
> > > > Till
> > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have the following use case: I want to to regression for a
> > timeseries
> > > > > dataset like:
> > > > >
> > > > > id, x1, x2, ..., xn, y
> > > > >
> > > > > id = point in time
> > > > > x = features
> > > > > y = target value
> > > > >
> > > > > In the Flink frame work I would map this to a LabeledVector (y,
> > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > > > >
> > > > > When I apply finally the predict() method I get a LabeledVector
> > > > > (y_predicted, DenseVector(x)).
> > > > >
> > > > > Now my problem is that I would like to plot the predicted target
> > value
> > > > > according to its time.
> > > > >
> > > > > What I have to do now is:
> > > > >
> > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
> > > > >
> > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > > >
> > > > > This is really a cumbersome process for such an simple thing. Is
> > there
> > > > any
> > > > > approach which makes this more simple. If not, can we extend the ML
> > > API.
> > > > to
> > > > > allow ids?
> > > > >
> > > > > Best regards,
> > > > > Felix
> > > > >
> > > >
> > >
> >
>

till.rohrmann

Re: Problem with ML pipeline

You're right Felix. You need to provide the `FitOperation` and
`PredictOperation` for the `Predictor` you want to use and the
`FitOperation` and `TransformOperation` for all `Transformer`s you want to
chain in front of the `Predictor`.

Specifying which features to take could be a solution. However, then you're
always carrying data along which is not needed. Especially for large scale
data, this might be prohibitive expensive. I guess the more efficient
solution would be to assign an ID and later join with the removed feature
elements.

Cheers,
Till

On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> wrote:

> A more general approach would be to take as input which indices of the
> vector to consider as features. After that, the vector can be returned as
> such and user can do what they wish with the non-feature values. This
> wouldn't need extending the predict operation, instead this can be
> specified in the model itself using a set parameter function. Or perhaps a
> better approach is to just take this input in the predict operation.
>
> Cheers!
> Sachin
> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> wrote:
>
> > Probably we also need it for the other classes of the pipeline as well,
> in
> > order to be able to pass the ID through the whole pipeline.
> >
> > Best regards,
> > Felix
> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email]
> >:
> >
> > > Then you only have to provide an implicit PredictOperation[SVM, (T,
> Int),
> > > (LabeledVector, Int)] value with T <: Vector in the scope where you
> call
> > > the predict operation.
> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]>
> wrote:
> > >
> > > > That would be great. I like the special predict operation better
> > because
> > > it
> > > > is only in some cases necessary to return the id. The special predict
> > > > Operation would save this overhead.
> > > >
> > > > Best regards,
> > > > Felix
> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > > [hidden email]
> > > > >:
> > > >
> > > > > I see your problem. One way to solve the problem is to implement a
> > > > special
> > > > > PredictOperation which takes a tuple (id, vector) and returns a
> tuple
> > > > (id,
> > > > > labeledVector). You can take a look at the implementation for the
> > > vector
> > > > > prediction operation.
> > > > >
> > > > > But we can also discuss about adding an ID field to the Vector
> type.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have the following use case: I want to to regression for a
> > > timeseries
> > > > > > dataset like:
> > > > > >
> > > > > > id, x1, x2, ..., xn, y
> > > > > >
> > > > > > id = point in time
> > > > > > x = features
> > > > > > y = target value
> > > > > >
> > > > > > In the Flink frame work I would map this to a LabeledVector (y,
> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > > > > >
> > > > > > When I apply finally the predict() method I get a LabeledVector
> > > > > > (y_predicted, DenseVector(x)).
> > > > > >
> > > > > > Now my problem is that I would like to plot the predicted target
> > > value
> > > > > > according to its time.
> > > > > >
> > > > > > What I have to do now is:
> > > > > >
> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
> > > > > >
> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > > > >
> > > > > > This is really a cumbersome process for such an simple thing. Is
> > > there
> > > > > any
> > > > > > approach which makes this more simple. If not, can we extend the
> ML
> > > > API.
> > > > > to
> > > > > > allow ids?
> > > > > >
> > > > > > Best regards,
> > > > > > Felix
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mikio Braun

Re: Problem with ML pipeline

Hi all,

I think there are number of issues here:

- whether or not we generally need ids for our examples. For
time-series, this is a must, but I think it would also help us with
many other things (like partitioning the data, or picking a consistent
subset), so I would think adding (numeric) ids in general to
LabeledVector would be ok.
- Some machinery to select features. My biggest concern here for
putting that as a parameter to the learning algorithm is that this
something independent of the learning algorith, so every algorithm
would need to duplicate the code for that. I think it's better if the
learning algorithm can assume that the LabelVector already contains
all the relevant features, and then there should be other operations
to project or extract a subset of examples.

-M

On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]> wrote:

> You're right Felix. You need to provide the `FitOperation` and
> `PredictOperation` for the `Predictor` you want to use and the
> `FitOperation` and `TransformOperation` for all `Transformer`s you want to
> chain in front of the `Predictor`.
>
> Specifying which features to take could be a solution. However, then you're
> always carrying data along which is not needed. Especially for large scale
> data, this might be prohibitive expensive. I guess the more efficient
> solution would be to assign an ID and later join with the removed feature
> elements.
>
> Cheers,
> Till
>
> On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]> wrote:
>
>> A more general approach would be to take as input which indices of the
>> vector to consider as features. After that, the vector can be returned as
>> such and user can do what they wish with the non-feature values. This
>> wouldn't need extending the predict operation, instead this can be
>> specified in the model itself using a set parameter function. Or perhaps a
>> better approach is to just take this input in the predict operation.
>>
>> Cheers!
>> Sachin
>> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]> wrote:
>>
>> > Probably we also need it for the other classes of the pipeline as well,
>> in
>> > order to be able to pass the ID through the whole pipeline.
>> >
>> > Best regards,
>> > Felix
>> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <[hidden email]
>> >:
>> >
>> > > Then you only have to provide an implicit PredictOperation[SVM, (T,
>> Int),
>> > > (LabeledVector, Int)] value with T <: Vector in the scope where you
>> call
>> > > the predict operation.
>> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]>
>> wrote:
>> > >
>> > > > That would be great. I like the special predict operation better
>> > because
>> > > it
>> > > > is only in some cases necessary to return the id. The special predict
>> > > > Operation would save this overhead.
>> > > >
>> > > > Best regards,
>> > > > Felix
>> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
>> > > [hidden email]
>> > > > >:
>> > > >
>> > > > > I see your problem. One way to solve the problem is to implement a
>> > > > special
>> > > > > PredictOperation which takes a tuple (id, vector) and returns a
>> tuple
>> > > > (id,
>> > > > > labeledVector). You can take a look at the implementation for the
>> > > vector
>> > > > > prediction operation.
>> > > > >
>> > > > > But we can also discuss about adding an ID field to the Vector
>> type.
>> > > > >
>> > > > > Cheers,
>> > > > > Till
>> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]>
>> > > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > I have the following use case: I want to to regression for a
>> > > timeseries
>> > > > > > dataset like:
>> > > > > >
>> > > > > > id, x1, x2, ..., xn, y
>> > > > > >
>> > > > > > id = point in time
>> > > > > > x = features
>> > > > > > y = target value
>> > > > > >
>> > > > > > In the Flink frame work I would map this to a LabeledVector (y,
>> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
>> > > > > >
>> > > > > > When I apply finally the predict() method I get a LabeledVector
>> > > > > > (y_predicted, DenseVector(x)).
>> > > > > >
>> > > > > > Now my problem is that I would like to plot the predicted target
>> > > value
>> > > > > > according to its time.
>> > > > > >
>> > > > > > What I have to do now is:
>> > > > > >
>> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
>> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" => Tuple2(x,id))
>> > > > > >
>> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
>> > > > > >
>> > > > > > This is really a cumbersome process for such an simple thing. Is
>> > > there
>> > > > > any
>> > > > > > approach which makes this more simple. If not, can we extend the
>> ML
>> > > > API.
>> > > > > to
>> > > > > > allow ids?
>> > > > > >
>> > > > > > Best regards,
>> > > > > > Felix
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

--
Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun

Theodore Vasiloudis

Re: Problem with ML pipeline

I agree with Mikio; ids would be useful overall, and feature selection
should not be a part of learning algorithms,
all features in a LabeledVector should be assumed to be relevant by the
learners.

On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun <[hidden email]>
wrote:

> Hi all,
>
> I think there are number of issues here:
>
> - whether or not we generally need ids for our examples. For
> time-series, this is a must, but I think it would also help us with
> many other things (like partitioning the data, or picking a consistent
> subset), so I would think adding (numeric) ids in general to
> LabeledVector would be ok.
> - Some machinery to select features. My biggest concern here for
> putting that as a parameter to the learning algorithm is that this
> something independent of the learning algorith, so every algorithm
> would need to duplicate the code for that. I think it's better if the
> learning algorithm can assume that the LabelVector already contains
> all the relevant features, and then there should be other operations
> to project or extract a subset of examples.
>
> -M
>
> On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]>
> wrote:
> > You're right Felix. You need to provide the `FitOperation` and
> > `PredictOperation` for the `Predictor` you want to use and the
> > `FitOperation` and `TransformOperation` for all `Transformer`s you want
> to
> > chain in front of the `Predictor`.
> >
> > Specifying which features to take could be a solution. However, then
> you're
> > always carrying data along which is not needed. Especially for large
> scale
> > data, this might be prohibitive expensive. I guess the more efficient
> > solution would be to assign an ID and later join with the removed feature
> > elements.
> >
> > Cheers,
> > Till
> >
> > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]>
> wrote:
> >
> >> A more general approach would be to take as input which indices of the
> >> vector to consider as features. After that, the vector can be returned
> as
> >> such and user can do what they wish with the non-feature values. This
> >> wouldn't need extending the predict operation, instead this can be
> >> specified in the model itself using a set parameter function. Or
> perhaps a
> >> better approach is to just take this input in the predict operation.
> >>
> >> Cheers!
> >> Sachin
> >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]>
> wrote:
> >>
> >> > Probably we also need it for the other classes of the pipeline as
> well,
> >> in
> >> > order to be able to pass the ID through the whole pipeline.
> >> >
> >> > Best regards,
> >> > Felix
> >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> [hidden email]
> >> >:
> >> >
> >> > > Then you only have to provide an implicit PredictOperation[SVM, (T,
> >> Int),
> >> > > (LabeledVector, Int)] value with T <: Vector in the scope where you
> >> call
> >> > > the predict operation.
> >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]>
> >> wrote:
> >> > >
> >> > > > That would be great. I like the special predict operation better
> >> > because
> >> > > it
> >> > > > is only in some cases necessary to return the id. The special
> predict
> >> > > > Operation would save this overhead.
> >> > > >
> >> > > > Best regards,
> >> > > > Felix
> >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> >> > > [hidden email]
> >> > > > >:
> >> > > >
> >> > > > > I see your problem. One way to solve the problem is to
> implement a
> >> > > > special
> >> > > > > PredictOperation which takes a tuple (id, vector) and returns a
> >> tuple
> >> > > > (id,
> >> > > > > labeledVector). You can take a look at the implementation for
> the
> >> > > vector
> >> > > > > prediction operation.
> >> > > > >
> >> > > > > But we can also discuss about adding an ID field to the Vector
> >> type.
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <[hidden email]
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > I have the following use case: I want to to regression for a
> >> > > timeseries
> >> > > > > > dataset like:
> >> > > > > >
> >> > > > > > id, x1, x2, ..., xn, y
> >> > > > > >
> >> > > > > > id = point in time
> >> > > > > > x = features
> >> > > > > > y = target value
> >> > > > > >
> >> > > > > > In the Flink frame work I would map this to a LabeledVector
> (y,
> >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> >> > > > > >
> >> > > > > > When I apply finally the predict() method I get a
> LabeledVector
> >> > > > > > (y_predicted, DenseVector(x)).
> >> > > > > >
> >> > > > > > Now my problem is that I would like to plot the predicted
> target
> >> > > value
> >> > > > > > according to its time.
> >> > > > > >
> >> > > > > > What I have to do now is:
> >> > > > > >
> >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> Tuple2(x,id))
> >> > > > > >
> >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> >> > > > > >
> >> > > > > > This is really a cumbersome process for such an simple thing.
> Is
> >> > > there
> >> > > > > any
> >> > > > > > approach which makes this more simple. If not, can we extend
> the
> >> ML
> >> > > > API.
> >> > > > > to
> >> > > > > > allow ids?
> >> > > > > >
> >> > > > > > Best regards,
> >> > > > > > Felix
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>

Sachin Goel

Re: Problem with ML pipeline

Yes. I agree too. It makes no sense for the learning algorithm to have
extra payload. Only relevant data makes sense.
Further, adding ID to the predict operation type definition seems a
legitimate choice. +1 from my side.

Regards
Sachin Goel

On Mon, Jun 8, 2015 at 4:06 PM, Theodore Vasiloudis <
[hidden email]> wrote:

> I agree with Mikio; ids would be useful overall, and feature selection
> should not be a part of learning algorithms,
> all features in a LabeledVector should be assumed to be relevant by the
> learners.
>
> On Mon, Jun 8, 2015 at 12:00 PM, Mikio Braun <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > I think there are number of issues here:
> >
> > - whether or not we generally need ids for our examples. For
> > time-series, this is a must, but I think it would also help us with
> > many other things (like partitioning the data, or picking a consistent
> > subset), so I would think adding (numeric) ids in general to
> > LabeledVector would be ok.
> > - Some machinery to select features. My biggest concern here for
> > putting that as a parameter to the learning algorithm is that this
> > something independent of the learning algorith, so every algorithm
> > would need to duplicate the code for that. I think it's better if the
> > learning algorithm can assume that the LabelVector already contains
> > all the relevant features, and then there should be other operations
> > to project or extract a subset of examples.
> >
> > -M
> >
> > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]>
> > wrote:
> > > You're right Felix. You need to provide the `FitOperation` and
> > > `PredictOperation` for the `Predictor` you want to use and the
> > > `FitOperation` and `TransformOperation` for all `Transformer`s you want
> > to
> > > chain in front of the `Predictor`.
> > >
> > > Specifying which features to take could be a solution. However, then
> > you're
> > > always carrying data along which is not needed. Especially for large
> > scale
> > > data, this might be prohibitive expensive. I guess the more efficient
> > > solution would be to assign an ID and later join with the removed
> feature
> > > elements.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]>
> > wrote:
> > >
> > >> A more general approach would be to take as input which indices of the
> > >> vector to consider as features. After that, the vector can be returned
> > as
> > >> such and user can do what they wish with the non-feature values. This
> > >> wouldn't need extending the predict operation, instead this can be
> > >> specified in the model itself using a set parameter function. Or
> > perhaps a
> > >> better approach is to just take this input in the predict operation.
> > >>
> > >> Cheers!
> > >> Sachin
> > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]>
> > wrote:
> > >>
> > >> > Probably we also need it for the other classes of the pipeline as
> > well,
> > >> in
> > >> > order to be able to pass the ID through the whole pipeline.
> > >> >
> > >> > Best regards,
> > >> > Felix
> > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> > [hidden email]
> > >> >:
> > >> >
> > >> > > Then you only have to provide an implicit PredictOperation[SVM,
> (T,
> > >> Int),
> > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where
> you
> > >> call
> > >> > > the predict operation.
> > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]>
> > >> wrote:
> > >> > >
> > >> > > > That would be great. I like the special predict operation better
> > >> > because
> > >> > > it
> > >> > > > is only in some cases necessary to return the id. The special
> > predict
> > >> > > > Operation would save this overhead.
> > >> > > >
> > >> > > > Best regards,
> > >> > > > Felix
> > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > >> > > [hidden email]
> > >> > > > >:
> > >> > > >
> > >> > > > > I see your problem. One way to solve the problem is to
> > implement a
> > >> > > > special
> > >> > > > > PredictOperation which takes a tuple (id, vector) and returns
> a
> > >> tuple
> > >> > > > (id,
> > >> > > > > labeledVector). You can take a look at the implementation for
> > the
> > >> > > vector
> > >> > > > > prediction operation.
> > >> > > > >
> > >> > > > > But we can also discuss about adding an ID field to the Vector
> > >> type.
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > Till
> > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <
> [hidden email]
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > I have the following use case: I want to to regression for a
> > >> > > timeseries
> > >> > > > > > dataset like:
> > >> > > > > >
> > >> > > > > > id, x1, x2, ..., xn, y
> > >> > > > > >
> > >> > > > > > id = point in time
> > >> > > > > > x = features
> > >> > > > > > y = target value
> > >> > > > > >
> > >> > > > > > In the Flink frame work I would map this to a LabeledVector
> > (y,
> > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > >> > > > > >
> > >> > > > > > When I apply finally the predict() method I get a
> > LabeledVector
> > >> > > > > > (y_predicted, DenseVector(x)).
> > >> > > > > >
> > >> > > > > > Now my problem is that I would like to plot the predicted
> > target
> > >> > > value
> > >> > > > > > according to its time.
> > >> > > > > >
> > >> > > > > > What I have to do now is:
> > >> > > > > >
> > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> > Tuple2(x,id))
> > >> > > > > >
> > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > >> > > > > >
> > >> > > > > > This is really a cumbersome process for such an simple
> thing.
> > Is
> > >> > > there
> > >> > > > > any
> > >> > > > > > approach which makes this more simple. If not, can we extend
> > the
> > >> ML
> > >> > > > API.
> > >> > > > > to
> > >> > > > > > allow ids?
> > >> > > > > >
> > >> > > > > > Best regards,
> > >> > > > > > Felix
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >
> >
> > --
> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
> >
>

till.rohrmann

Re: Problem with ML pipeline

In reply to this post by Mikio Braun

My gut feeling is also that a `Transformer` would be a good place to
implement feature selection. Then you can simply reuse it across multiple
algorithms by simply chaining them together.

However, I don't know yet what's the best way to realize the IDs. One way
would be to add an ID field to `Vector` and `LabeledVector`. Another way
would be to provide operations for `(ID, Vector)` and `(ID, LabeledVector)`
tuple types which reuse the implementations for `Vector` and
`LabeledVector`. This means that the developer doesn't have to implement
special operations for the tuple variants. The latter approach has the
advantage that you only use memory for IDs if you really need them.

Another question is how to assign the IDs. Does the user have to provide
them? Are they randomly chosen. Or do we assign each element an increasing
index based on the total number of elements?

On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email]>
wrote:

Sachin Goel

Re: Problem with ML pipeline

I think if the user doesn't provide IDs, we can safely assume that they
don't need it. We can just simply assign an ID of one as a temporary
measure and return the result, with no IDs [just to make the interface
cleaner].
If the IDs are provided, in that case, we simply use those IDs.
A possible template for this would be:

implicit def predictValues[T <: Vector] = {
new PredictOperation[SVM, T, LabeledVector]{
override def predict(
instance: SVM,
predictParameters: ParameterMap,
input: DataSet[T])
: DataSet[LabeledVector] = {
predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2)
}
}
}

implicit def predictValues[T <: (ID,Vector)] = {
new PredictOperation[SVM, T, (ID,LabeledVector)]{
override def predict(
instance: SVM,
predictParameters: ParameterMap,
input: DataSet[T])
: DataSet[LabeledVector] = {
predict(ParameterMap,input)
}
}
}

Regards
Sachin Goel

On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <[hidden email]>
wrote:

> My gut feeling is also that a `Transformer` would be a good place to
> implement feature selection. Then you can simply reuse it across multiple
> algorithms by simply chaining them together.
>
> However, I don't know yet what's the best way to realize the IDs. One way
> would be to add an ID field to `Vector` and `LabeledVector`. Another way
> would be to provide operations for `(ID, Vector)` and `(ID, LabeledVector)`
> tuple types which reuse the implementations for `Vector` and
> `LabeledVector`. This means that the developer doesn't have to implement
> special operations for the tuple variants. The latter approach has the
> advantage that you only use memory for IDs if you really need them.
>
> Another question is how to assign the IDs. Does the user have to provide
> them? Are they randomly chosen. Or do we assign each element an increasing
> index based on the total number of elements?
>
> On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > I think there are number of issues here:
> >
> > - whether or not we generally need ids for our examples. For
> > time-series, this is a must, but I think it would also help us with
> > many other things (like partitioning the data, or picking a consistent
> > subset), so I would think adding (numeric) ids in general to
> > LabeledVector would be ok.
> > - Some machinery to select features. My biggest concern here for
> > putting that as a parameter to the learning algorithm is that this
> > something independent of the learning algorith, so every algorithm
> > would need to duplicate the code for that. I think it's better if the
> > learning algorithm can assume that the LabelVector already contains
> > all the relevant features, and then there should be other operations
> > to project or extract a subset of examples.
> >
> > -M
> >
> > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <[hidden email]>
> > wrote:
> > > You're right Felix. You need to provide the `FitOperation` and
> > > `PredictOperation` for the `Predictor` you want to use and the
> > > `FitOperation` and `TransformOperation` for all `Transformer`s you want
> > to
> > > chain in front of the `Predictor`.
> > >
> > > Specifying which features to take could be a solution. However, then
> > you're
> > > always carrying data along which is not needed. Especially for large
> > scale
> > > data, this might be prohibitive expensive. I guess the more efficient
> > > solution would be to assign an ID and later join with the removed
> feature
> > > elements.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]>
> > wrote:
> > >
> > >> A more general approach would be to take as input which indices of the
> > >> vector to consider as features. After that, the vector can be returned
> > as
> > >> such and user can do what they wish with the non-feature values. This
> > >> wouldn't need extending the predict operation, instead this can be
> > >> specified in the model itself using a set parameter function. Or
> > perhaps a
> > >> better approach is to just take this input in the predict operation.
> > >>
> > >> Cheers!
> > >> Sachin
> > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]>
> > wrote:
> > >>
> > >> > Probably we also need it for the other classes of the pipeline as
> > well,
> > >> in
> > >> > order to be able to pass the ID through the whole pipeline.
> > >> >
> > >> > Best regards,
> > >> > Felix
> > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> > [hidden email]
> > >> >:
> > >> >
> > >> > > Then you only have to provide an implicit PredictOperation[SVM,
> (T,
> > >> Int),
> > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where
> you
> > >> call
> > >> > > the predict operation.
> > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]>
> > >> wrote:
> > >> > >
> > >> > > > That would be great. I like the special predict operation better
> > >> > because
> > >> > > it
> > >> > > > is only in some cases necessary to return the id. The special
> > predict
> > >> > > > Operation would save this overhead.
> > >> > > >
> > >> > > > Best regards,
> > >> > > > Felix
> > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > >> > > [hidden email]
> > >> > > > >:
> > >> > > >
> > >> > > > > I see your problem. One way to solve the problem is to
> > implement a
> > >> > > > special
> > >> > > > > PredictOperation which takes a tuple (id, vector) and returns
> a
> > >> tuple
> > >> > > > (id,
> > >> > > > > labeledVector). You can take a look at the implementation for
> > the
> > >> > > vector
> > >> > > > > prediction operation.
> > >> > > > >
> > >> > > > > But we can also discuss about adding an ID field to the Vector
> > >> type.
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > Till
> > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <
> [hidden email]
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > I have the following use case: I want to to regression for a
> > >> > > timeseries
> > >> > > > > > dataset like:
> > >> > > > > >
> > >> > > > > > id, x1, x2, ..., xn, y
> > >> > > > > >
> > >> > > > > > id = point in time
> > >> > > > > > x = features
> > >> > > > > > y = target value
> > >> > > > > >
> > >> > > > > > In the Flink frame work I would map this to a LabeledVector
> > (y,
> > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > >> > > > > >
> > >> > > > > > When I apply finally the predict() method I get a
> > LabeledVector
> > >> > > > > > (y_predicted, DenseVector(x)).
> > >> > > > > >
> > >> > > > > > Now my problem is that I would like to plot the predicted
> > target
> > >> > > value
> > >> > > > > > according to its time.
> > >> > > > > >
> > >> > > > > > What I have to do now is:
> > >> > > > > >
> > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> > Tuple2(x,id))
> > >> > > > > >
> > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > >> > > > > >
> > >> > > > > > This is really a cumbersome process for such an simple
> thing.
> > Is
> > >> > > there
> > >> > > > > any
> > >> > > > > > approach which makes this more simple. If not, can we extend
> > the
> > >> ML
> > >> > > > API.
> > >> > > > > to
> > >> > > > > > allow ids?
> > >> > > > > >
> > >> > > > > > Best regards,
> > >> > > > > > Felix
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >
> >
> > --
> > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
> >
>

Felix Neutatz

Re: Problem with ML pipeline

I am in favor of efficiency. Therefore I would be prefer to introduce new
methods, in order to save memory and network traffic. This would also solve
the problem of "how to come up with ids?"

Best regards,
Felix
Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" <[hidden email]>:

> I think if the user doesn't provide IDs, we can safely assume that they
> don't need it. We can just simply assign an ID of one as a temporary
> measure and return the result, with no IDs [just to make the interface
> cleaner].
> If the IDs are provided, in that case, we simply use those IDs.
> A possible template for this would be:
>
> implicit def predictValues[T <: Vector] = {
> new PredictOperation[SVM, T, LabeledVector]{
> override def predict(
> instance: SVM,
> predictParameters: ParameterMap,
> input: DataSet[T])
> : DataSet[LabeledVector] = {
> predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2)
> }
> }
> }
>
> implicit def predictValues[T <: (ID,Vector)] = {
> new PredictOperation[SVM, T, (ID,LabeledVector)]{
> override def predict(
> instance: SVM,
> predictParameters: ParameterMap,
> input: DataSet[T])
> : DataSet[LabeledVector] = {
> predict(ParameterMap,input)
> }
> }
> }
>
> Regards
> Sachin Goel
>
> On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <[hidden email]>
> wrote:
>
> > My gut feeling is also that a `Transformer` would be a good place to
> > implement feature selection. Then you can simply reuse it across multiple
> > algorithms by simply chaining them together.
> >
> > However, I don't know yet what's the best way to realize the IDs. One way
> > would be to add an ID field to `Vector` and `LabeledVector`. Another way
> > would be to provide operations for `(ID, Vector)` and `(ID,
> LabeledVector)`
> > tuple types which reuse the implementations for `Vector` and
> > `LabeledVector`. This means that the developer doesn't have to implement
> > special operations for the tuple variants. The latter approach has the
> > advantage that you only use memory for IDs if you really need them.
> >
> > Another question is how to assign the IDs. Does the user have to provide
> > them? Are they randomly chosen. Or do we assign each element an
> increasing
> > index based on the total number of elements?
> >
> > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I think there are number of issues here:
> > >
> > > - whether or not we generally need ids for our examples. For
> > > time-series, this is a must, but I think it would also help us with
> > > many other things (like partitioning the data, or picking a consistent
> > > subset), so I would think adding (numeric) ids in general to
> > > LabeledVector would be ok.
> > > - Some machinery to select features. My biggest concern here for
> > > putting that as a parameter to the learning algorithm is that this
> > > something independent of the learning algorith, so every algorithm
> > > would need to duplicate the code for that. I think it's better if the
> > > learning algorithm can assume that the LabelVector already contains
> > > all the relevant features, and then there should be other operations
> > > to project or extract a subset of examples.
> > >
> > > -M
> > >
> > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <
> [hidden email]>
> > > wrote:
> > > > You're right Felix. You need to provide the `FitOperation` and
> > > > `PredictOperation` for the `Predictor` you want to use and the
> > > > `FitOperation` and `TransformOperation` for all `Transformer`s you
> want
> > > to
> > > > chain in front of the `Predictor`.
> > > >
> > > > Specifying which features to take could be a solution. However, then
> > > you're
> > > > always carrying data along which is not needed. Especially for large
> > > scale
> > > > data, this might be prohibitive expensive. I guess the more efficient
> > > > solution would be to assign an ID and later join with the removed
> > feature
> > > > elements.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <[hidden email]
> >
> > > wrote:
> > > >
> > > >> A more general approach would be to take as input which indices of
> the
> > > >> vector to consider as features. After that, the vector can be
> returned
> > > as
> > > >> such and user can do what they wish with the non-feature values.
> This
> > > >> wouldn't need extending the predict operation, instead this can be
> > > >> specified in the model itself using a set parameter function. Or
> > > perhaps a
> > > >> better approach is to just take this input in the predict operation.
> > > >>
> > > >> Cheers!
> > > >> Sachin
> > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]>
> > > wrote:
> > > >>
> > > >> > Probably we also need it for the other classes of the pipeline as
> > > well,
> > > >> in
> > > >> > order to be able to pass the ID through the whole pipeline.
> > > >> >
> > > >> > Best regards,
> > > >> > Felix
> > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> > > [hidden email]
> > > >> >:
> > > >> >
> > > >> > > Then you only have to provide an implicit PredictOperation[SVM,
> > (T,
> > > >> Int),
> > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope where
> > you
> > > >> call
> > > >> > > the predict operation.
> > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <[hidden email]
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > That would be great. I like the special predict operation
> better
> > > >> > because
> > > >> > > it
> > > >> > > > is only in some cases necessary to return the id. The special
> > > predict
> > > >> > > > Operation would save this overhead.
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > > Felix
> > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > > >> > > [hidden email]
> > > >> > > > >:
> > > >> > > >
> > > >> > > > > I see your problem. One way to solve the problem is to
> > > implement a
> > > >> > > > special
> > > >> > > > > PredictOperation which takes a tuple (id, vector) and
> returns
> > a
> > > >> tuple
> > > >> > > > (id,
> > > >> > > > > labeledVector). You can take a look at the implementation
> for
> > > the
> > > >> > > vector
> > > >> > > > > prediction operation.
> > > >> > > > >
> > > >> > > > > But we can also discuss about adding an ID field to the
> Vector
> > > >> type.
> > > >> > > > >
> > > >> > > > > Cheers,
> > > >> > > > > Till
> > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <
> > [hidden email]
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi,
> > > >> > > > > >
> > > >> > > > > > I have the following use case: I want to to regression
> for a
> > > >> > > timeseries
> > > >> > > > > > dataset like:
> > > >> > > > > >
> > > >> > > > > > id, x1, x2, ..., xn, y
> > > >> > > > > >
> > > >> > > > > > id = point in time
> > > >> > > > > > x = features
> > > >> > > > > > y = target value
> > > >> > > > > >
> > > >> > > > > > In the Flink frame work I would map this to a
> LabeledVector
> > > (y,
> > > >> > > > > > DenseVector(x)). (I don't want to use the id as a feature)
> > > >> > > > > >
> > > >> > > > > > When I apply finally the predict() method I get a
> > > LabeledVector
> > > >> > > > > > (y_predicted, DenseVector(x)).
> > > >> > > > > >
> > > >> > > > > > Now my problem is that I would like to plot the predicted
> > > target
> > > >> > > value
> > > >> > > > > > according to its time.
> > > >> > > > > >
> > > >> > > > > > What I have to do now is:
> > > >> > > > > >
> > > >> > > > > > a = predictedDataSet.map ( LabeledVector => Tuple2(x,y_p))
> > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> > > Tuple2(x,id))
> > > >> > > > > >
> > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > >> > > > > >
> > > >> > > > > > This is really a cumbersome process for such an simple
> > thing.
> > > Is
> > > >> > > there
> > > >> > > > > any
> > > >> > > > > > approach which makes this more simple. If not, can we
> extend
> > > the
> > > >> ML
> > > >> > > > API.
> > > >> > > > > to
> > > >> > > > > > allow ids?
> > > >> > > > > >
> > > >> > > > > > Best regards,
> > > >> > > > > > Felix
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > >
> > >
> > > --
> > > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
> > >
> >
>

Sachin Goel

Re: Problem with ML pipeline

That would be better of course. My opinion had to do with
not-implementing-exactly-the-same-thing-twice. Perhaps Till could weigh in
here.
We really do need to come up with a general mechanism for this. Testing
labeled vectors has exactly the same problem. I'll look into how Spark and
sci-kit approach this.

Regards
Sachin Goel

On Mon, Jun 8, 2015 at 4:26 PM, Felix Neutatz <[hidden email]>
wrote:

> I am in favor of efficiency. Therefore I would be prefer to introduce new
> methods, in order to save memory and network traffic. This would also solve
> the problem of "how to come up with ids?"
>
> Best regards,
> Felix
> Am 08.06.2015 12:52 nachm. schrieb "Sachin Goel" <[hidden email]
> >:
>
> > I think if the user doesn't provide IDs, we can safely assume that they
> > don't need it. We can just simply assign an ID of one as a temporary
> > measure and return the result, with no IDs [just to make the interface
> > cleaner].
> > If the IDs are provided, in that case, we simply use those IDs.
> > A possible template for this would be:
> >
> > implicit def predictValues[T <: Vector] = {
> > new PredictOperation[SVM, T, LabeledVector]{
> > override def predict(
> > instance: SVM,
> > predictParameters: ParameterMap,
> > input: DataSet[T])
> > : DataSet[LabeledVector] = {
> > predict(ParameterMap,input.map(x=>(1,x))).map(x=> x._2)
> > }
> > }
> > }
> >
> > implicit def predictValues[T <: (ID,Vector)] = {
> > new PredictOperation[SVM, T, (ID,LabeledVector)]{
> > override def predict(
> > instance: SVM,
> > predictParameters: ParameterMap,
> > input: DataSet[T])
> > : DataSet[LabeledVector] = {
> > predict(ParameterMap,input)
> > }
> > }
> > }
> >
> > Regards
> > Sachin Goel
> >
> > On Mon, Jun 8, 2015 at 4:11 PM, Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > My gut feeling is also that a `Transformer` would be a good place to
> > > implement feature selection. Then you can simply reuse it across
> multiple
> > > algorithms by simply chaining them together.
> > >
> > > However, I don't know yet what's the best way to realize the IDs. One
> way
> > > would be to add an ID field to `Vector` and `LabeledVector`. Another
> way
> > > would be to provide operations for `(ID, Vector)` and `(ID,
> > LabeledVector)`
> > > tuple types which reuse the implementations for `Vector` and
> > > `LabeledVector`. This means that the developer doesn't have to
> implement
> > > special operations for the tuple variants. The latter approach has the
> > > advantage that you only use memory for IDs if you really need them.
> > >
> > > Another question is how to assign the IDs. Does the user have to
> provide
> > > them? Are they randomly chosen. Or do we assign each element an
> > increasing
> > > index based on the total number of elements?
> > >
> > > On Mon, Jun 8, 2015 at 12:00 PM Mikio Braun <[hidden email]
> >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I think there are number of issues here:
> > > >
> > > > - whether or not we generally need ids for our examples. For
> > > > time-series, this is a must, but I think it would also help us with
> > > > many other things (like partitioning the data, or picking a
> consistent
> > > > subset), so I would think adding (numeric) ids in general to
> > > > LabeledVector would be ok.
> > > > - Some machinery to select features. My biggest concern here for
> > > > putting that as a parameter to the learning algorithm is that this
> > > > something independent of the learning algorith, so every algorithm
> > > > would need to duplicate the code for that. I think it's better if the
> > > > learning algorithm can assume that the LabelVector already contains
> > > > all the relevant features, and then there should be other operations
> > > > to project or extract a subset of examples.
> > > >
> > > > -M
> > > >
> > > > On Mon, Jun 8, 2015 at 10:01 AM, Till Rohrmann <
> > [hidden email]>
> > > > wrote:
> > > > > You're right Felix. You need to provide the `FitOperation` and
> > > > > `PredictOperation` for the `Predictor` you want to use and the
> > > > > `FitOperation` and `TransformOperation` for all `Transformer`s you
> > want
> > > > to
> > > > > chain in front of the `Predictor`.
> > > > >
> > > > > Specifying which features to take could be a solution. However,
> then
> > > > you're
> > > > > always carrying data along which is not needed. Especially for
> large
> > > > scale
> > > > > data, this might be prohibitive expensive. I guess the more
> efficient
> > > > > solution would be to assign an ID and later join with the removed
> > > feature
> > > > > elements.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Mon, Jun 8, 2015 at 7:11 AM Sachin Goel <
> [hidden email]
> > >
> > > > wrote:
> > > > >
> > > > >> A more general approach would be to take as input which indices of
> > the
> > > > >> vector to consider as features. After that, the vector can be
> > returned
> > > > as
> > > > >> such and user can do what they wish with the non-feature values.
> > This
> > > > >> wouldn't need extending the predict operation, instead this can be
> > > > >> specified in the model itself using a set parameter function. Or
> > > > perhaps a
> > > > >> better approach is to just take this input in the predict
> operation.
> > > > >>
> > > > >> Cheers!
> > > > >> Sachin
> > > > >> On Jun 8, 2015 10:17 AM, "Felix Neutatz" <[hidden email]>
> > > > wrote:
> > > > >>
> > > > >> > Probably we also need it for the other classes of the pipeline
> as
> > > > well,
> > > > >> in
> > > > >> > order to be able to pass the ID through the whole pipeline.
> > > > >> >
> > > > >> > Best regards,
> > > > >> > Felix
> > > > >> > Am 06.06.2015 9:46 vorm. schrieb "Till Rohrmann" <
> > > > [hidden email]
> > > > >> >:
> > > > >> >
> > > > >> > > Then you only have to provide an implicit
> PredictOperation[SVM,
> > > (T,
> > > > >> Int),
> > > > >> > > (LabeledVector, Int)] value with T <: Vector in the scope
> where
> > > you
> > > > >> call
> > > > >> > > the predict operation.
> > > > >> > > On Jun 6, 2015 8:14 AM, "Felix Neutatz" <
> [hidden email]
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > That would be great. I like the special predict operation
> > better
> > > > >> > because
> > > > >> > > it
> > > > >> > > > is only in some cases necessary to return the id. The
> special
> > > > predict
> > > > >> > > > Operation would save this overhead.
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > > Felix
> > > > >> > > > Am 04.06.2015 7:56 nachm. schrieb "Till Rohrmann" <
> > > > >> > > [hidden email]
> > > > >> > > > >:
> > > > >> > > >
> > > > >> > > > > I see your problem. One way to solve the problem is to
> > > > implement a
> > > > >> > > > special
> > > > >> > > > > PredictOperation which takes a tuple (id, vector) and
> > returns
> > > a
> > > > >> tuple
> > > > >> > > > (id,
> > > > >> > > > > labeledVector). You can take a look at the implementation
> > for
> > > > the
> > > > >> > > vector
> > > > >> > > > > prediction operation.
> > > > >> > > > >
> > > > >> > > > > But we can also discuss about adding an ID field to the
> > Vector
> > > > >> type.
> > > > >> > > > >
> > > > >> > > > > Cheers,
> > > > >> > > > > Till
> > > > >> > > > > On Jun 4, 2015 7:30 PM, "Felix Neutatz" <
> > > [hidden email]
> > > > >
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi,
> > > > >> > > > > >
> > > > >> > > > > > I have the following use case: I want to to regression
> > for a
> > > > >> > > timeseries
> > > > >> > > > > > dataset like:
> > > > >> > > > > >
> > > > >> > > > > > id, x1, x2, ..., xn, y
> > > > >> > > > > >
> > > > >> > > > > > id = point in time
> > > > >> > > > > > x = features
> > > > >> > > > > > y = target value
> > > > >> > > > > >
> > > > >> > > > > > In the Flink frame work I would map this to a
> > LabeledVector
> > > > (y,
> > > > >> > > > > > DenseVector(x)). (I don't want to use the id as a
> feature)
> > > > >> > > > > >
> > > > >> > > > > > When I apply finally the predict() method I get a
> > > > LabeledVector
> > > > >> > > > > > (y_predicted, DenseVector(x)).
> > > > >> > > > > >
> > > > >> > > > > > Now my problem is that I would like to plot the
> predicted
> > > > target
> > > > >> > > value
> > > > >> > > > > > according to its time.
> > > > >> > > > > >
> > > > >> > > > > > What I have to do now is:
> > > > >> > > > > >
> > > > >> > > > > > a = predictedDataSet.map ( LabeledVector =>
> Tuple2(x,y_p))
> > > > >> > > > > > b = originalDataSet.map("id, x1, x2, ..., xn, y" =>
> > > > Tuple2(x,id))
> > > > >> > > > > >
> > > > >> > > > > > a.join(b).where("x").equalTo("x") { (a,b) => (id, y_p)
> > > > >> > > > > >
> > > > >> > > > > > This is really a cumbersome process for such an simple
> > > thing.
> > > > Is
> > > > >> > > there
> > > > >> > > > > any
> > > > >> > > > > > approach which makes this more simple. If not, can we
> > extend
> > > > the
> > > > >> ML
> > > > >> > > > API.
> > > > >> > > > > to
> > > > >> > > > > > allow ids?
> > > > >> > > > > >
> > > > >> > > > > > Best regards,
> > > > >> > > > > > Felix
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Mikio Braun - http://blog.mikiobraun.de,
> http://twitter.com/mikiobraun
> > > >
> > >
> >
>