(DEPRECATED) Apache Flink Mailing List archive.

LabeledVector with label vector

Classic

List

Threaded

6 messages Options

Hilmi Yildirim

LabeledVector with label vector

Hi,
in the ML-Pipeline of Flink we have the "LabeledVector" class. It
consists of a vector and a label as a double value. Unfortunately, it is
not applicable for sequence learning where the label is also a vector.
For example, in NLP we have a vector of words and the label is a vector
of the corresponding labels.

The optimize function of the "Solver" class has a DateSet[LabeledVector]
as input and, therefore, it is not applicable for sequence learning. I
think the LabeledVector should be adapted that the label is a vector
instead of a single Double value. What do you think?

Best Regards,

--
==================================================================
Hilmi Yildirim, M.Sc.
Researcher

DFKI GmbH
Intelligente Analytik für Massendaten
DFKI Projektbüro Berlin
Alt-Moabit 91c
D-10559 Berlin
Phone: +49 30 23895 1814

E-Mail: [hidden email]

-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------

Chiwan Park-2

Re: LabeledVector with label vector

Hi Hilmi,

Thanks for suggestion about type of labeled vector. Basically, I agree that your suggestion is reasonable. But, I would like to generialize `LabeledVector` like following example:

```
case class LabeledVector[T <: Serializable](label: T, vector: Vector) extends Serializable {
// some implementations for LabeledVector
}
```

How about this implementation? If there are any other opinions, please send a email to mailing list.

> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]> wrote:
>
> Hi,
> in the ML-Pipeline of Flink we have the "LabeledVector" class. It consists of a vector and a label as a double value. Unfortunately, it is not applicable for sequence learning where the label is also a vector. For example, in NLP we have a vector of words and the label is a vector of the corresponding labels.
>
> The optimize function of the "Solver" class has a DateSet[LabeledVector] as input and, therefore, it is not applicable for sequence learning. I think the LabeledVector should be adapted that the label is a vector instead of a single Double value. What do you think?
>
> Best Regards,
>
> --
> ==================================================================
> Hilmi Yildirim, M.Sc.
> Researcher
>
> DFKI GmbH
> Intelligente Analytik für Massendaten
> DFKI Projektbüro Berlin
> Alt-Moabit 91c
> D-10559 Berlin
> Phone: +49 30 23895 1814
>
> E-Mail: [hidden email]
>
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>

Regards,
Chiwan Park

Hilmi Yildirim

Re: Re: LabeledVector with label vector

In reply to this post by Hilmi Yildirim

Hi,
yes it is a good idea. One implementaiton with a single valued label and
a second implementation with a label vector.

Best Regards,
Hilmi

From: *Chiwan Park* <[hidden email] <mailto:[hidden email]>>
Date: Tue, Jan 5, 2016 at 12:17 PM
Subject: Re: LabeledVector with label vector
To: [hidden email] <mailto:[hidden email]>

Hi Hilmi,

Thanks for suggestion about type of labeled vector. Basically, I agree
that your suggestion is reasonable. But, I would like to generialize
`LabeledVector` like following example:

```
case class LabeledVector[T <: Serializable](label: T, vector: Vector)
extends Serializable {
// some implementations for LabeledVector
}
```

How about this implementation? If there are any other opinions, please
send a email to mailing list.

> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]
<mailto:[hidden email]>> wrote:
>
> Hi,
> in the ML-Pipeline of Flink we have the "LabeledVector" class. It
consists of a vector and a label as a double value. Unfortunately, it is
not applicable for sequence learning where the label is also a vector.
For example, in NLP we have a vector of words and the label is a vector
of the corresponding labels.
>
> The optimize function of the "Solver" class has a
DateSet[LabeledVector] as input and, therefore, it is not applicable for
sequence learning. I think the LabeledVector should be adapted that the
label is a vector instead of a single Double value. What do you think?
>
> Best Regards,
>
> --
> ==================================================================
> Hilmi Yildirim, M.Sc.
> Researcher
>
> DFKI GmbH
> Intelligente Analytik für Massendaten
> DFKI Projektbüro Berlin
> Alt-Moabit 91c
> D-10559 Berlin
> Phone: +49 30 23895 1814 <tel:%2B49%2030%2023895%201814>
>
> E-Mail: [hidden email] <mailto:[hidden email]>
>
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>

Regards,
Chiwan Park

Theodore Vasiloudis

Re: LabeledVector with label vector

In reply to this post by Chiwan Park-2

Generalizing the type of the label for the label vector is an idea we
played with when designing the current optimization framework.

We ended up deciding against it as the double type allows us to do
regressions and (multiclass) classification which should be the majority of
the use cases out there, while keeping the code simple.

Generalizing this to [T <: Serializable] is too broad I think. [T <:
Vector] is I think more reasonable, I cannot think of many cases where the
label in an optimization problems is something other than a vector/double.

Any change would require a number of changes in the optimization of course,
as optimizing for vector and double labels requires different handling of
error calculation etc but it should be doable.
Note however that since LabeledVector is such a core part of the library
any changes would involve a number of adjustments downstream.

Perhaps having different optimizers etc. for Vectors and double labels
makes sense, but I haven't put much though into this.

On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <[hidden email]> wrote:

> Hi Hilmi,
>
> Thanks for suggestion about type of labeled vector. Basically, I agree
> that your suggestion is reasonable. But, I would like to generialize
> `LabeledVector` like following example:
>
> ```
> case class LabeledVector[T <: Serializable](label: T, vector: Vector)
> extends Serializable {
> // some implementations for LabeledVector
> }
> ```
>
> How about this implementation? If there are any other opinions, please
> send a email to mailing list.
>
> > On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]>
> wrote:
> >
> > Hi,
> > in the ML-Pipeline of Flink we have the "LabeledVector" class. It
> consists of a vector and a label as a double value. Unfortunately, it is
> not applicable for sequence learning where the label is also a vector. For
> example, in NLP we have a vector of words and the label is a vector of the
> corresponding labels.
> >
> > The optimize function of the "Solver" class has a DateSet[LabeledVector]
> as input and, therefore, it is not applicable for sequence learning. I
> think the LabeledVector should be adapted that the label is a vector
> instead of a single Double value. What do you think?
> >
> > Best Regards,
> >
> > --
> > ==================================================================
> > Hilmi Yildirim, M.Sc.
> > Researcher
> >
> > DFKI GmbH
> > Intelligente Analytik für Massendaten
> > DFKI Projektbüro Berlin
> > Alt-Moabit 91c
> > D-10559 Berlin
> > Phone: +49 30 23895 1814
> >
> > E-Mail: [hidden email]
> >
> > -------------------------------------------------------------
> > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >
> > Geschaeftsfuehrung:
> > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> > Dr. Walter Olthoff
> >
> > Vorsitzender des Aufsichtsrats:
> > Prof. Dr. h.c. Hans A. Aukes
> >
> > Amtsgericht Kaiserslautern, HRB 2313
> > -------------------------------------------------------------
> >
>
> Regards,
> Chiwan Park
>
>
>

Chiwan Park-2

Re: LabeledVector with label vector

Hi Theodore,

Thanks for explaining the reason. :)

So how about change LabeledVector contains two vectors? One of vectors is for label and the other one is for value. I think this approach would be okay because a double value label could be represented as a DenseVector(Array(LABEL_VALUE)).

Only problem in this approach is some overhead of processing Vector type in case of single double label. If the overhead is significant, we should create two types of LabeledVector such as DoubleLabeledVector and VectorLabeledVector.

Which one is preferred?

> On Jan 5, 2016, at 11:38 PM, Theodore Vasiloudis <[hidden email]> wrote:
>
> Generalizing the type of the label for the label vector is an idea we
> played with when designing the current optimization framework.
>
> We ended up deciding against it as the double type allows us to do
> regressions and (multiclass) classification which should be the majority of
> the use cases out there, while keeping the code simple.
>
> Generalizing this to [T <: Serializable] is too broad I think. [T <:
> Vector] is I think more reasonable, I cannot think of many cases where the
> label in an optimization problems is something other than a vector/double.
>
> Any change would require a number of changes in the optimization of course,
> as optimizing for vector and double labels requires different handling of
> error calculation etc but it should be doable.
> Note however that since LabeledVector is such a core part of the library
> any changes would involve a number of adjustments downstream.
>
> Perhaps having different optimizers etc. for Vectors and double labels
> makes sense, but I haven't put much though into this.
>
>
> On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <[hidden email]> wrote:
>
>> Hi Hilmi,
>>
>> Thanks for suggestion about type of labeled vector. Basically, I agree
>> that your suggestion is reasonable. But, I would like to generialize
>> `LabeledVector` like following example:
>>
>> ```
>> case class LabeledVector[T <: Serializable](label: T, vector: Vector)
>> extends Serializable {
>> // some implementations for LabeledVector
>> }
>> ```
>>
>> How about this implementation? If there are any other opinions, please
>> send a email to mailing list.
>>
>>> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]>
>> wrote:
>>>
>>> Hi,
>>> in the ML-Pipeline of Flink we have the "LabeledVector" class. It
>> consists of a vector and a label as a double value. Unfortunately, it is
>> not applicable for sequence learning where the label is also a vector. For
>> example, in NLP we have a vector of words and the label is a vector of the
>> corresponding labels.
>>>
>>> The optimize function of the "Solver" class has a DateSet[LabeledVector]
>> as input and, therefore, it is not applicable for sequence learning. I
>> think the LabeledVector should be adapted that the label is a vector
>> instead of a single Double value. What do you think?
>>>
>>> Best Regards,
>>>
>>> --
>>> ==================================================================
>>> Hilmi Yildirim, M.Sc.
>>> Researcher
>>>
>>> DFKI GmbH
>>> Intelligente Analytik für Massendaten
>>> DFKI Projektbüro Berlin
>>> Alt-Moabit 91c
>>> D-10559 Berlin
>>> Phone: +49 30 23895 1814
>>>
>>> E-Mail: [hidden email]
>>>
>>> -------------------------------------------------------------
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> -------------------------------------------------------------
>>>
>>
>> Regards,
>> Chiwan Park

Regards,
Chiwan Park

till.rohrmann

Re: LabeledVector with label vector

Hi,

yes, initially we thought about introducing a LabeledVector where the label
can be a vector. However, for the sake of simplicity we decided to first
implement a LabeledVector with a single double value as label.

A simple double value should take 8 bytes of memory space. The
DenseVector(Array(Label_Value)) should take 4 bytes for the array
reference, 16 bytes for the array structure and 8 bytes for the value = 28
bytes. Thus, the space for the label would be roughly tripled. This might
not be too expensive assuming that the most space will be taken by the
feature vector. However, I would also assume that the access to the label
value wrapped in a DenseVector should be a little bit slower since one
first has to retrieve the reference to the array and then access the first
element instead of directly accessing a double field.

I would assume that this should not have such a big impact on the overall
performance. But without running some benchmarks, it’s hard to say.

Alternatively, you can also define your own custom type for NLP. I’m not
really familiar with sequence learning but can you use gradient descent for
that? If not, then you have to write your own solver which can also work on
a different type.

Cheers,
Till

On Wed, Jan 6, 2016 at 2:12 AM, Chiwan Park <[hidden email]> wrote:

> Hi Theodore,
>
> Thanks for explaining the reason. :)
>
> So how about change LabeledVector contains two vectors? One of vectors is
> for label and the other one is for value. I think this approach would be
> okay because a double value label could be represented as a
> DenseVector(Array(LABEL_VALUE)).
>
> Only problem in this approach is some overhead of processing Vector type
> in case of single double label. If the overhead is significant, we should
> create two types of LabeledVector such as DoubleLabeledVector and
> VectorLabeledVector.
>
> Which one is preferred?
>
> > On Jan 5, 2016, at 11:38 PM, Theodore Vasiloudis <
> [hidden email]> wrote:
> >
> > Generalizing the type of the label for the label vector is an idea we
> > played with when designing the current optimization framework.
> >
> > We ended up deciding against it as the double type allows us to do
> > regressions and (multiclass) classification which should be the majority
> of
> > the use cases out there, while keeping the code simple.
> >
> > Generalizing this to [T <: Serializable] is too broad I think. [T <:
> > Vector] is I think more reasonable, I cannot think of many cases where
> the
> > label in an optimization problems is something other than a
> vector/double.
> >
> > Any change would require a number of changes in the optimization of
> course,
> > as optimizing for vector and double labels requires different handling of
> > error calculation etc but it should be doable.
> > Note however that since LabeledVector is such a core part of the library
> > any changes would involve a number of adjustments downstream.
> >
> > Perhaps having different optimizers etc. for Vectors and double labels
> > makes sense, but I haven't put much though into this.
> >
> >
> > On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <[hidden email]>
> wrote:
> >
> >> Hi Hilmi,
> >>
> >> Thanks for suggestion about type of labeled vector. Basically, I agree
> >> that your suggestion is reasonable. But, I would like to generialize
> >> `LabeledVector` like following example:
> >>
> >> ```
> >> case class LabeledVector[T <: Serializable](label: T, vector: Vector)
> >> extends Serializable {
> >> // some implementations for LabeledVector
> >> }
> >> ```
> >>
> >> How about this implementation? If there are any other opinions, please
> >> send a email to mailing list.
> >>
> >>> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]>
> >> wrote:
> >>>
> >>> Hi,
> >>> in the ML-Pipeline of Flink we have the "LabeledVector" class. It
> >> consists of a vector and a label as a double value. Unfortunately, it is
> >> not applicable for sequence learning where the label is also a vector.
> For
> >> example, in NLP we have a vector of words and the label is a vector of
> the
> >> corresponding labels.
> >>>
> >>> The optimize function of the "Solver" class has a
> DateSet[LabeledVector]
> >> as input and, therefore, it is not applicable for sequence learning. I
> >> think the LabeledVector should be adapted that the label is a vector
> >> instead of a single Double value. What do you think?
> >>>
> >>> Best Regards,
> >>>
> >>> --
> >>> ==================================================================
> >>> Hilmi Yildirim, M.Sc.
> >>> Researcher
> >>>
> >>> DFKI GmbH
> >>> Intelligente Analytik für Massendaten
> >>> DFKI Projektbüro Berlin
> >>> Alt-Moabit 91c
> >>> D-10559 Berlin
> >>> Phone: +49 30 23895 1814
> >>>
> >>> E-Mail: [hidden email]
> >>>
> >>> -------------------------------------------------------------
> >>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> >>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >>>
> >>> Geschaeftsfuehrung:
> >>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> >>> Dr. Walter Olthoff
> >>>
> >>> Vorsitzender des Aufsichtsrats:
> >>> Prof. Dr. h.c. Hans A. Aukes
> >>>
> >>> Amtsgericht Kaiserslautern, HRB 2313
> >>> -------------------------------------------------------------
> >>>
> >>
> >> Regards,
> >> Chiwan Park
>
> Regards,
> Chiwan Park
>
>
>