Hi,
in the ML-Pipeline of Flink we have the "LabeledVector" class. It consists of a vector and a label as a double value. Unfortunately, it is not applicable for sequence learning where the label is also a vector. For example, in NLP we have a vector of words and the label is a vector of the corresponding labels. The optimize function of the "Solver" class has a DateSet[LabeledVector] as input and, therefore, it is not applicable for sequence learning. I think the LabeledVector should be adapted that the label is a vector instead of a single Double value. What do you think? Best Regards, -- ================================================================== Hilmi Yildirim, M.Sc. Researcher DFKI GmbH Intelligente Analytik für Massendaten DFKI Projektbüro Berlin Alt-Moabit 91c D-10559 Berlin Phone: +49 30 23895 1814 E-Mail: [hidden email] ------------------------------------------------------------- Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ------------------------------------------------------------- |
Hi Hilmi,
Thanks for suggestion about type of labeled vector. Basically, I agree that your suggestion is reasonable. But, I would like to generialize `LabeledVector` like following example: ``` case class LabeledVector[T <: Serializable](label: T, vector: Vector) extends Serializable { // some implementations for LabeledVector } ``` How about this implementation? If there are any other opinions, please send a email to mailing list. > On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]> wrote: > > Hi, > in the ML-Pipeline of Flink we have the "LabeledVector" class. It consists of a vector and a label as a double value. Unfortunately, it is not applicable for sequence learning where the label is also a vector. For example, in NLP we have a vector of words and the label is a vector of the corresponding labels. > > The optimize function of the "Solver" class has a DateSet[LabeledVector] as input and, therefore, it is not applicable for sequence learning. I think the LabeledVector should be adapted that the label is a vector instead of a single Double value. What do you think? > > Best Regards, > > -- > ================================================================== > Hilmi Yildirim, M.Sc. > Researcher > > DFKI GmbH > Intelligente Analytik für Massendaten > DFKI Projektbüro Berlin > Alt-Moabit 91c > D-10559 Berlin > Phone: +49 30 23895 1814 > > E-Mail: [hidden email] > > ------------------------------------------------------------- > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > > Geschaeftsfuehrung: > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > > Vorsitzender des Aufsichtsrats: > Prof. Dr. h.c. Hans A. Aukes > > Amtsgericht Kaiserslautern, HRB 2313 > ------------------------------------------------------------- > Regards, Chiwan Park |
In reply to this post by Hilmi Yildirim
Hi,
yes it is a good idea. One implementaiton with a single valued label and a second implementation with a label vector. Best Regards, Hilmi From: *Chiwan Park* <[hidden email] <mailto:[hidden email]>> Date: Tue, Jan 5, 2016 at 12:17 PM Subject: Re: LabeledVector with label vector To: [hidden email] <mailto:[hidden email]> Hi Hilmi, Thanks for suggestion about type of labeled vector. Basically, I agree that your suggestion is reasonable. But, I would like to generialize `LabeledVector` like following example: ``` case class LabeledVector[T <: Serializable](label: T, vector: Vector) extends Serializable { // some implementations for LabeledVector } ``` How about this implementation? If there are any other opinions, please send a email to mailing list. > On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email] <mailto:[hidden email]>> wrote: > > Hi, > in the ML-Pipeline of Flink we have the "LabeledVector" class. It consists of a vector and a label as a double value. Unfortunately, it is not applicable for sequence learning where the label is also a vector. For example, in NLP we have a vector of words and the label is a vector of the corresponding labels. > > The optimize function of the "Solver" class has a DateSet[LabeledVector] as input and, therefore, it is not applicable for sequence learning. I think the LabeledVector should be adapted that the label is a vector instead of a single Double value. What do you think? > > Best Regards, > > -- > ================================================================== > Hilmi Yildirim, M.Sc. > Researcher > > DFKI GmbH > Intelligente Analytik für Massendaten > DFKI Projektbüro Berlin > Alt-Moabit 91c > D-10559 Berlin > Phone: +49 30 23895 1814 <tel:%2B49%2030%2023895%201814> > > E-Mail: [hidden email] <mailto:[hidden email]> > > ------------------------------------------------------------- > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > > Geschaeftsfuehrung: > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > > Vorsitzender des Aufsichtsrats: > Prof. Dr. h.c. Hans A. Aukes > > Amtsgericht Kaiserslautern, HRB 2313 > ------------------------------------------------------------- > Regards, Chiwan Park |
In reply to this post by Chiwan Park-2
Generalizing the type of the label for the label vector is an idea we
played with when designing the current optimization framework. We ended up deciding against it as the double type allows us to do regressions and (multiclass) classification which should be the majority of the use cases out there, while keeping the code simple. Generalizing this to [T <: Serializable] is too broad I think. [T <: Vector] is I think more reasonable, I cannot think of many cases where the label in an optimization problems is something other than a vector/double. Any change would require a number of changes in the optimization of course, as optimizing for vector and double labels requires different handling of error calculation etc but it should be doable. Note however that since LabeledVector is such a core part of the library any changes would involve a number of adjustments downstream. Perhaps having different optimizers etc. for Vectors and double labels makes sense, but I haven't put much though into this. On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <[hidden email]> wrote: > Hi Hilmi, > > Thanks for suggestion about type of labeled vector. Basically, I agree > that your suggestion is reasonable. But, I would like to generialize > `LabeledVector` like following example: > > ``` > case class LabeledVector[T <: Serializable](label: T, vector: Vector) > extends Serializable { > // some implementations for LabeledVector > } > ``` > > How about this implementation? If there are any other opinions, please > send a email to mailing list. > > > On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]> > wrote: > > > > Hi, > > in the ML-Pipeline of Flink we have the "LabeledVector" class. It > consists of a vector and a label as a double value. Unfortunately, it is > not applicable for sequence learning where the label is also a vector. For > example, in NLP we have a vector of words and the label is a vector of the > corresponding labels. > > > > The optimize function of the "Solver" class has a DateSet[LabeledVector] > as input and, therefore, it is not applicable for sequence learning. I > think the LabeledVector should be adapted that the label is a vector > instead of a single Double value. What do you think? > > > > Best Regards, > > > > -- > > ================================================================== > > Hilmi Yildirim, M.Sc. > > Researcher > > > > DFKI GmbH > > Intelligente Analytik für Massendaten > > DFKI Projektbüro Berlin > > Alt-Moabit 91c > > D-10559 Berlin > > Phone: +49 30 23895 1814 > > > > E-Mail: [hidden email] > > > > ------------------------------------------------------------- > > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > > > > Geschaeftsfuehrung: > > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > > Dr. Walter Olthoff > > > > Vorsitzender des Aufsichtsrats: > > Prof. Dr. h.c. Hans A. Aukes > > > > Amtsgericht Kaiserslautern, HRB 2313 > > ------------------------------------------------------------- > > > > Regards, > Chiwan Park > > > |
Hi Theodore,
Thanks for explaining the reason. :) So how about change LabeledVector contains two vectors? One of vectors is for label and the other one is for value. I think this approach would be okay because a double value label could be represented as a DenseVector(Array(LABEL_VALUE)). Only problem in this approach is some overhead of processing Vector type in case of single double label. If the overhead is significant, we should create two types of LabeledVector such as DoubleLabeledVector and VectorLabeledVector. Which one is preferred? > On Jan 5, 2016, at 11:38 PM, Theodore Vasiloudis <[hidden email]> wrote: > > Generalizing the type of the label for the label vector is an idea we > played with when designing the current optimization framework. > > We ended up deciding against it as the double type allows us to do > regressions and (multiclass) classification which should be the majority of > the use cases out there, while keeping the code simple. > > Generalizing this to [T <: Serializable] is too broad I think. [T <: > Vector] is I think more reasonable, I cannot think of many cases where the > label in an optimization problems is something other than a vector/double. > > Any change would require a number of changes in the optimization of course, > as optimizing for vector and double labels requires different handling of > error calculation etc but it should be doable. > Note however that since LabeledVector is such a core part of the library > any changes would involve a number of adjustments downstream. > > Perhaps having different optimizers etc. for Vectors and double labels > makes sense, but I haven't put much though into this. > > > On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <[hidden email]> wrote: > >> Hi Hilmi, >> >> Thanks for suggestion about type of labeled vector. Basically, I agree >> that your suggestion is reasonable. But, I would like to generialize >> `LabeledVector` like following example: >> >> ``` >> case class LabeledVector[T <: Serializable](label: T, vector: Vector) >> extends Serializable { >> // some implementations for LabeledVector >> } >> ``` >> >> How about this implementation? If there are any other opinions, please >> send a email to mailing list. >> >>> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]> >> wrote: >>> >>> Hi, >>> in the ML-Pipeline of Flink we have the "LabeledVector" class. It >> consists of a vector and a label as a double value. Unfortunately, it is >> not applicable for sequence learning where the label is also a vector. For >> example, in NLP we have a vector of words and the label is a vector of the >> corresponding labels. >>> >>> The optimize function of the "Solver" class has a DateSet[LabeledVector] >> as input and, therefore, it is not applicable for sequence learning. I >> think the LabeledVector should be adapted that the label is a vector >> instead of a single Double value. What do you think? >>> >>> Best Regards, >>> >>> -- >>> ================================================================== >>> Hilmi Yildirim, M.Sc. >>> Researcher >>> >>> DFKI GmbH >>> Intelligente Analytik für Massendaten >>> DFKI Projektbüro Berlin >>> Alt-Moabit 91c >>> D-10559 Berlin >>> Phone: +49 30 23895 1814 >>> >>> E-Mail: [hidden email] >>> >>> ------------------------------------------------------------- >>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH >>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern >>> >>> Geschaeftsfuehrung: >>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) >>> Dr. Walter Olthoff >>> >>> Vorsitzender des Aufsichtsrats: >>> Prof. Dr. h.c. Hans A. Aukes >>> >>> Amtsgericht Kaiserslautern, HRB 2313 >>> ------------------------------------------------------------- >>> >> >> Regards, >> Chiwan Park Regards, Chiwan Park |
Hi,
yes, initially we thought about introducing a LabeledVector where the label can be a vector. However, for the sake of simplicity we decided to first implement a LabeledVector with a single double value as label. A simple double value should take 8 bytes of memory space. The DenseVector(Array(Label_Value)) should take 4 bytes for the array reference, 16 bytes for the array structure and 8 bytes for the value = 28 bytes. Thus, the space for the label would be roughly tripled. This might not be too expensive assuming that the most space will be taken by the feature vector. However, I would also assume that the access to the label value wrapped in a DenseVector should be a little bit slower since one first has to retrieve the reference to the array and then access the first element instead of directly accessing a double field. I would assume that this should not have such a big impact on the overall performance. But without running some benchmarks, it’s hard to say. Alternatively, you can also define your own custom type for NLP. I’m not really familiar with sequence learning but can you use gradient descent for that? If not, then you have to write your own solver which can also work on a different type. Cheers, Till On Wed, Jan 6, 2016 at 2:12 AM, Chiwan Park <[hidden email]> wrote: > Hi Theodore, > > Thanks for explaining the reason. :) > > So how about change LabeledVector contains two vectors? One of vectors is > for label and the other one is for value. I think this approach would be > okay because a double value label could be represented as a > DenseVector(Array(LABEL_VALUE)). > > Only problem in this approach is some overhead of processing Vector type > in case of single double label. If the overhead is significant, we should > create two types of LabeledVector such as DoubleLabeledVector and > VectorLabeledVector. > > Which one is preferred? > > > On Jan 5, 2016, at 11:38 PM, Theodore Vasiloudis < > [hidden email]> wrote: > > > > Generalizing the type of the label for the label vector is an idea we > > played with when designing the current optimization framework. > > > > We ended up deciding against it as the double type allows us to do > > regressions and (multiclass) classification which should be the majority > of > > the use cases out there, while keeping the code simple. > > > > Generalizing this to [T <: Serializable] is too broad I think. [T <: > > Vector] is I think more reasonable, I cannot think of many cases where > the > > label in an optimization problems is something other than a > vector/double. > > > > Any change would require a number of changes in the optimization of > course, > > as optimizing for vector and double labels requires different handling of > > error calculation etc but it should be doable. > > Note however that since LabeledVector is such a core part of the library > > any changes would involve a number of adjustments downstream. > > > > Perhaps having different optimizers etc. for Vectors and double labels > > makes sense, but I haven't put much though into this. > > > > > > On Tue, Jan 5, 2016 at 12:17 PM, Chiwan Park <[hidden email]> > wrote: > > > >> Hi Hilmi, > >> > >> Thanks for suggestion about type of labeled vector. Basically, I agree > >> that your suggestion is reasonable. But, I would like to generialize > >> `LabeledVector` like following example: > >> > >> ``` > >> case class LabeledVector[T <: Serializable](label: T, vector: Vector) > >> extends Serializable { > >> // some implementations for LabeledVector > >> } > >> ``` > >> > >> How about this implementation? If there are any other opinions, please > >> send a email to mailing list. > >> > >>> On Jan 5, 2016, at 7:36 PM, Hilmi Yildirim <[hidden email]> > >> wrote: > >>> > >>> Hi, > >>> in the ML-Pipeline of Flink we have the "LabeledVector" class. It > >> consists of a vector and a label as a double value. Unfortunately, it is > >> not applicable for sequence learning where the label is also a vector. > For > >> example, in NLP we have a vector of words and the label is a vector of > the > >> corresponding labels. > >>> > >>> The optimize function of the "Solver" class has a > DateSet[LabeledVector] > >> as input and, therefore, it is not applicable for sequence learning. I > >> think the LabeledVector should be adapted that the label is a vector > >> instead of a single Double value. What do you think? > >>> > >>> Best Regards, > >>> > >>> -- > >>> ================================================================== > >>> Hilmi Yildirim, M.Sc. > >>> Researcher > >>> > >>> DFKI GmbH > >>> Intelligente Analytik für Massendaten > >>> DFKI Projektbüro Berlin > >>> Alt-Moabit 91c > >>> D-10559 Berlin > >>> Phone: +49 30 23895 1814 > >>> > >>> E-Mail: [hidden email] > >>> > >>> ------------------------------------------------------------- > >>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH > >>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern > >>> > >>> Geschaeftsfuehrung: > >>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) > >>> Dr. Walter Olthoff > >>> > >>> Vorsitzender des Aufsichtsrats: > >>> Prof. Dr. h.c. Hans A. Aukes > >>> > >>> Amtsgericht Kaiserslautern, HRB 2313 > >>> ------------------------------------------------------------- > >>> > >> > >> Regards, > >> Chiwan Park > > Regards, > Chiwan Park > > > |
Free forum by Nabble | Edit this page |