Hi Slava,
I'm forwarding your message to our new mailing list at Apache: [hidden email] You can subscribe to the list by sending an (empty) email to: [hidden email]. We are planning to shut down the stratosphere-dev@googlegroups soon. Regarding your question: When using the Tuples, you don't need to specify a keySelector. It is sufficient to specify the ID(s) of the keys: http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int.. .) So you should be able to do a ".groupBy(0,3,4)" Robert ---------- Forwarded message ---------- From: Vyacheslav Zholudev <[hidden email]> Date: Thu, Jun 12, 2014 at 12:17 AM Subject: [stratosphere-dev] Grouping by a tuple To: [hidden email] Hi, Being used to the Hive grouping like "GROUP BY userId, productId, year" I'm wondering what's the best way to do it in Stratosphere? The groupBy's KeySelector implies that a Comparable object is returned, however, the obvious choice like TupleN is not comparable. In primitive cases I would prefer to avoid introducing comparable extra entities for grouping tuples of "primitive" types. Would it make sense to introduce "ComparableTupleN<T1 extends Comparable<? extends T1>, ..., Tn extends Comparable<? extends Tn>>"? Or am I missing the obvious way in a Stratosphere way? Thanks, Vyacheslav -- You received this message because you are subscribed to the Google Groups "stratosphere-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. Visit this group at http://groups.google.com/group/stratosphere-dev. For more options, visit https://groups.google.com/d/optout. |
I think the issue is rather grouping a DataSet of custom types on multiple
fields than grouping a Tuple DataSet. In this case you need to use a KeySelector and would like to return a Tuple containing all fields you want to group on. But as Slava said the returning type must be comparable (which Tuples are not). I think it should be possible to check at optimization time whether all fields of a tuple are comparable and allow to use such tuples as a grouping key. Would be good to open a JIRA for this in any case. This is a common problem when working with POJOs. 2014-06-12 0:25 GMT+02:00 Robert Metzger <[hidden email]>: > Hi Slava, > > I'm forwarding your message to our new mailing list at Apache: > [hidden email] > You can subscribe to the list by sending an (empty) email to: > [hidden email]. > We are planning to shut down the stratosphere-dev@googlegroups soon. > > Regarding your question: When using the Tuples, you don't need to specify a > keySelector. It is sufficient to specify the ID(s) of the keys: > > http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int > .. > .) > So you should be able to do a ".groupBy(0,3,4)" > > Robert > > ---------- Forwarded message ---------- > From: Vyacheslav Zholudev <[hidden email]> > Date: Thu, Jun 12, 2014 at 12:17 AM > Subject: [stratosphere-dev] Grouping by a tuple > To: [hidden email] > > > Hi, > > Being used to the Hive grouping like "GROUP BY userId, productId, year" I'm > wondering what's the best way to do it in Stratosphere? The groupBy's > KeySelector implies that a Comparable object is returned, however, the > obvious choice like TupleN is not comparable. In primitive cases I would > prefer to avoid introducing comparable extra entities for grouping tuples > of "primitive" types. Would it make sense to introduce "ComparableTupleN<T1 > extends Comparable<? extends T1>, ..., Tn extends Comparable<? extends > Tn>>"? > > Or am I missing the obvious way in a Stratosphere way? > > Thanks, > Vyacheslav > > -- > You received this message because you are subscribed to the Google Groups > "stratosphere-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [hidden email]. > Visit this group at http://groups.google.com/group/stratosphere-dev. > For more options, visit https://groups.google.com/d/optout. > |
In reply to this post by Robert Metzger
Hi Robert,
thanks, I will post my future questions to that list.
Actually my question is about the situation when I don't have tuples. Assume I have a DataSet<UserData> ds and I want to invoke ds.groupBy(/* grouping by <userId, sessionId, dayOfTheYear> */), the ideal choice would be to return a comparable tuple from the KeySelector. On the side note, would it be possible to generate the clone method for the tuples? Yesterday I was copying a Tuple13 in a groupReduce function by hand and it was a pretty long line of code :) Thanks, Vyacheslav |
In reply to this post by Fabian Hueske
+1 for opening a ticket.
On Thu, Jun 12, 2014 at 12:46 AM, Fabian Hueske <[hidden email]> wrote: > I think the issue is rather grouping a DataSet of custom types on multiple > fields than grouping a Tuple DataSet. > In this case you need to use a KeySelector and would like to return a Tuple > containing all fields you want to group on. > But as Slava said the returning type must be comparable (which Tuples are > not). > > I think it should be possible to check at optimization time whether all > fields of a tuple are comparable and allow to use such tuples as a grouping > key. > > Would be good to open a JIRA for this in any case. This is a common problem > when working with POJOs. > > > 2014-06-12 0:25 GMT+02:00 Robert Metzger <[hidden email]>: > > > Hi Slava, > > > > I'm forwarding your message to our new mailing list at Apache: > > [hidden email] > > You can subscribe to the list by sending an (empty) email to: > > [hidden email]. > > We are planning to shut down the stratosphere-dev@googlegroups soon. > > > > Regarding your question: When using the Tuples, you don't need to > specify a > > keySelector. It is sufficient to specify the ID(s) of the keys: > > > > > http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int > > .. > > .) > > So you should be able to do a ".groupBy(0,3,4)" > > > > Robert > > > > ---------- Forwarded message ---------- > > From: Vyacheslav Zholudev <[hidden email]> > > Date: Thu, Jun 12, 2014 at 12:17 AM > > Subject: [stratosphere-dev] Grouping by a tuple > > To: [hidden email] > > > > > > Hi, > > > > Being used to the Hive grouping like "GROUP BY userId, productId, year" > I'm > > wondering what's the best way to do it in Stratosphere? The groupBy's > > KeySelector implies that a Comparable object is returned, however, the > > obvious choice like TupleN is not comparable. In primitive cases I would > > prefer to avoid introducing comparable extra entities for grouping tuples > > of "primitive" types. Would it make sense to introduce > "ComparableTupleN<T1 > > extends Comparable<? extends T1>, ..., Tn extends Comparable<? extends > > Tn>>"? > > > > Or am I missing the obvious way in a Stratosphere way? > > > > Thanks, > > Vyacheslav > > > > -- > > You received this message because you are subscribed to the Google Groups > > "stratosphere-dev" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to [hidden email]. > > Visit this group at http://groups.google.com/group/stratosphere-dev. > > For more options, visit https://groups.google.com/d/optout. > > > |
In reply to this post by Fabian Hueske
Thanks Robert and Fabian,
>I think the issue is rather grouping a DataSet of custom types on multiple fields than grouping a Tuple DataSet Yes, that's what I meant. Btw, would it be possible to generate a copy method for tuples as well? Copying e.g. Tuple15 could be quite tedious in the code, but essentially it's just another object creation. |
Copying tuples should be possible. I will open a JIRA for that.
Regarding the grouping on multiple fields. A temporary workaround would be to write a MapFunction which extracts the key fields into a Tuple4<POJO, KEY1, KEY2, KEY3> and group on fields 1, 2, 3. This is btw also very similar of is done under the hood when using a KeySelector. Once the Java API was extended to work with named fields of POJOs we don't even need KeySelector functions for this... 2014-06-12 9:53 GMT+02:00 Vyacheslav Zholudev <[hidden email] >: > Thanks Robert and Fabian, > >I think the issue is rather grouping a DataSet of custom types on > multiple > fields than grouping a Tuple DataSet > > Yes, that's what I meant. > > Btw, would it be possible to generate a copy method for tuples as well? > Copying e.g. Tuple15 could be quite tedious in the code, but essentially > it's just another object creation. > > > > -- > View this message in context: > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Fwd-stratosphere-dev-Grouping-by-a-tuple-tp40p54.html > Sent from the Apache Flink (Incubator) Mailing List archive. mailing list > archive at Nabble.com. > |
Free forum by Nabble | Edit this page |