(DEPRECATED) Apache Flink Mailing List archive.

Fwd: [stratosphere-dev] Grouping by a tuple

Classic

List

Threaded

6 messages Options

Robert Metzger

Jun 11, 2014; 10:25pm

Fwd: [stratosphere-dev] Grouping by a tuple

Hi Slava,

I'm forwarding your message to our new mailing list at Apache:
[hidden email]
You can subscribe to the list by sending an (empty) email to:
[hidden email].
We are planning to shut down the stratosphere-dev@googlegroups soon.

Regarding your question: When using the Tuples, you don't need to specify a
keySelector. It is sufficient to specify the ID(s) of the keys:
http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int..
.)
So you should be able to do a ".groupBy(0,3,4)"

Robert

---------- Forwarded message ----------
From: Vyacheslav Zholudev <[hidden email]>
Date: Thu, Jun 12, 2014 at 12:17 AM
Subject: [stratosphere-dev] Grouping by a tuple
To: [hidden email]

Hi,

Being used to the Hive grouping like "GROUP BY userId, productId, year" I'm
wondering what's the best way to do it in Stratosphere? The groupBy's
KeySelector implies that a Comparable object is returned, however, the
obvious choice like TupleN is not comparable. In primitive cases I would
prefer to avoid introducing comparable extra entities for grouping tuples
of "primitive" types. Would it make sense to introduce "ComparableTupleN<T1
extends Comparable<? extends T1>, ..., Tn extends Comparable<? extends
Tn>>"?

Or am I missing the obvious way in a Stratosphere way?

Thanks,
Vyacheslav

--
You received this message because you are subscribed to the Google Groups
"stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [hidden email].
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Fabian Hueske

Jun 11, 2014; 10:46pm

Re: [stratosphere-dev] Grouping by a tuple

I think the issue is rather grouping a DataSet of custom types on multiple
fields than grouping a Tuple DataSet.
In this case you need to use a KeySelector and would like to return a Tuple
containing all fields you want to group on.
But as Slava said the returning type must be comparable (which Tuples are
not).

I think it should be possible to check at optimization time whether all
fields of a tuple are comparable and allow to use such tuples as a grouping
key.

Would be good to open a JIRA for this in any case. This is a common problem
when working with POJOs.

2014-06-12 0:25 GMT+02:00 Robert Metzger <[hidden email]>:

> Hi Slava,
>
> I'm forwarding your message to our new mailing list at Apache:
> [hidden email]
> You can subscribe to the list by sending an (empty) email to:
> [hidden email].
> We are planning to shut down the stratosphere-dev@googlegroups soon.
>
> Regarding your question: When using the Tuples, you don't need to specify a
> keySelector. It is sufficient to specify the ID(s) of the keys:
>
> http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int
> ..
> .)
> So you should be able to do a ".groupBy(0,3,4)"
>
> Robert
>
> ---------- Forwarded message ----------
> From: Vyacheslav Zholudev <[hidden email]>
> Date: Thu, Jun 12, 2014 at 12:17 AM
> Subject: [stratosphere-dev] Grouping by a tuple
> To: [hidden email]
>
>
> Hi,
>
> Being used to the Hive grouping like "GROUP BY userId, productId, year" I'm
> wondering what's the best way to do it in Stratosphere? The groupBy's
> KeySelector implies that a Comparable object is returned, however, the
> obvious choice like TupleN is not comparable. In primitive cases I would
> prefer to avoid introducing comparable extra entities for grouping tuples
> of "primitive" types. Would it make sense to introduce "ComparableTupleN<T1
> extends Comparable<? extends T1>, ..., Tn extends Comparable<? extends
> Tn>>"?
>
> Or am I missing the obvious way in a Stratosphere way?
>
> Thanks,
> Vyacheslav
>
> --
> You received this message because you are subscribed to the Google Groups
> "stratosphere-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
> Visit this group at http://groups.google.com/group/stratosphere-dev.
> For more options, visit https://groups.google.com/d/optout.
>

... [show rest of quote]

Vyacheslav Zholudev

Jun 12, 2014; 7:46am

Re: [stratosphere-dev] Grouping by a tuple

In reply to this post by Robert Metzger

Hi Robert,

thanks, I will post my future questions to that list.

Regarding your question: When using the Tuples, you don't need to specify a keySelector. It is sufficient to specify the ID(s) of the keys: <a href="http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int.." target="_blank" onmousedown="this.href='http://www.google.com/url?q\75http%3A%2F%2Fstratosphere-javadocs.github.io%2Feu%2Fstratosphere%2Fapi%2Fjava%2FDataSet.html%23groupBy(int..\46sa\75D\46sntz\0751\46usg\75AFQjCNGv56jr8kjFpqDrJyt0NgTnX5F3Og';return true;" onclick="this.href='http://www.google.com/url?q\75http%3A%2F%2Fstratosphere-javadocs.github.io%2Feu%2Fstratosphere%2Fapi%2Fjava%2FDataSet.html%23groupBy(int..\46sa\75D\46sntz\0751\46usg\75AFQjCNGv56jr8kjFpqDrJyt0NgTnX5F3Og';return true;">http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int...)

So you should be able to do a ".groupBy(0,3,4)"

Actually my question is about the situation when I don't have tuples. Assume I have a DataSet<UserData> ds and I want to invoke ds.groupBy(/* grouping by <userId, sessionId, dayOfTheYear> */), the ideal choice would be to return a comparable tuple from the KeySelector.

On the side note, would it be possible to generate the clone method for the tuples? Yesterday I was copying a Tuple13 in a groupReduce function by hand and it was a pretty long line of code :)

Thanks,

Vyacheslav

Robert Metzger

Jun 12, 2014; 7:53am

Re: [stratosphere-dev] Grouping by a tuple

In reply to this post by Fabian Hueske

+1 for opening a ticket.

On Thu, Jun 12, 2014 at 12:46 AM, Fabian Hueske <[hidden email]> wrote:

> I think the issue is rather grouping a DataSet of custom types on multiple
> fields than grouping a Tuple DataSet.
> In this case you need to use a KeySelector and would like to return a Tuple
> containing all fields you want to group on.
> But as Slava said the returning type must be comparable (which Tuples are
> not).
>
> I think it should be possible to check at optimization time whether all
> fields of a tuple are comparable and allow to use such tuples as a grouping
> key.
>
> Would be good to open a JIRA for this in any case. This is a common problem
> when working with POJOs.
>
>
> 2014-06-12 0:25 GMT+02:00 Robert Metzger <[hidden email]>:
>
> > Hi Slava,
> >
> > I'm forwarding your message to our new mailing list at Apache:
> > [hidden email]
> > You can subscribe to the list by sending an (empty) email to:
> > [hidden email].
> > We are planning to shut down the stratosphere-dev@googlegroups soon.
> >
> > Regarding your question: When using the Tuples, you don't need to
> specify a
> > keySelector. It is sufficient to specify the ID(s) of the keys:
> >
> >
> http://stratosphere-javadocs.github.io/eu/stratosphere/api/java/DataSet.html#groupBy(int
> > ..
> > .)
> > So you should be able to do a ".groupBy(0,3,4)"
> >
> > Robert
> >
> > ---------- Forwarded message ----------
> > From: Vyacheslav Zholudev <[hidden email]>
> > Date: Thu, Jun 12, 2014 at 12:17 AM
> > Subject: [stratosphere-dev] Grouping by a tuple
> > To: [hidden email]
> >
> >
> > Hi,
> >
> > Being used to the Hive grouping like "GROUP BY userId, productId, year"
> I'm
> > wondering what's the best way to do it in Stratosphere? The groupBy's
> > KeySelector implies that a Comparable object is returned, however, the
> > obvious choice like TupleN is not comparable. In primitive cases I would
> > prefer to avoid introducing comparable extra entities for grouping tuples
> > of "primitive" types. Would it make sense to introduce
> "ComparableTupleN<T1
> > extends Comparable<? extends T1>, ..., Tn extends Comparable<? extends
> > Tn>>"?
> >
> > Or am I missing the obvious way in a Stratosphere way?
> >
> > Thanks,
> > Vyacheslav
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [hidden email].
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/d/optout.
> >
>

... [show rest of quote]

Vyacheslav Zholudev

Jun 12, 2014; 7:53am

Re: [stratosphere-dev] Grouping by a tuple

In reply to this post by Fabian Hueske

Thanks Robert and Fabian,
>I think the issue is rather grouping a DataSet of custom types on multiple
fields than grouping a Tuple DataSet

Yes, that's what I meant.

Btw, would it be possible to generate a copy method for tuples as well? Copying e.g. Tuple15 could be quite tedious in the code, but essentially it's just another object creation.

Fabian Hueske

Jun 12, 2014; 8:15am

Re: [stratosphere-dev] Grouping by a tuple

Copying tuples should be possible. I will open a JIRA for that.

Regarding the grouping on multiple fields. A temporary workaround would be
to write a MapFunction which extracts the key fields into a Tuple4<POJO,
KEY1, KEY2, KEY3> and group on fields 1, 2, 3. This is btw also very
similar of is done under the hood when using a KeySelector.
Once the Java API was extended to work with named fields of POJOs we don't
even need KeySelector functions for this...

2014-06-12 9:53 GMT+02:00 Vyacheslav Zholudev <[hidden email]
>:

> Thanks Robert and Fabian,
> >I think the issue is rather grouping a DataSet of custom types on
> multiple
> fields than grouping a Tuple DataSet
>
> Yes, that's what I meant.
>
> Btw, would it be possible to generate a copy method for tuples as well?
> Copying e.g. Tuple15 could be quite tedious in the code, but essentially
> it's just another object creation.
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Fwd-stratosphere-dev-Grouping-by-a-tuple-tp40p54.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>

... [show rest of quote]