Conceptual difference Windows and DataSet

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Conceptual difference Windows and DataSet

Kevin Jacobs
Hi,

I have the following use case:

     1. Group by a specific field.

     2. Get a list of all messages belonging to the group.

     3. Count the number of records in the group.

With the use of DataSets, it is fairly easy to do this (see
http://stackoverflow.com/questions/38745446/apache-flink-sum-and-keep-grouped/38747685#38747685):

|fromElements(("a-b", "data1", 1), ("a-c", "data2", 1), ("a-b", "data3",
1)). groupBy(0). reduceGroup { (it: Iterator[(String, String, Int)],
out: Collector[(String, List[String], Int)]) => { val group = it.toList
if (group.length > 0) out.collect((group(0)._1, group.map(_._2),
group.map(_._3).sum)) } |

So, now I am moving to DataStreams (since the input is really a
DataStream). From my perspective, a Window should provide the same
functionality as a DataSet. This would easify the process a lot:

     1. Window the elements.

     2. Apply the same operations as before.

Is there a way in Flink to do so? Otherwise, I would like to think of a
solution to this problem.

Regards,
Kevin
Reply | Threaded
Open this post in threaded view
|

Re: Conceptual difference Windows and DataSet

Theodore Vasiloudis
Hello Kevin,

I'm not very familiar with the stream API, but I think you can achieve what
you want by mapping over your elements to turn the
strings into one-item lists, so that you get a key-value that is (K:
String, V: (List[String], Int))  and then apply the window reduce function,
which produces a data stream out of
a windowed stream, you combine your lists there and sum the value. Again,
it's not a great way to use reduce, since you are growing the list with
each reduction.

Regards,
Theodore

On Thu, Aug 4, 2016 at 1:36 AM, Kevin Jacobs <[hidden email]> wrote:

> Hi,
>
> I have the following use case:
>
>     1. Group by a specific field.
>
>     2. Get a list of all messages belonging to the group.
>
>     3. Count the number of records in the group.
>
> With the use of DataSets, it is fairly easy to do this (see
> http://stackoverflow.com/questions/38745446/apache-flink-
> sum-and-keep-grouped/38747685#38747685):
>
> |fromElements(("a-b", "data1", 1), ("a-c", "data2", 1), ("a-b", "data3",
> 1)). groupBy(0). reduceGroup { (it: Iterator[(String, String, Int)], out:
> Collector[(String, List[String], Int)]) => { val group = it.toList if
> (group.length > 0) out.collect((group(0)._1, group.map(_._2),
> group.map(_._3).sum)) } |
>
> So, now I am moving to DataStreams (since the input is really a
> DataStream). From my perspective, a Window should provide the same
> functionality as a DataSet. This would easify the process a lot:
>
>     1. Window the elements.
>
>     2. Apply the same operations as before.
>
> Is there a way in Flink to do so? Otherwise, I would like to think of a
> solution to this problem.
>
> Regards,
> Kevin
>
Reply | Threaded
Open this post in threaded view
|

Re: Conceptual difference Windows and DataSet

Stephan Ewen
Hi Kevin!

The windows in Flink's DataStream API are organized by key. The reason is
that the windows are very flexible, and each key can form different windows
than the other (think sessions per user - each session starts and stops
differently).

There has been discussion about introducing something like "aligned
windows". These types of windows would be the same across all keys and
could therefor be globally organized. One could even think that these offer
DataSet-like features.
That is a bit into the future, still.

Greeting,
Stephan


On Sat, Aug 6, 2016 at 11:58 PM, Theodore Vasiloudis <
[hidden email]> wrote:

> Hello Kevin,
>
> I'm not very familiar with the stream API, but I think you can achieve what
> you want by mapping over your elements to turn the
> strings into one-item lists, so that you get a key-value that is (K:
> String, V: (List[String], Int))  and then apply the window reduce function,
> which produces a data stream out of
> a windowed stream, you combine your lists there and sum the value. Again,
> it's not a great way to use reduce, since you are growing the list with
> each reduction.
>
> Regards,
> Theodore
>
> On Thu, Aug 4, 2016 at 1:36 AM, Kevin Jacobs <[hidden email]> wrote:
>
> > Hi,
> >
> > I have the following use case:
> >
> >     1. Group by a specific field.
> >
> >     2. Get a list of all messages belonging to the group.
> >
> >     3. Count the number of records in the group.
> >
> > With the use of DataSets, it is fairly easy to do this (see
> > http://stackoverflow.com/questions/38745446/apache-flink-
> > sum-and-keep-grouped/38747685#38747685):
> >
> > |fromElements(("a-b", "data1", 1), ("a-c", "data2", 1), ("a-b", "data3",
> > 1)). groupBy(0). reduceGroup { (it: Iterator[(String, String, Int)], out:
> > Collector[(String, List[String], Int)]) => { val group = it.toList if
> > (group.length > 0) out.collect((group(0)._1, group.map(_._2),
> > group.map(_._3).sum)) } |
> >
> > So, now I am moving to DataStreams (since the input is really a
> > DataStream). From my perspective, a Window should provide the same
> > functionality as a DataSet. This would easify the process a lot:
> >
> >     1. Window the elements.
> >
> >     2. Apply the same operations as before.
> >
> > Is there a way in Flink to do so? Otherwise, I would like to think of a
> > solution to this problem.
> >
> > Regards,
> > Kevin
> >
>