(DEPRECATED) Apache Flink Mailing List archive.

Does flink support groupByKey([numTasks])

Classic

List

Threaded

3 messages Options

Liang Chen

Does flink support groupByKey([numTasks])

Hi

Now i am considering migrate Sparkstreaming case to Flink for comparing performance.

Does flink support groupByKey([numTasks]) ,When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
If it is not exist, how to use groupBy() to implement the same function?

Martin Liesenberg

Re: Does flink support groupByKey([numTasks])

Hi,

as far as I can tell there is no direct equivalent, which is probably due
to the underlying execution models.

I think the desired behaviour can be expressed by something along the lines
of:
stream.groupBy(0).window(Count.of(<size>))
where:
stream is a DataStream<Tuple2<K, V>> and <size> would be the batch size of
your SparkStreaming job.

The window can also be expressed in terms of time which would look
something like this: .window(Time.of(<time>, <time_unit>))

You can find slides on the streaming API at [1] and there is a number of
examples at [2]

best regards,
martin

[1] http://dataartisans.github.io/flink-training/dataStreamBasics/intro.html
[2]
https://github.com/dataArtisans/flink-training-exercises/tree/master/src/main/java/com/dataArtisans/flinkTraining/exercises/dataStreamJava

Liang Chen <[hidden email]> schrieb am Sa., 12. Sep. 2015 um
05:53 Uhr:

> Hi
>
> Now i am considering migrate Sparkstreaming case to Flink for comparing
> performance.
>
> Does flink support groupByKey([numTasks]) ,When called on a dataset of (K,
> V) pairs, returns a dataset of (K, Iterable<V>) pairs.
> If it is not exist, how to use groupBy() to implement the same function?
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Does-flink-support-groupByKey-numTasks-tp7973.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>

Fabian Hueske-2

Re: Does flink support groupByKey([numTasks])

Hi Liang,

Martin gave a very good answer.

I'd like to add a few words regarding the different stream processing
models of Spark and Flink. Spark Streaming is based on mini-batches and
each mini-batch is processed like a batch program. Hence a Spark Streaming
program is implemented like a batch program.

In contrast, Flink pipelines data streams and processes them an infinite
sequence of data. Because data streams are infinite, you cannot simply
group their data (a group would never be closed). This issue is addressed
by using windows which discretize a stream by time or count.

Although Spark's mini-batches can be interpreted as a global time windows,
it can be tricky to implement exactly the same semantics in Flink because
the internal implementations of Spark and Flink differ.

Best, Fabian

2015-09-12 12:33 GMT+02:00 Martin Liesenberg <[hidden email]>:

> Hi,
>
> as far as I can tell there is no direct equivalent, which is probably due
> to the underlying execution models.
>
> I think the desired behaviour can be expressed by something along the lines
> of:
> stream.groupBy(0).window(Count.of(<size>))
> where:
> stream is a DataStream<Tuple2<K, V>> and <size> would be the batch size of
> your SparkStreaming job.
>
> The window can also be expressed in terms of time which would look
> something like this: .window(Time.of(<time>, <time_unit>))
>
> You can find slides on the streaming API at [1] and there is a number of
> examples at [2]
>
> best regards,
> martin
>
> [1]
> http://dataartisans.github.io/flink-training/dataStreamBasics/intro.html
> [2]
>
> https://github.com/dataArtisans/flink-training-exercises/tree/master/src/main/java/com/dataArtisans/flinkTraining/exercises/dataStreamJava
>
>
> Liang Chen <[hidden email]> schrieb am Sa., 12. Sep. 2015 um
> 05:53 Uhr:
>
> > Hi
> >
> > Now i am considering migrate Sparkstreaming case to Flink for comparing
> > performance.
> >
> > Does flink support groupByKey([numTasks]) ,When called on a dataset of
> (K,
> > V) pairs, returns a dataset of (K, Iterable<V>) pairs.
> > If it is not exist, how to use groupBy() to implement the same function?
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Does-flink-support-groupByKey-numTasks-tp7973.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>