(DEPRECATED) Apache Flink Mailing List archive.

Benchmarks of Flink, supporting Flink in BigDataBench

Classic

List

Threaded

4 messages Options

Xinhui Tian

Benchmarks of Flink, supporting Flink in BigDataBench

Hello, everyone.

I'm a PhD student from the Institute of Computing Technology, Chinese Academy of Sciences. Our team has released a benchmark for big data systems called BigDataBench, which has become an industry-standard big data benchmark in China. You can find our work on this website: http://prof.ict.ac.cn/BigDataBench/

We are now planning to support Flink in our benchmark, which could provide a set of workloads on different domains and an objective comparison with systems such as Spark and Hadoop. But we are new to this system, so we are asking for your advice about benchmark design. The first thing is to decide what workloads should be added to our benchmark and which domain we should pay more attention.

The attachment is a preliminary plan, which lists some workloads that have already been implemented in the Spark version. We plan to first implement these workloads on Flink, and evalute these two systems. Does anyone have some adivce for this list? We will be very grateful for any idea.
BigDataBench_for_Flink.docx

Thanks ;)

Stephan Ewen

Re: Benchmarks of Flink, supporting Flink in BigDataBench

Hi!

Thanks for reaching out and adding Flink to BigDataBench.

The plan you sent looks like a nice first draft. It is pretty much batch
jobs. Here are a few ideas what you could add as batch jobs:

- Joins are something people seem do a lot with these systems, so a 2-3
table join would be a nice addition

- For batch algorithms, it is often interesting to scale them beyond
memory (we have seen that a lot from users)

- For graph algorithms, you can try incremental versions (see here:
http://data-artisans.com/data-analysis-with-flink.html)

On the streaming side, it is harder, as the systems are very different
there and bot every system can do everything.
For Flink, some ideas would be:
- Streaming Grep
- Streaming pattern detection (see
https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine
)
- Streaming word count
- For streaming Jobs, it is often interesting to play with enabled /
disabled fault tolerance

A few generic comments on Flink, for performance testing.

- The Java API is usually slightly faster then the Scala API, but only by
a bit
- Tuples (Java) and case classes (Scala) usually beat POJOs in performance.
- If your implementation allows it, turning on "objectReuseMode()" can
gain some performance.
- If you implement sorting / Tera sort, have a look here, for how to make
sure that Flink handles the Hadoop types efficiently
http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html

Greetings,
Stephan

On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <[hidden email]> wrote:

> Hello, everyone.
>
> I'm a PhD student from the Institute of Computing Technology, Chinese
> Academy of Sciences. Our team has released a benchmark for big data systems
> called BigDataBench, which has become an industry-standard big data
> benchmark in China. You can find our work on this website:
> http://prof.ict.ac.cn/BigDataBench/
>
> We are now planning to support Flink in our benchmark, which could provide
> a
> set of workloads on different domains and an objective comparison with
> systems such as Spark and Hadoop. But we are new to this system, so we are
> asking for your advice about benchmark design. The first thing is to decide
> what workloads should be added to our benchmark and which domain we should
> pay more attention.
>
> The attachment is a preliminary plan, which lists some workloads that have
> already been implemented in the Spark version. We plan to first implement
> these workloads on Flink, and evalute these two systems. Does anyone have
> some adivce for this list? We will be very grateful for any idea.
> BigDataBench_for_Flink.docx
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx
> >
>
> Thanks ;)
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>

Fabian Hueske-2

Re: Benchmarks of Flink, supporting Flink in BigDataBench

Hi,

welcome to the Flink community and thanks for including Flink into your
benchmark suite! That's really exciting news :-)

Most of the jobs that you listed in your preliminary plan are available as
example programs in Flink's code base [1].
However, you should know, that these examples are NOT tuned for performance
but rather for easy understanding and to showcase certain features.

If your implementations of Flink programs are online available (e.g., on
Github) we could assist with some performance tuning.

Thank you,
Fabian

[1]
https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java

2015-07-20 10:19 GMT+02:00 Stephan Ewen <[hidden email]>:

> Hi!
>
> Thanks for reaching out and adding Flink to BigDataBench.
>
> The plan you sent looks like a nice first draft. It is pretty much batch
> jobs. Here are a few ideas what you could add as batch jobs:
>
> - Joins are something people seem do a lot with these systems, so a 2-3
> table join would be a nice addition
>
> - For batch algorithms, it is often interesting to scale them beyond
> memory (we have seen that a lot from users)
>
> - For graph algorithms, you can try incremental versions (see here:
> http://data-artisans.com/data-analysis-with-flink.html)
>
>
>
> On the streaming side, it is harder, as the systems are very different
> there and bot every system can do everything.
> For Flink, some ideas would be:
> - Streaming Grep
> - Streaming pattern detection (see
>
> https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine
> )
> - Streaming word count
> - For streaming Jobs, it is often interesting to play with enabled /
> disabled fault tolerance
>
>
>
> A few generic comments on Flink, for performance testing.
>
> - The Java API is usually slightly faster then the Scala API, but only by
> a bit
> - Tuples (Java) and case classes (Scala) usually beat POJOs in
> performance.
> - If your implementation allows it, turning on "objectReuseMode()" can
> gain some performance.
> - If you implement sorting / Tera sort, have a look here, for how to make
> sure that Flink handles the Hadoop types efficiently
>
> http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html
>
> Greetings,
> Stephan
>
>
>
> On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <[hidden email]> wrote:
>
> > Hello, everyone.
> >
> > I'm a PhD student from the Institute of Computing Technology, Chinese
> > Academy of Sciences. Our team has released a benchmark for big data
> systems
> > called BigDataBench, which has become an industry-standard big data
> > benchmark in China. You can find our work on this website:
> > http://prof.ict.ac.cn/BigDataBench/
> >
> > We are now planning to support Flink in our benchmark, which could
> provide
> > a
> > set of workloads on different domains and an objective comparison with
> > systems such as Spark and Hadoop. But we are new to this system, so we
> are
> > asking for your advice about benchmark design. The first thing is to
> decide
> > what workloads should be added to our benchmark and which domain we
> should
> > pay more attention.
> >
> > The attachment is a preliminary plan, which lists some workloads that
> have
> > already been implemented in the Spark version. We plan to first implement
> > these workloads on Flink, and evalute these two systems. Does anyone have
> > some adivce for this list? We will be very grateful for any idea.
> > BigDataBench_for_Flink.docx
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx
> > >
> >
> > Thanks ;)
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>

hawin

Re: Benchmarks of Flink, supporting Flink in BigDataBench

In reply to this post by Xinhui Tian

Hi Xinhui

As Stephan mentioned for the batch jobs, there are 2 - 3 tables would be nice addition.
Can we use the same Spark examples as below to implement it.
Thanks.

For example:
1. Scan Query
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

2. Aggregation Query
SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X)

3. Join Query
SELECT sourceIP, totalRevenue, avgPageRank
FROM
(SELECT sourceIP,
AVG(pageRank) as avgPageRank,
SUM(adRevenue) as totalRevenue
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')
GROUP BY UV.sourceIP)
ORDER BY totalRevenue DESC LIMIT 1

https://amplab.cs.berkeley.edu/benchmark/