Hello, everyone.
I'm a PhD student from the Institute of Computing Technology, Chinese Academy of Sciences. Our team has released a benchmark for big data systems called BigDataBench, which has become an industry-standard big data benchmark in China. You can find our work on this website: http://prof.ict.ac.cn/BigDataBench/ We are now planning to support Flink in our benchmark, which could provide a set of workloads on different domains and an objective comparison with systems such as Spark and Hadoop. But we are new to this system, so we are asking for your advice about benchmark design. The first thing is to decide what workloads should be added to our benchmark and which domain we should pay more attention. The attachment is a preliminary plan, which lists some workloads that have already been implemented in the Spark version. We plan to first implement these workloads on Flink, and evalute these two systems. Does anyone have some adivce for this list? We will be very grateful for any idea. BigDataBench_for_Flink.docx Thanks ;) |
Hi!
Thanks for reaching out and adding Flink to BigDataBench. The plan you sent looks like a nice first draft. It is pretty much batch jobs. Here are a few ideas what you could add as batch jobs: - Joins are something people seem do a lot with these systems, so a 2-3 table join would be a nice addition - For batch algorithms, it is often interesting to scale them beyond memory (we have seen that a lot from users) - For graph algorithms, you can try incremental versions (see here: http://data-artisans.com/data-analysis-with-flink.html) On the streaming side, it is harder, as the systems are very different there and bot every system can do everything. For Flink, some ideas would be: - Streaming Grep - Streaming pattern detection (see https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine ) - Streaming word count - For streaming Jobs, it is often interesting to play with enabled / disabled fault tolerance A few generic comments on Flink, for performance testing. - The Java API is usually slightly faster then the Scala API, but only by a bit - Tuples (Java) and case classes (Scala) usually beat POJOs in performance. - If your implementation allows it, turning on "objectReuseMode()" can gain some performance. - If you implement sorting / Tera sort, have a look here, for how to make sure that Flink handles the Hadoop types efficiently http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html Greetings, Stephan On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <[hidden email]> wrote: > Hello, everyone. > > I'm a PhD student from the Institute of Computing Technology, Chinese > Academy of Sciences. Our team has released a benchmark for big data systems > called BigDataBench, which has become an industry-standard big data > benchmark in China. You can find our work on this website: > http://prof.ict.ac.cn/BigDataBench/ > > We are now planning to support Flink in our benchmark, which could provide > a > set of workloads on different domains and an objective comparison with > systems such as Spark and Hadoop. But we are new to this system, so we are > asking for your advice about benchmark design. The first thing is to decide > what workloads should be added to our benchmark and which domain we should > pay more attention. > > The attachment is a preliminary plan, which lists some workloads that have > already been implemented in the Spark version. We plan to first implement > these workloads on Flink, and evalute these two systems. Does anyone have > some adivce for this list? We will be very grateful for any idea. > BigDataBench_for_Flink.docx > < > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx > > > > Thanks ;) > > > > -- > View this message in context: > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html > Sent from the Apache Flink Mailing List archive. mailing list archive at > Nabble.com. > |
Hi,
welcome to the Flink community and thanks for including Flink into your benchmark suite! That's really exciting news :-) Most of the jobs that you listed in your preliminary plan are available as example programs in Flink's code base [1]. However, you should know, that these examples are NOT tuned for performance but rather for easy understanding and to showcase certain features. If your implementations of Flink programs are online available (e.g., on Github) we could assist with some performance tuning. Thank you, Fabian [1] https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java 2015-07-20 10:19 GMT+02:00 Stephan Ewen <[hidden email]>: > Hi! > > Thanks for reaching out and adding Flink to BigDataBench. > > The plan you sent looks like a nice first draft. It is pretty much batch > jobs. Here are a few ideas what you could add as batch jobs: > > - Joins are something people seem do a lot with these systems, so a 2-3 > table join would be a nice addition > > - For batch algorithms, it is often interesting to scale them beyond > memory (we have seen that a lot from users) > > - For graph algorithms, you can try incremental versions (see here: > http://data-artisans.com/data-analysis-with-flink.html) > > > > On the streaming side, it is harder, as the systems are very different > there and bot every system can do everything. > For Flink, some ideas would be: > - Streaming Grep > - Streaming pattern detection (see > > https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine > ) > - Streaming word count > - For streaming Jobs, it is often interesting to play with enabled / > disabled fault tolerance > > > > A few generic comments on Flink, for performance testing. > > - The Java API is usually slightly faster then the Scala API, but only by > a bit > - Tuples (Java) and case classes (Scala) usually beat POJOs in > performance. > - If your implementation allows it, turning on "objectReuseMode()" can > gain some performance. > - If you implement sorting / Tera sort, have a look here, for how to make > sure that Flink handles the Hadoop types efficiently > > http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html > > Greetings, > Stephan > > > > On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <[hidden email]> wrote: > > > Hello, everyone. > > > > I'm a PhD student from the Institute of Computing Technology, Chinese > > Academy of Sciences. Our team has released a benchmark for big data > systems > > called BigDataBench, which has become an industry-standard big data > > benchmark in China. You can find our work on this website: > > http://prof.ict.ac.cn/BigDataBench/ > > > > We are now planning to support Flink in our benchmark, which could > provide > > a > > set of workloads on different domains and an objective comparison with > > systems such as Spark and Hadoop. But we are new to this system, so we > are > > asking for your advice about benchmark design. The first thing is to > decide > > what workloads should be added to our benchmark and which domain we > should > > pay more attention. > > > > The attachment is a preliminary plan, which lists some workloads that > have > > already been implemented in the Spark version. We plan to first implement > > these workloads on Flink, and evalute these two systems. Does anyone have > > some adivce for this list? We will be very grateful for any idea. > > BigDataBench_for_Flink.docx > > < > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx > > > > > > > Thanks ;) > > > > > > > > -- > > View this message in context: > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html > > Sent from the Apache Flink Mailing List archive. mailing list archive at > > Nabble.com. > > > |
In reply to this post by Xinhui Tian
Hi Xinhui
As Stephan mentioned for the batch jobs, there are 2 - 3 tables would be nice addition. Can we use the same Spark examples as below to implement it. Thanks. For example: 1. Scan Query SELECT pageURL, pageRank FROM rankings WHERE pageRank > X 2. Aggregation Query SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) 3. Join Query SELECT sourceIP, totalRevenue, avgPageRank FROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP) ORDER BY totalRevenue DESC LIMIT 1 https://amplab.cs.berkeley.edu/benchmark/ |
Free forum by Nabble | Edit this page |