Hi,
I am currently working working for an organization which is using Apache Spark as main data processing framework. Now the organization is wondering whether Apache Flink is better at processing their data than Apache Spark. Therefore, I am evaluating Apache Flink and I am comparing it to Apache Spark. When I looked at Apache Flink for the first time, I could not find any comparison to Apache Spark at Flink's website. Would it be an idea to give some information about the differences of both frameworks on the website? I would like to contribute to that if you think that would be helpful. Regards, Kevin |
Hi Kevin,
Thanks for being willing to contribute such an effort. I think it is a completely valid discussion to ask in your organization and please feel free to ask us questions during your evaluation. Putting statements on the Flink website highlighting the differences would be very tricky though. I would advise against that. Let me elaborate on that. The "How does it compare to Spark?" is definitely one of the most frequently asked questions that we get and we can generally give three types of answers: *1. General architecture decisions* - Streaming (pipelined) execution engine (or long running opreator model). - Native iteration operator. - ... The issue with this approach is that in itself it states borderline no useful information for a decision maker. There you need benchmarks or fancy features, so let us evaluate them. *2. Benchmarks* You can find plenty of third-party benchmarks and soft evaluations [1,2,3] of the two systems out there. The problem with these are that they are very reliant on the version of the systems used, tuning and understanding the general architecture. E.g. [1] favors Storm, but if you re-do the whole benchmark from a Flink point of view you get [4]. After a couple of versions the benchmark results can be very different. *3. Fancy Features* - Exactly once spillable streaming state stored locally - Savepoints - ... Similarly to the previous point these might be an edge at some point in time, but the whole streaming space is moving very quickly and as it is open source projects tend to copy each other to a certain extent. This of course does not mean that doing evaluations at any point in time is meaningless, but you need to update them frequently (check [5] and [6]) and they can do more harm then good if not treated with care. I hope I was not too discouraging and could help you with your endeavor. It is also very important to take your specific use cases into account. Best, Marton [1] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at [2] https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/ [3] http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/ [4] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ [5] http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem [6] http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821 On Fri, Jul 8, 2016 at 2:23 PM, Kevin Jacobs <[hidden email]> wrote: > Hi, > > I am currently working working for an organization which is using Apache > Spark as main data processing framework. Now the organization is wondering > whether Apache Flink is better at processing their data than Apache Spark. > Therefore, I am evaluating Apache Flink and I am comparing it to Apache > Spark. > > When I looked at Apache Flink for the first time, I could not find any > comparison to Apache Spark at Flink's website. Would it be an idea to give > some information about the differences of both frameworks on the website? I > would like to contribute to that if you think that would be helpful. > > Regards, > Kevin > |
Hi Marton,
Thank you for your elaborate answer. I will comment in your e-mail below: On 08.07.2016 15:13, Márton Balassi wrote: > Hi Kevin, > > Thanks for being willing to contribute such an effort. I think it is a > completely valid discussion to ask in your organization and please feel > free to ask us questions during your evaluation. Putting statements on the > Flink website highlighting the differences would be very tricky though. I > would advise against that. Let me elaborate on that. Thank you, I will definitely ask questions during the evaluation, next week we will be setting up some experiments. > > The "How does it compare to Spark?" is definitely one of the most > frequently asked questions that we get and we can generally give three > types of answers: > > *1. General architecture decisions* > > - Streaming (pipelined) execution engine (or long running opreator > model). > - Native iteration operator. > - ... > > The issue with this approach is that in itself it states borderline no > useful information for a decision maker. There you need benchmarks or fancy > features, so let us evaluate them. That is definitely true, but don't you think that Flink and Spark will "collapse" at some point in time? The differences between the two frameworks are getting smaller and smaller, Spark also has support for streaming. Or will the difference in the architecture be key in differentiating the two frameworks? > > *2. Benchmarks* > You can find plenty of third-party benchmarks and soft evaluations [1,2,3] > of the two systems out there. The problem with these are that they are very > reliant on the version of the systems used, tuning and understanding the > general architecture. E.g. [1] favors Storm, but if you re-do the whole > benchmark from a Flink point of view you get [4]. After a couple of > versions the benchmark results can be very different. > > *3. Fancy Features* > > - Exactly once spillable streaming state stored locally > - Savepoints > - ... > > Similarly to the previous point these might be an edge at some point in > time, but the whole streaming space is moving very quickly and as it is > open source projects tend to copy each other to a certain extent. Why is this spacing moving so quickly? Is it due to the new technologies that arise of processing streaming data? Would that not converge to only a handful of stable frameworks in the future (just speculating)? > > This of course does not mean that doing evaluations at any point in time is > meaningless, but you need to update them frequently (check [5] and [6]) and > they can do more harm then good if not treated with care. It would be great if there were evaluation methods that are reusable, so this process does not have to be repeated every time. Unfortunately, there always is a difference with previous frameworks, so that implies that custom made evaluations should be made for every new framework. I like the TeraGen/TeraSort/TeraValidate benchmark, that is at least a general benchmark approach too some extend. > > I hope I was not too discouraging and could help you with your endeavor. It > is also very important to take your specific use cases into account. It is definitely not discouraging, thank you for the answer :-)! > > Best, > > Marton > > [1] > https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at > [2] https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/ > [3] http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/ > [4] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ > [5] > http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem > [6] > http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821 > > On Fri, Jul 8, 2016 at 2:23 PM, Kevin Jacobs <[hidden email]> wrote: > >> Hi, >> >> I am currently working working for an organization which is using Apache >> Spark as main data processing framework. Now the organization is wondering >> whether Apache Flink is better at processing their data than Apache Spark. >> Therefore, I am evaluating Apache Flink and I am comparing it to Apache >> Spark. >> >> When I looked at Apache Flink for the first time, I could not find any >> comparison to Apache Spark at Flink's website. Would it be an idea to give >> some information about the differences of both frameworks on the website? I >> would like to contribute to that if you think that would be helpful. >> >> Regards, >> Kevin >> |
Hey Kevin,
Your questions regarding the future of this space are so difficult to answer that investors are betting millions of dollars on many possible outcomes. You should consult them. :) When you are choosing a system your best bet is current capabilities, the roadmap and the community of the system. As far as TeraStuff goes a look here [1] might help. [1] http://eastcirclek.blogspot.hu/2015/06/terasort-for-spark-and-flink-with-range.html Marton On Fri, Jul 8, 2016 at 3:33 PM, Kevin Jacobs <[hidden email]> wrote: > Hi Marton, > > Thank you for your elaborate answer. I will comment in your e-mail below: > > On 08.07.2016 15:13, Márton Balassi wrote: > >> Hi Kevin, >> >> Thanks for being willing to contribute such an effort. I think it is a >> completely valid discussion to ask in your organization and please feel >> free to ask us questions during your evaluation. Putting statements on the >> Flink website highlighting the differences would be very tricky though. I >> would advise against that. Let me elaborate on that. >> > > Thank you, I will definitely ask questions during the evaluation, next > week we will be setting up some experiments. > > >> The "How does it compare to Spark?" is definitely one of the most >> frequently asked questions that we get and we can generally give three >> types of answers: >> >> *1. General architecture decisions* >> >> - Streaming (pipelined) execution engine (or long running opreator >> model). >> - Native iteration operator. >> - ... >> >> The issue with this approach is that in itself it states borderline no >> useful information for a decision maker. There you need benchmarks or >> fancy >> features, so let us evaluate them. >> > > That is definitely true, but don't you think that Flink and Spark will > "collapse" at some point in time? The differences between the two > frameworks are getting smaller and smaller, Spark also has support for > streaming. Or will the difference in the architecture be key in > differentiating the two frameworks? > > >> *2. Benchmarks* >> You can find plenty of third-party benchmarks and soft evaluations [1,2,3] >> of the two systems out there. The problem with these are that they are >> very >> reliant on the version of the systems used, tuning and understanding the >> general architecture. E.g. [1] favors Storm, but if you re-do the whole >> benchmark from a Flink point of view you get [4]. After a couple of >> versions the benchmark results can be very different. >> >> *3. Fancy Features* >> >> - Exactly once spillable streaming state stored locally >> - Savepoints >> - ... >> >> Similarly to the previous point these might be an edge at some point in >> time, but the whole streaming space is moving very quickly and as it is >> open source projects tend to copy each other to a certain extent. >> > > Why is this spacing moving so quickly? Is it due to the new technologies > that arise of processing streaming data? Would that not converge to only a > handful of stable frameworks in the future (just speculating)? > > >> This of course does not mean that doing evaluations at any point in time >> is >> meaningless, but you need to update them frequently (check [5] and [6]) >> and >> they can do more harm then good if not treated with care. >> > > It would be great if there were evaluation methods that are reusable, so > this process does not have to be repeated every time. Unfortunately, there > always is a difference with previous frameworks, so that implies that > custom made evaluations should be made for every new framework. I like the > TeraGen/TeraSort/TeraValidate benchmark, that is at least a general > benchmark approach too some extend. > > >> I hope I was not too discouraging and could help you with your endeavor. >> It >> is also very important to take your specific use cases into account. >> > > It is definitely not discouraging, thank you for the answer :-)! > > > >> Best, >> >> Marton >> >> [1] >> >> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at >> [2] https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/ >> [3] http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/ >> [4] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ >> [5] >> >> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem >> [6] >> >> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821 >> >> On Fri, Jul 8, 2016 at 2:23 PM, Kevin Jacobs <[hidden email]> >> wrote: >> >> Hi, >>> >>> I am currently working working for an organization which is using Apache >>> Spark as main data processing framework. Now the organization is >>> wondering >>> whether Apache Flink is better at processing their data than Apache >>> Spark. >>> Therefore, I am evaluating Apache Flink and I am comparing it to Apache >>> Spark. >>> >>> When I looked at Apache Flink for the first time, I could not find any >>> comparison to Apache Spark at Flink's website. Would it be an idea to >>> give >>> some information about the differences of both frameworks on the >>> website? I >>> would like to contribute to that if you think that would be helpful. >>> >>> Regards, >>> Kevin >>> >>> > |
In reply to this post by Kevin Jacobs
Hi Kevin
I think good series is "Introduction to Flink Streaming" from here: http://blog.madhukaraphatak.com/page2/ Perhaps add it to the list. Regards Jan On Fri, 2016-07-08 at 14:23 +0200, Kevin Jacobs wrote: > Hi, > > I am currently working working for an organization which is using > Apache > Spark as main data processing framework. Now the organization is > wondering whether Apache Flink is better at processing their data > than > Apache Spark. Therefore, I am evaluating Apache Flink and I am > comparing > it to Apache Spark. > > When I looked at Apache Flink for the first time, I could not find > any > comparison to Apache Spark at Flink's website. Would it be an idea > to > give some information about the differences of both frameworks on > the > website? I would like to contribute to that if you think that would > be > helpful. > > Regards, > Kevin |
In reply to this post by Kevin Jacobs
Hi Kevin,
I have orchestrated an evaluation of Spark and Flink for various batch and graph processing workloads (no streaming, no sql) (this work has been accepted as a paper at Cluster and I will publish soon a report, for more details please contact me directly). Both engines did well, stable and scalable, in some case one is better but with some differences in tuning and resource usage. Depending on your use case, you can or cannot compare these two engines, but overall they complete each other :). I think the biggest issue is with the recovery on failure (more for batch use cases) , something which is to be handled (better) with this proposal https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+:+Fine+Grained+Recovery+from+Task+Failures> Best, Ovidiu > On 08 Jul 2016, at 15:33, Kevin Jacobs <[hidden email]> wrote: > > Hi Marton, > > Thank you for your elaborate answer. I will comment in your e-mail below: > > On 08.07.2016 15:13, Márton Balassi wrote: >> Hi Kevin, >> >> Thanks for being willing to contribute such an effort. I think it is a >> completely valid discussion to ask in your organization and please feel >> free to ask us questions during your evaluation. Putting statements on the >> Flink website highlighting the differences would be very tricky though. I >> would advise against that. Let me elaborate on that. > > Thank you, I will definitely ask questions during the evaluation, next week we will be setting up some experiments. > >> >> The "How does it compare to Spark?" is definitely one of the most >> frequently asked questions that we get and we can generally give three >> types of answers: >> >> *1. General architecture decisions* >> >> - Streaming (pipelined) execution engine (or long running opreator >> model). >> - Native iteration operator. >> - ... >> >> The issue with this approach is that in itself it states borderline no >> useful information for a decision maker. There you need benchmarks or fancy >> features, so let us evaluate them. > > That is definitely true, but don't you think that Flink and Spark will "collapse" at some point in time? The differences between the two frameworks are getting smaller and smaller, Spark also has support for streaming. Or will the difference in the architecture be key in differentiating the two frameworks? > >> >> *2. Benchmarks* >> You can find plenty of third-party benchmarks and soft evaluations [1,2,3] >> of the two systems out there. The problem with these are that they are very >> reliant on the version of the systems used, tuning and understanding the >> general architecture. E.g. [1] favors Storm, but if you re-do the whole >> benchmark from a Flink point of view you get [4]. After a couple of >> versions the benchmark results can be very different. >> >> *3. Fancy Features* >> >> - Exactly once spillable streaming state stored locally >> - Savepoints >> - ... >> >> Similarly to the previous point these might be an edge at some point in >> time, but the whole streaming space is moving very quickly and as it is >> open source projects tend to copy each other to a certain extent. > > Why is this spacing moving so quickly? Is it due to the new technologies that arise of processing streaming data? Would that not converge to only a handful of stable frameworks in the future (just speculating)? > >> >> This of course does not mean that doing evaluations at any point in time is >> meaningless, but you need to update them frequently (check [5] and [6]) and >> they can do more harm then good if not treated with care. > > It would be great if there were evaluation methods that are reusable, so this process does not have to be repeated every time. Unfortunately, there always is a difference with previous frameworks, so that implies that custom made evaluations should be made for every new framework. I like the TeraGen/TeraSort/TeraValidate benchmark, that is at least a general benchmark approach too some extend. > >> >> I hope I was not too discouraging and could help you with your endeavor. It >> is also very important to take your specific use cases into account. > > It is definitely not discouraging, thank you for the answer :-)! > >> >> Best, >> >> Marton >> >> [1] >> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at >> [2] https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/ >> [3] http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/ >> [4] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ >> [5] >> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem >> [6] >> http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821 >> >> On Fri, Jul 8, 2016 at 2:23 PM, Kevin Jacobs <[hidden email]> wrote: >> >>> Hi, >>> >>> I am currently working working for an organization which is using Apache >>> Spark as main data processing framework. Now the organization is wondering >>> whether Apache Flink is better at processing their data than Apache Spark. >>> Therefore, I am evaluating Apache Flink and I am comparing it to Apache >>> Spark. >>> >>> When I looked at Apache Flink for the first time, I could not find any >>> comparison to Apache Spark at Flink's website. Would it be an idea to give >>> some information about the differences of both frameworks on the website? I >>> would like to contribute to that if you think that would be helpful. >>> >>> Regards, >>> Kevin |
Free forum by Nabble | Edit this page |