(DEPRECATED) Apache Flink Mailing List archive.

Fwd: [stratosphere-dev] Spark comparison

Classic

List

Threaded

5 messages Options

Robert Metzger

Fwd: [stratosphere-dev] Spark comparison

Forwarding the message to the new mailing list ...

---------- Forwarded message ----------
From: Nirvanesque <[hidden email]>
Date: Fri, Aug 29, 2014 at 1:57 PM
Subject: Re: [stratosphere-dev] Spark comparison
To: [hidden email]

Ufuk and the Flink team,

You and your team are familiar by now with this comparison (Master thesis
of Ze Ni in the KTH Institute)
http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

I would like to know your viewpoints in this direction?

Thanks in advance,
Anirvan

On Tuesday, December 3, 2013 6:19:57 PM UTC+1, Ufuk Celebi wrote:

> Hey Ankur,
>
> I like the idea of a comparison matrix. We tried to do something similar
> with Hadoop already (parts of it are on the front page of our website),
> which we used for a local summit here. Comparing Stratosphere to Spark in
> this way would be a natural extension to this. ;-)
>
> Internally, we ran some benchmarks against 0.7.3 (unfortunately right
> before the 0.8 release). We didn't publish the results as there are certain
> aspects that make the comparison unfair (for example we have no fault
> tolerance right now whereas Spark does). As soon as we (re-)introduce fault
> tolerance mechanisms, we will re-run the benchmarks.
>
> I can publish the code for the Stratosphere and Spark programs we looked
> at on GitHub. If I add Scala versions of the Stratosphere programs, this
> will also go to your proposed direction of having a direct comparison.
>
> Is there any specific use case where you want to see numbers? Or is it
> more like a general thing where you want to see how both systems perform?
>
> Best,
>
> Ufuk
>
> On 03 Dec 2013, at 18:03, Ankur Chauhan <[hidden email]> wrote:
>
> Hi all,
>
>
> Sitting at spark-summit 2013, I was interested in figuring out if anyone
> has done a feature comparison and or benchmarks against spark/storm/etc.
> This may also serve as a "compatibility matrix" and would help a lot when
> people want to compare the two projects and help us understand what are the
> strengths and weakness of each project.
>
> -- Ankur
>
> --
> You received this message because you are subscribed to the Google Groups
> "stratosphere-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
>
> Visit this group at http://groups.google.com/group/stratosphere-dev.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> --

You received this message because you are subscribed to the Google Groups
"stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [hidden email].
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Kostas Tzoumas-2

Re: [stratosphere-dev] Spark comparison

Hi Anirvan,

Yes, I am familiar with this thesis. I think that this comparison is by now
quite old (>1 year if I am not mistaken), and both systems have evolved
substantially since then.

Kostas

On Fri, Aug 29, 2014 at 7:01 PM, Robert Metzger <[hidden email]> wrote:

> Forwarding the message to the new mailing list ...
>
> ---------- Forwarded message ----------
> From: Nirvanesque <[hidden email]>
> Date: Fri, Aug 29, 2014 at 1:57 PM
> Subject: Re: [stratosphere-dev] Spark comparison
> To: [hidden email]
>
>
> Ufuk and the Flink team,
>
> You and your team are familiar by now with this comparison (Master thesis
> of Ze Ni in the KTH Institute)
> http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf
>
> I would like to know your viewpoints in this direction?
>
> Thanks in advance,
> Anirvan
>
>
>
> On Tuesday, December 3, 2013 6:19:57 PM UTC+1, Ufuk Celebi wrote:
>
> > Hey Ankur,
> >
> > I like the idea of a comparison matrix. We tried to do something similar
> > with Hadoop already (parts of it are on the front page of our website),
> > which we used for a local summit here. Comparing Stratosphere to Spark in
> > this way would be a natural extension to this. ;-)
> >
> > Internally, we ran some benchmarks against 0.7.3 (unfortunately right
> > before the 0.8 release). We didn't publish the results as there are
> certain
> > aspects that make the comparison unfair (for example we have no fault
> > tolerance right now whereas Spark does). As soon as we (re-)introduce
> fault
> > tolerance mechanisms, we will re-run the benchmarks.
> >
> > I can publish the code for the Stratosphere and Spark programs we looked
> > at on GitHub. If I add Scala versions of the Stratosphere programs, this
> > will also go to your proposed direction of having a direct comparison.
> >
> > Is there any specific use case where you want to see numbers? Or is it
> > more like a general thing where you want to see how both systems perform?
> >
> > Best,
> >
> > Ufuk
> >
> > On 03 Dec 2013, at 18:03, Ankur Chauhan <[hidden email]> wrote:
> >
> > Hi all,
> >
> >
> > Sitting at spark-summit 2013, I was interested in figuring out if anyone
> > has done a feature comparison and or benchmarks against spark/storm/etc.
> > This may also serve as a "compatibility matrix" and would help a lot when
> > people want to compare the two projects and help us understand what are
> the
> > strengths and weakness of each project.
> >
> > -- Ankur
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [hidden email].
> >
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
> > --
> You received this message because you are subscribed to the Google Groups
> "stratosphere-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
> Visit this group at http://groups.google.com/group/stratosphere-dev.
> For more options, visit https://groups.google.com/d/optout.
>

Stephan Ewen

Re: [stratosphere-dev] Spark comparison

Hi!

I agree with Kostas, the code base of Stratosphere that was used was quite
old.

The current Flnk version is different already, with the new APIs and
different type handling.

Flink is taking a route that makes sure that the runtime is very robust,
memory wise. We pay currently a few CPU cycles overhead for that, but we
have an effort gong to bring that down.

It would be interesting to rerun the experiments then...

Greetings,
Stephan

On Sat, Aug 30, 2014 at 9:16 AM, Kostas Tzoumas <[hidden email]> wrote:

> Hi Anirvan,
>
> Yes, I am familiar with this thesis. I think that this comparison is by now
> quite old (>1 year if I am not mistaken), and both systems have evolved
> substantially since then.
>
> Kostas
>
>
> On Fri, Aug 29, 2014 at 7:01 PM, Robert Metzger <[hidden email]>
> wrote:
>
> > Forwarding the message to the new mailing list ...
> >
> > ---------- Forwarded message ----------
> > From: Nirvanesque <[hidden email]>
> > Date: Fri, Aug 29, 2014 at 1:57 PM
> > Subject: Re: [stratosphere-dev] Spark comparison
> > To: [hidden email]
> >
> >
> > Ufuk and the Flink team,
> >
> > You and your team are familiar by now with this comparison (Master thesis
> > of Ze Ni in the KTH Institute)
> > http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf
> >
> > I would like to know your viewpoints in this direction?
> >
> > Thanks in advance,
> > Anirvan
> >
> >
> >
> > On Tuesday, December 3, 2013 6:19:57 PM UTC+1, Ufuk Celebi wrote:
> >
> > > Hey Ankur,
> > >
> > > I like the idea of a comparison matrix. We tried to do something
> similar
> > > with Hadoop already (parts of it are on the front page of our website),
> > > which we used for a local summit here. Comparing Stratosphere to Spark
> in
> > > this way would be a natural extension to this. ;-)
> > >
> > > Internally, we ran some benchmarks against 0.7.3 (unfortunately right
> > > before the 0.8 release). We didn't publish the results as there are
> > certain
> > > aspects that make the comparison unfair (for example we have no fault
> > > tolerance right now whereas Spark does). As soon as we (re-)introduce
> > fault
> > > tolerance mechanisms, we will re-run the benchmarks.
> > >
> > > I can publish the code for the Stratosphere and Spark programs we
> looked
> > > at on GitHub. If I add Scala versions of the Stratosphere programs,
> this
> > > will also go to your proposed direction of having a direct comparison.
> > >
> > > Is there any specific use case where you want to see numbers? Or is it
> > > more like a general thing where you want to see how both systems
> perform?
> > >
> > > Best,
> > >
> > > Ufuk
> > >
> > > On 03 Dec 2013, at 18:03, Ankur Chauhan <[hidden email]> wrote:
> > >
> > > Hi all,
> > >
> > >
> > > Sitting at spark-summit 2013, I was interested in figuring out if
> anyone
> > > has done a feature comparison and or benchmarks against
> spark/storm/etc.
> > > This may also serve as a "compatibility matrix" and would help a lot
> when
> > > people want to compare the two projects and help us understand what are
> > the
> > > strengths and weakness of each project.
> > >
> > > -- Ankur
> > >
> > > --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "stratosphere-dev" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> an
> > > email to [hidden email].
> > >
> > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > For more options, visit https://groups.google.com/groups/opt_out.
> > >
> > >
> > > --
> > You received this message because you are subscribed to the Google Groups
> > "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [hidden email].
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/d/optout.
> >
>

Anirvan Basu-2

RE: [stratosphere-dev] Spark comparison

Stephan et Kostas,

I agree that the study is 1-yr old (so old in terms of dev timeframe for both these projects).
Seems that Spark has caught up good wind on its sails - Google, Facebook, Yahoo, IBM ... what about you folks ?
Are you also pitching these giants ? Let's assume that it is a fat-tail scenario.
Appears to me something similar to MongoDB in the NoSQL world (compared to Raven or Couch) :-) Still need to figure hype or reality!

I tried Spark 1.0.2 this week:
- installation was fairly simple,
- the Python API was easy to do some beyond-hello world programmes,(did not check their R package though)
- they also have a good streaming package,
- advantage was a good series of tutorials & webinars (helps to get rid of the fear of "jumping into the water" for dummies like me)

Some pertinent questions:
1. Would you be interested, if we did a neutral comparison of Flink and Spark, baselined to Hadoop M-R framework ? I was also thinking of adding Summingbird - would like to know your viewpoints there.
If we did publish, we would try to present it in some conference naturally! So think of the perils as well ;-)
Actually, Robert had asked me a similar question - he put the idea in my head!

2. To what set of criteria would you want to compare Flink and Spark ?

3. Where do you stand for graph-based algos ? We are looking for a stable framework for graph-based programmes -like balanced graph partitioning, evolution, ... - that way the Spark graphx appeared very interesting.
I know you have your own Spargel there - so how do you compare? Do you also do vertex-based balanced partitioning (for e.g. JA-BE-JA k-way partitioning) ? Can you do edge-based partitioning ? I didn't come across any framework that realizes the latter.
Here attached is a simple paper presented by an Italian research group - they jumped on to the Spark bandwagon!
Let me know your opinions (perhaps, you may know the group already)

Best !
Anirvan

-----Original Message-----
From: Stephan Ewen [mailto:[hidden email]]
Sent: samedi 30 août 2014 18:26
To: [hidden email]
Subject: Re: [stratosphere-dev] Spark comparison

Hi!

I agree with Kostas, the code base of Stratosphere that was used was quite old.

The current Flnk version is different already, with the new APIs and different type handling.

Flink is taking a route that makes sure that the runtime is very robust, memory wise. We pay currently a few CPU cycles overhead for that, but we have an effort gong to bring that down.

It would be interesting to rerun the experiments then...

Greetings,
Stephan

On Sat, Aug 30, 2014 at 9:16 AM, Kostas Tzoumas <[hidden email]> wrote:

> Hi Anirvan,
>
> Yes, I am familiar with this thesis. I think that this comparison is
> by now quite old (>1 year if I am not mistaken), and both systems have
> evolved substantially since then.
>
> Kostas
>
>
> On Fri, Aug 29, 2014 at 7:01 PM, Robert Metzger <[hidden email]>
> wrote:
>
> > Forwarding the message to the new mailing list ...
> >
> > ---------- Forwarded message ----------
> > From: Nirvanesque <[hidden email]>
> > Date: Fri, Aug 29, 2014 at 1:57 PM
> > Subject: Re: [stratosphere-dev] Spark comparison
> > To: [hidden email]
> >
> >
> > Ufuk and the Flink team,
> >
> > You and your team are familiar by now with this comparison (Master
> > thesis of Ze Ni in the KTH Institute)
> > http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf
> >
> > I would like to know your viewpoints in this direction?
> >
> > Thanks in advance,
> > Anirvan
> >
> >
> >
> > On Tuesday, December 3, 2013 6:19:57 PM UTC+1, Ufuk Celebi wrote:
> >
> > > Hey Ankur,
> > >
> > > I like the idea of a comparison matrix. We tried to do something
> similar
> > > with Hadoop already (parts of it are on the front page of our
> > > website), which we used for a local summit here. Comparing
> > > Stratosphere to Spark
> in
> > > this way would be a natural extension to this. ;-)
> > >
> > > Internally, we ran some benchmarks against 0.7.3 (unfortunately
> > > right before the 0.8 release). We didn't publish the results as
> > > there are
> > certain
> > > aspects that make the comparison unfair (for example we have no
> > > fault tolerance right now whereas Spark does). As soon as we
> > > (re-)introduce
> > fault
> > > tolerance mechanisms, we will re-run the benchmarks.
> > >
> > > I can publish the code for the Stratosphere and Spark programs we
> looked
> > > at on GitHub. If I add Scala versions of the Stratosphere
> > > programs,
> this
> > > will also go to your proposed direction of having a direct comparison.
> > >
> > > Is there any specific use case where you want to see numbers? Or
> > > is it more like a general thing where you want to see how both
> > > systems
> perform?
> > >
> > > Best,
> > >
> > > Ufuk
> > >
> > > On 03 Dec 2013, at 18:03, Ankur Chauhan <[hidden email]> wrote:
> > >
> > > Hi all,
> > >
> > >
> > > Sitting at spark-summit 2013, I was interested in figuring out if
> anyone
> > > has done a feature comparison and or benchmarks against
> spark/storm/etc.
> > > This may also serve as a "compatibility matrix" and would help a
> > > lot
> when
> > > people want to compare the two projects and help us understand
> > > what are
> > the
> > > strengths and weakness of each project.
> > >
> > > -- Ankur
> > >
> > > --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "stratosphere-dev" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send
> an
> > > email to [hidden email].
> > >
> > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > For more options, visit https://groups.google.com/groups/opt_out.
> > >
> > >
> > > --
> > You received this message because you are subscribed to the Google
> > Groups "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to [hidden email].
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/d/optout.
> >
>

Kostas Tzoumas-2

Re: [stratosphere-dev] Spark comparison

In reply to this post by Stephan Ewen

Hi Anirvan,

I am not sure if the discussion on level of hype, sales, etc is relevant to
the dev@ mailing list.

My opinion on your first two questions:

1. An updated performance comparison would be indeed very nice. There is
some work started by some Flink contributors to create some performance
scripts for Flink, Spark, and MapReduce here:
https://github.com/project-flink/flink-perf. Help in this direction would
definitely be very welcome, perhaps you would be interested in contributing
there! As Stephan said, the community still needs to do some work on how
the runtime deals with serialized data in order for a performance
comparison to make sense for Flink.

Keep in mind that Flink is an open source project, you do not need
permission by anyone to publish studies on it in conferences ;-)

2. I would focus on measures of performance and scalability in various
setups, data sets, and job complexities. One could also think about
usability, but getting this done objectively is in my experience hard and
time consuming (requires user studies) and the results may be obsolete
soon, as Flink is adding API features.

Kostas

On Sat, Aug 30, 2014 at 8:14 PM, Anirvan Basu <
[hidden email]> wrote:

> Stephan et Kostas,
>
> I agree that the study is 1-yr old (so old in terms of dev timeframe for
> both these projects).
> Seems that Spark has caught up good wind on its sails - Google, Facebook,
> Yahoo, IBM ... what about you folks ?
> Are you also pitching these giants ? Let's assume that it is a fat-tail
> scenario.
> Appears to me something similar to MongoDB in the NoSQL world (compared to
> Raven or Couch) :-) Still need to figure hype or reality!
>
> I tried Spark 1.0.2 this week:
> - installation was fairly simple,
> - the Python API was easy to do some beyond-hello world programmes,(did
> not check their R package though)
> - they also have a good streaming package,
> - advantage was a good series of tutorials & webinars (helps to get rid of
> the fear of "jumping into the water" for dummies like me)
>
> Some pertinent questions:
> 1. Would you be interested, if we did a neutral comparison of Flink and
> Spark, baselined to Hadoop M-R framework ? I was also thinking of adding
> Summingbird - would like to know your viewpoints there.
> If we did publish, we would try to present it in some conference
> naturally! So think of the perils as well ;-)
> Actually, Robert had asked me a similar question - he put the idea in my
> head!
>
> 2. To what set of criteria would you want to compare Flink and Spark ?
>
> 3. Where do you stand for graph-based algos ? We are looking for a stable
> framework for graph-based programmes -like balanced graph partitioning,
> evolution, ... - that way the Spark graphx appeared very interesting.
> I know you have your own Spargel there - so how do you compare? Do you
> also do vertex-based balanced partitioning (for e.g. JA-BE-JA k-way
> partitioning) ? Can you do edge-based partitioning ? I didn't come across
> any framework that realizes the latter.
> Here attached is a simple paper presented by an Italian research group -
> they jumped on to the Spark bandwagon!
> Let me know your opinions (perhaps, you may know the group already)
>
> Best !
> Anirvan
>
>
> -----Original Message-----
> From: Stephan Ewen [mailto:[hidden email]]
> Sent: samedi 30 août 2014 18:26
> To: [hidden email]
> Subject: Re: [stratosphere-dev] Spark comparison
>
> Hi!
>
> I agree with Kostas, the code base of Stratosphere that was used was quite
> old.
>
> The current Flnk version is different already, with the new APIs and
> different type handling.
>
> Flink is taking a route that makes sure that the runtime is very robust,
> memory wise. We pay currently a few CPU cycles overhead for that, but we
> have an effort gong to bring that down.
>
> It would be interesting to rerun the experiments then...
>
> Greetings,
> Stephan
>
>
>
> On Sat, Aug 30, 2014 at 9:16 AM, Kostas Tzoumas <[hidden email]>
> wrote:
>
> > Hi Anirvan,
> >
> > Yes, I am familiar with this thesis. I think that this comparison is
> > by now quite old (>1 year if I am not mistaken), and both systems have
> > evolved substantially since then.
> >
> > Kostas
> >
> >
> > On Fri, Aug 29, 2014 at 7:01 PM, Robert Metzger <[hidden email]>
> > wrote:
> >
> > > Forwarding the message to the new mailing list ...
> > >
> > > ---------- Forwarded message ----------
> > > From: Nirvanesque <[hidden email]>
> > > Date: Fri, Aug 29, 2014 at 1:57 PM
> > > Subject: Re: [stratosphere-dev] Spark comparison
> > > To: [hidden email]
> > >
> > >
> > > Ufuk and the Flink team,
> > >
> > > You and your team are familiar by now with this comparison (Master
> > > thesis of Ze Ni in the KTH Institute)
> > > http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf
> > >
> > > I would like to know your viewpoints in this direction?
> > >
> > > Thanks in advance,
> > > Anirvan
> > >
> > >
> > >
> > > On Tuesday, December 3, 2013 6:19:57 PM UTC+1, Ufuk Celebi wrote:
> > >
> > > > Hey Ankur,
> > > >
> > > > I like the idea of a comparison matrix. We tried to do something
> > similar
> > > > with Hadoop already (parts of it are on the front page of our
> > > > website), which we used for a local summit here. Comparing
> > > > Stratosphere to Spark
> > in
> > > > this way would be a natural extension to this. ;-)
> > > >
> > > > Internally, we ran some benchmarks against 0.7.3 (unfortunately
> > > > right before the 0.8 release). We didn't publish the results as
> > > > there are
> > > certain
> > > > aspects that make the comparison unfair (for example we have no
> > > > fault tolerance right now whereas Spark does). As soon as we
> > > > (re-)introduce
> > > fault
> > > > tolerance mechanisms, we will re-run the benchmarks.
> > > >
> > > > I can publish the code for the Stratosphere and Spark programs we
> > looked
> > > > at on GitHub. If I add Scala versions of the Stratosphere
> > > > programs,
> > this
> > > > will also go to your proposed direction of having a direct
> comparison.
> > > >
> > > > Is there any specific use case where you want to see numbers? Or
> > > > is it more like a general thing where you want to see how both
> > > > systems
> > perform?
> > > >
> > > > Best,
> > > >
> > > > Ufuk
> > > >
> > > > On 03 Dec 2013, at 18:03, Ankur Chauhan <[hidden email]> wrote:
> > > >
> > > > Hi all,
> > > >
> > > >
> > > > Sitting at spark-summit 2013, I was interested in figuring out if
> > anyone
> > > > has done a feature comparison and or benchmarks against
> > spark/storm/etc.
> > > > This may also serve as a "compatibility matrix" and would help a
> > > > lot
> > when
> > > > people want to compare the two projects and help us understand
> > > > what are
> > > the
> > > > strengths and weakness of each project.
> > > >
> > > > -- Ankur
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups
> > > > "stratosphere-dev" group.
> > > > To unsubscribe from this group and stop receiving emails from it,
> > > > send
> > an
> > > > email to [hidden email].
> > > >
> > > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > > For more options, visit https://groups.google.com/groups/opt_out.
> > > >
> > > >
> > > > --
> > > You received this message because you are subscribed to the Google
> > > Groups "stratosphere-dev" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to [hidden email].
> > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> >
>