Stale Synchronous Parallel iterations in Flink

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Stale Synchronous Parallel iterations in Flink

Nam-Luc Tran
Hello Everyone, 

I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research
subjects cover distributed machine learning and we have been working
on dataflow graph processing for a while now. We have been reading
from you since Stratosphere :-)

Our current research focuses on Stale Synchronous Parallelism and we
are currently considering Apache Flink as a good candidate for
implementing and delivering the best results among the existing
processing solutions. I have written a post about it here:
https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran 

What do you guys think about the approach? Does it seem feasible, or
do you have anything similar in your roadmap? 

Best regards,

Tran Nam-Luc



Links:
------
[1] http://euranova.eu

Reply | Threaded
Open this post in threaded view
|

Re: Stale Synchronous Parallel iterations in Flink

Kostas Tzoumas-2
Hi Nam-Luc,

Several of your observations in the blog post are to the point. Iterations
are already pipelined, and the distributed state that the delta iterations
access can be possibly lifted to a parameter server API.

We need to work a bit through the details on how fault tolerance and
termination would work and how the recent refactoring of the runtime to
include intermediate results would play for this. This is feasible, but
will require runtime code.

Simply removing the dams from the iterations can make iterations
asynchronous, but introduces two issues:

(1) How to checkpoint such computations in order to recover them upon
failures.

(2) How to check for the termination of the computation without a sync
barrier. Can the SSP model help with that?

Kostas

On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran <[hidden email]>
wrote:

> Hello Everyone,
>
> I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research
> subjects cover distributed machine learning and we have been working
> on dataflow graph processing for a while now. We have been reading
> from you since Stratosphere :-)
>
> Our current research focuses on Stale Synchronous Parallelism and we
> are currently considering Apache Flink as a good candidate for
> implementing and delivering the best results among the existing
> processing solutions. I have written a post about it here:
>
> https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran
>
>
> What do you guys think about the approach? Does it seem feasible, or
> do you have anything similar in your roadmap?
>
> Best regards,
>
> Tran Nam-Luc
>
>
>
> Links:
> ------
> [1] http://euranova.eu
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Stale Synchronous Parallel iterations in Flink

Stephan Ewen
In reply to this post by Nam-Luc Tran
Hey Tran Nam-Luc!

Great post with some really cool thoughts.
I just posted this answer to your LinkedIN post.

Greetings,
Stephan

=============================================

Nice post, very cool idea! Your understanding of Flink in that respect is
really good. I had not heard of SSP before,
but it seems to be a good compromise between bulk synchronous and
asynchronous iterations.

Here are some comments and thought about how Apache Flink realizes
iterations and how that mechanism can be extended
to support SSP:

 - The loop in Flink is standing, operators are not re-created and
re-deployed in every iteration.

 - We signal the end of a superstep by pushing a special message from the
"head" of the loop to the "tail". This can be thought of
   like a clock cycle. Currently, the new superstep starts on each parallel
thread once all "tails" have received the message,
   thus forming the BSP barrier. For delta iterations, this "tail" is the
next workset - the solution set completely independent of that.

 - We can probably interpret the "end-of-superstep" messages as clock
messages. We could then allow threads to start their
   next superstep if all "tails" have seen clock messages at least of its
own clock time minus the slack.

If you are looking to implement this in Flink, or dig deeper into this, let
me know, I would be happy to help.


On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran <[hidden email]>
wrote:

> Hello Everyone,
>
> I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research
> subjects cover distributed machine learning and we have been working
> on dataflow graph processing for a while now. We have been reading
> from you since Stratosphere :-)
>
> Our current research focuses on Stale Synchronous Parallelism and we
> are currently considering Apache Flink as a good candidate for
> implementing and delivering the best results among the existing
> processing solutions. I have written a post about it here:
>
> https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran
>
>
> What do you guys think about the approach? Does it seem feasible, or
> do you have anything similar in your roadmap?
>
> Best regards,
>
> Tran Nam-Luc
>
>
>
> Links:
> ------
> [1] http://euranova.eu
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Stale Synchronous Parallel iterations in Flink

Nam-Luc Tran
In reply to this post by Nam-Luc Tran
Hello guys,

Thank you for your replies.

>(1) How to checkpoint such computations in order to recover them
upon failures.
Approximate snapshots are possible with SSP. In this case, a snapshot
(solution set and or working set, what is exactly snapshotted can be
discussed) is triggered once a "clock=t" message has been received
from each thread. The snapshot will not be "pure" meaning that it
contains strictly updates from 1...t, as some faster threads may be
ahead of t and their results may already be incorporated by slower
threads. However iterative convergent algorithms suitable for SSPs are
able to tolerate this situation.

>(2) How to check for the termination of the computation without a
sync barrier. Can the SSP model help with that?
From what I have seen the SSP evaluates the convergence of the global
model in order to terminate the iterations. In our case, I guess this
would require to evaluate the convergence on the solution set and/or
emptiness of the working set. Is it possible to interrupt Flink
operators while they are running?

>We signal the end of a superstep by pushing a special message from
the "head" of the loop to the "tail". This can be thought of like a
clock cycle. Currently, the new superstep starts on each
parallel thread once all "tails" have received the message, thus
forming the BSP barrier. For delta iterations, this "tail" is
the next workset - the solution set completely independent of that.
Could you point me out the classes in the code that realise this
behaviour?

I am very interested in digging deeper and implementing this in Flink.
What would be the best point to start on from here according to you? 

Best regards,

Tran Nam-Luc

At Monday, 23/02/2015 on 10:32 Stephan Ewen wrote:

Hey Tran Nam-Luc!

Great post with some really cool thoughts.
I just posted this answer to your LinkedIN post.

Greetings,
Stephan

=============================================

Nice post, very cool idea! Your understanding of Flink in that respect
is
really good. I had not heard of SSP before,
but it seems to be a good compromise between bulk synchronous and
asynchronous iterations.

Here are some comments and thought about how Apache Flink realizes
iterations and how that mechanism can be extended
to support SSP:

- The loop in Flink is standing, operators are not re-created and
re-deployed in every iteration.

- We signal the end of a superstep by pushing a special message from
the
"head" of the loop to the "tail". This can be thought of
   like a clock cycle. Currently, the new superstep starts on each
parallel
thread once all "tails" have received the message,
   thus forming the BSP barrier. For delta iterations, this "tail"
is the
next workset - the solution set completely independent of that.

- We can probably interpret the "end-of-superstep" messages as clock
messages. We could then allow threads to start their
   next superstep if all "tails" have seen clock messages at least
of its
own clock time minus the slack.

If you are looking to implement this in Flink, or dig deeper into
this, let
me know, I would be happy to help.

On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran
wrote:

> Hello Everyone,
>
> I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research
> subjects cover distributed machine learning and we have been working
> on dataflow graph processing for a while now. We have been reading
> from you since Stratosphere :-)
>
> Our current research focuses on Stale Synchronous Parallelism and we
> are currently considering Apache Flink as a good candidate for
> implementing and delivering the best results among the existing
> processing solutions. I have written a post about it here:
>
>
https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran

>
>
> What do you guys think about the approach? Does it seem feasible, or
> do you have anything similar in your roadmap?
>
> Best regards,
>
> Tran Nam-Luc
>
>
>
> Links:
> ------
> [1] http://euranova.eu
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Stale Synchronous Parallel iterations in Flink

Martin Neumann
Hej,

Very interesting discussion.
I hadn't heard of the SSP model before, looks like something I want to look
into.
I wonder if any of the algorithms that would work in that model would not
work in an asynchronous model. Since asynchronous is basically a SSP model
with infinite slack. Iterative convergent algorithms such as connected
components would run in both environment the same.
Formulated differently: Are slack parameter arbitrary or is it defined by
the structure of the algorithm like it is the case in ABSP?

For the implementation issues, maybe we can learn something from Microsoft
Naiad <http://research.microsoft.com/en-us/projects/naiad/> which has some
similarities to this.


cheers Martin

On Mon, Feb 23, 2015 at 4:55 PM, Nam-Luc Tran <[hidden email]>
wrote:

> Hello guys,
>
> Thank you for your replies.
>
> >(1) How to checkpoint such computations in order to recover them
> upon failures.
> Approximate snapshots are possible with SSP. In this case, a snapshot
> (solution set and or working set, what is exactly snapshotted can be
> discussed) is triggered once a "clock=t" message has been received
> from each thread. The snapshot will not be "pure" meaning that it
> contains strictly updates from 1...t, as some faster threads may be
> ahead of t and their results may already be incorporated by slower
> threads. However iterative convergent algorithms suitable for SSPs are
> able to tolerate this situation.
>
> >(2) How to check for the termination of the computation without a
> sync barrier. Can the SSP model help with that?
> From what I have seen the SSP evaluates the convergence of the global
> model in order to terminate the iterations. In our case, I guess this
> would require to evaluate the convergence on the solution set and/or
> emptiness of the working set. Is it possible to interrupt Flink
> operators while they are running?
>
> >We signal the end of a superstep by pushing a special message from
> the "head" of the loop to the "tail". This can be thought of like a
> clock cycle. Currently, the new superstep starts on each
> parallel thread once all "tails" have received the message, thus
> forming the BSP barrier. For delta iterations, this "tail" is
> the next workset - the solution set completely independent of that.
> Could you point me out the classes in the code that realise this
> behaviour?
>
> I am very interested in digging deeper and implementing this in Flink.
> What would be the best point to start on from here according to you?
>
> Best regards,
>
> Tran Nam-Luc
>
> At Monday, 23/02/2015 on 10:32 Stephan Ewen wrote:
>
> Hey Tran Nam-Luc!
>
> Great post with some really cool thoughts.
> I just posted this answer to your LinkedIN post.
>
> Greetings,
> Stephan
>
> =============================================
>
> Nice post, very cool idea! Your understanding of Flink in that respect
> is
> really good. I had not heard of SSP before,
> but it seems to be a good compromise between bulk synchronous and
> asynchronous iterations.
>
> Here are some comments and thought about how Apache Flink realizes
> iterations and how that mechanism can be extended
> to support SSP:
>
> - The loop in Flink is standing, operators are not re-created and
> re-deployed in every iteration.
>
> - We signal the end of a superstep by pushing a special message from
> the
> "head" of the loop to the "tail". This can be thought of
>    like a clock cycle. Currently, the new superstep starts on each
> parallel
> thread once all "tails" have received the message,
>    thus forming the BSP barrier. For delta iterations, this "tail"
> is the
> next workset - the solution set completely independent of that.
>
> - We can probably interpret the "end-of-superstep" messages as clock
> messages. We could then allow threads to start their
>    next superstep if all "tails" have seen clock messages at least
> of its
> own clock time minus the slack.
>
> If you are looking to implement this in Flink, or dig deeper into
> this, let
> me know, I would be happy to help.
>
> On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran
> wrote:
>
> > Hello Everyone,
> >
> > I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research
> > subjects cover distributed machine learning and we have been working
> > on dataflow graph processing for a while now. We have been reading
> > from you since Stratosphere :-)
> >
> > Our current research focuses on Stale Synchronous Parallelism and we
> > are currently considering Apache Flink as a good candidate for
> > implementing and delivering the best results among the existing
> > processing solutions. I have written a post about it here:
> >
> >
>
> https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran
> >
> >
> > What do you guys think about the approach? Does it seem feasible, or
> > do you have anything similar in your roadmap?
> >
> > Best regards,
> >
> > Tran Nam-Luc
> >
> >
> >
> > Links:
> > ------
> > [1] http://euranova.eu
> >
> >
>
>
>