Hello Everyone,
I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research subjects cover distributed machine learning and we have been working on dataflow graph processing for a while now. We have been reading from you since Stratosphere :-) Our current research focuses on Stale Synchronous Parallelism and we are currently considering Apache Flink as a good candidate for implementing and delivering the best results among the existing processing solutions. I have written a post about it here: https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran What do you guys think about the approach? Does it seem feasible, or do you have anything similar in your roadmap? Best regards, Tran Nam-Luc Links: ------ [1] http://euranova.eu |
Hi Nam-Luc,
Several of your observations in the blog post are to the point. Iterations are already pipelined, and the distributed state that the delta iterations access can be possibly lifted to a parameter server API. We need to work a bit through the details on how fault tolerance and termination would work and how the recent refactoring of the runtime to include intermediate results would play for this. This is feasible, but will require runtime code. Simply removing the dams from the iterations can make iterations asynchronous, but introduces two issues: (1) How to checkpoint such computations in order to recover them upon failures. (2) How to check for the termination of the computation without a sync barrier. Can the SSP model help with that? Kostas On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran <[hidden email]> wrote: > Hello Everyone, > > I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research > subjects cover distributed machine learning and we have been working > on dataflow graph processing for a while now. We have been reading > from you since Stratosphere :-) > > Our current research focuses on Stale Synchronous Parallelism and we > are currently considering Apache Flink as a good candidate for > implementing and delivering the best results among the existing > processing solutions. I have written a post about it here: > > https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran > > > What do you guys think about the approach? Does it seem feasible, or > do you have anything similar in your roadmap? > > Best regards, > > Tran Nam-Luc > > > > Links: > ------ > [1] http://euranova.eu > > |
In reply to this post by Nam-Luc Tran
Hey Tran Nam-Luc!
Great post with some really cool thoughts. I just posted this answer to your LinkedIN post. Greetings, Stephan ============================================= Nice post, very cool idea! Your understanding of Flink in that respect is really good. I had not heard of SSP before, but it seems to be a good compromise between bulk synchronous and asynchronous iterations. Here are some comments and thought about how Apache Flink realizes iterations and how that mechanism can be extended to support SSP: - The loop in Flink is standing, operators are not re-created and re-deployed in every iteration. - We signal the end of a superstep by pushing a special message from the "head" of the loop to the "tail". This can be thought of like a clock cycle. Currently, the new superstep starts on each parallel thread once all "tails" have received the message, thus forming the BSP barrier. For delta iterations, this "tail" is the next workset - the solution set completely independent of that. - We can probably interpret the "end-of-superstep" messages as clock messages. We could then allow threads to start their next superstep if all "tails" have seen clock messages at least of its own clock time minus the slack. If you are looking to implement this in Flink, or dig deeper into this, let me know, I would be happy to help. On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran <[hidden email]> wrote: > Hello Everyone, > > I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research > subjects cover distributed machine learning and we have been working > on dataflow graph processing for a while now. We have been reading > from you since Stratosphere :-) > > Our current research focuses on Stale Synchronous Parallelism and we > are currently considering Apache Flink as a good candidate for > implementing and delivering the best results among the existing > processing solutions. I have written a post about it here: > > https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran > > > What do you guys think about the approach? Does it seem feasible, or > do you have anything similar in your roadmap? > > Best regards, > > Tran Nam-Luc > > > > Links: > ------ > [1] http://euranova.eu > > |
In reply to this post by Nam-Luc Tran
Hello guys,
Thank you for your replies. >(1) How to checkpoint such computations in order to recover them upon failures. Approximate snapshots are possible with SSP. In this case, a snapshot (solution set and or working set, what is exactly snapshotted can be discussed) is triggered once a "clock=t" message has been received from each thread. The snapshot will not be "pure" meaning that it contains strictly updates from 1...t, as some faster threads may be ahead of t and their results may already be incorporated by slower threads. However iterative convergent algorithms suitable for SSPs are able to tolerate this situation. >(2) How to check for the termination of the computation without a sync barrier. Can the SSP model help with that? From what I have seen the SSP evaluates the convergence of the global model in order to terminate the iterations. In our case, I guess this would require to evaluate the convergence on the solution set and/or emptiness of the working set. Is it possible to interrupt Flink operators while they are running? >We signal the end of a superstep by pushing a special message from the "head" of the loop to the "tail". This can be thought of like a clock cycle. Currently, the new superstep starts on each parallel thread once all "tails" have received the message, thus forming the BSP barrier. For delta iterations, this "tail" is the next workset - the solution set completely independent of that. Could you point me out the classes in the code that realise this behaviour? I am very interested in digging deeper and implementing this in Flink. What would be the best point to start on from here according to you? Best regards, Tran Nam-Luc At Monday, 23/02/2015 on 10:32 Stephan Ewen wrote: Hey Tran Nam-Luc! Great post with some really cool thoughts. I just posted this answer to your LinkedIN post. Greetings, Stephan ============================================= Nice post, very cool idea! Your understanding of Flink in that respect is really good. I had not heard of SSP before, but it seems to be a good compromise between bulk synchronous and asynchronous iterations. Here are some comments and thought about how Apache Flink realizes iterations and how that mechanism can be extended to support SSP: - The loop in Flink is standing, operators are not re-created and re-deployed in every iteration. - We signal the end of a superstep by pushing a special message from the "head" of the loop to the "tail". This can be thought of like a clock cycle. Currently, the new superstep starts on each parallel thread once all "tails" have received the message, thus forming the BSP barrier. For delta iterations, this "tail" is the next workset - the solution set completely independent of that. - We can probably interpret the "end-of-superstep" messages as clock messages. We could then allow threads to start their next superstep if all "tails" have seen clock messages at least of its own clock time minus the slack. If you are looking to implement this in Flink, or dig deeper into this, let me know, I would be happy to help. On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran wrote: > Hello Everyone, > > I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research > subjects cover distributed machine learning and we have been working > on dataflow graph processing for a while now. We have been reading > from you since Stratosphere :-) > > Our current research focuses on Stale Synchronous Parallelism and we > are currently considering Apache Flink as a good candidate for > implementing and delivering the best results among the existing > processing solutions. I have written a post about it here: > > > > > What do you guys think about the approach? Does it seem feasible, or > do you have anything similar in your roadmap? > > Best regards, > > Tran Nam-Luc > > > > Links: > ------ > [1] http://euranova.eu > > |
Hej,
Very interesting discussion. I hadn't heard of the SSP model before, looks like something I want to look into. I wonder if any of the algorithms that would work in that model would not work in an asynchronous model. Since asynchronous is basically a SSP model with infinite slack. Iterative convergent algorithms such as connected components would run in both environment the same. Formulated differently: Are slack parameter arbitrary or is it defined by the structure of the algorithm like it is the case in ABSP? For the implementation issues, maybe we can learn something from Microsoft Naiad <http://research.microsoft.com/en-us/projects/naiad/> which has some similarities to this. cheers Martin On Mon, Feb 23, 2015 at 4:55 PM, Nam-Luc Tran <[hidden email]> wrote: > Hello guys, > > Thank you for your replies. > > >(1) How to checkpoint such computations in order to recover them > upon failures. > Approximate snapshots are possible with SSP. In this case, a snapshot > (solution set and or working set, what is exactly snapshotted can be > discussed) is triggered once a "clock=t" message has been received > from each thread. The snapshot will not be "pure" meaning that it > contains strictly updates from 1...t, as some faster threads may be > ahead of t and their results may already be incorporated by slower > threads. However iterative convergent algorithms suitable for SSPs are > able to tolerate this situation. > > >(2) How to check for the termination of the computation without a > sync barrier. Can the SSP model help with that? > From what I have seen the SSP evaluates the convergence of the global > model in order to terminate the iterations. In our case, I guess this > would require to evaluate the convergence on the solution set and/or > emptiness of the working set. Is it possible to interrupt Flink > operators while they are running? > > >We signal the end of a superstep by pushing a special message from > the "head" of the loop to the "tail". This can be thought of like a > clock cycle. Currently, the new superstep starts on each > parallel thread once all "tails" have received the message, thus > forming the BSP barrier. For delta iterations, this "tail" is > the next workset - the solution set completely independent of that. > Could you point me out the classes in the code that realise this > behaviour? > > I am very interested in digging deeper and implementing this in Flink. > What would be the best point to start on from here according to you? > > Best regards, > > Tran Nam-Luc > > At Monday, 23/02/2015 on 10:32 Stephan Ewen wrote: > > Hey Tran Nam-Luc! > > Great post with some really cool thoughts. > I just posted this answer to your LinkedIN post. > > Greetings, > Stephan > > ============================================= > > Nice post, very cool idea! Your understanding of Flink in that respect > is > really good. I had not heard of SSP before, > but it seems to be a good compromise between bulk synchronous and > asynchronous iterations. > > Here are some comments and thought about how Apache Flink realizes > iterations and how that mechanism can be extended > to support SSP: > > - The loop in Flink is standing, operators are not re-created and > re-deployed in every iteration. > > - We signal the end of a superstep by pushing a special message from > the > "head" of the loop to the "tail". This can be thought of > like a clock cycle. Currently, the new superstep starts on each > parallel > thread once all "tails" have received the message, > thus forming the BSP barrier. For delta iterations, this "tail" > is the > next workset - the solution set completely independent of that. > > - We can probably interpret the "end-of-superstep" messages as clock > messages. We could then allow threads to start their > next superstep if all "tails" have seen clock messages at least > of its > own clock time minus the slack. > > If you are looking to implement this in Flink, or dig deeper into > this, let > me know, I would be happy to help. > > On Fri, Feb 20, 2015 at 5:27 PM, Nam-Luc Tran > wrote: > > > Hello Everyone, > > > > I am Nam-Luc Tran, research Engineer at EURA NOVA [1]. Our research > > subjects cover distributed machine learning and we have been working > > on dataflow graph processing for a while now. We have been reading > > from you since Stratosphere :-) > > > > Our current research focuses on Stale Synchronous Parallelism and we > > are currently considering Apache Flink as a good candidate for > > implementing and delivering the best results among the existing > > processing solutions. I have written a post about it here: > > > > > > https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran > > > > > > What do you guys think about the approach? Does it seem feasible, or > > do you have anything similar in your roadmap? > > > > Best regards, > > > > Tran Nam-Luc > > > > > > > > Links: > > ------ > > [1] http://euranova.eu > > > > > > > |
Free forum by Nabble | Edit this page |