Hi,
We are using Flink for streaming and find the "stop-the-world" recovery behavior of Flink prohibitive for use cases that prioritize availability. Partial recovery as outlined in FLIP-1 would probably alleviate these concerns. https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures Looking at the subtasks in https://issues.apache.org/jira/browse/FLINK-4256 it appears that much of the work was already done but not much recent progress? What is missing (for streaming)? How close is version 2 (recovery from limited intermediate results)? Thanks! Thomas |
Hi,
1. Currently, much work in FLINK-4256 is about failover improvements in the bouded dataset scenario. 2. For the streaming scenario, a new shuffle plugin + proper failover strategy could avoid the "stop-the-word" recovery. 3. We have already done many works about the new shuffle in the old Flink shuffle architectures because many of our customers have the concern. We have a plan to move the work to the new Flink pluggable shuffle architecture. Best, Guowei Thomas Weise <[hidden email]> 于2019年7月26日周五 上午8:54写道: > Hi, > > We are using Flink for streaming and find the "stop-the-world" recovery > behavior of Flink prohibitive for use cases that prioritize availability. > Partial recovery as outlined in FLIP-1 would probably alleviate these > concerns. > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > > Looking at the subtasks in > https://issues.apache.org/jira/browse/FLINK-4256 it > appears that much of the work was already done but not much recent > progress? What is missing (for streaming)? How close is version 2 (recovery > from limited intermediate results)? > > Thanks! > Thomas > |
Hi Thomas!
For Batch, this should be working in release 1.9. For streaming, it is a bit more tricky, mainly because of the fact that you have to deal with downstream correctness. Either a recovery still needs to reset downstream tasks (which means on average half of the tasks) or needs to wait before publishing the data downstream until a persistent point for recovery has been reached. I have looked a bit into the second variant here. This needs a bit more thought (currently also busy with 1.9 release) but in the course of the next release cycle we might be able to share some initial design. Best, Stephan On Fri, Jul 26, 2019 at 3:46 AM Guowei Ma <[hidden email]> wrote: > Hi, > 1. Currently, much work in FLINK-4256 is about failover improvements in the > bouded dataset scenario. > 2. For the streaming scenario, a new shuffle plugin + proper failover > strategy could avoid the "stop-the-word" recovery. > 3. We have already done many works about the new shuffle in the old Flink > shuffle architectures because many of our customers have the concern. We > have a plan to move the work to the new Flink pluggable shuffle > architecture. > > Best, > Guowei > > > Thomas Weise <[hidden email]> 于2019年7月26日周五 上午8:54写道: > > > Hi, > > > > We are using Flink for streaming and find the "stop-the-world" recovery > > behavior of Flink prohibitive for use cases that prioritize availability. > > Partial recovery as outlined in FLIP-1 would probably alleviate these > > concerns. > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > > > > Looking at the subtasks in > > https://issues.apache.org/jira/browse/FLINK-4256 it > > appears that much of the work was already done but not much recent > > progress? What is missing (for streaming)? How close is version 2 > (recovery > > from limited intermediate results)? > > > > Thanks! > > Thomas > > > |
Guowei and Stephan, thanks for the reply!
The biggest gain that FLIP-1 will deliver for streaming is that parallel processing can continue accept for those parallel paths affected by the failure, even when all tasks in an affected path need to be reset. Assuming task manager process failure as most common scenario and the default scheduling, that would leave (parallelism - number of task slots) processing paths online (all tasks in the failed TM are reset). For applications that are very latency sensitive and can produce independent results, this is a big deal. Also, since by default all tasks end up in a single slot, a TM failure will reset everything when a shuffle is involved anyways, even when it is upstream. But where needed a shuffle can be externalized using an intermediate Kafka topic, for example. So perhaps I should also ask how far away we are from completing the first variant (without intermediate results) for streaming! Thanks, Thomas On Fri, Jul 26, 2019 at 12:36 AM Stephan Ewen <[hidden email]> wrote: > Hi Thomas! > > For Batch, this should be working in release 1.9. > > For streaming, it is a bit more tricky, mainly because of the fact that you > have to deal with downstream correctness. > Either a recovery still needs to reset downstream tasks (which means on > average half of the tasks) or needs to wait before publishing the data > downstream until a persistent point for recovery has been reached. > > I have looked a bit into the second variant here. This needs a bit more > thought (currently also busy with 1.9 release) but in the course of the > next release cycle we might be able to share some initial design. > > Best, > Stephan > > > > On Fri, Jul 26, 2019 at 3:46 AM Guowei Ma <[hidden email]> wrote: > > > Hi, > > 1. Currently, much work in FLINK-4256 is about failover improvements in > the > > bouded dataset scenario. > > 2. For the streaming scenario, a new shuffle plugin + proper failover > > strategy could avoid the "stop-the-word" recovery. > > 3. We have already done many works about the new shuffle in the old Flink > > shuffle architectures because many of our customers have the concern. We > > have a plan to move the work to the new Flink pluggable shuffle > > architecture. > > > > Best, > > Guowei > > > > > > Thomas Weise <[hidden email]> 于2019年7月26日周五 上午8:54写道: > > > > > Hi, > > > > > > We are using Flink for streaming and find the "stop-the-world" recovery > > > behavior of Flink prohibitive for use cases that prioritize > availability. > > > Partial recovery as outlined in FLIP-1 would probably alleviate these > > > concerns. > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > > > > > > Looking at the subtasks in > > > https://issues.apache.org/jira/browse/FLINK-4256 it > > > appears that much of the work was already done but not much recent > > > progress? What is missing (for streaming)? How close is version 2 > > (recovery > > > from limited intermediate results)? > > > > > > Thanks! > > > Thomas > > > > > > |
Free forum by Nabble | Edit this page |