(DEPRECATED) Apache Flink Mailing List archive.

Fine Grained Recovery / FLIP-1

Classic

List

Threaded

4 messages Options

Thomas Weise

Fine Grained Recovery / FLIP-1

Hi,

We are using Flink for streaming and find the "stop-the-world" recovery
behavior of Flink prohibitive for use cases that prioritize availability.
Partial recovery as outlined in FLIP-1 would probably alleviate these
concerns.

https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures

Looking at the subtasks in https://issues.apache.org/jira/browse/FLINK-4256 it
appears that much of the work was already done but not much recent
progress? What is missing (for streaming)? How close is version 2 (recovery
from limited intermediate results)?

Thanks!
Thomas

Guowei Ma

Re: Fine Grained Recovery / FLIP-1

Hi,
1. Currently, much work in FLINK-4256 is about failover improvements in the
bouded dataset scenario.
2. For the streaming scenario, a new shuffle plugin + proper failover
strategy could avoid the "stop-the-word" recovery.
3. We have already done many works about the new shuffle in the old Flink
shuffle architectures because many of our customers have the concern. We
have a plan to move the work to the new Flink pluggable shuffle
architecture.

Best,
Guowei

Thomas Weise <[hidden email]> 于2019年7月26日周五上午8:54写道：

> Hi,
>
> We are using Flink for streaming and find the "stop-the-world" recovery
> behavior of Flink prohibitive for use cases that prioritize availability.
> Partial recovery as outlined in FLIP-1 would probably alleviate these
> concerns.
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
>
> Looking at the subtasks in
> https://issues.apache.org/jira/browse/FLINK-4256 it
> appears that much of the work was already done but not much recent
> progress? What is missing (for streaming)? How close is version 2 (recovery
> from limited intermediate results)?
>
> Thanks!
> Thomas
>

Stephan Ewen

Re: Fine Grained Recovery / FLIP-1

Hi Thomas!

For Batch, this should be working in release 1.9.

For streaming, it is a bit more tricky, mainly because of the fact that you
have to deal with downstream correctness.
Either a recovery still needs to reset downstream tasks (which means on
average half of the tasks) or needs to wait before publishing the data
downstream until a persistent point for recovery has been reached.

I have looked a bit into the second variant here. This needs a bit more
thought (currently also busy with 1.9 release) but in the course of the
next release cycle we might be able to share some initial design.

Best,
Stephan

On Fri, Jul 26, 2019 at 3:46 AM Guowei Ma <[hidden email]> wrote:

> Hi,
> 1. Currently, much work in FLINK-4256 is about failover improvements in the
> bouded dataset scenario.
> 2. For the streaming scenario, a new shuffle plugin + proper failover
> strategy could avoid the "stop-the-word" recovery.
> 3. We have already done many works about the new shuffle in the old Flink
> shuffle architectures because many of our customers have the concern. We
> have a plan to move the work to the new Flink pluggable shuffle
> architecture.
>
> Best,
> Guowei
>
>
> Thomas Weise <[hidden email]> 于2019年7月26日周五上午8:54写道：
>
> > Hi,
> >
> > We are using Flink for streaming and find the "stop-the-world" recovery
> > behavior of Flink prohibitive for use cases that prioritize availability.
> > Partial recovery as outlined in FLIP-1 would probably alleviate these
> > concerns.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> >
> > Looking at the subtasks in
> > https://issues.apache.org/jira/browse/FLINK-4256 it
> > appears that much of the work was already done but not much recent
> > progress? What is missing (for streaming)? How close is version 2
> (recovery
> > from limited intermediate results)?
> >
> > Thanks!
> > Thomas
> >
>

Thomas Weise

Re: Fine Grained Recovery / FLIP-1

Guowei and Stephan, thanks for the reply!

The biggest gain that FLIP-1 will deliver for streaming is that parallel
processing can continue accept for those parallel paths affected by the
failure, even when all tasks in an affected path need to be reset. Assuming
task manager process failure as most common scenario and the default
scheduling, that would leave (parallelism - number of task slots)
processing paths online (all tasks in the failed TM are reset). For
applications that are very latency sensitive and can produce independent
results, this is a big deal.

Also, since by default all tasks end up in a single slot, a TM failure will
reset everything when a shuffle is involved anyways, even when it is
upstream. But where needed a shuffle can be externalized using an
intermediate Kafka topic, for example. So perhaps I should also ask how far
away we are from completing the first variant (without intermediate
results) for streaming!

Thanks,
Thomas

On Fri, Jul 26, 2019 at 12:36 AM Stephan Ewen <[hidden email]> wrote:

> Hi Thomas!
>
> For Batch, this should be working in release 1.9.
>
> For streaming, it is a bit more tricky, mainly because of the fact that you
> have to deal with downstream correctness.
> Either a recovery still needs to reset downstream tasks (which means on
> average half of the tasks) or needs to wait before publishing the data
> downstream until a persistent point for recovery has been reached.
>
> I have looked a bit into the second variant here. This needs a bit more
> thought (currently also busy with 1.9 release) but in the course of the
> next release cycle we might be able to share some initial design.
>
> Best,
> Stephan
>
>
>
> On Fri, Jul 26, 2019 at 3:46 AM Guowei Ma <[hidden email]> wrote:
>
> > Hi,
> > 1. Currently, much work in FLINK-4256 is about failover improvements in
> the
> > bouded dataset scenario.
> > 2. For the streaming scenario, a new shuffle plugin + proper failover
> > strategy could avoid the "stop-the-word" recovery.
> > 3. We have already done many works about the new shuffle in the old Flink
> > shuffle architectures because many of our customers have the concern. We
> > have a plan to move the work to the new Flink pluggable shuffle
> > architecture.
> >
> > Best,
> > Guowei
> >
> >
> > Thomas Weise <[hidden email]> 于2019年7月26日周五上午8:54写道：
> >
> > > Hi,
> > >
> > > We are using Flink for streaming and find the "stop-the-world" recovery
> > > behavior of Flink prohibitive for use cases that prioritize
> availability.
> > > Partial recovery as outlined in FLIP-1 would probably alleviate these
> > > concerns.
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
> > >
> > > Looking at the subtasks in
> > > https://issues.apache.org/jira/browse/FLINK-4256 it
> > > appears that much of the work was already done but not much recent
> > > progress? What is missing (for streaming)? How close is version 2
> > (recovery
> > > from limited intermediate results)?
> > >
> > > Thanks!
> > > Thomas
> > >
> >
>