[DISCUSS] FLIP-135: Approximate Task-Local Recovery

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-135: Approximate Task-Local Recovery

curcur
Hi Devs,

I would like to start a formal discussion about "FLIP-135 Approximate
Task-Local Recovery" [1] which supports approximate single task failure
recovery without restarting the entire streaming job.

Flink is no longer a pure streaming engine as it was born and has been
extended to fit into many different scenarios over time: batch, AI,
event-driven applications, e.t.c. Approximate task-local recovery is one of
the attempts to fulfill these diversified scenarios and trade data
consistency for fast failure recovery. More specifically, if a task fails,
only the failed task restarts without affecting the rest of the job.
Approximate task-local recovery is similar to
RestartPipelinedRegionFailoverStrategy [2], with two major differences:
- Instead of restarting a connected region, approximate task-local recovery
restarts only the failed task(s) for a streaming job.
- RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate
task-local recovery expects data loss and a bit of data duplication when
sources fail.

Approximate task-local recovery is useful in scenarios where a certain
amount of data loss is tolerable, but a full pipeline restart is not
affordable. A typical use case is online training. Online training jobs are
usually complicated with all-to-all task connections, and a single task
failure with RestartPipelinedRegionFailoverStrategy may result in a
complete restart of the whole pipeline. Besides, the initialization is
time-consuming, including the procedure of loading training models and
starting Python subprocesses, etc. The initialization may take minutes to
complete on average. Hence, we introduce an approximate task-local recovery
strategy to only restart failed tasks.

To ease the discussion, we divide the problem of approximate task-local
recovery into three steps with each step only focusing on addressing a set
of problems, sketched as follows:
1). Sink Recovery, 2). Downstream Recovery and 3). Single Task Recovery.

This FLIP focuses on tackling issues related to the first step "Sink
Recovery", and we would like to collect broader feedback in this dedicated
mail thread.

Best,

Yuan

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-135: Approximate Task-Local Recovery

Piotr Nowojski-5
Hi,

I was involved in some off-line discussions and some early stages of the
discussions on this topic. In recent weeks Yuan has also prepared a good
looking prototype and draft changes that look good to me, so I would be +1
for accepting this FLIP.

I guess as there were no objections here, I would suggest starting a vote
thread for this proposal. If someone finds a problem with this FLIP, please
vote -1 there or let us know here.

Best,
Piotrek

niedz., 16 sie 2020 o 17:02 Yuan Mei <[hidden email]> napisaƂ(a):

> Hi Devs,
>
> I would like to start a formal discussion about "FLIP-135 Approximate
> Task-Local Recovery" [1] which supports approximate single task failure
> recovery without restarting the entire streaming job.
>
> Flink is no longer a pure streaming engine as it was born and has been
> extended to fit into many different scenarios over time: batch, AI,
> event-driven applications, e.t.c. Approximate task-local recovery is one of
> the attempts to fulfill these diversified scenarios and trade data
> consistency for fast failure recovery. More specifically, if a task fails,
> only the failed task restarts without affecting the rest of the job.
> Approximate task-local recovery is similar to
> RestartPipelinedRegionFailoverStrategy [2], with two major differences:
> - Instead of restarting a connected region, approximate task-local recovery
> restarts only the failed task(s) for a streaming job.
> - RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate
> task-local recovery expects data loss and a bit of data duplication when
> sources fail.
>
> Approximate task-local recovery is useful in scenarios where a certain
> amount of data loss is tolerable, but a full pipeline restart is not
> affordable. A typical use case is online training. Online training jobs are
> usually complicated with all-to-all task connections, and a single task
> failure with RestartPipelinedRegionFailoverStrategy may result in a
> complete restart of the whole pipeline. Besides, the initialization is
> time-consuming, including the procedure of loading training models and
> starting Python subprocesses, etc. The initialization may take minutes to
> complete on average. Hence, we introduce an approximate task-local recovery
> strategy to only restart failed tasks.
>
> To ease the discussion, we divide the problem of approximate task-local
> recovery into three steps with each step only focusing on addressing a set
> of problems, sketched as follows:
> 1). Sink Recovery, 2). Downstream Recovery and 3). Single Task Recovery.
>
> This FLIP focuses on tackling issues related to the first step "Sink
> Recovery", and we would like to collect broader feedback in this dedicated
> mail thread.
>
> Best,
>
> Yuan
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures
>