Hi Devs,
I would like to start a formal discussion about "FLIP-135 Approximate Task-Local Recovery" [1] which supports approximate single task failure recovery without restarting the entire streaming job. Flink is no longer a pure streaming engine as it was born and has been extended to fit into many different scenarios over time: batch, AI, event-driven applications, e.t.c. Approximate task-local recovery is one of the attempts to fulfill these diversified scenarios and trade data consistency for fast failure recovery. More specifically, if a task fails, only the failed task restarts without affecting the rest of the job. Approximate task-local recovery is similar to RestartPipelinedRegionFailoverStrategy [2], with two major differences: - Instead of restarting a connected region, approximate task-local recovery restarts only the failed task(s) for a streaming job. - RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate task-local recovery expects data loss and a bit of data duplication when sources fail. Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable. A typical use case is online training. Online training jobs are usually complicated with all-to-all task connections, and a single task failure with RestartPipelinedRegionFailoverStrategy may result in a complete restart of the whole pipeline. Besides, the initialization is time-consuming, including the procedure of loading training models and starting Python subprocesses, etc. The initialization may take minutes to complete on average. Hence, we introduce an approximate task-local recovery strategy to only restart failed tasks. To ease the discussion, we divide the problem of approximate task-local recovery into three steps with each step only focusing on addressing a set of problems, sketched as follows: 1). Sink Recovery, 2). Downstream Recovery and 3). Single Task Recovery. This FLIP focuses on tackling issues related to the first step "Sink Recovery", and we would like to collect broader feedback in this dedicated mail thread. Best, Yuan [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures |
Hi,
I was involved in some off-line discussions and some early stages of the discussions on this topic. In recent weeks Yuan has also prepared a good looking prototype and draft changes that look good to me, so I would be +1 for accepting this FLIP. I guess as there were no objections here, I would suggest starting a vote thread for this proposal. If someone finds a problem with this FLIP, please vote -1 there or let us know here. Best, Piotrek niedz., 16 sie 2020 o 17:02 Yuan Mei <[hidden email]> napisaĆ(a): > Hi Devs, > > I would like to start a formal discussion about "FLIP-135 Approximate > Task-Local Recovery" [1] which supports approximate single task failure > recovery without restarting the entire streaming job. > > Flink is no longer a pure streaming engine as it was born and has been > extended to fit into many different scenarios over time: batch, AI, > event-driven applications, e.t.c. Approximate task-local recovery is one of > the attempts to fulfill these diversified scenarios and trade data > consistency for fast failure recovery. More specifically, if a task fails, > only the failed task restarts without affecting the rest of the job. > Approximate task-local recovery is similar to > RestartPipelinedRegionFailoverStrategy [2], with two major differences: > - Instead of restarting a connected region, approximate task-local recovery > restarts only the failed task(s) for a streaming job. > - RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate > task-local recovery expects data loss and a bit of data duplication when > sources fail. > > Approximate task-local recovery is useful in scenarios where a certain > amount of data loss is tolerable, but a full pipeline restart is not > affordable. A typical use case is online training. Online training jobs are > usually complicated with all-to-all task connections, and a single task > failure with RestartPipelinedRegionFailoverStrategy may result in a > complete restart of the whole pipeline. Besides, the initialization is > time-consuming, including the procedure of loading training models and > starting Python subprocesses, etc. The initialization may take minutes to > complete on average. Hence, we introduce an approximate task-local recovery > strategy to only restart failed tasks. > > To ease the discussion, we divide the problem of approximate task-local > recovery into three steps with each step only focusing on addressing a set > of problems, sketched as follows: > 1). Sink Recovery, 2). Downstream Recovery and 3). Single Task Recovery. > > This FLIP focuses on tackling issues related to the first step "Sink > Recovery", and we would like to collect broader feedback in this dedicated > mail thread. > > Best, > > Yuan > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-135+Approximate+Task-Local+Recovery > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures > |
Free forum by Nabble | Edit this page |