[DISCUSS] Checkpoint Failure process improvement

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Checkpoint Failure process improvement

vino yang
Hi all,


Currently, the checkpoint's failure handling logic is somewhat confusing
(not focused), which makes some functions on existing code passive.

So I provide a design document to improve the Checkpoint failure process
logic.

This design document primarily describes how to improve checkpoint failure
handling logic and make it more clear.

Based on this, we introduce a CheckpointFailureManager, which makes the
checkpoint failure processing more flexible.

This mainly comes from the following appeals:


   -

   FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after
   "n" unsuccessful checkpoints
   -

   FLINK-10074[3]: Allowable number of checkpoint failure
   -

   FLINK-10724[2]: Refactor failure handling in checkpoint coordinator


https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing

*Thanks to @Andrey Zagrebin for helping me review the documentation and
suggesting a lot of improvements.*

Feedback and comments are very welcome!

Best,
Vino

[1]: https://issues.apache.org/jira/browse/FLINK-4810

[2]: https://issues.apache.org/jira/browse/FLINK-10724
[3]: https://issues.apache.org/jira/browse/FLINK-10074
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Checkpoint Failure process improvement

vino yang
Hi all,

I will try to start coding based on the design document. Any feedback is
welcome throughout the process.

Best,
Vino

vino yang <[hidden email]> 于2019年1月9日周三 上午12:29写道:

> Hi all,
>
>
> Currently, the checkpoint's failure handling logic is somewhat confusing
> (not focused), which makes some functions on existing code passive.
>
> So I provide a design document to improve the Checkpoint failure process
> logic.
>
> This design document primarily describes how to improve checkpoint failure
> handling logic and make it more clear.
>
> Based on this, we introduce a CheckpointFailureManager, which makes the
> checkpoint failure processing more flexible.
>
> This mainly comes from the following appeals:
>
>
>    -
>
>    FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after
>    "n" unsuccessful checkpoints
>    -
>
>    FLINK-10074[3]: Allowable number of checkpoint failure
>    -
>
>    FLINK-10724[2]: Refactor failure handling in checkpoint coordinator
>
>
>
> https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing
>
> *Thanks to @Andrey Zagrebin for helping me review the documentation and
> suggesting a lot of improvements.*
>
> Feedback and comments are very welcome!
>
> Best,
> Vino
>
> [1]: https://issues.apache.org/jira/browse/FLINK-4810
>
> [2]: https://issues.apache.org/jira/browse/FLINK-10724
> [3]: https://issues.apache.org/jira/browse/FLINK-10074
>