(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] Checkpoint Failure process improvement

Classic

List

Threaded

2 messages Options

vino yang

[DISCUSS] Checkpoint Failure process improvement

Hi all,

Currently, the checkpoint's failure handling logic is somewhat confusing
(not focused), which makes some functions on existing code passive.

So I provide a design document to improve the Checkpoint failure process
logic.

This design document primarily describes how to improve checkpoint failure
handling logic and make it more clear.

Based on this, we introduce a CheckpointFailureManager, which makes the
checkpoint failure processing more flexible.

This mainly comes from the following appeals:

-

FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after
"n" unsuccessful checkpoints
-

FLINK-10074[3]: Allowable number of checkpoint failure
-

FLINK-10724[2]: Refactor failure handling in checkpoint coordinator

https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing

*Thanks to @Andrey Zagrebin for helping me review the documentation and
suggesting a lot of improvements.*

Feedback and comments are very welcome!

Best,
Vino

[1]: https://issues.apache.org/jira/browse/FLINK-4810

[2]: https://issues.apache.org/jira/browse/FLINK-10724
[3]: https://issues.apache.org/jira/browse/FLINK-10074

vino yang

Re: [DISCUSS] Checkpoint Failure process improvement

Hi all,

I will try to start coding based on the design document. Any feedback is
welcome throughout the process.

Best,
Vino

vino yang <[hidden email]> 于2019年1月9日周三上午12:29写道：

> Hi all,
>
>
> Currently, the checkpoint's failure handling logic is somewhat confusing
> (not focused), which makes some functions on existing code passive.
>
> So I provide a design document to improve the Checkpoint failure process
> logic.
>
> This design document primarily describes how to improve checkpoint failure
> handling logic and make it more clear.
>
> Based on this, we introduce a CheckpointFailureManager, which makes the
> checkpoint failure processing more flexible.
>
> This mainly comes from the following appeals:
>
>
> -
>
> FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after
> "n" unsuccessful checkpoints
> -
>
> FLINK-10074[3]: Allowable number of checkpoint failure
> -
>
> FLINK-10724[2]: Refactor failure handling in checkpoint coordinator
>
>
>
> https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing
>
> *Thanks to @Andrey Zagrebin for helping me review the documentation and
> suggesting a lot of improvements.*
>
> Feedback and comments are very welcome!
>
> Best,
> Vino
>
> [1]: https://issues.apache.org/jira/browse/FLINK-4810
>
> [2]: https://issues.apache.org/jira/browse/FLINK-10724
> [3]: https://issues.apache.org/jira/browse/FLINK-10074
>