Hi all,
Currently, the checkpoint's failure handling logic is somewhat confusing
(not focused), which makes some functions on existing code passive.
So I provide a design document to improve the Checkpoint failure process
logic.
This design document primarily describes how to improve checkpoint failure
handling logic and make it more clear.
Based on this, we introduce a CheckpointFailureManager, which makes the
checkpoint failure processing more flexible.
This mainly comes from the following appeals:
-
FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after
"n" unsuccessful checkpoints
-
FLINK-10074[3]: Allowable number of checkpoint failure
-
FLINK-10724[2]: Refactor failure handling in checkpoint coordinator
https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing*Thanks to @Andrey Zagrebin for helping me review the documentation and
suggesting a lot of improvements.*
Feedback and comments are very welcome!
Best,
Vino
[1]:
https://issues.apache.org/jira/browse/FLINK-4810[2]:
https://issues.apache.org/jira/browse/FLINK-10724[3]:
https://issues.apache.org/jira/browse/FLINK-10074