|
Hi all,
Currently, the checkpoint's failure handling logic is somewhat confusing (not focused), which makes some functions on existing code passive. So I provide a design document to improve the Checkpoint failure process logic. This design document primarily describes how to improve checkpoint failure handling logic and make it more clear. Based on this, we introduce a CheckpointFailureManager, which makes the checkpoint failure processing more flexible. This mainly comes from the following appeals: - FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints - FLINK-10074[3]: Allowable number of checkpoint failure - FLINK-10724[2]: Refactor failure handling in checkpoint coordinator https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing *Thanks to @Andrey Zagrebin for helping me review the documentation and suggesting a lot of improvements.* Feedback and comments are very welcome! Best, Vino [1]: https://issues.apache.org/jira/browse/FLINK-4810 [2]: https://issues.apache.org/jira/browse/FLINK-10724 [3]: https://issues.apache.org/jira/browse/FLINK-10074 |
|
Hi all,
I will try to start coding based on the design document. Any feedback is welcome throughout the process. Best, Vino vino yang <[hidden email]> 于2019年1月9日周三 上午12:29写道: > Hi all, > > > Currently, the checkpoint's failure handling logic is somewhat confusing > (not focused), which makes some functions on existing code passive. > > So I provide a design document to improve the Checkpoint failure process > logic. > > This design document primarily describes how to improve checkpoint failure > handling logic and make it more clear. > > Based on this, we introduce a CheckpointFailureManager, which makes the > checkpoint failure processing more flexible. > > This mainly comes from the following appeals: > > > - > > FLINK-4810[1]: Checkpoint Coordinator should fail ExecutionGraph after > "n" unsuccessful checkpoints > - > > FLINK-10074[3]: Allowable number of checkpoint failure > - > > FLINK-10724[2]: Refactor failure handling in checkpoint coordinator > > > > https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing > > *Thanks to @Andrey Zagrebin for helping me review the documentation and > suggesting a lot of improvements.* > > Feedback and comments are very welcome! > > Best, > Vino > > [1]: https://issues.apache.org/jira/browse/FLINK-4810 > > [2]: https://issues.apache.org/jira/browse/FLINK-10724 > [3]: https://issues.apache.org/jira/browse/FLINK-10074 > |
| Free forum by Nabble | Edit this page |
