Stephan Ewen created FLINK-1953:
-----------------------------------
Summary: Rework Checkpoint Coordinator
Key: FLINK-1953
URL:
https://issues.apache.org/jira/browse/FLINK-1953 Project: Flink
Issue Type: Bug
Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Fix For: 0.9
The checkpoint coordinator currently contains no tests and is vulnerable to a variety of situations. In particular, I propose to add:
- Better configurability which tasks receive the trigger checkpoint messages, which tasks need to acknowledge the checkpoint, and which tasks need to receive confirmation messages.
- checkpoint timeouts, such that incomplete checkpoints are guaranteed to be cleaned up after a while, regardless of successful checkpoints
- better sanity checking of messages and fields, to properly handle/ignore messages for old/expired checkpoints, or invalidly routed messages
- Better handling of checkpoint attempts at points where the execution has just failed is is currently being canceled.
- Add a good set of tests
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)