Alex created FLINK-13205:
----------------------------
Summary: Checkpoints/savepoints injection has loose ordering properties when a stop-with-savepoint is triggered
Key: FLINK-13205
URL:
https://issues.apache.org/jira/browse/FLINK-13205 Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.9.0
Reporter: Alex
Assignee: Alex
When a stop-with-savepoint is triggered at a source task, the task's dispatcher ({{Task.asyncCallDispatcher}})'s thread pool is extended (from single-threaded, it becomes multi-threaded).
This leads to a race of applying consequent checkpoints/savepoints from dispatcher's queue at the same time and checkpoints/savepoints would be not strictly ordered in the event stream.
As the result, checkpoints/savepoints that injected later than they should, may be "silently subsumed": potentially, they would be ignored and won't be reported to checkpoint coordinator.
*Proposed solution:*
Revert {{Task.asyncCallDispatcher}} behavior to be single-threaded.
For stop-with-savepoint feature, the dispatcher's thread that performs the synchronous savepoint doesn't need to be blocking and {{StreamTask.finishTask()}} invocation can be delegated to {{StreamTask.notifyCheckpointComplete()}}.
*Note:* imo, the issue described here is not critical, but the proposed change should simplify implementation. This ticket can be considered as enhancement.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)