Till Rohrmann created FLINK-21979:
-------------------------------------
Summary: Job can be restarted from the beginning after it reached a terminal state
Key: FLINK-21979
URL:
https://issues.apache.org/jira/browse/FLINK-21979 Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.12.2, 1.11.3, 1.13.0
Reporter: Till Rohrmann
Fix For: 1.14.0
Currently, the {{JobMaster}} removes all checkpoints after a job reaches a globally terminal state. Then it notifies the {{Dispatcher}} about the termination of the job. The {{Dispatcher}} then removes the job from the {{SubmittedJobGraphStore}}. If the {{Dispatcher}} process fails before doing that it might get restarted. In this case, the {{Dispatcher}} would still find the job in the {{SubmittedJobGraphStore}} and recover it. Since the {{CompletedCheckpointStore}} is empty, it would start executing this job from the beginning.
I think we must not remove job state before the job has not been marked as done or made inaccessible for any restarted processes. Concretely, we should first remove the job from the {{SubmittedJobGraphStore}} and only then delete the checkpoints. Ideally all the job related cleanup operation happens atomically.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)