[jira] [Created] (FLINK-21979) Job can be restarted from the beginning after it reached a terminal state

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21979) Job can be restarted from the beginning after it reached a terminal state

Shang Yuanchun (Jira)
Till Rohrmann created FLINK-21979:
-------------------------------------

             Summary: Job can be restarted from the beginning after it reached a terminal state
                 Key: FLINK-21979
                 URL: https://issues.apache.org/jira/browse/FLINK-21979
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.12.2, 1.11.3, 1.13.0
            Reporter: Till Rohrmann
             Fix For: 1.14.0


Currently, the {{JobMaster}} removes all checkpoints after a job reaches a globally terminal state. Then it notifies the {{Dispatcher}} about the termination of the job. The {{Dispatcher}} then removes the job from the {{SubmittedJobGraphStore}}. If the {{Dispatcher}} process fails before doing that it might get restarted. In this case, the {{Dispatcher}} would still find the job in the {{SubmittedJobGraphStore}} and recover it. Since the {{CompletedCheckpointStore}} is empty, it would start executing this job from the beginning.

I think we must not remove job state before the job has not been marked as done or made inaccessible for any restarted processes. Concretely, we should first remove the job from the {{SubmittedJobGraphStore}} and only then delete the checkpoints. Ideally all the job related cleanup operation happens atomically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)