[jira] [Created] (FLINK-13593) Prevent failing the wrong job in CheckpointFailureManager

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-13593) Prevent failing the wrong job in CheckpointFailureManager

Shang Yuanchun (Jira)
Yu Li created FLINK-13593:
-----------------------------

             Summary: Prevent failing the wrong job in CheckpointFailureManager
                 Key: FLINK-13593
                 URL: https://issues.apache.org/jira/browse/FLINK-13593
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.9.0
            Reporter: Yu Li
             Fix For: 1.9.0


Due to the asynchronously handling of checkpoint decline message in {{LegacyScheduler#declineCheckpoint}}, it's possible that the message is handled before job status transition thus {{receiveDeclineMessage}} grabbed the lock in {{CheckpointCoordinator}} before {{pendingCheckpoints}} got cleared by {{stopCheckpointScheduler}} (as triggered by the job status listener {{CheckpointCoordinatorDeActivator}}). And if the job/tasks restarts quickly enough, the {{FailJobCallback}} in {{CheckpointFailureManager}} might unexpectedly fail the job again, as observed in FLINK-13527.

To resolve the issue, we need to add a safe guard when failing the job, passing through the {{ExecutionAttemptID}} and checking against the current executions to make sure the to-be-failed one is still running, so we won't fail the newly restarted one by accident.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)