[jira] [Created] (FLINK-22483) Recover checkpoints when JobMaster gains leadership

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-22483) Recover checkpoints when JobMaster gains leadership

Shang Yuanchun (Jira)
Robert Metzger created FLINK-22483:
--------------------------------------

             Summary: Recover checkpoints when JobMaster gains leadership
                 Key: FLINK-22483
                 URL: https://issues.apache.org/jira/browse/FLINK-22483
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.13.0
            Reporter: Robert Metzger
             Fix For: 1.14.0


Recovering checkpoints (from the CompletedCheckpointStore) is a potentially blocking operation, for example if the file system implementation is retrying to connect to a unavailable storage backend.

Currently, we are calling the CompletedCheckpointStore.recover() method from the main thread of the JobManager, making it unresponsive to any RPC call while the recover method is blocked.

By moving the recovery to the start of the JobManager (which happens asynchronously after the JobMaster has gained leadership), Flink will remain responsive (reporting a job in INITIALIZING state).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)