Robert Metzger created FLINK-22483:
--------------------------------------
Summary: Recover checkpoints when JobMaster gains leadership
Key: FLINK-22483
URL:
https://issues.apache.org/jira/browse/FLINK-22483 Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
Fix For: 1.14.0
Recovering checkpoints (from the CompletedCheckpointStore) is a potentially blocking operation, for example if the file system implementation is retrying to connect to a unavailable storage backend.
Currently, we are calling the CompletedCheckpointStore.recover() method from the main thread of the JobManager, making it unresponsive to any RPC call while the recover method is blocked.
By moving the recovery to the start of the JobManager (which happens asynchronously after the JobMaster has gained leadership), Flink will remain responsive (reporting a job in INITIALIZING state).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)