[jira] [Created] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart

Shang Yuanchun (Jira)
Lu Niu created FLINK-16931:
------------------------------

             Summary: Large _metadata file lead to JobManager not responding when restart
                 Key: FLINK-16931
                 URL: https://issues.apache.org/jira/browse/FLINK-16931
             Project: Flink
          Issue Type: Bug
            Reporter: Lu Niu


When _metadata file is big, JobManager could never recover from checkpoint. It fall into a loop that fetch checkpoint -> JM timeout -> restart). Here is related log: 
2020-04-01 17:08:25,689 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Recovering checkpoints from ZooKeeper.
2020-04-01 17:08:25,698 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Found 3 checkpoints in ZooKeeper.
2020-04-01 17:08:25,698 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying to fetch 3 checkpoints from storage.
2020-04-01 17:08:25,698 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying to retrieve checkpoint 50.
2020-04-01 17:08:48,589 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying to retrieve checkpoint 51.
2020-04-01 17:09:12,775 INFO  org.apache.flink.yarn.YarnResourceManager                     - The heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
Digging into the code, looks like ExecutionGraph::restart runs in JobMaster main thread and finally calls ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download file form DFS. The main thread is basically blocked for a while because of this. One possible solution is to making the downloading part async. More things might need to consider as the original change tries to make it single-threaded. [https://github.com/apache/flink/pull/7568]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)