(DEPRECATED) Apache Flink Mailing List archive.

[jira] [Created] (FLINK-19596) Do not recover CompletedCheckpointStore on each failover

Classic

List

Threaded

1 message

Shang Yuanchun (Jira)

[jira] [Created] (FLINK-19596) Do not recover CompletedCheckpointStore on each failover

Jiayi Liao created FLINK-19596:
----------------------------------

Summary: Do not recover CompletedCheckpointStore on each failover
Key: FLINK-19596
URL: https://issues.apache.org/jira/browse/FLINK-19596
Project: Flink
Issue Type: Improvement
Components: Runtime / Checkpointing
Affects Versions: 1.11.2
Reporter: Jiayi Liao

{{completedCheckpointStore.recover()}} in {{restoreLatestCheckpointedStateInternal}} could be a bottleneck on failover because the {{CompletedCheckpointStore}} needs to load HDFS files to instantialize the {{CompleteCheckpoint}} instances.

The impact is significant in our case below:

* Jobs with high parallelism (no shuffle) which transfer data from Kafka to other filesystems.
* If a machine goes down, several containers and tens of tasks are affected, which means the {{completedCheckpointStore.recover()}} would be called tens of times since the tasks are not in a failover region.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)