(DEPRECATED) Apache Flink Mailing List archive.

[jira] [Created] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Classic

List

Threaded

1 message

Shang Yuanchun (Jira)

[jira] [Created] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Steven Zhen Wu created FLINK-19816:
--------------------------------------

Summary: Flink restored from a wrong checkpoint (a very old one and not the last completed one)
Key: FLINK-19816
URL: https://issues.apache.org/jira/browse/FLINK-19816
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.11.0
Reporter: Steven Zhen Wu

h2. Summary

Upon failure, it seems that Flink didn't restore from the last completed checkpoint. Instead, it restored from a very old checkpoint. As a result, Kafka offsets are invalid and caused the job to replay from the beginning as Kafka consumer "auto.offset.reset" was set to "EARLIEST".

This is an embarrassingly parallel stateless job. Parallelism is over 1,000.

h2. Sequence of events from the logs

Just before the failure, checkpoint *210768* completed.

{code}
2020-10-25 02:35:05,970 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
{code}

During restart, somehow it decided to restore from checkpoint *203531*.
{code:java}
2020-10-25 02:36:03,301 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess.
2020-10-25 02:36:03,302 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [cluster-io-thread-5] - Recover all persisted job graphs.
2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem [cluster-io-thread-25] - Deleting path: s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
2020-10-25 02:36:03,307 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [cluster-io-thread-5] - Trying to recover job with job id 233b4938179c06974e4535ac8a868675.

2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem [cluster-io-thread-25] - Deleting path: s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
...
2020-10-25 02:36:03,427 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from ZooKeeper.
2020-10-25 02:36:03,432 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in ZooKeeper.
2020-10-25 02:36:03,432 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints from storage.
2020-10-25 02:36:03,432 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [flink-akka.actor.default-dispatcher-82003] - Starting job 233b4938179c06974e4535ac8a868675 from savepoint s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata ()
{code}

--
This message was sent by Atlassian Jira
(v8.3.4#803005)