Peng Zhang created FLINK-21685:
---------------------------------- Summary: Flink JobManager failed to restart in K8S HA setup Key: FLINK-21685 URL: https://issues.apache.org/jira/browse/FLINK-21685 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.12.2, 1.12.1 Reporter: Peng Zhang Attachments: flink-ha.log We use Flink K8S session cluster with HA mode (1 JobManager and 4 TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink JobManager failed to recover job from checkpoint {{2021-03-08 13:16:42,962 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1. 2021-03-08 13:16:43,014 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 9a534b2e309b24f78866b65d94082ead located at s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1. 2021-03-08 13:16:43,023 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore 2021-03-08 13:16:43,024 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2 for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 2021-03-08 13:16:43,046 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 at akka.tcp://flink@10.2.179.12:6123/user/rpc/jobmanager_2. 2021-03-08 13:16:43,060 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host 2021-03-08 13:16:43,060 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@10.2.174.188:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.2.174.188:6123]] Caused by: [java.net.NoRouteToHostException: No route to host] }} Attached is the log, and our configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Hello, looking at your error, it should be that the file could not be found, or there was a communication problem. Let me ask if your flink on k8s uses the StatefulSet mode? Or can you find the location of ck storage now?
At 2021-03-09 15:23:00, "Peng Zhang (Jira)" <[hidden email]> wrote: >Peng Zhang created FLINK-21685: >---------------------------------- > > Summary: Flink JobManager failed to restart in K8S HA setup > Key: FLINK-21685 > URL: https://issues.apache.org/jira/browse/FLINK-21685 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.12.2, 1.12.1 > Reporter: Peng Zhang > Attachments: flink-ha.log > >We use Flink K8S session cluster with HA mode (1 JobManager and 4 TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink JobManager failed to recover job from checkpoint > > > >{{2021-03-08 13:16:42,962 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1. 2021-03-08 13:16:43,014 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 9a534b2e309b24f78866b65d94082ead located at s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1. 2021-03-08 13:16:43,023 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore 2021-03-08 13:16:43,024 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2 for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 2021-03-08 13:16:43,046 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 at akka.tcp://flink@10.2.179.12:6123/user/rpc/jobmanager_2. 2021-03-08 13:16:43,060 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host 2021-03-08 13:16:43,060 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@10.2.174.188:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@10.2.174.188:6123]] Caused by: [java.net.NoRouteToHostException: No route to host] }} > > > >Attached is the log, and our configuration. > > > > > >-- >This message was sent by Atlassian Jira >(v8.3.4#803005) |
Free forum by Nabble | Edit this page |