flink 1.9 Restore from a checkpoint taken in 1.11

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

flink 1.9 Restore from a checkpoint taken in 1.11

Lu Niu
Hi, Flink dev

Is it supported that a flink job in version 1.9 could restore from a
checkpoint taken from the same job using 1.11? The context is we are
migrating to version 1.11 and we need a backup plan for emergency fallback.
We did a test and it throws error:
```
Caused by: org.apache.flink.runtime.rest.util.RestClientException:
[Internal server error., <Exception on server side:
org.apache.flink.runtime.client.JobSubmissionException: Failed to submit
job.
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$internalSubmitJob$2(Dispatcher.java:344)
        at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
        at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
        at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set up
JobManager
        at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
        ... 6 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not
set up JobManager
        at
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:152)
        at
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:83)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:387)
        at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
        ... 7 more
Caused by: java.lang.IllegalArgumentException: Cannot restore savepoint
version 3.
        at
org.apache.flink.runtime.checkpoint.savepoint.SavepointSerializers.getSerializer(SavepointSerializers.java:80)
        at
org.apache.flink.runtime.checkpoint.Checkpoints.loadCheckpointMetadata(Checkpoints.java:106)
        at
org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:143)
        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1099)
        at
org.apache.flink.runtime.scheduler.LegacyScheduler.tryRestoreExecutionGraphFromSavepoint(LegacyScheduler.java:237)
        at
org.apache.flink.runtime.scheduler.LegacyScheduler.createAndRestoreExecutionGraph(LegacyScheduler.java:196)
        at
org.apache.flink.runtime.scheduler.LegacyScheduler.<init>(LegacyScheduler.java:176)
        at
org.apache.flink.runtime.scheduler.LegacySchedulerFactory.createInstance(LegacySchedulerFactory.java:70)
        at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
        at
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
        at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
        at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
        at
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:146)
        ... 10 more

End of exception on server side>]
```
So it seems not. I just want to confirm that with the community.


Best
Lu
Reply | Threaded
Open this post in threaded view
|

Re: flink 1.9 Restore from a checkpoint taken in 1.11

Yun Tang
Hi Lu

Flink guarantees backwards compatibility but not forwards compatibility and you can find the compatibility table here [1].
Flink-1.11 introduced unaligned checkpoint which upgrades the savepoint version to 3, and that's why Flink-1.9 cannot consume savepoint generated by Flink-1.11.


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/upgrading.html#compatibility-table

Best
Yun Tang

________________________________
From: Lu Niu <[hidden email]>
Sent: Tuesday, December 1, 2020 9:53
To: [hidden email] <[hidden email]>
Subject: flink 1.9 Restore from a checkpoint taken in 1.11

Hi, Flink dev

Is it supported that a flink job in version 1.9 could restore from a
checkpoint taken from the same job using 1.11? The context is we are
migrating to version 1.11 and we need a backup plan for emergency fallback.
We did a test and it throws error:
```
Caused by: org.apache.flink.runtime.rest.util.RestClientException:
[Internal server error., <Exception on server side:
org.apache.flink.runtime.client.JobSubmissionException: Failed to submit
job.
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$internalSubmitJob$2(Dispatcher.java:344)
        at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
        at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
        at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set up
JobManager
        at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
        at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
        ... 6 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not
set up JobManager
        at
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:152)
        at
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:83)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:387)
        at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
        ... 7 more
Caused by: java.lang.IllegalArgumentException: Cannot restore savepoint
version 3.
        at
org.apache.flink.runtime.checkpoint.savepoint.SavepointSerializers.getSerializer(SavepointSerializers.java:80)
        at
org.apache.flink.runtime.checkpoint.Checkpoints.loadCheckpointMetadata(Checkpoints.java:106)
        at
org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:143)
        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1099)
        at
org.apache.flink.runtime.scheduler.LegacyScheduler.tryRestoreExecutionGraphFromSavepoint(LegacyScheduler.java:237)
        at
org.apache.flink.runtime.scheduler.LegacyScheduler.createAndRestoreExecutionGraph(LegacyScheduler.java:196)
        at
org.apache.flink.runtime.scheduler.LegacyScheduler.<init>(LegacyScheduler.java:176)
        at
org.apache.flink.runtime.scheduler.LegacySchedulerFactory.createInstance(LegacySchedulerFactory.java:70)
        at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278)
        at
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266)
        at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
        at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
        at
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:146)
        ... 10 more

End of exception on server side>]
```
So it seems not. I just want to confirm that with the community.


Best
Lu