Mark Cho created FLINK-11957:
-------------------------------- Summary: Expose failure cause in the API response when dispatcher fails to submit a job Key: FLINK-11957 URL: https://issues.apache.org/jira/browse/FLINK-11957 Project: Flink Issue Type: Improvement Components: Runtime / REST Affects Versions: 1.7.2 Reporter: Mark Cho We use POST /jars/:jarid/run API endpoint to submit a Flink job https://ci.apache.org/projects/flink/flink-docs-release-1.7/monitoring/rest_api.html#jars-jarid-run Currently, whenever there is an error, API response only returns the following info: {code:java} { "errors": [ "org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job." ] } {code} Since job submission can fail for multiple reasons, it would be helpful to have some information that tells us why the job submission failed. Currently, we have to dig into the Flink logs to find the root cause. Some examples of job submission failure can be: {code:java} java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176) at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308) at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) ... 7 more Caused by: java.io.FileNotFoundException: Cannot find checkpoint or savepoint file/directory 's3://us-east-1.spaas.test/checkpoints/metadata/spaas_app_mcho-flink_bp_test/cee4-155266396689/fa82a7d2c8dfb6f7fb14bf2e319d4367/chk-969/_metadata' on file system 's3'. at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:241) at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:109) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1100) at org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1241) at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1165) at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:296) at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157) ... 10 more {code} {code:java} java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176) at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308) at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) ... 7 more Caused by: org.apache.flink.util.FlinkRuntimeException: Incompatible failover strategy - strategy 'Individual Task Restart' can only handle jobs with only disconnected tasks. at org.apache.flink.runtime.executiongraph.failover.RestartIndividualStrategy.notifyNewVertices(RestartIndividualStrategy.java:142) at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:860) at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:232) at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:100) at org.apache.flink.runtime.jobmaster.JobMaster.createExecutionGraph(JobMaster.java:1173) at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1153) at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:296) at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157) ... 10 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) |
Free forum by Nabble | Edit this page |