Jeff Zhang created FLINK-18167:
---------------------------------- Summary: Flink Job hangs there when one vertex is failed and another is cancelled. Key: FLINK-18167 URL: https://issues.apache.org/jira/browse/FLINK-18167 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.10.0 Reporter: Jeff Zhang Attachments: image-2020-06-06-15-39-35-441.png After I call cancel with savepoint, the cancel operation is failed. The following is what I see in client side. {code:java} WARN [2020-06-06 13:45:16,003] ({Thread-1241} JobManager.java[cancelJob]:137) - Fail to cancel job 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with paragraph paragraph_1586733868269_783581378 java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129) at org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648) at org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:873) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerSynchronousSavepoint$0(CheckpointCoordinator.java:428) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$1(CheckpointCoordinator.java:457) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:429) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1445) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1436) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1266) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1253) at org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46) at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatusChange(ExecutionGraph.java:1654) at org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1236) at org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1214) at org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:421) at org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:232) at org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:219) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:207) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:202) at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:58) at org.apache.flink.runtime.executiongraph.ExecutionGraph.failGlobal(ExecutionGraph.java:1035) at org.apache.flink.runtime.executiongraph.ExecutionGraph$1.lambda$failJob$0(ExecutionGraph.java:468) ... 22 more Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. at org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:428) ... 38 more ERROR [2020-06-06 13:45:16,007] ({Thread-1241} RemoteInterpreterServer.java[lambda$cancel$1]:802) - Fail to cancel paragraph: paragraph_1586733868269_783581378 WARN [2020-06-06 13:45:16,283] ({pool-1-thread-3} JobManager.java[getJobProgress]:99) - Unable to get job progress for paragraph: paragraph_1586733868269_783581378, because no job is associated with this paragraph INFO [2020-06-06 13:45:16,742] ({pool-6-thread-1} AbstractStreamSqlJob.java[run]:245) - Refresh result of paragraph: paragraph_1586847370895_154139610 WARN [2020-06-06 13:45:16,784] ({pool-1-thread-3} JobManager.java[getJobProgress]:99) - Unable to get job progress for paragraph: paragraph_1586733868269_783581378, because no job is associated with this paragraph WARN [2020-06-06 13:45:17,211] ({Thread-1240} JobManager.java[cancelJob]:137) - Fail to cancel job 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with paragraph paragraph_1586733868269_783581378 java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129) at org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648) at org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800) at java.lang.Thread.run(Thread.java:748) {code} But in the flink web UI, I see that one vertex is failed and another is cancelled. !image-2020-06-06-15-39-35-441.png! And when I call rest api for check the status of this job. I see that the job state is RUNNING. But this job just hangs there, never recover or do anything else. {code:java} {jid: "cc69431798db3e8a3541b4ec4c020e5d",name: "UnnamedTable_select url, count(1) as c from log group by url_0",isStoppable: false,state: "RUNNING",start-time: 1591351246553,end-time: -1,duration: 77611856,now: 1591428858409, {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Free forum by Nabble | Edit this page |