[jira] [Created] (FLINK-18167) Flink Job hangs there when one vertex is failed and another is cancelled.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-18167) Flink Job hangs there when one vertex is failed and another is cancelled.

Shang Yuanchun (Jira)
Jeff Zhang created FLINK-18167:
----------------------------------

             Summary: Flink Job hangs there when one vertex is failed and another is cancelled.
                 Key: FLINK-18167
                 URL: https://issues.apache.org/jira/browse/FLINK-18167
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.10.0
            Reporter: Jeff Zhang
         Attachments: image-2020-06-06-15-39-35-441.png

After I call cancel with savepoint, the cancel operation is failed. The following is what I see in client side. 
{code:java}

WARN [2020-06-06 13:45:16,003] ({Thread-1241} JobManager.java[cancelJob]:137) - Fail to cancel job 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with paragraph paragraph_1586733868269_783581378
java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129)
        at org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648)
        at org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119)
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
        at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$stopWithSavepoint$9(SchedulerBase.java:873)
        at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
        at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
        at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
        at akka.actor.ActorCell.invoke(ActorCell.scala:561)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
        at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerSynchronousSavepoint$0(CheckpointCoordinator.java:428)
        at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$1(CheckpointCoordinator.java:457)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:429)
        at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1445)
        at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpoint(CheckpointCoordinator.java:1436)
        at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoints(CheckpointCoordinator.java:1266)
        at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1253)
        at org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatusChange(ExecutionGraph.java:1654)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1236)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.transitionState(ExecutionGraph.java:1214)
        at org.apache.flink.runtime.scheduler.SchedulerBase.transitionExecutionGraphState(SchedulerBase.java:421)
        at org.apache.flink.runtime.scheduler.DefaultScheduler.addVerticesToRestartPending(DefaultScheduler.java:232)
        at org.apache.flink.runtime.scheduler.DefaultScheduler.restartTasksWithDelay(DefaultScheduler.java:219)
        at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeRestartTasks(DefaultScheduler.java:207)
        at org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:202)
        at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:58)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.failGlobal(ExecutionGraph.java:1035)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph$1.lambda$failJob$0(ExecutionGraph.java:468)
        ... 22 more
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
        at org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:428)
        ... 38 more
ERROR [2020-06-06 13:45:16,007] ({Thread-1241} RemoteInterpreterServer.java[lambda$cancel$1]:802) - Fail to cancel paragraph: paragraph_1586733868269_783581378
 WARN [2020-06-06 13:45:16,283] ({pool-1-thread-3} JobManager.java[getJobProgress]:99) - Unable to get job progress for paragraph: paragraph_1586733868269_783581378, because no job is associated with this paragraph
 INFO [2020-06-06 13:45:16,742] ({pool-6-thread-1} AbstractStreamSqlJob.java[run]:245) - Refresh result of paragraph: paragraph_1586847370895_154139610
 WARN [2020-06-06 13:45:16,784] ({pool-1-thread-3} JobManager.java[getJobProgress]:99) - Unable to get job progress for paragraph: paragraph_1586733868269_783581378, because no job is associated with this paragraph
 WARN [2020-06-06 13:45:17,211] ({Thread-1240} JobManager.java[cancelJob]:137) - Fail to cancel job 7e5492f35c1a7f5dad7c805ba943ea52 that is associated with paragraph paragraph_1586733868269_783581378
java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending.
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at org.apache.zeppelin.flink.JobManager.cancelJob(JobManager.java:129)
        at org.apache.zeppelin.flink.FlinkScalaInterpreter.cancel(FlinkScalaInterpreter.scala:648)
        at org.apache.zeppelin.flink.FlinkInterpreter.cancel(FlinkInterpreter.java:101)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:119)
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.lambda$cancel$1(RemoteInterpreterServer.java:800)
        at java.lang.Thread.run(Thread.java:748) {code}
But in the flink web UI, I see that one vertex is failed and another is cancelled. 

!image-2020-06-06-15-39-35-441.png!

And when I call rest api for check the status of this job. I see that the job state is RUNNING. But this job just hangs there, never recover or do anything else.
{code:java}

{jid: "cc69431798db3e8a3541b4ec4c020e5d",name: "UnnamedTable_select url, count(1) as c from log group by url_0",isStoppable: false,state: "RUNNING",start-time: 1591351246553,end-time: -1,duration: 77611856,now: 1591428858409, {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)