[jira] [Created] (FLINK-16429) failed to restore flink job from checkpoints due to unhandled exceptions

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-16429) failed to restore flink job from checkpoints due to unhandled exceptions

Shang Yuanchun (Jira)
Yu Yang created FLINK-16429:
-------------------------------

             Summary: failed to restore flink job from checkpoints due to unhandled exceptions
                 Key: FLINK-16429
                 URL: https://issues.apache.org/jira/browse/FLINK-16429
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.9.1
            Reporter: Yu Yang


We are trying to restore our flink job from check-points, and run into AskTimeoutException related failures at a high frequency. Our environment is Hadoop 2.7.1 + Yarn + Flink 1.9.1. 

We hit this issue in 9 out of 10 runs, and were able to restore the application from given check-points from time to time. As the application can be restored, the check-point files shall not be corrupted. It seems that the issue is that jobmaster got timeout when it handles job submission request.  

 

Below is the exception stack trace, it is thrown from

[https://github.com/apache/flink/blob/2ec645a5bfd3cfadaf0057412401e91da0b21873/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/AbstractHandler.java#L209]

2020-03-05 00:04:14,360 ERROR org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Unhandled exception: httpRequest uri:/v1/jobs, context: ChannelHandlerContext(org.apache.flink.runtime.rest.handler.router.RouterHandler_ROUTED_HANDLER, [id: 0xc39aca33, L:/10.1.85.22:41000 - R:/10.1.16.251:44]) akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-34498396]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply. at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)