[jira] [Created] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

Shang Yuanchun (Jira)
wgcn created FLINK-20138:
----------------------------

             Summary: Flink Job can not recover due to  timeout of requiring slots when flink jobmanager restarted
                 Key: FLINK-20138
                 URL: https://issues.apache.org/jira/browse/FLINK-20138
             Project: Flink
          Issue Type: Bug
          Components: Deployment / YARN, Table SQL / Runtime
         Environment: flink : 1.9.2
hadoop :2.7.2
jdk:1.8
            Reporter: wgcn
         Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png

our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some jobs  can not recover due to  timeout of requiring slots.

SlotPoolImp always did not connect ResourceManager
```
2020-11-09 16:31:31,794                           INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
```

1.We did not find  the log of YarnResourceManager requesting container   at the jobmanager log of attachment.
2.The node  of Zookeeper is also  showed at attachment .





--
This message was sent by Atlassian Jira
(v8.3.4#803005)