[jira] [Created] (FLINK-15456) Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-15456) Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests

Shang Yuanchun (Jira)
Zhu Zhu created FLINK-15456:
-------------------------------

             Summary: Job keeps failing on slot allocation timeout due to RM not allocating new TMs for slot requests
                 Key: FLINK-15456
                 URL: https://issues.apache.org/jira/browse/FLINK-15456
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: Zhu Zhu
             Fix For: 1.10.0
         Attachments: jm_part.log

As in the attached JM log, the job tried to start 30 TMs but only 29 are registered. So the job fails due to not able to acquire all 30 slots needed in time.
And when the failover happens and tasks are re-scheduled, the RM will not ask for new TMs even if it cannot fulfill the slot requests. So the job will keep failing for slot allocation timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)