[jira] [Created] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-9351) RM stop assigning slot to Job because the TM killed before connecting to JM successfully

Shang Yuanchun (Jira)
Sihua Zhou created FLINK-9351:
---------------------------------

             Summary: RM stop assigning slot to Job because the TM killed before connecting to JM successfully
                 Key: FLINK-9351
                 URL: https://issues.apache.org/jira/browse/FLINK-9351
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.5.0
            Reporter: Sihua Zhou


The steps are the following(copied from Stephan's comments in [5931 title|https://github.com/apache/flink/pull/5931]):

JobMaster / SlotPool requests a slot (AllocationID) from the ResourceManager
ResourceManager starts a container with a TaskManager
TaskManager registers at ResourceManager, which tells the TaskManager to push a slot to the JobManager.
TaskManager container is killed
The ResourceManager does not queue back the slot requests (AllocationIDs) that it sent to the previous TaskManager, so the requests are lost and need to time out before another attempt is tried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)