[jira] [Created] (FLINK-16215) Start redundant TaskExecutor when JM failed

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-16215) Start redundant TaskExecutor when JM failed

Shang Yuanchun (Jira)
YufeiLiu created FLINK-16215:
--------------------------------

             Summary: Start redundant TaskExecutor when JM failed
                 Key: FLINK-16215
                 URL: https://issues.apache.org/jira/browse/FLINK-16215
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: YufeiLiu


TaskExecutor will reconnect to the new ResourceManager leader when JM failed, and JobMaster will restart and reschedule job. If job slot request arrive earlier than TM registration, RM will start new workers rather than reuse the existing TMs.
It‘s hard to reproduce becasue TM registration usually come first, and timeout check will stop redundant TMs.
But I think it would be better if we make the {{recoverWokerNode}} to interface, and put recovered slots in {{pendingSlots}} wait for TM reconnection.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)