[jira] [Created] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

Shang Yuanchun (Jira)
Till Rohrmann created FLINK-6160:
------------------------------------

             Summary:  Retry JobManager/ResourceManager connection in case of timeout
                 Key: FLINK-6160
                 URL: https://issues.apache.org/jira/browse/FLINK-6160
             Project: Flink
          Issue Type: Sub-task
          Components: Distributed Coordination
    Affects Versions: 1.3.0
            Reporter: Till Rohrmann
             Fix For: 1.3.0


In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to the remote component. Furthermore, it assumes that the component has actually failed and, thus, it will only start trying to connect to the component if it is notified about a new leader address and leader session id. This is brittle, because the heartbeat could also time out without the component having crashed. Thus, we should add an automatic retry to the latest known leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)