[jira] [Created] (FLINK-15416) Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-15416) Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel

Shang Yuanchun (Jira)
Zhenqiu Huang created FLINK-15416:
-------------------------------------

             Summary: Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel
                 Key: FLINK-15416
                 URL: https://issues.apache.org/jira/browse/FLINK-15416
             Project: Flink
          Issue Type: Wish
          Components: Runtime / Network
    Affects Versions: 1.10.0
            Reporter: Zhenqiu Huang


We run a flink with 256 TMs in production. The job internally has keyby logic. Thus, it builds a 256 * 256 communication channels. An outage happened when there is a chip internal link of one of the network switchs broken that connecting these machines. During the outage, the flink can't restart successfully as there is always an exception like  "Connecting the channel failed: Connecting to remote task manager + '****/10.14.139.6:41300' has failed. This might indicate that the remote task manager has been lost.

After deep investigation with the network infrastructure team, we found there are 6 switchs connecting with these machines. Each switch has 32 physcal links. Every socket is round-robin assigned to each of links for load balances. Thus, there is always average 256 * 256 / 6 * 32  * 2 = 170 channels will be assigned to the broken link. The issue lasted for 4 hours until we found the broken link and restart the problematic switch.

Given this, we found that the retry of creating channel will help to resolve this issue. For our networking topology, we can set retry to 2. As 170 / (132 * 132) < 1, which means after retry twice no channel in 170 channels will be assigned to the broken link in the average case.

I think it is valuable fix for this kind of partial network partition.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)