Zhenqiu Huang created FLINK-15416:
-------------------------------------
Summary: Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel
Key: FLINK-15416
URL:
https://issues.apache.org/jira/browse/FLINK-15416 Project: Flink
Issue Type: Wish
Components: Runtime / Network
Affects Versions: 1.10.0
Reporter: Zhenqiu Huang
We run a flink with 256 TMs in production. The job internally has keyby logic. Thus, it builds a 256 * 256 communication channels. An outage happened when there is a chip internal link of one of the network switchs broken that connecting these machines. During the outage, the flink can't restart successfully as there is always an exception like "Connecting the channel failed: Connecting to remote task manager + '****/10.14.139.6:41300' has failed. This might indicate that the remote task manager has been lost.
After deep investigation with the network infrastructure team, we found there are 6 switchs connecting with these machines. Each switch has 32 physcal links. Every socket is round-robin assigned to each of links for load balances. Thus, there is always average 256 * 256 / 6 * 32 * 2 = 170 channels will be assigned to the broken link. The issue lasted for 4 hours until we found the broken link and restart the problematic switch.
Given this, we found that the retry of creating channel will help to resolve this issue. For our networking topology, we can set retry to 2. As 170 / (132 * 132) < 1, which means after retry twice no channel in 170 channels will be assigned to the broken link in the average case.
I think it is valuable fix for this kind of partial network partition.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)