Jin Xing created FLINK-22676:
--------------------------------
Summary: The partition tracker should support remote shuffle properly
Key: FLINK-22676
URL:
https://issues.apache.org/jira/browse/FLINK-22676 Project: Flink
Issue Type: Sub-task
Components: Runtime / Network
Reporter: Jin Xing
In current Flink, data partition is bound with the ResourceID of TM in Execution#startTrackingPartitions and partition tracker will stop tracking corresponding partitions when a TM disconnects(JobMaster#disconnectTaskManager), i.e. the lifecycle of shuffle data is bound with computing resource (TM). It works fine for internal shuffle service, but doesn't for remote shuffle service. Note that shuffle data is accommodated on remote, the lifecycle of a completed partition is capable to be decoupled with TM, i.e. TM is totally fine to be released when no computing task on it and further shuffle reading requests could be directed to remote shuffle cluster. In addition, when a TM is lost, its completed data partitions on remote shuffle cluster could avoid reproducing.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)