[jira] [Created] (FLINK-20865) Prevent potential resource deadlock in fine-grained resource management

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-20865) Prevent potential resource deadlock in fine-grained resource management

Shang Yuanchun (Jira)
Yangze Guo created FLINK-20865:
----------------------------------

             Summary: Prevent potential resource deadlock in fine-grained resource management
                 Key: FLINK-20865
                 URL: https://issues.apache.org/jira/browse/FLINK-20865
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
            Reporter: Yangze Guo
             Fix For: 1.13.0
         Attachments: 屏幕快照 2021-01-06 下午2.32.57.png

!屏幕快照 2021-01-06 下午2.32.57.png|width=954,height=288!

The above figure demonstrates a potential case of deadlock due to scheduling dependency. For the given topology, initially the scheduler will request 4 slots, for A, B, C and D. Assuming only 2 slots are available, if both slots are assigned to Pipeline Region 0 (as shown on the left), A and B will first finish execution, then C and D will be executed, and finally E will be executed. However, if in the beginning the 2 slots are assigned to A and C (as shown on the right), then neither of A and C can finish execution due to missing B and D consuming the data they produced.

Currently, with coarse-grained resource management, the scheduler guarantees to always finish fulfilling requirements of one pipeline region before starting to fulfill requirements of another. That means the deadlock case shown on the right of the above figure can never happen.

However, there’s no such guarantee in fine-grained resource management. Since resource requirements for SSGs can be different, there’s no control on which requirements will be fulfilled first, when there’s not enough resources to fulfill all the requirements. Therefore, it’s not always possible to fulfill one pipeline region prior to another.

To solve this problem, we can make the scheduler defer requesting slots for other SSGs before requirements of the current SSG are fulfilled, for fine-grained resource management, at the price of more scheduling time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)