[jira] [Created] (FLINK-19142) Investigate slot hijacking from preceding pipelined regions after failover

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-19142) Investigate slot hijacking from preceding pipelined regions after failover

Shang Yuanchun (Jira)
Andrey Zagrebin created FLINK-19142:
---------------------------------------

             Summary: Investigate slot hijacking from preceding pipelined regions after failover
                 Key: FLINK-19142
                 URL: https://issues.apache.org/jira/browse/FLINK-19142
             Project: Flink
          Issue Type: Improvement
            Reporter: Andrey Zagrebin


The ticket originates from [this PR discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221].

The previous AllocationIDs are used by PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot where they were previously executed before a failover. If the previous slot (AllocationID) is not available, we do not want subtasks to take previous slots (AllocationIDs) of other subtasks.

The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. The previous AllocationIDs of other bulks stay unknown. Therefore, the current bulk can potentially hijack the previous slots from the preceding bulks. On the other hand the previous AllocationIDs of other tasks should be taken if the other tasks are not going to run at the same time, e.g. not enough resources after failover or other bulks are done.

One way to do it may be to give to MergingSharedSlotProfileRetriever all previous AllocationIDs of bulks which are going to run at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)