(DEPRECATED) Apache Flink Mailing List archive.

[jira] [Created] (FLINK-12245) Transient slot allocation failure on job recovery

Classic

List

Threaded

1 message

Shang Yuanchun (Jira)

[jira] [Created] (FLINK-12245) Transient slot allocation failure on job recovery

Hwanju Kim created FLINK-12245:
----------------------------------

Summary: Transient slot allocation failure on job recovery
Key: FLINK-12245
URL: https://issues.apache.org/jira/browse/FLINK-12245
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.6.3
Environment: Flink 1.6.2 with Kubernetes
Reporter: Hwanju Kim

In 1.6.2, We have experienced that slot allocation is transiently failed on job recovery especially when task manager (TM) is unavailable leading to heartbeat failure. By transient, it means it fails once with slot allocation timeout (by default 5min) and then next recovering restart is succeeded.

I found that each _Execution_ remembers previous allocations and tries to prefer the last previous allocation for the sake of local state recovery from the resolved slot candidates. If the previous allocation belongs to unavailable TM, the candidates do not have this previous allocation, thereby forcing slot provider to request a new slot to resource manager, which then finds a new TM and its available slots. So far it is expected and fine, but any next execution that also belonged to the unavailable TM and has the first task as predecessor fails with the unavailable previous allocation as well. Here it also requests another new slot since it never finds the gone previous allocation from candidates. However, this behavior may make more slot requests than available. For example, if two pipelined tasks shared one slot in one TM, which is then crashed being replaced with a new TM, two new slot requests are generated from the tasks. Since two slot requests cannot be fulfilled by one slot TM, it hits slot allocation timeout and restarts the job.

{code:java}
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 2, slots allocated: 1 {code}

At the next round of recovery, since the second execution failed to allocate a new slot, its last previous allocation is _null_, then it falls back to locality-based allocation strategy, which can find the slot allocated for the first task, and thus succeeded. Although it is eventually succeeded, it increases downtime by slot allocation timeout.

The reason of this behavior is _PreviousAllocationSchedulingStrategy.findMatchWithLocality()_ immediately returns _null_ if previous allocation is not empty and is not contained in candidate list. I thought that if previous allocation is not in the candidates, it can fall back to _LocationPreferenceSchedulingStrategy.findMatchWithLocality()_ rather than returning _null_. By doing so, it can avoid requesting more than available. Requesting more slots could be fine in an environment where resource managers can reactively spawn up more TMs (like Yarn/Mesos) although it could spawn more than needed, but StandaloneResourceManager with statically provisioned resource cannot help but failing to allocate requested slots.

Having looked at the mainline branch and 1.8.0, although I have not attempted to reproduce this issue with mainline, the related code is changed to what I have expected (falling back to locality-based strategy if previous allocation is not in candidates): PreviousAllocationSlotSelectionStrategy.selectBestSlotForProfile(). Those led me to reading group-aware scheduling work ([https://docs.google.com/document/d/1q7NOqt05HIN-PlKEEPB36JiuU1Iu9fnxxVGJzylhsxU/edit#heading=h.k15nfgsa5bnk]). In addition, I checked in 1.6.2 _matchPreviousLocationNotAvailable_ test expects the problematic behavior I described. So, I started wondering whether the behavior of previous allocation strategy in non-mainline is by design or not. I have a fix similar to the mainline and verified that the problem is resolved, but I am bringing up the issue to have context around the behavior and to discuss what would be the side-effect of the fix. I understand the current vertex-by-vertex scheduling would be inefficient by letting an execution that belonged to unavailable slot steal another task's previous slot, but having slot allocation failure seems worse to me.

I searched with slot allocation failure term in existing issues, and couldn't find the same issue, hence this issue. Please feel free to deduplicate it if any.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)