[jira] [Created] (FLINK-13555) Failures of slot requests requiring unfulfillable managed memory should not be ignored.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-13555) Failures of slot requests requiring unfulfillable managed memory should not be ignored.

Shang Yuanchun (Jira)
Xintong Song created FLINK-13555:
------------------------------------

             Summary: Failures of slot requests requiring unfulfillable managed memory should not be ignored.
                 Key: FLINK-13555
                 URL: https://issues.apache.org/jira/browse/FLINK-13555
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.9.0
            Reporter: Xintong Song
             Fix For: 1.9.0
         Attachments: flink-unk-standalonesession-0-u-home.log, flink-unk-taskexecutor-0-u-home.log

Currently, SlotPool ignores failures of requesting slots from ResourceManager for all batch slot requests. The idea behind this is to allow batch slot requests pending at SlotPool and waiting for other tasks to finish and release slots. A slot request will be failed only if it is not fulfilled in its timeout.

However, there could be two kinds of request slots from RM failures.
 # RM does not have available slots. All slots are in use at the moment. But they might become available later when the currently running tasks finish.
 # The slot request requires too many resources that can not be fulfilled by any slot (available or not) in the cluster. The request is also not likely to be fulfilled later.

For the 2nd kinds of failures, it doesn't make sense to wait for the timeout. We should fail the job immediately, with proper error messages describing the problem and suggesting the user to tune job or cluster configurations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)