Xintong Song created FLINK-13555:
------------------------------------
Summary: Failures of slot requests requiring unfulfillable managed memory should not be ignored.
Key: FLINK-13555
URL:
https://issues.apache.org/jira/browse/FLINK-13555 Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Xintong Song
Fix For: 1.9.0
Attachments: flink-unk-standalonesession-0-u-home.log, flink-unk-taskexecutor-0-u-home.log
Currently, SlotPool ignores failures of requesting slots from ResourceManager for all batch slot requests. The idea behind this is to allow batch slot requests pending at SlotPool and waiting for other tasks to finish and release slots. A slot request will be failed only if it is not fulfilled in its timeout.
However, there could be two kinds of request slots from RM failures.
# RM does not have available slots. All slots are in use at the moment. But they might become available later when the currently running tasks finish.
# The slot request requires too many resources that can not be fulfilled by any slot (available or not) in the cluster. The request is also not likely to be fulfilled later.
For the 2nd kinds of failures, it doesn't make sense to wait for the timeout. We should fail the job immediately, with proper error messages describing the problem and suggesting the user to tune job or cluster configurations.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)