Piyush Narang created FLINK-14158:
-------------------------------------
Summary: Update Mesos configs to add leaseOfferExpiration and declinedOfferRefuse durations
Key: FLINK-14158
URL:
https://issues.apache.org/jira/browse/FLINK-14158 Project: Flink
Issue Type: Bug
Reporter: Piyush Narang
While debugging some Flink on Mesos scheduling issues (tied to our use of Mesos quotas) we end up getting skewed offers that are useless fairly often. As we are not rejecting these offers fast enough and as we are not telling Mesos to not re-send for a long enough period, we end up not being able to schedule our job for upwards of an hour (~30 Mesos containers).
The Fenzo default is to reject expired and unused Mesos offers after 120s, this can be overridden using their TaskScheduler builder. Additionally, Mesos allows us to override the time for which it won't re-send offers (default is 5s). We found that updating to reject more aggressively (every 1s instead of 120s) and keeping rejected offers away for longer (60s instead of 5s) dramatically increases our chances of scheduling our jobs on Mesos.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)