[jira] [Created] (FLINK-14158) Update Mesos configs to add leaseOfferExpiration and declinedOfferRefuse durations

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-14158) Update Mesos configs to add leaseOfferExpiration and declinedOfferRefuse durations

Shang Yuanchun (Jira)
Piyush Narang created FLINK-14158:
-------------------------------------

             Summary: Update Mesos configs to add leaseOfferExpiration and declinedOfferRefuse durations
                 Key: FLINK-14158
                 URL: https://issues.apache.org/jira/browse/FLINK-14158
             Project: Flink
          Issue Type: Bug
            Reporter: Piyush Narang


While debugging some Flink on Mesos scheduling issues (tied to our use of Mesos quotas) we end up getting skewed offers that are useless fairly often. As we are not rejecting these offers fast enough and as we are not telling Mesos to not re-send for a long enough period, we end up not being able to schedule our job for upwards of an hour (~30 Mesos containers).

The Fenzo default is to reject expired and unused Mesos offers after 120s, this can be overridden using their TaskScheduler builder. Additionally, Mesos allows us to override the time for which it won't re-send offers (default is 5s). We found that updating to reject more aggressively (every 1s instead of 120s) and keeping rejected offers away for longer (60s instead of 5s) dramatically increases our chances of scheduling our jobs on Mesos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)