Levi Ramsey created FLINK-19773:
-----------------------------------
Summary: Exponential backoff restart strategy
Key: FLINK-19773
URL:
https://issues.apache.org/jira/browse/FLINK-19773 Project: Flink
Issue Type: Improvement
Affects Versions: 1.11.2
Reporter: Levi Ramsey
There are situations where the current restart strategies (fixed-delay and failure-rate) seem to be suboptimal. For example, in HDFS sinks, a delay between restarts shorter than the lease expiration time in HDFS is going to result in many restart attempts which fail, putting somewhat pointless stress on a cluster. On the other hand, setting a delay of close to the lease expiration time will mean far more downtime than necessary when the cause of failure is something that works itself out quickly.
An exponential backoff restart strategy would address this. For example a backoff strategy where the jobs are contending for a lease on a shared resource that terminates after 1200 seconds of inactivity might have successive delays of 1, 2, 4, 8, 16... 1024 seconds (after which a cumulative delay of more than 1200 seconds has passed).
While not intrinsically tied to exponential backoff (it's more of an example of variable delay), in the case of many jobs failing due to an infrastructure failure, a thundering herd scenario can be mitigated by adding jitter to the delays, e.g. 0 -> 1 -> 2 -> 3/4/5 -> 5/6/7/8/9/10/11 seconds. With this jitter, eventually a set of jobs competing to restart will spread out.
(logging the ticket more to start a discussion and perhaps get context around if this had been considered and rejected, etc.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)