[jira] [Created] (FLINK-19773) Exponential backoff restart strategy

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-19773) Exponential backoff restart strategy

Shang Yuanchun (Jira)
Levi Ramsey created FLINK-19773:
-----------------------------------

             Summary: Exponential backoff restart strategy
                 Key: FLINK-19773
                 URL: https://issues.apache.org/jira/browse/FLINK-19773
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 1.11.2
            Reporter: Levi Ramsey


There are situations where the current restart strategies (fixed-delay and failure-rate) seem to be suboptimal.  For example, in HDFS sinks, a delay between restarts shorter than the lease expiration time in HDFS is going to result in many restart attempts which fail, putting somewhat pointless stress on a cluster.  On the other hand, setting a delay of close to the lease expiration time will mean far more downtime than necessary when the cause of failure is something that works itself out quickly.

 

An exponential backoff restart strategy would address this.  For example a backoff strategy where the jobs are contending for a lease on a shared resource that terminates after 1200 seconds of inactivity might have successive delays of 1, 2, 4, 8, 16... 1024 seconds (after which a cumulative delay of more than 1200 seconds has passed).

While not intrinsically tied to exponential backoff (it's more of an example of variable delay), in the case of many jobs failing due to an infrastructure failure, a thundering herd scenario can be mitigated by adding jitter to the delays, e.g. 0 -> 1 -> 2 -> 3/4/5 -> 5/6/7/8/9/10/11 seconds.  With this jitter, eventually a set of jobs competing to restart will spread out.

(logging the ticket more to start a discussion and perhaps get context around if this had been considered and rejected, etc.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)