[jira] [Created] (FLINK-20752) FailureRateRestartBackoffTimeStrategy allows one less restart than configured

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-20752) FailureRateRestartBackoffTimeStrategy allows one less restart than configured

Shang Yuanchun (Jira)
Chesnay Schepler created FLINK-20752:
----------------------------------------

             Summary: FailureRateRestartBackoffTimeStrategy allows one less restart than configured
                 Key: FLINK-20752
                 URL: https://issues.apache.org/jira/browse/FLINK-20752
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: Chesnay Schepler
            Assignee: Chesnay Schepler
             Fix For: 1.13.0


The {{FailureRateRestartBackoffTimeStrategy}} maintains a list of failure timestamps, keeping N timestamps where N is the configured of failures per interval.

The timestamp is added when #notifyFailure() is called, and later evaluated within #canRestart().

To determine whether a restart should be allowed we first check whether we are already storing N timestamps, and if so check whether the earliest failure still falls within the current interval. If it does, we reject the restart.

The problem is that we check whether we have already stored exactly N timestamps. If we have exactly N timestamps, and we allow N failures per interval, then we should be checking whether we have N+1 timestamps have been stored instead.

For example, let's say we allow 2 exceptions, and 2 have occurred so far. Regardless of what the timestamps are, we should still allow a restart in this case.
Only once a third exception occurs should we be looking at the timestamps, and we should furthermore only look at the exception exceeding the allowed failure count; in this example it is the very first exception.

I don't know why this went unnoticed for so long; the relevant tests fail rather reliably for me locally. ({{FailureRateRestartBackoffTimeStrategyTest}}, {{SimpleRecoveryFailureRateStrategyITBase}})

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)