[jira] [Created] (FLINK-14375) Avoid to trigger failover on a non-effective task failure notification

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-14375) Avoid to trigger failover on a non-effective task failure notification

Shang Yuanchun (Jira)
Zhu Zhu created FLINK-14375:
-------------------------------

             Summary: Avoid to trigger failover on a non-effective task failure notification
                 Key: FLINK-14375
                 URL: https://issues.apache.org/jira/browse/FLINK-14375
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: Zhu Zhu
             Fix For: 1.10.0


The DefaultScheduler triggers failover if a task is notified to be FAILED. However, in the case the multiple tasks in the same region fail together, it will trigger multiple failovers. The later triggered failovers are useless, lead to concurrent failovers and will increase the restart attempts count.

To avoid that, I'd propose to check that the effectiveness of a task failure to decide whether to trigger a failover, namely checking in {{DefaultScheduler#maybeHandleTaskFailure()}} to see whether the vertex state is really FAILED rather than CANCELED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)