Zhu Zhu created FLINK-14375:
-------------------------------
Summary: Avoid to trigger failover on a non-effective task failure notification
Key: FLINK-14375
URL:
https://issues.apache.org/jira/browse/FLINK-14375 Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Zhu Zhu
Fix For: 1.10.0
The DefaultScheduler triggers failover if a task is notified to be FAILED. However, in the case the multiple tasks in the same region fail together, it will trigger multiple failovers. The later triggered failovers are useless, lead to concurrent failovers and will increase the restart attempts count.
To avoid that, I'd propose to check that the effectiveness of a task failure to decide whether to trigger a failover, namely checking in {{DefaultScheduler#maybeHandleTaskFailure()}} to see whether the vertex state is really FAILED rather than CANCELED.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)