Zhu Zhu created FLINK-14164:
-------------------------------
Summary: Add a metric to show failover count regarding fine grained recovery
Key: FLINK-14164
URL:
https://issues.apache.org/jira/browse/FLINK-14164 Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination, Runtime / Metrics
Affects Versions: 1.9.0, 1.10.0
Reporter: Zhu Zhu
Fix For: 1.10.0
Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers.
However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries.
As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} which also respects fine grained recoveries.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)