[jira] [Created] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery

Shang Yuanchun (Jira)
Zhu Zhu created FLINK-14164:
-------------------------------

             Summary: Add a metric to show failover count regarding fine grained recovery
                 Key: FLINK-14164
                 URL: https://issues.apache.org/jira/browse/FLINK-14164
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination, Runtime / Metrics
    Affects Versions: 1.9.0, 1.10.0
            Reporter: Zhu Zhu
             Fix For: 1.10.0


Previously Flink uses restart all strategy to recover jobs from failures. And the metric "fullRestart" is used to show the count of failovers.

However, with fine grained recovery introduced in 1.9.0, the "fullRestart" metric only reveals how many times the entire graph has been restarted, not including the number of fine grained failure recoveries.

As many users want to build their job alerting based on failovers, I'd propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} which also respects fine grained recoveries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)