Gyula Fora created FLINK-12373:
----------------------------------
Summary: Improve checkpointing metrics
Key: FLINK-12373
URL:
https://issues.apache.org/jira/browse/FLINK-12373 Project: Flink
Issue Type: New Feature
Components: Runtime / Checkpointing
Reporter: Gyula Fora
The checkpoint metrics encapsulated in the CheckpointMetrics class currently exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync duration and async duration
I think it would be a great improvement to break up the tracking of the sync duration into the different components as it contains information that is critical to improve the SLA of large jobs.
I suggest we break up the sync duration into 4 subcomponents:
1. prepareSnapshotPreBarrier
2. Snapshot timers
3. Snapshot operator states
4. Sync keyed state checkpoint
Maybe the operator state part could be further broken up into keyed/non-keyed part, i dont know.
I think knowing these metrics is crucial for users to minimise the latency caused by checkpointing.
Whether we want to show all this info on the web ui is another discussion :)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)