[jira] [Created] (FLINK-12373) Improve checkpointing metrics

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-12373) Improve checkpointing metrics

Shang Yuanchun (Jira)
Gyula Fora created FLINK-12373:
----------------------------------

             Summary: Improve checkpointing metrics
                 Key: FLINK-12373
                 URL: https://issues.apache.org/jira/browse/FLINK-12373
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Checkpointing
            Reporter: Gyula Fora


The checkpoint metrics encapsulated in the CheckpointMetrics class currently exposes 4 core metrics for each operator: bytesBuffered, alignment time, sync duration and async duration

I think it would be a great improvement to break up the tracking of the sync duration into the different components as it contains information that is critical to improve the SLA of large jobs.

I suggest we break up the sync duration into 4 subcomponents:

 1. prepareSnapshotPreBarrier
 2. Snapshot timers
 3. Snapshot operator states
 4. Sync keyed state checkpoint

Maybe the operator state part could be further broken up into keyed/non-keyed part, i dont know.

I think knowing these metrics is crucial for users to minimise the latency caused by checkpointing.

Whether we want to show all this info on the web ui is another discussion :)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)