[jira] [Created] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-20886) Add the option to get a threaddump on checkpoint timeouts

Shang Yuanchun (Jira)
Nico Kruber created FLINK-20886:
-----------------------------------

             Summary: Add the option to get a threaddump on checkpoint timeouts
                 Key: FLINK-20886
                 URL: https://issues.apache.org/jira/browse/FLINK-20886
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
    Affects Versions: 1.12.0
            Reporter: Nico Kruber


For debugging checkpoint timeouts, I was thinking about the following addition to Flink:

When a checkpoint times out and the async thread is still running, create a threaddump [1] and either add this to the checkpoint stats, log it, or write it out.

This may help identifying where the checkpoint is stuck (maybe a lock, could also be in a third party lib like the FS connectors,...). It would give us some insights into what the thread is currently doing.

Limiting the scope of the threads would be nice but may not be possible in the general case since additional threads (spawned by the FS connector lib, or otherwise connected) may interact with the async thread(s) by e.g. going through the same locks.


[1] https://crunchify.com/how-to-generate-java-thread-dump-programmatically/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)