Julius Michaelis created FLINK-13808:
---------------------------------------- Summary: Checkpoints expired by timeout may leak RocksDB files Key: FLINK-13808 URL: https://issues.apache.org/jira/browse/FLINK-13808 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.8.1, 1.8.0 Environment: Reproduction is difficult, the only place I can reliably reproduce is a 4-node cluster with parallelismĀ +>+ 100. A semi-reliable way of reproducing the issue is provided by [this docker-compose|https://github.com/jcaesar/flink-rocksdb-file-leak]. Reporter: Julius Michaelis A RocksDB state backend with HDFS checkpoints, with or without local recovery, may leak files inĀ {{io.tmp.dirs}} on checkpoint expiry by timeout. If the size of a checkpoint crosses what can be transferred during one checkpoint timeout, checkpoints will continue to fail forever. If this is combined with a quick rollover of SST files (e.g. due to a high density of writes), this may quickly exhaust available disk space (or memory, as /tmp is the default location). As a workaround, the jobmanager's REST API can be frequently queried for failed checkpoints, and deleting associated files accordingly. I've tried investing the cause a little bit, but I'm stuck: * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before completing.}} and similar gets printed, so * [{{abortExpired}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L549], so * [{{dispose}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L455], so * [{{cancelCaller}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488], so * [the canceler is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497] ([through one more layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]), so * [{{cleanup}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95], (possibly [not from {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]), so * [{{cleanupProvidedResources}} is invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162] (this is the indirection that made me give up), so * [this trace log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372] should be printed, but it isn't. I have some time to further investigate, but I'd appreciate help on finding out where in this chain things go wrong. -- This message was sent by Atlassian Jira (v8.3.2#803003) |
Free forum by Nabble | Edit this page |