[jira] [Created] (FLINK-17089) Checkpoint fail because RocksDBException: Error While opening a file for sequentially reading

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-17089) Checkpoint fail because RocksDBException: Error While opening a file for sequentially reading

Shang Yuanchun (Jira)
Lu Niu created FLINK-17089:
------------------------------

             Summary: Checkpoint fail because RocksDBException: Error While opening a file for sequentially reading
                 Key: FLINK-17089
                 URL: https://issues.apache.org/jira/browse/FLINK-17089
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
            Reporter: Lu Niu


we use incremental rocksdb state backend. Flink job checkpoint throws following exception after running for about 20 hours:
{code:java}
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/foo/bar/usercache/xxx/appcache/application_1584397637704_9072/flink-io-4e2294f0-7e9b-4102-b079-1089f23c47aa/job_d781983f4967703b0480c7943e8100af_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__27_60__uuid_dee7e33b-9bce-42f3-909a-f6fa4ab52d8c/db/MANIFEST-000006: No such file or directory at org.rocksdb.Checkpoint.createCheckpoint(Native Method) at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51) at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:249) at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:160) at org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:439) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:411) ... 17 more
{code}
This failure consistent happens until the job restarts.

Some findings:

Jobmanager log shows each time the error came from different subTask:
{code:java}
// grep jobManager log on appcache/application_1584397637704_9622
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-c42b6665-0170-4dc9-9933-8abd78812fd5/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__5_60__uuid_fa8124e4-1678-4555-a90a-8eec4d974a22/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a8dfe34d-909e-4aea-8d20-c89199b20856/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__4_60__uuid_12fc9764-418e-4802-800e-3623e385743f/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-e98c35d7-586a-4edb-9eba-99c6fd823540/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__9_60__uuid_f52a3f02-aa12-4285-b594-b94e1b0f8ba7/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a2887f93-1c75-48b1-8b67-72acdc69ce1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__2_60__uuid_6a8267eb-aa04-48a3-b82f-7b5b9f21c8e0/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-e98c35d7-586a-4edb-9eba-99c6fd823540/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__9_60__uuid_f52a3f02-aa12-4285-b594-b94e1b0f8ba7/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-7be6a975-c0cd-4083-a1c3-b47e4c8fbb1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__13_60__uuid_d779fe65-181f-40d2-b32e-e17a023c128d/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-44fefa0f-c58a-4ce5-ac44-b8b9a436eae5/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__40_60__uuid_bfcd85f6-270b-4e56-8c09-250d9171b8a3/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-1dff583b-5fb3-4521-8cdf-261a2e3a0f4d/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__6_60__uuid_27a20e68-22d6-4e35-a23f-f267c523b829/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a8dfe34d-909e-4aea-8d20-c89199b20856/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__4_60__uuid_12fc9764-418e-4802-800e-3623e385743f/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a2887f93-1c75-48b1-8b67-72acdc69ce1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__2_60__uuid_6a8267eb-aa04-48a3-b82f-7b5b9f21c8e0/db/MANIFEST-000006: No such file or directory
Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
{code}
question:

The state size is actually small. The largest one is ~3KB. That is actually smaller state.backend.fs.memory-threshold we set. In this case, why it still need to store data in rocksdb? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)