[jira] [Created] (FLINK-3431) Add retrying logic for RocksDB snapshots

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-3431) Add retrying logic for RocksDB snapshots

Shang Yuanchun (Jira)
Gyula Fora created FLINK-3431:
---------------------------------

             Summary: Add retrying logic for RocksDB snapshots
                 Key: FLINK-3431
                 URL: https://issues.apache.org/jira/browse/FLINK-3431
             Project: Flink
          Issue Type: Improvement
          Components: Streaming
            Reporter: Gyula Fora
            Priority: Critical


Currently the RocksDB snapshots rely on hdfs copy not failing while taking the snapshots.
In some cases when the state size is big enough the HDFS nodes might get so overloaded that the copy operation fails on errors like this:

AsynchronousException{java.io.IOException: All datanodes 172.26.86.90:50010 are bad. Aborting...}
at org.apache.flink.streaming.runtime.tasks.StreamTask$1.run(StreamTask.java:545)
Caused by: java.io.IOException: All datanodes 172.26.86.90:50010 are bad. Aborting...
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1023)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:838)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:483)

I think it would be important that we don't immediately fail the job in these cases but retry the copy operation after some random sleep time. It might be also good to do a random sleep before the copy depending on the state size to smoothen out IO a little bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)