[jira] [Created] (FLINK-17288) Speedup loading from savepoints into RocksDB by bulk load

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-17288) Speedup loading from savepoints into RocksDB by bulk load

Shang Yuanchun (Jira)
Jun Qin created FLINK-17288:
-------------------------------

             Summary: Speedup loading from savepoints into RocksDB by bulk load
                 Key: FLINK-17288
                 URL: https://issues.apache.org/jira/browse/FLINK-17288
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / State Backends
            Reporter: Jun Qin


When resource is a constraint,  loading a big savepoint into RocksDB may take some time. This may also impact the job recovery time when the savepoint was used for recovery.

Bulk load from savepoint should help in this regard. Here is an excerpt from the RocksDB FAQ:
{quote}*Q: What's the fastest way to load data into RocksDB?*

A: A fast way to direct insert data to the DB:
 # using single writer thread and insert in sorted order
 # batch hundreds of keys into one write batch
 # use vector memtable
 # make sure options.max_background_flushes is at least 4
 # before inserting the data, disable automatic compaction, set options.level0_file_num_compaction_trigger, options.level0_slowdown_writes_trigger and options.level0_stop_writes_trigger to very large. After inserting all the data, issue a manual compaction.

3-5 will be automatically done if you call Options::PrepareForBulkLoad() to your option

If you can pre-process the data offline before inserting. There is a faster way: you can sort the data, generate SST files with non-overlapping ranges in parallel and bulkload the SST files. See [https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files]
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)