Jun Qin created FLINK-19008:
-------------------------------
Summary: Flink Job runs slow after restore + downscale from an incremental checkpoint (rocksdb)
Key: FLINK-19008
URL:
https://issues.apache.org/jira/browse/FLINK-19008 Project: Flink
Issue Type: Improvement
Reporter: Jun Qin
A customer runs a Flink job with RocksDB state backend. Checkpoints are retained and done incrementally. The state size is several TB. When they restore + downscale from a retained checkpoint, although the downloading of checkpoint files took ~20min, the job throughput returns to the expected level only after 3 hours.
I do not have RocksDB logs. The suspicion for those 3 hours is due to heavy RocksDB compaction and/or flush. As it was observed that checkpoint could not finish faster enough due to long {{checkpoint duration (sync)}}. How can we make this restoring phase shorter?
For compaction, I think it is worth to check the improvement of:
{code:java}
CompactionPri compaction_pri = kMinOverlappingRatio;{code}
which has been set to default in RocksDB 6.x:
{code:java}
// In Level-based compaction, it Determines which file from a level to be
// picked to merge to the next level. We suggest people try
// kMinOverlappingRatio first when you tune your database.
enum CompactionPri : char {
// Slightly prioritize larger files by size compensated by #deletes
kByCompensatedSize = 0x0,
// First compact files whose data's latest update time is oldest.
// Try this if you only update some hot keys in small ranges.
kOldestLargestSeqFirst = 0x1,
// First compact files whose range hasn't been compacted to the next level
// for the longest. If your updates are random across the key space,
// write amplification is slightly better with this option.
kOldestSmallestSeqFirst = 0x2,
// First compact files whose ratio between overlapping size in next level
// and its size is the smallest. It in many cases can optimize write
// amplification.
kMinOverlappingRatio = 0x3,
};
...
// Default: kMinOverlappingRatio CompactionPri compaction_pri = kMinOverlappingRatio;{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)