[jira] [Created] (FLINK-19238) RocksDB performance issue with low managed memory and high parallelism

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-19238) RocksDB performance issue with low managed memory and high parallelism

Shang Yuanchun (Jira)
Juha Mynttinen created FLINK-19238:
--------------------------------------

             Summary: RocksDB performance issue with low managed memory and high parallelism
                 Key: FLINK-19238
                 URL: https://issues.apache.org/jira/browse/FLINK-19238
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 1.11.1
            Reporter: Juha Mynttinen


h2. The issue

When using {{RocksDBStateBackend}}, it's possible to configure RocksDB so that it almost constantly flushes the active memtable, causing high IO and CPU usage.

This happens because this check will be true essentially always [https://github.com/dataArtisans/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/include/rocksdb/write_buffer_manager.h#L47].
h2. Reproducing the issue

To reproduce the issue, the following needs to happen:
 * Use RocksDB state backend
 * Use managed memory
 * have "low" managed memory size
 * have "high" parallelism (e.g. 5) OR have enough operators (the exact count unknown)

The easiest way to do all this is to do {{StreamExecutionEnvironment.createLocalEnvironment}} and creating a simple Flink job and setting the parallelism "high enough". Nothing else is needed.
h2. Background

Arena memory block size is by default 1/8 of the memtable size [https://github.com/dataArtisans/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/db/column_family.cc#L196]. When the memtable has any data, it'll consume one arena block. The arena block size will be higher the "mutable limit". The mutable limit is calculated from the shared write buffer manager size. Having low managed memory and high parallelism pushes the mutable limit to a too low value.
h2. Documentation

In docs ([https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html):
  
 "An advanced option (expert mode) to reduce the number of MemTable flushes in setups with many states, is to tune RocksDB’s ColumnFamily options (arena block size, max background flush threads, etc.) via a RocksDBOptionsFactory". 
  
 This snippet in the docs is probably talking about the issue I'm witnessing. I think there are two issues here:
  
 1) it's hard/impossible to know what kind of performance one can expect from a Flink application. Thus, it's hard to know if one is suffering from e.g. from this performance issue, or if the system is performing normally (and inherently being slow).
 2) even if one suspects a performance issue, it's very hard to find the root cause of the performance issue (memtable flush happening frequently). To find out this one would need to know what's the normal flush frequency.
  
 Also the doc says "in setups with many states". The same problem is hit when using just one state, but "high" parallelism (5).
  
 If the arena block size _ever_ needs  to be configured only to "fix" this issue, it'd be best if there _never_ was a need to modify arena block size. What if we forget even mentioning arena block size in the docs and focus on the managed memory size, since managed memory size is something the user does tune.
h1. The proposed fix

The proposed fix is to log the issue on WARN level and tell the user clearly what is happening and how to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)