Where does RocksDB's mem allocation occur

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Where does RocksDB's mem allocation occur

Roshan Naik-2
For yarn deployments,  Lets say you have  lets say the container size = 10 GB
 containerized.heap-cutoff-ratio = 0.3  ( = 3GB)
That means 7GB is available for Flinks various subsystems which include,= jvm heap, and all the DirectByteBufferAllocatins (netty + netw buff + .. ) and Java metadata. 
I am wondering if RocksDB memory allocations (which is C++ native memory allocations) are drawn from the 3GB "cutoff" space or it will come from whatever is left from the remaining 7GB (i.e left after reserving for above mentioned pieces).
-roshan
Reply | Threaded
Open this post in threaded view
|

Re: Where does RocksDB's mem allocation occur

Yun Tang
Hi Roshan

From our experience, RocksDB memory allocation actually cannot be controlled well from Flink's side.
The option containerized.heap-cutoff-ratio is mainly used to calculate JVM heap size, and the left part is treated as off-heap size. In perfect situation, RocksDB's memory should deploy in the off-heap side. However, Flink just start the RocksDB process and left the memory allocation to RocksDB itself. If YARN is enabled to check total memory usage, and the total memory usage exceed the limit due to RocksDB memory increased, container would be killed then.

To control RocksDB memory, I recommend you to configure an acceptable write buffer and block cache size, set 'cacheIndexAndFilterBlocks', 'optimizeFilterForHits' and 'pinL0FilterAndIndexBlocksInCache' as true (the first one is for memory control and the latter two are for performance when we cache index & filter, refer to [1] for more information.) Last but not least, ensure not to use many states within one operator, that will cause RocksDB use many column families and each family consumes the specific write buffer(s).

[1] https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks
[https://avatars0.githubusercontent.com/u/69631?s=400&v=4]<https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks>

Memory usage in RocksDB · facebook/rocksdb Wiki · GitHub<https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks>
A library that provides an embeddable, persistent key-value store for fast storage. - facebook/rocksdb
github.com


Best
Yun Tang

________________________________
From: Roshan Naik <[hidden email]>
Sent: Saturday, February 23, 2019 10:09
To: [hidden email]
Subject: Where does RocksDB's mem allocation occur

For yarn deployments,  Lets say you have  lets say the container size = 10 GB
 containerized.heap-cutoff-ratio = 0.3  ( = 3GB)
That means 7GB is available for Flinks various subsystems which include,= jvm heap, and all the DirectByteBufferAllocatins (netty + netw buff + .. ) and Java metadata.
I am wondering if RocksDB memory allocations (which is C++ native memory allocations) are drawn from the 3GB "cutoff" space or it will come from whatever is left from the remaining 7GB (i.e left after reserving for above mentioned pieces).
-roshan
Reply | Threaded
Open this post in threaded view
|

Re: Where does RocksDB's mem allocation occur

Roshan Naik-2
 Based on my (streaming mode) experiments I see that its not simply on heap and off heap memory. There are actually 3 divisions of memory:

1- On heap  (-Xmx) . 2- Off heap - (DirectByteBuffers allocations ... NetwkBuff, Netty, JVMMetadata, optionally the TM managed mem)3- and The container "cut off" part (0.3 in my example)
The cut off ratio controls what is left over for the  1 & 2. Thereafter the other off-heap reservations dictate what is left over for on-heap.
Obviously RocksDB mem is not on-heap. My intuition is that RocksDB mem might fall into the "cut off" section + off heap. However that depends on whether or not Flink+Netty fully pre-allocate whatever is reserved for the off-heap memory before RocksDB spins up.  If they do preallocate, then RocksDB native allocations will fall into 3 only.
If cut off is not used by anything.. I cant think of good reason for having such a high reservation (default 25%) in every container being totally unused.
I don't see any easy way to   a- Confirm where RocksDB mem (i,e in 2 or 3 or in 2&3)   b- Rough estimate for the amt of mem RDB needs for a certain MB or GB of data that I need to host in it  c- determine how to tune 1 &2 & 3 to ensure the RDB gets enough memory without randomly crashing the job
Unfortunately coverage of this mem division is only briefly given in some of the unofficial presentations on youtube ... but  appears to be inaccurate.
-roshan
    On Friday, February 22, 2019, 10:44:02 PM PST, Yun Tang <[hidden email]> wrote:  
 
 Hi Roshan

From our experience, RocksDB memory allocation actually cannot be controlled well from Flink's side.
The option containerized.heap-cutoff-ratio is mainly used to calculate JVM heap size, and the left part is treated as off-heap size. In perfect situation, RocksDB's memory should deploy in the off-heap side. However, Flink just start the RocksDB process and left the memory allocation to RocksDB itself. If YARN is enabled to check total memory usage, and the total memory usage exceed the limit due to RocksDB memory increased, container would be killed then.

To control RocksDB memory, I recommend you to configure an acceptable write buffer and block cache size, set 'cacheIndexAndFilterBlocks', 'optimizeFilterForHits' and 'pinL0FilterAndIndexBlocksInCache' as true (the first one is for memory control and the latter two are for performance when we cache index & filter, refer to [1] for more information.) Last but not least, ensure not to use many states within one operator, that will cause RocksDB use many column families and each family consumes the specific write buffer(s).

[1] https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks
[https://avatars0.githubusercontent.com/u/69631?s=400&v=4]<https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks>

Memory usage in RocksDB · facebook/rocksdb Wiki · GitHub<https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks>
A library that provides an embeddable, persistent key-value store for fast storage. - facebook/rocksdb
github.com


Best
Yun Tang

________________________________
From: Roshan Naik <[hidden email]>
Sent: Saturday, February 23, 2019 10:09
To: [hidden email]
Subject: Where does RocksDB's mem allocation occur

For yarn deployments,  Lets say you have  lets say the container size = 10 GB
 containerized.heap-cutoff-ratio = 0.3  ( = 3GB)
That means 7GB is available for Flinks various subsystems which include,= jvm heap, and all the DirectByteBufferAllocatins (netty + netw buff + .. ) and Java metadata.
I am wondering if RocksDB memory allocations (which is C++ native memory allocations) are drawn from the 3GB "cutoff" space or it will come from whatever is left from the remaining 7GB (i.e left after reserving for above mentioned pieces).
-roshan  
Reply | Threaded
Open this post in threaded view
|

Re: Where does RocksDB's mem allocation occur

Roshan Naik-2
Sorry resending with proper formatting.. yahoo mail defaults to rich text...and that messes up formatting on this mailing list.


Based on my (streaming mode) experiments I see that its not simply on heap and off heap memory. There are actually 3 divisions of memory:

1- On heap  (-Xmx) . 
2- Off heap - (DirectByteBuffers allocations ... NetwkBuff, Netty, JVMMetadata, optionally the TM managed mem)
3- and The container "cut off" part (0.3 in my example)

The cut off ratio controls what is left over for the  1 & 2. Thereafter the other off-heap reservations dictate what is left over for on-heap.

Obviously RocksDB mem is not on-heap. 
My intuition is that RocksDB mem might fall into the "cut off" section + off heap. However that depends on whether or not Flink+Netty fully pre-allocate whatever is reserved for the off-heap memory before RocksDB spins up.  If they do preallocate, then RocksDB native allocations will fall into 3 only.

If cut off is not used by anything.. I cant think of good reason for having such a high reservation (default 25%) in every container being totally unused.

I don't see any easy way to 
  a- Confirm where RocksDB mem (i,e in 2 or 3 or in 2&3) 
  b- Rough estimate for the amt of mem RDB needs for a certain MB or GB of data that I need to host in it
  c- determine how to tune 1 &2 & 3 to ensure the RDB gets enough memory without randomly crashing the job

Unfortunately coverage of this mem division is only briefly given in some of the unofficial presentations on youtube ... but  appears to be inaccurate.


-roshan