[jira] [Created] (FLINK-13477) Containerized TaskManager killed because of lack of memory overhead

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-13477) Containerized TaskManager killed because of lack of memory overhead

Shang Yuanchun (Jira)
Benoit Hanotte created FLINK-13477:
--------------------------------------

             Summary: Containerized TaskManager killed because of lack of memory overhead
                 Key: FLINK-13477
                 URL: https://issues.apache.org/jira/browse/FLINK-13477
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / Mesos
    Affects Versions: 1.9.0
            Reporter: Benoit Hanotte


Currently, the `-XX:MaxDirectMemorySize` parameter is set as:
`MaxDirectMemorySize = containerMemoryMB - heapSizeMB`
(see [https://github.com/apache/flink/blob/7fec4392b21b07c69ba15ea554731886f181609e/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ContaineredTaskManagerParameters.java#L162])

However as explained at
 https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html,
`MaxDirectMemorySize` only sets the maximum amount of memory that can be
used for direct buffers, thus the amount of off-heap memory used can be
greater than that value, leading to the container being killed by Mesos
or Yarn as it exceeds the allocated memory.

In addition, users might want to allocate off-heap memory through native
code, in which case they will want to keep some of the container memory
free and unallocated by Flink.

To solve this issue, we currently set the following parameter:
{code:java}
-Dcontainerized.taskmanager.env.FLINK_ENV_JAVA_OPTS='-XX:MaxDirectMemorySize=600m'
{code}
which overrides the value that Flink picks (744M in this case) with a lower one to keep some overhead memory in the TaskManager containers. However this is an "ugly" hack as it goes around the clever memory allocation that Flink performs and allows to bypass the sanity checks done in `ContaineredTaskManagerParameters`.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)