[jira] [Created] (FLINK-15178) Task crash due to mmap allocation failure for BLOCKING shuffle

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-15178) Task crash due to mmap allocation failure for BLOCKING shuffle

Shang Yuanchun (Jira)
Zhu Zhu created FLINK-15178:
-------------------------------

             Summary: Task crash due to mmap allocation failure for BLOCKING shuffle
                 Key: FLINK-15178
                 URL: https://issues.apache.org/jira/browse/FLINK-15178
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.10.0
            Reporter: Zhu Zhu
             Fix For: 1.10.0
         Attachments: MultiRegionBatchNumberCount.java, flink-conf.yaml

I met this issue when running testing batch(DataSet) job with 1000 parallelism.
Some TMs crashes due to error below:

{code:java}
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
[thread 139864559318784 also had an error]
[thread 139867407243008 also had an error]
{code}

With either of the following actions, this problem would not happen:
1. changing ExecutionMode from BATCH_FORCED to PIPELINED
2. changing config "taskmanager.network.bounded-blocking-subpartition-type" from default "auto" to "file"
So looks it is related to the mmap of BLOCKING shuffle.

This problem would always happen in the beginning of a job, and disappeared after several rounds of failovers so the job would finally succeed.

The job code and conf is attached.
The command to run it (on a yarn cluster) is

{code:java}
bin/flink run -d -m yarn-cluster -c com.alibaba.blink.tests.MultiRegionBatchNumberCount ../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000
{code}

[~sewen]  [~pnowojski]  [~kevin.cyj]  Do you know why this issue could happen?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)