回复:[jira] [Created] (FLINK-6057) Better default needed for num network buffers

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

回复:[jira] [Created] (FLINK-6057) Better default needed for num network buffers

wangzhijiang
Hi Luke,
       Yes, it is really not very friendly for users to set the number of network buffers. And it may cause OOM exception if not set the proper amount for large scale job.
The proper amount should be calculated automatically by framework based on the number of input and output channels for each task, and @Nico Kruber is working on 
this feature now. You can check this jira for some details "https://issues.apache.org/jira/browse/FLINK-4545".
Cheers
Zhijiang
------------------------------------------------------------------发件人:Luke Hutchison (JIRA) <[hidden email]>发送时间:2017年3月15日(星期三) 17:01收件人:dev <[hidden email]>主 题:[jira] [Created] (FLINK-6057) Better default needed for num network buffers
Luke Hutchison created FLINK-6057:
-------------------------------------

             Summary: Better default needed for num network buffers
                 Key: FLINK-6057
                 URL: https://issues.apache.org/jira/browse/FLINK-6057
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.2.0
            Reporter: Luke Hutchison


Using the default environment,

{code}
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
{code}

my code will sometimes fail with an error that Flink ran out of network buffers. To fix this, I have to do:

{code}
int numTasks = Runtime.getRuntime().availableProcessors();
config.setInteger(ConfigConstants.DEFAULT_PARALLELISM_KEY, numTasks);
config.setInteger(ConfigConstants.TASK_MANAGER_NUM_TASK_SLOTS, numTasks);
config.setInteger(ConfigConstants.TASK_MANAGER_NETWORK_NUM_BUFFERS_KEY, numTasks * 2048);
{code}

The default value of 2048 fails when I increase the degree of parallelism for a large Flink pipeline (hence the fix to set the number of buffers to numTasks * 2048).

This is particularly problematic because a pipeline can work fine on one machine, and when you start the pipeline on a machine with more cores, it can fail.

The default execution environment should pick a saner default based on the level of parallelism (or whatever is needed to ensure that the number of network buffers is not going to be exceeded for a given execution environment).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)