[jira] [Created] (FLINK-21201) Creating BoundedBlockingSubpartition blocks TaskManager’s main thread

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21201) Creating BoundedBlockingSubpartition blocks TaskManager’s main thread

Shang Yuanchun (Jira)
Zhilong Hong created FLINK-21201:
------------------------------------

             Summary: Creating BoundedBlockingSubpartition blocks TaskManager’s main thread
                 Key: FLINK-21201
                 URL: https://issues.apache.org/jira/browse/FLINK-21201
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.12.1
            Reporter: Zhilong Hong
         Attachments: jobmanager.log.tar.gz, taskmanager.log.tar.gz

When we are trying to run batch jobs with 8k parallelism, it takes a long time to deploy the vertices. After the investigation, we find that creating BoundedBlockingSubpartition blocks TaskManager’s main thread during the procedure of {{submitTask}}. 

When JobMaster invokes {{submitTask}} and sends an RPC call to the TaskManager, the TaskManager will receive the RPC call and execute the {{submitTask}} method in its main thread. In the {{submitTask}} method, the TaskExecutor will create a Task instance and try to start it. During the creation, the TaskExecutor will create the ResultPartition and its ResultSubpartitions. 

For the batch job, the type of ResultSubpartitions is the BoundedBlockingSubpartition with the FileChannelBoundedData. The BoundedBlockingSubpartition will create a file on the local disk, which is an IO operation and could take a long time. 

In our test, it would take at most 28 seconds to create 8k BoundedBlockingSubpartitions. This procedure blocks the main thread of the TaskManager, and would lead to heartbeat timeout and slow task deploying. In my opinion, the IO operation should be executed with IOExecutor rather than the main thread. 

The log of JobManager and TaskManager is attached below. A typical task is Source 0: #898.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)