[jira] [Resolved] (FLINK-469) LocalDistributedExecutor Deadlock with Low Buffer Count

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (FLINK-469) LocalDistributedExecutor Deadlock with Low Buffer Count

Shang Yuanchun (Jira)

     [ https://issues.apache.org/jira/browse/FLINK-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ufuk Celebi resolved FLINK-469.
-------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: pre-apache)

Fixed in [2db78a8dc1a4664f3e384005d7e07bea594b835b|https://github.com/apache/incubator-flink/commit/2db78a8dc1a4664f3e384005d7e07bea594b835b].

> LocalDistributedExecutor Deadlock with Low Buffer Count
> -------------------------------------------------------
>
>                 Key: FLINK-469
>                 URL: https://issues.apache.org/jira/browse/FLINK-469
>             Project: Flink
>          Issue Type: Bug
>            Reporter: GitHub Import
>              Labels: github-import
>
> I'm currently working on ([#25|https://github.com/stratosphere/stratosphere/issues/25] | [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]) and discovered a possible deadlock in the network stack, because of the buffer management in combination with the `LocalDistributedExecutor` (LDE).
> The LDE starts a JobManager and multiple TaskManagers on different network ports in a single VM. Every TaskManager has an associated `ByteBufferedChannelManager` (single instance) and `GlobalBufferPool` (singleton) for data transfers. When tasks get registered with a TaskManager (which is atomic per TaskManager), the ChannelManager ensures that there are enough network buffers available to execute the task -- this means that there has to be at least one buffer per task channel. If this condition does not hold, an exception is thrown and the task fails. This decision is made locally per task and not for the whole plan, e.g. for WordCount it is possible that all map tasks get enough buffers, but a following reduce throws an exception at runtime.
> The problem occurs in combination with the LDE: we have multiple TMs with their ChannelManager instances, but only a singleton GlobalBufferPool. This results in a problem with the available buffer computation, because each TM justs considers its local channels (registered at the ChannelManager) and not the channels of others TMs (which is perfectly fine in a real distributed setup). Therefore, it is possible for tasks to deadlock, because of missing buffers (buffer requests are blocking).
> You are likely to reproduce this problem by running `LocalDistributedExecutorTest` and setting the number of buffers to 20 and the buffer size to 4096 bytes (see `ConfigConstants`; make also sure to set `multicastEnabled` in ByteBufferedChannelManager to `false`, because it influences the computation -- multicast does not work anyways).
> I will fix this with the upcoming PR for ([#25|https://github.com/stratosphere/stratosphere/issues/25] | [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]).
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/469
> Created by: [uce|https://github.com/uce]
> Labels: bug, runtime,
> Assignee: [uce|https://github.com/uce]
> Created at: Wed Feb 12 13:58:36 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)