[jira] [Created] (FLINK-12538) Network notifyDataAvailable() only called after getting a new buffer

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-12538) Network notifyDataAvailable() only called after getting a new buffer

Shang Yuanchun (Jira)
Nico Kruber created FLINK-12538:
-----------------------------------

             Summary: Network notifyDataAvailable() only called after getting a new buffer
                 Key: FLINK-12538
                 URL: https://issues.apache.org/jira/browse/FLINK-12538
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.8.0, 1.7.2, 1.6.3, 1.9.0
            Reporter: Nico Kruber


There is a potential regression in Flink 1.5+ which came with the low-latency changes. Whenever the {{RecordWriter}} finishes a buffer, it will first ask for a new buffer, then adds it to the appropriate result subpartition which notifies Netty of data being available.

In back-pressured scenarios where all buffers from the local pool are taken, it may happen that you do not immediately get a new buffer and have to wait for as long as it takes to get it before Netty can make use of the finished network buffer. Pre 1.5, Flink always immediately notified the downwards stack.
Although we do still have the output flusher notifying Netty within at most 100ms (by default), the new behaviour may actually decrease throughput and latency in a back-pressured scenario.

Having a quick look at the code, changing this behaviour is probably not too difficult but only needs to take care not to introduce additional locking / locking multiple times compared to now. Things to do/consider:
* {{PipelinedSubpartition#add()}} contains some optimisations to avoid unnecessary flushes but these conditions are under a lock -> try to not acquire it twice
* {{RecordWriter#requestNewBufferBuilder()}} could therefore maybe have an optimised path with a non-blocking buffer builder request if successful and if not, notify/flush and do another blocking request

After talking to [~pnowojski] offline, we are not sure how grave the issue is and whether we would improve by changing it. If you are willing to take a look and have code changing the current behaviour, please verify that it does not cause any performance regression itself and actually does improve some scenario (shown by a performance test, e.g. via https://github.com/dataArtisans/flink-benchmarks ).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)