Hi,
I experience a strange behaviour with our Flink application. So I created a very simple sample application to demonstrate the problem.
A simple Flink application reads data from Kakfa, perfoms a simple transformation and accesses an external Redis database to read data within a FlatMap operator. When running the application with parallelism higher than 1, there is an unexpected high latency only on one operator instance (the “bad” instance is not always the same, it is randomly “selected” across multiple runs) that accesses the external database. There are multiple Redis instances, all running in standalone mode, so each Redis request is served by the local instance. To demonstrate that the latency is not related to the Redis, I completely removed the database access and simulated its latency with a sleep operation for about 0.1 ms, resulting to the same strange behavior.
So, when the operator has more work to do (simulated the load with a sleep time of 0.1ms), there is this unexpected high latency on only one of the instance. So question is why a single instance has this strange behaviour and not all instance?
Profiling the application by enabling the Flink monitoring mechanism, we see that all instances of the upstream operator are backpressured. However the input buffer pool (and the input exclusive buffer pool) usage on the “bad” node are 100% during the whole run while on the other instances it is ok (more detailed monitoring data/graphs in the attached pdf).
There is no skew in the dataset. I also replaced the keyBy with rebalance, which follows a round-robbin data distribution, but the strange behaviour is still there.
I expected all nodes to exhibit similar (either low or high) latency. So the question is why only one operator instance exhibits high latency? Is there any change there is a starvation problem due to credit-based flow control?
Removing the keyBy (i.e. remove the data shuffling across nodes) between the operators, the system exhibits the expected behaviour.
I also attach a pdf with more details about the application and graphs with monitoring data.
I hope someone could have an idea about this unexpected behaviour.
Thank you,
Antonis