Insufficient number of network buffers after restarting

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Insufficient number of network buffers after restarting

Yufei Liu
Hey,
I’ve found that job will throw “java.io.IOException: Insufficient number of network buffers: required 51, but only 1 available” after job retstart, and I’ve observed TM use much more network buffers than before.
My internal branch is under 1.10.0 can easily  reproduce, but I use 1.12.0 doesn’t have this issue. I Think maybe was already fixed after some PR, I'm curious about what can lead to this problem?

Best.
YuFei.
Reply | Threaded
Open this post in threaded view
|

Re: Insufficient number of network buffers after restarting

Yangze Guo
Hi, Yufei.

Can you reproduce this issue in 1.10.0? The deterministic slot sharing
introduced in 1.12.0 is one possible reason. Before 1.12.0, the
distribution of tasks in slots is not determined. Even if the network
buffers are enough from the perspective of the cluster. Bad
distribution of tasks can lead to the "insufficient network buffer" as
well.

Best,
Yangze Guo

On Fri, Dec 25, 2020 at 12:54 AM Yufei Liu <[hidden email]> wrote:
>
> Hey,
> I’ve found that job will throw “java.io.IOException: Insufficient number of network buffers: required 51, but only 1 available” after job retstart, and I’ve observed TM use much more network buffers than before.
> My internal branch is under 1.10.0 can easily  reproduce, but I use 1.12.0 doesn’t have this issue. I Think maybe was already fixed after some PR, I'm curious about what can lead to this problem?
>
> Best.
> YuFei.
Reply | Threaded
Open this post in threaded view
|

Re: Insufficient number of network buffers after restarting

Till Rohrmann
Hi Yufei,

I cannot remember exactly the changes in this area between Flink 1.10.0 and
Flink 1.12.0. It sounds a bit as if we were not releasing memory segments
fast enough or had a memory leak. One thing to try out is to increase the
restart delay to see whether it is the first problem. Alternatively, you
can also try to bisect the commits in between these versions. If you have a
test failing reliably, this shouldn't take too long. Maybe Piotr knows
about a fix which could have solved this problem.

Cheers,
Till

On Fri, Dec 25, 2020 at 3:05 AM Yangze Guo <[hidden email]> wrote:

> Hi, Yufei.
>
> Can you reproduce this issue in 1.10.0? The deterministic slot sharing
> introduced in 1.12.0 is one possible reason. Before 1.12.0, the
> distribution of tasks in slots is not determined. Even if the network
> buffers are enough from the perspective of the cluster. Bad
> distribution of tasks can lead to the "insufficient network buffer" as
> well.
>
> Best,
> Yangze Guo
>
> On Fri, Dec 25, 2020 at 12:54 AM Yufei Liu <[hidden email]> wrote:
> >
> > Hey,
> > I’ve found that job will throw “java.io.IOException: Insufficient number
> of network buffers: required 51, but only 1 available” after job retstart,
> and I’ve observed TM use much more network buffers than before.
> > My internal branch is under 1.10.0 can easily  reproduce, but I use
> 1.12.0 doesn’t have this issue. I Think maybe was already fixed after some
> PR, I'm curious about what can lead to this problem?
> >
> > Best.
> > YuFei.
>
Reply | Threaded
Open this post in threaded view
|

Re: Insufficient number of network buffers after restarting

Piotr Nowojski-5
Hi Yufei,

My prime suspect would be changes to the memory configuration introduced in
1.11 [1]

Piotrek

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html#memory-management

pon., 28 gru 2020 o 09:52 Till Rohrmann <[hidden email]> napisał(a):

> Hi Yufei,
>
> I cannot remember exactly the changes in this area between Flink 1.10.0 and
> Flink 1.12.0. It sounds a bit as if we were not releasing memory segments
> fast enough or had a memory leak. One thing to try out is to increase the
> restart delay to see whether it is the first problem. Alternatively, you
> can also try to bisect the commits in between these versions. If you have a
> test failing reliably, this shouldn't take too long. Maybe Piotr knows
> about a fix which could have solved this problem.
>
> Cheers,
> Till
>
> On Fri, Dec 25, 2020 at 3:05 AM Yangze Guo <[hidden email]> wrote:
>
> > Hi, Yufei.
> >
> > Can you reproduce this issue in 1.10.0? The deterministic slot sharing
> > introduced in 1.12.0 is one possible reason. Before 1.12.0, the
> > distribution of tasks in slots is not determined. Even if the network
> > buffers are enough from the perspective of the cluster. Bad
> > distribution of tasks can lead to the "insufficient network buffer" as
> > well.
> >
> > Best,
> > Yangze Guo
> >
> > On Fri, Dec 25, 2020 at 12:54 AM Yufei Liu <[hidden email]>
> wrote:
> > >
> > > Hey,
> > > I’ve found that job will throw “java.io.IOException: Insufficient
> number
> > of network buffers: required 51, but only 1 available” after job
> retstart,
> > and I’ve observed TM use much more network buffers than before.
> > > My internal branch is under 1.10.0 can easily  reproduce, but I use
> > 1.12.0 doesn’t have this issue. I Think maybe was already fixed after
> some
> > PR, I'm curious about what can lead to this problem?
> > >
> > > Best.
> > > YuFei.
> >
>