NioEventLoop consumes most of the CPU

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

NioEventLoop consumes most of the CPU

Kruse, Sebastian
Hi everyone,

Everytime when I am running jvisualvm on one of the machines in our cluster during a Flink job, I see that NioEventLoop.select() is taking 50% to 70% CPU self-time. I wonder how severe this is. It might be busy-waiting time that cannot be filled otherwise, but I wanted to ask you if you also faced this issue and/or you know the cause of that circumstance.

Cheers,
Sebastian
Reply | Threaded
Open this post in threaded view
|

Re: NioEventLoop consumes most of the CPU

Stephan Ewen
Hi!

That does not sound right, I agree. Can you tell us a bit more?

- What version of Flink are you using?

- I assume the NIO loop is executed by a Netty thread. Can you tell us
whether it is from a "io.netty.*" thread, or a "org.jboss.netty.*" thread?
The former is from Flink's data network thread, the later from akka.

- Is you job data heavy (data transfer is in progress most of the time), or
is it compute heavy (network is not fully utilized)

Thanks for your help!
Stephan
 Am 05.05.2015 16:52 schrieb "Kruse, Sebastian" <[hidden email]>:

> Hi everyone,
>
> Everytime when I am running jvisualvm on one of the machines in our
> cluster during a Flink job, I see that NioEventLoop.select() is taking 50%
> to 70% CPU self-time. I wonder how severe this is. It might be busy-waiting
> time that cannot be filled otherwise, but I wanted to ask you if you also
> faced this issue and/or you know the cause of that circumstance.
>
> Cheers,
> Sebastian
>
Reply | Threaded
Open this post in threaded view
|

Re: NioEventLoop consumes most of the CPU

Ufuk Celebi-2
I agree with Stephan's points. Thanks for reporting and let's investigate
this further.

To keep in mind: I think VisualVM is using hprof for CPU sampling, which
has some known issues (
http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html).
For one thing, it's profiling Java's RUNNABLE state, which does not
necessarily correspond to a running Thread (in OS terms) consuming CPU. The
select call (like epollWait()) keeps the Thread in this state.


On Tue, May 5, 2015 at 9:23 PM, Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> That does not sound right, I agree. Can you tell us a bit more?
>
> - What version of Flink are you using?
>
> - I assume the NIO loop is executed by a Netty thread. Can you tell us
> whether it is from a "io.netty.*" thread, or a "org.jboss.netty.*" thread?
> The former is from Flink's data network thread, the later from akka.
>
> - Is you job data heavy (data transfer is in progress most of the time), or
> is it compute heavy (network is not fully utilized)
>
> Thanks for your help!
> Stephan
>  Am 05.05.2015 16:52 schrieb "Kruse, Sebastian" <[hidden email]>:
>
> > Hi everyone,
> >
> > Everytime when I am running jvisualvm on one of the machines in our
> > cluster during a Flink job, I see that NioEventLoop.select() is taking
> 50%
> > to 70% CPU self-time. I wonder how severe this is. It might be
> busy-waiting
> > time that cannot be filled otherwise, but I wanted to ask you if you also
> > faced this issue and/or you know the cause of that circumstance.
> >
> > Cheers,
> > Sebastian
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: NioEventLoop consumes most of the CPU

Stephan Ewen
Uful has a good point
The NIO epoll wait method leaves the thread in state RUNNABLE. That may
explain things.

Still, would be good to have more information on your setup.

Stephan
 Am 06.05.2015 10:15 schrieb "Ufuk Celebi" <[hidden email]>:

> I agree with Stephan's points. Thanks for reporting and let's investigate
> this further.
>
> To keep in mind: I think VisualVM is using hprof for CPU sampling, which
> has some known issues (
>
> http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html
> ).
> For one thing, it's profiling Java's RUNNABLE state, which does not
> necessarily correspond to a running Thread (in OS terms) consuming CPU. The
> select call (like epollWait()) keeps the Thread in this state.
>
>
> On Tue, May 5, 2015 at 9:23 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi!
> >
> > That does not sound right, I agree. Can you tell us a bit more?
> >
> > - What version of Flink are you using?
> >
> > - I assume the NIO loop is executed by a Netty thread. Can you tell us
> > whether it is from a "io.netty.*" thread, or a "org.jboss.netty.*"
> thread?
> > The former is from Flink's data network thread, the later from akka.
> >
> > - Is you job data heavy (data transfer is in progress most of the time),
> or
> > is it compute heavy (network is not fully utilized)
> >
> > Thanks for your help!
> > Stephan
> >  Am 05.05.2015 16:52 schrieb "Kruse, Sebastian" <[hidden email]
> >:
> >
> > > Hi everyone,
> > >
> > > Everytime when I am running jvisualvm on one of the machines in our
> > > cluster during a Flink job, I see that NioEventLoop.select() is taking
> > 50%
> > > to 70% CPU self-time. I wonder how severe this is. It might be
> > busy-waiting
> > > time that cannot be filled otherwise, but I wanted to ask you if you
> also
> > > faced this issue and/or you know the cause of that circumstance.
> > >
> > > Cheers,
> > > Sebastian
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: NioEventLoop consumes most of the CPU

Kruse, Sebastian
Hi,

Thanks for your answers. I see the point that this might be an issue with the profiler, although I cannot see the epoll function in the stack trace. It seems that the NioEventLoop is not using this function but rather seems to work around it (cf. NioEventLoop:220). However, are you using any other profiling tools that you could recommend? Then I could tell you whether other tools are confirming this issue.

To answer some of your questions regarding my setup:
I could observe this workload in both 0.8.1 and the current master.

The exact method is io.netty.channel.nio.NioEventLoop.select(). Interestingly, VisualVM displays SortMergerReading threads as the most time-consuming ones.

I would say that my job is rather data-heavy as I am trying to keep my UDFs as efficient as possible. ;) However, the CPUs of the slaves are fully loaded and - if I can trust VisualVM - most of it is for networking and serialization.

Cheers,
Sebastian

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Stephan Ewen
Sent: Mittwoch, 6. Mai 2015 12:25
To: [hidden email]
Subject: Re: NioEventLoop consumes most of the CPU

Uful has a good point
The NIO epoll wait method leaves the thread in state RUNNABLE. That may explain things.

Still, would be good to have more information on your setup.

Stephan
 Am 06.05.2015 10:15 schrieb "Ufuk Celebi" <[hidden email]>:

> I agree with Stephan's points. Thanks for reporting and let's
> investigate this further.
>
> To keep in mind: I think VisualVM is using hprof for CPU sampling,
> which has some known issues (
>
> http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hp
> rof.html
> ).
> For one thing, it's profiling Java's RUNNABLE state, which does not
> necessarily correspond to a running Thread (in OS terms) consuming
> CPU. The select call (like epollWait()) keeps the Thread in this state.
>
>
> On Tue, May 5, 2015 at 9:23 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi!
> >
> > That does not sound right, I agree. Can you tell us a bit more?
> >
> > - What version of Flink are you using?
> >
> > - I assume the NIO loop is executed by a Netty thread. Can you tell
> > us whether it is from a "io.netty.*" thread, or a "org.jboss.netty.*"
> thread?
> > The former is from Flink's data network thread, the later from akka.
> >
> > - Is you job data heavy (data transfer is in progress most of the
> > time),
> or
> > is it compute heavy (network is not fully utilized)
> >
> > Thanks for your help!
> > Stephan
> >  Am 05.05.2015 16:52 schrieb "Kruse, Sebastian"
> ><[hidden email]
> >:
> >
> > > Hi everyone,
> > >
> > > Everytime when I am running jvisualvm on one of the machines in
> > > our cluster during a Flink job, I see that NioEventLoop.select()
> > > is taking
> > 50%
> > > to 70% CPU self-time. I wonder how severe this is. It might be
> > busy-waiting
> > > time that cannot be filled otherwise, but I wanted to ask you if
> > > you
> also
> > > faced this issue and/or you know the cause of that circumstance.
> > >
> > > Cheers,
> > > Sebastian
> > >
> >
>