[DISCUSS] CPU flame graph for a job vertex in web UI.

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] CPU flame graph for a job vertex in web UI.

David Morávek
Hello,

While looking into Flink internals, I've noticed that there is already a
mechanism for stack-trace sampling of a particular job vertex.

I think it may be really useful to allow user to easily render a cpu
flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI for a
selected vertex (new tab next to back pressure) of a running job. Back
pressure tab already provides a good idea of which vertex causes trouble,
but it's hard to say what's actually going on.

I've tried to implement a basic REST endpoint
<https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9>,
that prepares data for the flame graph rendering and it seems to be
providing good insight.

It should be straightforward to render data from the endpoint in new UI
using existing <https://github.com/spiermar/d3-flame-graph> javascript
libraries.

WDYT? Is this worth pushing forward?

D.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Till Rohrmann
Hi David,

thanks for starting this discussion. I like the idea of improving insights
into Flink's execution and I believe that a flame graph could be helpful.

I quickly glanced over your changes and I think they go in a good
direction. One idea could be to share the `StackTraceSample` produced by
the `StackTraceSampleCoordinator` between the different
`StackTraceOperatorTracker` so that we don't send multiple requests for the
same operators. That way we would decrease a bit the RPC load.

Apart from that, I think the next steps would be to find a committer who
could shepherd this effort and help you with merging it.

Cheers,
Till

On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote:

> Hello,
>
> While looking into Flink internals, I've noticed that there is already a
> mechanism for stack-trace sampling of a particular job vertex.
>
> I think it may be really useful to allow user to easily render a cpu
> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI for
> a
> selected vertex (new tab next to back pressure) of a running job. Back
> pressure tab already provides a good idea of which vertex causes trouble,
> but it's hard to say what's actually going on.
>
> I've tried to implement a basic REST endpoint
> <
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> >,
> that prepares data for the flame graph rendering and it seems to be
> providing good insight.
>
> It should be straightforward to render data from the endpoint in new UI
> using existing <https://github.com/spiermar/d3-flame-graph> javascript
> libraries.
>
> WDYT? Is this worth pushing forward?
>
> D.
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

David Morávek-2
Hi Till, thanks for the feedback! These endpoints are only called when the
vertex is selected in the UI, so there should be any heavy RPC load. For
back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
flame-graph, we want to sample the whole stack trace and we need different
sampling rate (longer period, more samples). Those are the main reasons to
split these in two "trackers", but I may be missing something.

I've prepared a little demo, so others can have a better idea of what I
have in mind.

https://youtu.be/GUNDehj9z9o

Please note that this is a proof of concept and I'm not frontend person, so
it may look little clumsy :)

D.

On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> wrote:

> Hi David,
>
> thanks for starting this discussion. I like the idea of improving insights
> into Flink's execution and I believe that a flame graph could be helpful.
>
> I quickly glanced over your changes and I think they go in a good
> direction. One idea could be to share the `StackTraceSample` produced by
> the `StackTraceSampleCoordinator` between the different
> `StackTraceOperatorTracker` so that we don't send multiple requests for the
> same operators. That way we would decrease a bit the RPC load.
>
> Apart from that, I think the next steps would be to find a committer who
> could shepherd this effort and help you with merging it.
>
> Cheers,
> Till
>
> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote:
>
> > Hello,
> >
> > While looking into Flink internals, I've noticed that there is already a
> > mechanism for stack-trace sampling of a particular job vertex.
> >
> > I think it may be really useful to allow user to easily render a cpu
> > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> for
> > a
> > selected vertex (new tab next to back pressure) of a running job. Back
> > pressure tab already provides a good idea of which vertex causes trouble,
> > but it's hard to say what's actually going on.
> >
> > I've tried to implement a basic REST endpoint
> > <
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > >,
> > that prepares data for the flame graph rendering and it seems to be
> > providing good insight.
> >
> > It should be straightforward to render data from the endpoint in new UI
> > using existing <https://github.com/spiermar/d3-flame-graph> javascript
> > libraries.
> >
> > WDYT? Is this worth pushing forward?
> >
> > D.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

Jark Wu-2
Hi David,

The demo looks charming! I think it will definitely help a lot when
performance tuning.
A big +1 for this.

I cc-ed Yadong who's one of the main contributors of the new Web UI.
Maybe he can give some help on the front end.

Regards,
Jark

On Fri, 2 Aug 2019 at 04:26, David Morávek <[hidden email]> wrote:

> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
>
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
>
> https://youtu.be/GUNDehj9z9o
>
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
>
> D.
>
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]>
> wrote:
>
> > Hi David,
> >
> > thanks for starting this discussion. I like the idea of improving
> insights
> > into Flink's execution and I believe that a flame graph could be helpful.
> >
> > I quickly glanced over your changes and I think they go in a good
> > direction. One idea could be to share the `StackTraceSample` produced by
> > the `StackTraceSampleCoordinator` between the different
> > `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> > same operators. That way we would decrease a bit the RPC load.
> >
> > Apart from that, I think the next steps would be to find a committer who
> > could shepherd this effort and help you with merging it.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote:
> >
> > > Hello,
> > >
> > > While looking into Flink internals, I've noticed that there is already
> a
> > > mechanism for stack-trace sampling of a particular job vertex.
> > >
> > > I think it may be really useful to allow user to easily render a cpu
> > > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> > for
> > > a
> > > selected vertex (new tab next to back pressure) of a running job. Back
> > > pressure tab already provides a good idea of which vertex causes
> trouble,
> > > but it's hard to say what's actually going on.
> > >
> > > I've tried to implement a basic REST endpoint
> > > <
> > >
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > > >,
> > > that prepares data for the flame graph rendering and it seems to be
> > > providing good insight.
> > >
> > > It should be straightforward to render data from the endpoint in new UI
> > > using existing <https://github.com/spiermar/d3-flame-graph> javascript
> > > libraries.
> > >
> > > WDYT? Is this worth pushing forward?
> > >
> > > D.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

blues zheng
Big +1 for this helpful feature :)


On 08/02/2019 13:54, Jark Wu wrote:
Hi David,

The demo looks charming! I think it will definitely help a lot when
performance tuning.
A big +1 for this.

I cc-ed Yadong who's one of the main contributors of the new Web UI.
Maybe he can give some help on the front end.

Regards,
Jark

On Fri, 2 Aug 2019 at 04:26, David Morávek <[hidden email]> wrote:

> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
>
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
>
> https://youtu.be/GUNDehj9z9o
>
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
>
> D.
>
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]>
> wrote:
>
> > Hi David,
> >
> > thanks for starting this discussion. I like the idea of improving
> insights
> > into Flink's execution and I believe that a flame graph could be helpful.
> >
> > I quickly glanced over your changes and I think they go in a good
> > direction. One idea could be to share the `StackTraceSample` produced by
> > the `StackTraceSampleCoordinator` between the different
> > `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> > same operators. That way we would decrease a bit the RPC load.
> >
> > Apart from that, I think the next steps would be to find a committer who
> > could shepherd this effort and help you with merging it.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote:
> >
> > > Hello,
> > >
> > > While looking into Flink internals, I've noticed that there is already
> a
> > > mechanism for stack-trace sampling of a particular job vertex.
> > >
> > > I think it may be really useful to allow user to easily render a cpu
> > > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> > for
> > > a
> > > selected vertex (new tab next to back pressure) of a running job. Back
> > > pressure tab already provides a good idea of which vertex causes
> trouble,
> > > but it's hard to say what's actually going on.
> > >
> > > I've tried to implement a basic REST endpoint
> > > <
> > >
> >
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> > > >,
> > > that prepares data for the flame graph rendering and it seems to be
> > > providing good insight.
> > >
> > > It should be straightforward to render data from the endpoint in new UI
> > > using existing <https://github.com/spiermar/d3-flame-graph> javascript
> > > libraries.
> > >
> > > WDYT? Is this worth pushing forward?
> > >
> > > D.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

PaulLam
In reply to this post by David Morávek-2
Hi David,

Thanks for the new feature! I think the flame graph would be a useful tool to understand the state of job executions, and it looks good too. +1 for this.

And a minor question: do we plan to support multiple kinds of flame graphs? It would be great if we have both on-cpu and off-cpu flame graphs.

Best,
Paul Lam

> 在 2019年8月2日,04:24,David Morávek <[hidden email]> 写道:
>
> Hi Till, thanks for the feedback! These endpoints are only called when the
> vertex is selected in the UI, so there should be any heavy RPC load. For
> back-pressure, we only sample top 3 calls of the stack (depth = 3). For the
> flame-graph, we want to sample the whole stack trace and we need different
> sampling rate (longer period, more samples). Those are the main reasons to
> split these in two "trackers", but I may be missing something.
>
> I've prepared a little demo, so others can have a better idea of what I
> have in mind.
>
> https://youtu.be/GUNDehj9z9o
>
> Please note that this is a proof of concept and I'm not frontend person, so
> it may look little clumsy :)
>
> D.
>
> On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> wrote:
>
>> Hi David,
>>
>> thanks for starting this discussion. I like the idea of improving insights
>> into Flink's execution and I believe that a flame graph could be helpful.
>>
>> I quickly glanced over your changes and I think they go in a good
>> direction. One idea could be to share the `StackTraceSample` produced by
>> the `StackTraceSampleCoordinator` between the different
>> `StackTraceOperatorTracker` so that we don't send multiple requests for the
>> same operators. That way we would decrease a bit the RPC load.
>>
>> Apart from that, I think the next steps would be to find a committer who
>> could shepherd this effort and help you with merging it.
>>
>> Cheers,
>> Till
>>
>> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote:
>>
>>> Hello,
>>>
>>> While looking into Flink internals, I've noticed that there is already a
>>> mechanism for stack-trace sampling of a particular job vertex.
>>>
>>> I think it may be really useful to allow user to easily render a cpu
>>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
>> for
>>> a
>>> selected vertex (new tab next to back pressure) of a running job. Back
>>> pressure tab already provides a good idea of which vertex causes trouble,
>>> but it's hard to say what's actually going on.
>>>
>>> I've tried to implement a basic REST endpoint
>>> <
>>>
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>>>> ,
>>> that prepares data for the flame graph rendering and it seems to be
>>> providing good insight.
>>>
>>> It should be straightforward to render data from the endpoint in new UI
>>> using existing <https://github.com/spiermar/d3-flame-graph> javascript
>>> libraries.
>>>
>>> WDYT? Is this worth pushing forward?
>>>
>>> D.
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

David Morávek
Hi Paul, for now I only plan to add the one based on java stack traces.

On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <[hidden email]> wrote:

> Hi David,
>
> Thanks for the new feature! I think the flame graph would be a useful tool
> to understand the state of job executions, and it looks good too. +1 for
> this.
>
> And a minor question: do we plan to support multiple kinds of flame
> graphs? It would be great if we have both on-cpu and off-cpu flame graphs.
>
> Best,
> Paul Lam
>
> > 在 2019年8月2日,04:24,David Morávek <[hidden email]> 写道:
> >
> > Hi Till, thanks for the feedback! These endpoints are only called when
> the
> > vertex is selected in the UI, so there should be any heavy RPC load. For
> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For
> the
> > flame-graph, we want to sample the whole stack trace and we need
> different
> > sampling rate (longer period, more samples). Those are the main reasons
> to
> > split these in two "trackers", but I may be missing something.
> >
> > I've prepared a little demo, so others can have a better idea of what I
> > have in mind.
> >
> > https://youtu.be/GUNDehj9z9o
> >
> > Please note that this is a proof of concept and I'm not frontend person,
> so
> > it may look little clumsy :)
> >
> > D.
> >
> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]>
> wrote:
> >
> >> Hi David,
> >>
> >> thanks for starting this discussion. I like the idea of improving
> insights
> >> into Flink's execution and I believe that a flame graph could be
> helpful.
> >>
> >> I quickly glanced over your changes and I think they go in a good
> >> direction. One idea could be to share the `StackTraceSample` produced by
> >> the `StackTraceSampleCoordinator` between the different
> >> `StackTraceOperatorTracker` so that we don't send multiple requests for
> the
> >> same operators. That way we would decrease a bit the RPC load.
> >>
> >> Apart from that, I think the next steps would be to find a committer who
> >> could shepherd this effort and help you with merging it.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote:
> >>
> >>> Hello,
> >>>
> >>> While looking into Flink internals, I've noticed that there is already
> a
> >>> mechanism for stack-trace sampling of a particular job vertex.
> >>>
> >>> I think it may be really useful to allow user to easily render a cpu
> >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
> >> for
> >>> a
> >>> selected vertex (new tab next to back pressure) of a running job. Back
> >>> pressure tab already provides a good idea of which vertex causes
> trouble,
> >>> but it's hard to say what's actually going on.
> >>>
> >>> I've tried to implement a basic REST endpoint
> >>> <
> >>>
> >>
> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
> >>>> ,
> >>> that prepares data for the flame graph rendering and it seems to be
> >>> providing good insight.
> >>>
> >>> It should be straightforward to render data from the endpoint in new UI
> >>> using existing <https://github.com/spiermar/d3-flame-graph> javascript
> >>> libraries.
> >>>
> >>> WDYT? Is this worth pushing forward?
> >>>
> >>> D.
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] CPU flame graph for a job vertex in web UI.

David Morávek
I've created FLINK-13550 <https://issues.apache.org/jira/browse/FLINK-13550>
to track the issue.

Is there any committer who'd be willing to "shepherd this effort"? :)

Thanks,
D.

On Fri, Aug 2, 2019 at 10:22 AM David Morávek <[hidden email]> wrote:

> Hi Paul, for now I only plan to add the one based on java stack traces.
>
> On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <[hidden email]> wrote:
>
>> Hi David,
>>
>> Thanks for the new feature! I think the flame graph would be a useful
>> tool to understand the state of job executions, and it looks good too. +1
>> for this.
>>
>> And a minor question: do we plan to support multiple kinds of flame
>> graphs? It would be great if we have both on-cpu and off-cpu flame graphs.
>>
>> Best,
>> Paul Lam
>>
>> > 在 2019年8月2日,04:24,David Morávek <[hidden email]> 写道:
>> >
>> > Hi Till, thanks for the feedback! These endpoints are only called when
>> the
>> > vertex is selected in the UI, so there should be any heavy RPC load. For
>> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For
>> the
>> > flame-graph, we want to sample the whole stack trace and we need
>> different
>> > sampling rate (longer period, more samples). Those are the main reasons
>> to
>> > split these in two "trackers", but I may be missing something.
>> >
>> > I've prepared a little demo, so others can have a better idea of what I
>> > have in mind.
>> >
>> > https://youtu.be/GUNDehj9z9o
>> >
>> > Please note that this is a proof of concept and I'm not frontend
>> person, so
>> > it may look little clumsy :)
>> >
>> > D.
>> >
>> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]>
>> wrote:
>> >
>> >> Hi David,
>> >>
>> >> thanks for starting this discussion. I like the idea of improving
>> insights
>> >> into Flink's execution and I believe that a flame graph could be
>> helpful.
>> >>
>> >> I quickly glanced over your changes and I think they go in a good
>> >> direction. One idea could be to share the `StackTraceSample` produced
>> by
>> >> the `StackTraceSampleCoordinator` between the different
>> >> `StackTraceOperatorTracker` so that we don't send multiple requests
>> for the
>> >> same operators. That way we would decrease a bit the RPC load.
>> >>
>> >> Apart from that, I think the next steps would be to find a committer
>> who
>> >> could shepherd this effort and help you with merging it.
>> >>
>> >> Cheers,
>> >> Till
>> >>
>> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote:
>> >>
>> >>> Hello,
>> >>>
>> >>> While looking into Flink internals, I've noticed that there is
>> already a
>> >>> mechanism for stack-trace sampling of a particular job vertex.
>> >>>
>> >>> I think it may be really useful to allow user to easily render a cpu
>> >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI
>> >> for
>> >>> a
>> >>> selected vertex (new tab next to back pressure) of a running job. Back
>> >>> pressure tab already provides a good idea of which vertex causes
>> trouble,
>> >>> but it's hard to say what's actually going on.
>> >>>
>> >>> I've tried to implement a basic REST endpoint
>> >>> <
>> >>>
>> >>
>> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9
>> >>>> ,
>> >>> that prepares data for the flame graph rendering and it seems to be
>> >>> providing good insight.
>> >>>
>> >>> It should be straightforward to render data from the endpoint in new
>> UI
>> >>> using existing <https://github.com/spiermar/d3-flame-graph>
>> javascript
>> >>> libraries.
>> >>>
>> >>> WDYT? Is this worth pushing forward?
>> >>>
>> >>> D.
>> >>>
>> >>
>>
>>