Hello,
While looking into Flink internals, I've noticed that there is already a mechanism for stack-trace sampling of a particular job vertex. I think it may be really useful to allow user to easily render a cpu flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI for a selected vertex (new tab next to back pressure) of a running job. Back pressure tab already provides a good idea of which vertex causes trouble, but it's hard to say what's actually going on. I've tried to implement a basic REST endpoint <https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9>, that prepares data for the flame graph rendering and it seems to be providing good insight. It should be straightforward to render data from the endpoint in new UI using existing <https://github.com/spiermar/d3-flame-graph> javascript libraries. WDYT? Is this worth pushing forward? D. |
Hi David,
thanks for starting this discussion. I like the idea of improving insights into Flink's execution and I believe that a flame graph could be helpful. I quickly glanced over your changes and I think they go in a good direction. One idea could be to share the `StackTraceSample` produced by the `StackTraceSampleCoordinator` between the different `StackTraceOperatorTracker` so that we don't send multiple requests for the same operators. That way we would decrease a bit the RPC load. Apart from that, I think the next steps would be to find a committer who could shepherd this effort and help you with merging it. Cheers, Till On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote: > Hello, > > While looking into Flink internals, I've noticed that there is already a > mechanism for stack-trace sampling of a particular job vertex. > > I think it may be really useful to allow user to easily render a cpu > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI for > a > selected vertex (new tab next to back pressure) of a running job. Back > pressure tab already provides a good idea of which vertex causes trouble, > but it's hard to say what's actually going on. > > I've tried to implement a basic REST endpoint > < > https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 > >, > that prepares data for the flame graph rendering and it seems to be > providing good insight. > > It should be straightforward to render data from the endpoint in new UI > using existing <https://github.com/spiermar/d3-flame-graph> javascript > libraries. > > WDYT? Is this worth pushing forward? > > D. > |
Hi Till, thanks for the feedback! These endpoints are only called when the
vertex is selected in the UI, so there should be any heavy RPC load. For back-pressure, we only sample top 3 calls of the stack (depth = 3). For the flame-graph, we want to sample the whole stack trace and we need different sampling rate (longer period, more samples). Those are the main reasons to split these in two "trackers", but I may be missing something. I've prepared a little demo, so others can have a better idea of what I have in mind. https://youtu.be/GUNDehj9z9o Please note that this is a proof of concept and I'm not frontend person, so it may look little clumsy :) D. On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> wrote: > Hi David, > > thanks for starting this discussion. I like the idea of improving insights > into Flink's execution and I believe that a flame graph could be helpful. > > I quickly glanced over your changes and I think they go in a good > direction. One idea could be to share the `StackTraceSample` produced by > the `StackTraceSampleCoordinator` between the different > `StackTraceOperatorTracker` so that we don't send multiple requests for the > same operators. That way we would decrease a bit the RPC load. > > Apart from that, I think the next steps would be to find a committer who > could shepherd this effort and help you with merging it. > > Cheers, > Till > > On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote: > > > Hello, > > > > While looking into Flink internals, I've noticed that there is already a > > mechanism for stack-trace sampling of a particular job vertex. > > > > I think it may be really useful to allow user to easily render a cpu > > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI > for > > a > > selected vertex (new tab next to back pressure) of a running job. Back > > pressure tab already provides a good idea of which vertex causes trouble, > > but it's hard to say what's actually going on. > > > > I've tried to implement a basic REST endpoint > > < > > > https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 > > >, > > that prepares data for the flame graph rendering and it seems to be > > providing good insight. > > > > It should be straightforward to render data from the endpoint in new UI > > using existing <https://github.com/spiermar/d3-flame-graph> javascript > > libraries. > > > > WDYT? Is this worth pushing forward? > > > > D. > > > |
Hi David,
The demo looks charming! I think it will definitely help a lot when performance tuning. A big +1 for this. I cc-ed Yadong who's one of the main contributors of the new Web UI. Maybe he can give some help on the front end. Regards, Jark On Fri, 2 Aug 2019 at 04:26, David Morávek <[hidden email]> wrote: > Hi Till, thanks for the feedback! These endpoints are only called when the > vertex is selected in the UI, so there should be any heavy RPC load. For > back-pressure, we only sample top 3 calls of the stack (depth = 3). For the > flame-graph, we want to sample the whole stack trace and we need different > sampling rate (longer period, more samples). Those are the main reasons to > split these in two "trackers", but I may be missing something. > > I've prepared a little demo, so others can have a better idea of what I > have in mind. > > https://youtu.be/GUNDehj9z9o > > Please note that this is a proof of concept and I'm not frontend person, so > it may look little clumsy :) > > D. > > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> > wrote: > > > Hi David, > > > > thanks for starting this discussion. I like the idea of improving > insights > > into Flink's execution and I believe that a flame graph could be helpful. > > > > I quickly glanced over your changes and I think they go in a good > > direction. One idea could be to share the `StackTraceSample` produced by > > the `StackTraceSampleCoordinator` between the different > > `StackTraceOperatorTracker` so that we don't send multiple requests for > the > > same operators. That way we would decrease a bit the RPC load. > > > > Apart from that, I think the next steps would be to find a committer who > > could shepherd this effort and help you with merging it. > > > > Cheers, > > Till > > > > On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote: > > > > > Hello, > > > > > > While looking into Flink internals, I've noticed that there is already > a > > > mechanism for stack-trace sampling of a particular job vertex. > > > > > > I think it may be really useful to allow user to easily render a cpu > > > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI > > for > > > a > > > selected vertex (new tab next to back pressure) of a running job. Back > > > pressure tab already provides a good idea of which vertex causes > trouble, > > > but it's hard to say what's actually going on. > > > > > > I've tried to implement a basic REST endpoint > > > < > > > > > > https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 > > > >, > > > that prepares data for the flame graph rendering and it seems to be > > > providing good insight. > > > > > > It should be straightforward to render data from the endpoint in new UI > > > using existing <https://github.com/spiermar/d3-flame-graph> javascript > > > libraries. > > > > > > WDYT? Is this worth pushing forward? > > > > > > D. > > > > > > |
Big +1 for this helpful feature :)
On 08/02/2019 13:54, Jark Wu wrote: Hi David, The demo looks charming! I think it will definitely help a lot when performance tuning. A big +1 for this. I cc-ed Yadong who's one of the main contributors of the new Web UI. Maybe he can give some help on the front end. Regards, Jark On Fri, 2 Aug 2019 at 04:26, David Morávek <[hidden email]> wrote: > Hi Till, thanks for the feedback! These endpoints are only called when the > vertex is selected in the UI, so there should be any heavy RPC load. For > back-pressure, we only sample top 3 calls of the stack (depth = 3). For the > flame-graph, we want to sample the whole stack trace and we need different > sampling rate (longer period, more samples). Those are the main reasons to > split these in two "trackers", but I may be missing something. > > I've prepared a little demo, so others can have a better idea of what I > have in mind. > > https://youtu.be/GUNDehj9z9o > > Please note that this is a proof of concept and I'm not frontend person, so > it may look little clumsy :) > > D. > > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> > wrote: > > > Hi David, > > > > thanks for starting this discussion. I like the idea of improving > insights > > into Flink's execution and I believe that a flame graph could be helpful. > > > > I quickly glanced over your changes and I think they go in a good > > direction. One idea could be to share the `StackTraceSample` produced by > > the `StackTraceSampleCoordinator` between the different > > `StackTraceOperatorTracker` so that we don't send multiple requests for > the > > same operators. That way we would decrease a bit the RPC load. > > > > Apart from that, I think the next steps would be to find a committer who > > could shepherd this effort and help you with merging it. > > > > Cheers, > > Till > > > > On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote: > > > > > Hello, > > > > > > While looking into Flink internals, I've noticed that there is already > a > > > mechanism for stack-trace sampling of a particular job vertex. > > > > > > I think it may be really useful to allow user to easily render a cpu > > > flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI > > for > > > a > > > selected vertex (new tab next to back pressure) of a running job. Back > > > pressure tab already provides a good idea of which vertex causes > trouble, > > > but it's hard to say what's actually going on. > > > > > > I've tried to implement a basic REST endpoint > > > < > > > > > > https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 > > > >, > > > that prepares data for the flame graph rendering and it seems to be > > > providing good insight. > > > > > > It should be straightforward to render data from the endpoint in new UI > > > using existing <https://github.com/spiermar/d3-flame-graph> javascript > > > libraries. > > > > > > WDYT? Is this worth pushing forward? > > > > > > D. > > > > > > |
In reply to this post by David Morávek-2
Hi David,
Thanks for the new feature! I think the flame graph would be a useful tool to understand the state of job executions, and it looks good too. +1 for this. And a minor question: do we plan to support multiple kinds of flame graphs? It would be great if we have both on-cpu and off-cpu flame graphs. Best, Paul Lam > 在 2019年8月2日,04:24,David Morávek <[hidden email]> 写道: > > Hi Till, thanks for the feedback! These endpoints are only called when the > vertex is selected in the UI, so there should be any heavy RPC load. For > back-pressure, we only sample top 3 calls of the stack (depth = 3). For the > flame-graph, we want to sample the whole stack trace and we need different > sampling rate (longer period, more samples). Those are the main reasons to > split these in two "trackers", but I may be missing something. > > I've prepared a little demo, so others can have a better idea of what I > have in mind. > > https://youtu.be/GUNDehj9z9o > > Please note that this is a proof of concept and I'm not frontend person, so > it may look little clumsy :) > > D. > > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> wrote: > >> Hi David, >> >> thanks for starting this discussion. I like the idea of improving insights >> into Flink's execution and I believe that a flame graph could be helpful. >> >> I quickly glanced over your changes and I think they go in a good >> direction. One idea could be to share the `StackTraceSample` produced by >> the `StackTraceSampleCoordinator` between the different >> `StackTraceOperatorTracker` so that we don't send multiple requests for the >> same operators. That way we would decrease a bit the RPC load. >> >> Apart from that, I think the next steps would be to find a committer who >> could shepherd this effort and help you with merging it. >> >> Cheers, >> Till >> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote: >> >>> Hello, >>> >>> While looking into Flink internals, I've noticed that there is already a >>> mechanism for stack-trace sampling of a particular job vertex. >>> >>> I think it may be really useful to allow user to easily render a cpu >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI >> for >>> a >>> selected vertex (new tab next to back pressure) of a running job. Back >>> pressure tab already provides a good idea of which vertex causes trouble, >>> but it's hard to say what's actually going on. >>> >>> I've tried to implement a basic REST endpoint >>> < >>> >> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 >>>> , >>> that prepares data for the flame graph rendering and it seems to be >>> providing good insight. >>> >>> It should be straightforward to render data from the endpoint in new UI >>> using existing <https://github.com/spiermar/d3-flame-graph> javascript >>> libraries. >>> >>> WDYT? Is this worth pushing forward? >>> >>> D. >>> >> |
Hi Paul, for now I only plan to add the one based on java stack traces.
On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <[hidden email]> wrote: > Hi David, > > Thanks for the new feature! I think the flame graph would be a useful tool > to understand the state of job executions, and it looks good too. +1 for > this. > > And a minor question: do we plan to support multiple kinds of flame > graphs? It would be great if we have both on-cpu and off-cpu flame graphs. > > Best, > Paul Lam > > > 在 2019年8月2日,04:24,David Morávek <[hidden email]> 写道: > > > > Hi Till, thanks for the feedback! These endpoints are only called when > the > > vertex is selected in the UI, so there should be any heavy RPC load. For > > back-pressure, we only sample top 3 calls of the stack (depth = 3). For > the > > flame-graph, we want to sample the whole stack trace and we need > different > > sampling rate (longer period, more samples). Those are the main reasons > to > > split these in two "trackers", but I may be missing something. > > > > I've prepared a little demo, so others can have a better idea of what I > > have in mind. > > > > https://youtu.be/GUNDehj9z9o > > > > Please note that this is a proof of concept and I'm not frontend person, > so > > it may look little clumsy :) > > > > D. > > > > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> > wrote: > > > >> Hi David, > >> > >> thanks for starting this discussion. I like the idea of improving > insights > >> into Flink's execution and I believe that a flame graph could be > helpful. > >> > >> I quickly glanced over your changes and I think they go in a good > >> direction. One idea could be to share the `StackTraceSample` produced by > >> the `StackTraceSampleCoordinator` between the different > >> `StackTraceOperatorTracker` so that we don't send multiple requests for > the > >> same operators. That way we would decrease a bit the RPC load. > >> > >> Apart from that, I think the next steps would be to find a committer who > >> could shepherd this effort and help you with merging it. > >> > >> Cheers, > >> Till > >> > >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote: > >> > >>> Hello, > >>> > >>> While looking into Flink internals, I've noticed that there is already > a > >>> mechanism for stack-trace sampling of a particular job vertex. > >>> > >>> I think it may be really useful to allow user to easily render a cpu > >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI > >> for > >>> a > >>> selected vertex (new tab next to back pressure) of a running job. Back > >>> pressure tab already provides a good idea of which vertex causes > trouble, > >>> but it's hard to say what's actually going on. > >>> > >>> I've tried to implement a basic REST endpoint > >>> < > >>> > >> > https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 > >>>> , > >>> that prepares data for the flame graph rendering and it seems to be > >>> providing good insight. > >>> > >>> It should be straightforward to render data from the endpoint in new UI > >>> using existing <https://github.com/spiermar/d3-flame-graph> javascript > >>> libraries. > >>> > >>> WDYT? Is this worth pushing forward? > >>> > >>> D. > >>> > >> > > |
I've created FLINK-13550 <https://issues.apache.org/jira/browse/FLINK-13550>
to track the issue. Is there any committer who'd be willing to "shepherd this effort"? :) Thanks, D. On Fri, Aug 2, 2019 at 10:22 AM David Morávek <[hidden email]> wrote: > Hi Paul, for now I only plan to add the one based on java stack traces. > > On Fri, Aug 2, 2019 at 9:34 AM Paul Lam <[hidden email]> wrote: > >> Hi David, >> >> Thanks for the new feature! I think the flame graph would be a useful >> tool to understand the state of job executions, and it looks good too. +1 >> for this. >> >> And a minor question: do we plan to support multiple kinds of flame >> graphs? It would be great if we have both on-cpu and off-cpu flame graphs. >> >> Best, >> Paul Lam >> >> > 在 2019年8月2日,04:24,David Morávek <[hidden email]> 写道: >> > >> > Hi Till, thanks for the feedback! These endpoints are only called when >> the >> > vertex is selected in the UI, so there should be any heavy RPC load. For >> > back-pressure, we only sample top 3 calls of the stack (depth = 3). For >> the >> > flame-graph, we want to sample the whole stack trace and we need >> different >> > sampling rate (longer period, more samples). Those are the main reasons >> to >> > split these in two "trackers", but I may be missing something. >> > >> > I've prepared a little demo, so others can have a better idea of what I >> > have in mind. >> > >> > https://youtu.be/GUNDehj9z9o >> > >> > Please note that this is a proof of concept and I'm not frontend >> person, so >> > it may look little clumsy :) >> > >> > D. >> > >> > On Thu, Aug 1, 2019 at 11:40 AM Till Rohrmann <[hidden email]> >> wrote: >> > >> >> Hi David, >> >> >> >> thanks for starting this discussion. I like the idea of improving >> insights >> >> into Flink's execution and I believe that a flame graph could be >> helpful. >> >> >> >> I quickly glanced over your changes and I think they go in a good >> >> direction. One idea could be to share the `StackTraceSample` produced >> by >> >> the `StackTraceSampleCoordinator` between the different >> >> `StackTraceOperatorTracker` so that we don't send multiple requests >> for the >> >> same operators. That way we would decrease a bit the RPC load. >> >> >> >> Apart from that, I think the next steps would be to find a committer >> who >> >> could shepherd this effort and help you with merging it. >> >> >> >> Cheers, >> >> Till >> >> >> >> On Wed, Jul 31, 2019 at 7:05 PM David Morávek <[hidden email]> wrote: >> >> >> >>> Hello, >> >>> >> >>> While looking into Flink internals, I've noticed that there is >> already a >> >>> mechanism for stack-trace sampling of a particular job vertex. >> >>> >> >>> I think it may be really useful to allow user to easily render a cpu >> >>> flamegraph <http://www.brendangregg.com/flamegraphs.html> in a new UI >> >> for >> >>> a >> >>> selected vertex (new tab next to back pressure) of a running job. Back >> >>> pressure tab already provides a good idea of which vertex causes >> trouble, >> >>> but it's hard to say what's actually going on. >> >>> >> >>> I've tried to implement a basic REST endpoint >> >>> < >> >>> >> >> >> https://github.com/dmvk/flink/commit/716231822d2fe99004895cdd0a365560479445b9 >> >>>> , >> >>> that prepares data for the flame graph rendering and it seems to be >> >>> providing good insight. >> >>> >> >>> It should be straightforward to render data from the endpoint in new >> UI >> >>> using existing <https://github.com/spiermar/d3-flame-graph> >> javascript >> >>> libraries. >> >>> >> >>> WDYT? Is this worth pushing forward? >> >>> >> >>> D. >> >>> >> >> >> >> |
Free forum by Nabble | Edit this page |