(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] FLIP-165: Operator's Flame Graphs

Classic

List

Threaded

13 messages Options

Alexander Fedulov

[DISCUSS] FLIP-165: Operator's Flame Graphs

Hi All,

I would like to start a discussion for FLIP-165: Operator's Flame Graphs [1]

A Flame Graph [2] is a visualization that is very effective for providing
answers to the questions like:
- Which methods are currently consuming CPU resources?
- How CPU utilization by one method compares to the others?
- Which series of calls on the stack led to executing a particular method?

I have already opened a PR [3] that represents the implementation approach
proposed in the FLIP. It supports both on-CPU and off-CPU [4] Flame Graphs.

Looking forward to your feedback.

P.S: I would like to give kudos to David Moravek for his prototyping work
[5] on this feature. Although the proposed implementation significantly
diverges from his prototype on the Flink side, the work done on connecting
the d3-flame-graph library to the right data structure retrieved from Flink
was instrumental for enabling this feature.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-165%3A+Operator%27s+Flame+Graphs
[2] http://www.brendangregg.com/flamegraphs.html
[3] https://github.com/apache/flink/pull/15054
[4] http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
[5]
https://issues.apache.org/jira/browse/FLINK-13550?focusedCommentId=17083026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17083026

Best,

--

Alexander Fedulov | Solutions Architect

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner

Till Rohrmann

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Thanks for starting this discussion Alex. I really like this feature
because it gives better insights into what a Flink job is doing.

Quick question on the TaskExecutorGateway extension:

* What is the requestId used for in the RPC call?
* Would it make sense to group numSubSamples, delayBetweenSamples and
maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease
the number of parameters and group those which are closely related together.

Cheers,
Till

On Tue, Mar 2, 2021 at 12:45 AM Alexander Fedulov <[hidden email]>
wrote:

> Hi All,
>
> I would like to start a discussion for FLIP-165: Operator's Flame Graphs
> [1]
>
> A Flame Graph [2] is a visualization that is very effective for providing
> answers to the questions like:
> - Which methods are currently consuming CPU resources?
> - How CPU utilization by one method compares to the others?
> - Which series of calls on the stack led to executing a particular method?
>
> I have already opened a PR [3] that represents the implementation approach
> proposed in the FLIP. It supports both on-CPU and off-CPU [4] Flame Graphs.
>
> Looking forward to your feedback.
>
> P.S: I would like to give kudos to David Moravek for his prototyping work
> [5] on this feature. Although the proposed implementation significantly
> diverges from his prototype on the Flink side, the work done on connecting
> the d3-flame-graph library to the right data structure retrieved from Flink
> was instrumental for enabling this feature.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-165%3A+Operator%27s+Flame+Graphs
> [2] http://www.brendangregg.com/flamegraphs.html
> [3] https://github.com/apache/flink/pull/15054
> [4] http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
> [5]
>
> https://issues.apache.org/jira/browse/FLINK-13550?focusedCommentId=17083026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17083026
>
> Best,
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>

Alexander Fedulov

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Hi Till,

Thanks for your comments.

* What is the requestId used for in the RPC call?

It is the handle that is used as the key in the
ThreadInfoRequestCoordinator's pending responses Map. I believe it was
called sampleId in the StackTraceSampleCoordinator, but I decided to rename
it because there is also a ThreadInfoSampleService which is actually
responsible for sampling the JVM numSamples number of times. I found that
the notion of what a sample is was a bit confusing. Now one thread info
request corresponds to gathering numSamples from a corresponding Task. Hope
that makes sense.

* Would it make sense to group numSubSamples, delayBetweenSamples and
maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease
the number of parameters and group those which are closely related together.

Good point. I will rework it accordingly.

Best,
--

Alexander Fedulov | Solutions Architect
Follow us @VervericaData

--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Till Rohrmann

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Why does the caller of TaskExecutorGateway.requestThreadInfoSamples need to
specify the request id? Is it because the caller can send a second request
with the same id? Or can the caller query the result of a previous request
by specifying the requestId?

If the TaskExecutor does not need to know about the id, then we might be
able to drop it.

Cheers
Till

On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <[hidden email]>
wrote:

> Hi Till,
>
> Thanks for your comments.
>
> * What is the requestId used for in the RPC call?
>
> It is the handle that is used as the key in the
> ThreadInfoRequestCoordinator's pending responses Map. I believe it was
> called sampleId in the StackTraceSampleCoordinator, but I decided to rename
> it because there is also a ThreadInfoSampleService which is actually
> responsible for sampling the JVM numSamples number of times. I found that
> the notion of what a sample is was a bit confusing. Now one thread info
> request corresponds to gathering numSamples from a corresponding Task. Hope
> that makes sense.
>
> * Would it make sense to group numSubSamples, delayBetweenSamples and
> maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease
> the number of parameters and group those which are closely related
> together.
>
> Good point. I will rework it accordingly.
>
> Best,
> --
>
> Alexander Fedulov | Solutions Architect
> Follow us @VervericaData
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>

Alexander Fedulov

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

It is passed back as part of the response to the asynchronous callback
within the coordinator and is used to decide if all outstanding requests to
the parallel instances of a particular operator returned successfully. If
so, the request is considered successful, sub-results are combined and the
thread info result future for an operator completes.

https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150

https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277

Best,

--

Alexander Fedulov | Solutions Architect

<https://www.ververica.com/>

Follow us @VervericaData

On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]> wrote:

> Why does the caller of TaskExecutorGateway.requestThreadInfoSamples need to
> specify the request id? Is it because the caller can send a second request
> with the same id? Or can the caller query the result of a previous request
> by specifying the requestId?
>
> If the TaskExecutor does not need to know about the id, then we might be
> able to drop it.
>
> Cheers
> Till
>
> On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <[hidden email]>
> wrote:
>
> > Hi Till,
> >
> > Thanks for your comments.
> >
> > * What is the requestId used for in the RPC call?
> >
> > It is the handle that is used as the key in the
> > ThreadInfoRequestCoordinator's pending responses Map. I believe it was
> > called sampleId in the StackTraceSampleCoordinator, but I decided to
> rename
> > it because there is also a ThreadInfoSampleService which is actually
> > responsible for sampling the JVM numSamples number of times. I found that
> > the notion of what a sample is was a bit confusing. Now one thread info
> > request corresponds to gathering numSamples from a corresponding Task.
> Hope
> > that makes sense.
> >
> > * Would it make sense to group numSubSamples, delayBetweenSamples and
> > maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease
> > the number of parameters and group those which are closely related
> > together.
> >
> > Good point. I will rework it accordingly.
> >
> > Best,
> > --
> >
> > Alexander Fedulov | Solutions Architect
> > Follow us @VervericaData
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >
>

Till Rohrmann

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Ah ok. Thanks for the clarification Alex.

Cheers,
Till

On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <[hidden email]>
wrote:

> It is passed back as part of the response to the asynchronous callback
> within the coordinator and is used to decide if all outstanding requests to
> the parallel instances of a particular operator returned successfully. If
> so, the request is considered successful, sub-results are combined and the
> thread info result future for an operator completes.
>
>
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
>
>
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
>
> Best,
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
>
> On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples need
> to
> > specify the request id? Is it because the caller can send a second
> request
> > with the same id? Or can the caller query the result of a previous
> request
> > by specifying the requestId?
> >
> > If the TaskExecutor does not need to know about the id, then we might be
> > able to drop it.
> >
> > Cheers
> > Till
> >
> > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> [hidden email]>
> > wrote:
> >
> > > Hi Till,
> > >
> > > Thanks for your comments.
> > >
> > > * What is the requestId used for in the RPC call?
> > >
> > > It is the handle that is used as the key in the
> > > ThreadInfoRequestCoordinator's pending responses Map. I believe it was
> > > called sampleId in the StackTraceSampleCoordinator, but I decided to
> > rename
> > > it because there is also a ThreadInfoSampleService which is actually
> > > responsible for sampling the JVM numSamples number of times. I found
> that
> > > the notion of what a sample is was a bit confusing. Now one thread info
> > > request corresponds to gathering numSamples from a corresponding Task.
> > Hope
> > > that makes sense.
> > >
> > > * Would it make sense to group numSubSamples, delayBetweenSamples and
> > > maxStackTraceDepth into a ThreadSamplesRequest class? This would
> decrease
> > > the number of parameters and group those which are closely related
> > > together.
> > >
> > > Good point. I will rework it accordingly.
> > >
> > > Best,
> > > --
> > >
> > > Alexander Fedulov | Solutions Architect
> > > Follow us @VervericaData
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > >
> >
>

Piotr Nowojski-5

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Nice feature +1 from my side for it.

In the PR I think we are missing documentation. I think it's especially
important to mention the limitations of this approach for performance
analysis. If we make it easy for the user to get such kind of data, it's
important they do not proverbially shoot themselves in their own foot with
false conclusions. We should clearly mention how those data are sampled,
and point to limitations such as:
- data from all threads/operators are squashed together, so if there is a
data skew it will be averaged out
- stack sampling is/can be biased (JVM threads are more likely to be
stopped in some places than others, while skipping/rarely stopping in the
true hot spots - so one should treat the results with a grain of salt below
a certain level)
- true bottleneck might be obscured by the fact flame graphs are squashing
results from all of the threads. There can be 60% of time spent in one
component according to a flame graph, but it might not necessarily be the
bottleneck, which could be in a completely different component running
which has a single thread burning 100% of a single CPU core, barely visible
in the flame graph at all.

It's great to have such a nice tool readily and easily available, but we
need to make sure people who are using it are aware when it can be
misleading.

Piotrek

wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> napisał(a):

> Ah ok. Thanks for the clarification Alex.
>
> Cheers,
> Till
>
> On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <[hidden email]>
> wrote:
>
> > It is passed back as part of the response to the asynchronous callback
> > within the coordinator and is used to decide if all outstanding requests
> to
> > the parallel instances of a particular operator returned successfully. If
> > so, the request is considered successful, sub-results are combined and
> the
> > thread info result future for an operator completes.
> >
> >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> >
> >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> >
> > Best,
> >
> > --
> >
> > Alexander Fedulov | Solutions Architect
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> >
> > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples
> need
> > to
> > > specify the request id? Is it because the caller can send a second
> > request
> > > with the same id? Or can the caller query the result of a previous
> > request
> > > by specifying the requestId?
> > >
> > > If the TaskExecutor does not need to know about the id, then we might
> be
> > > able to drop it.
> > >
> > > Cheers
> > > Till
> > >
> > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> > [hidden email]>
> > > wrote:
> > >
> > > > Hi Till,
> > > >
> > > > Thanks for your comments.
> > > >
> > > > * What is the requestId used for in the RPC call?
> > > >
> > > > It is the handle that is used as the key in the
> > > > ThreadInfoRequestCoordinator's pending responses Map. I believe it
> was
> > > > called sampleId in the StackTraceSampleCoordinator, but I decided to
> > > rename
> > > > it because there is also a ThreadInfoSampleService which is actually
> > > > responsible for sampling the JVM numSamples number of times. I found
> > that
> > > > the notion of what a sample is was a bit confusing. Now one thread
> info
> > > > request corresponds to gathering numSamples from a corresponding
> Task.
> > > Hope
> > > > that makes sense.
> > > >
> > > > * Would it make sense to group numSubSamples, delayBetweenSamples and
> > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would
> > decrease
> > > > the number of parameters and group those which are closely related
> > > > together.
> > > >
> > > > Good point. I will rework it accordingly.
> > > >
> > > > Best,
> > > > --
> > > >
> > > > Alexander Fedulov | Solutions Architect
> > > > Follow us @VervericaData
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from:
> > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > >
> > >
> >
>

Alexander Fedulov

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Hi Piotr,

Thanks for the comments - all valid points.
We should definitely document how the Flame Graphs are constructed - I will
work on the docs. Do you have a proposition about the part of which
page/section they should become?
I would like to also mention here that I plan to work on further
improvements, such as the ability to "zoom in" into the Flame Graphs for
the individual Tasks in the "unsquashed" form, so some of those concerns
should be mitigated in the future.

Best,

--

Alexander Fedulov | Solutions Architect

<https://www.ververica.com/>

Follow us @VervericaData

On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]> wrote:

> Nice feature +1 from my side for it.
>
> In the PR I think we are missing documentation. I think it's especially
> important to mention the limitations of this approach for performance
> analysis. If we make it easy for the user to get such kind of data, it's
> important they do not proverbially shoot themselves in their own foot with
> false conclusions. We should clearly mention how those data are sampled,
> and point to limitations such as:
> - data from all threads/operators are squashed together, so if there is a
> data skew it will be averaged out
> - stack sampling is/can be biased (JVM threads are more likely to be
> stopped in some places than others, while skipping/rarely stopping in the
> true hot spots - so one should treat the results with a grain of salt below
> a certain level)
> - true bottleneck might be obscured by the fact flame graphs are squashing
> results from all of the threads. There can be 60% of time spent in one
> component according to a flame graph, but it might not necessarily be the
> bottleneck, which could be in a completely different component running
> which has a single thread burning 100% of a single CPU core, barely visible
> in the flame graph at all.
>
> It's great to have such a nice tool readily and easily available, but we
> need to make sure people who are using it are aware when it can be
> misleading.
>
> Piotrek
>
> wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> napisał(a):
>
> > Ah ok. Thanks for the clarification Alex.
> >
> > Cheers,
> > Till
> >
> > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
> [hidden email]>
> > wrote:
> >
> > > It is passed back as part of the response to the asynchronous callback
> > > within the coordinator and is used to decide if all outstanding
> requests
> > to
> > > the parallel instances of a particular operator returned successfully.
> If
> > > so, the request is considered successful, sub-results are combined and
> > the
> > > thread info result future for an operator completes.
> > >
> > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> > >
> > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> > >
> > > Best,
> > >
> > > --
> > >
> > > Alexander Fedulov | Solutions Architect
> > >
> > > <https://www.ververica.com/>
> > >
> > > Follow us @VervericaData
> > >
> > >
> > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples
> > need
> > > to
> > > > specify the request id? Is it because the caller can send a second
> > > request
> > > > with the same id? Or can the caller query the result of a previous
> > > request
> > > > by specifying the requestId?
> > > >
> > > > If the TaskExecutor does not need to know about the id, then we might
> > be
> > > > able to drop it.
> > > >
> > > > Cheers
> > > > Till
> > > >
> > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > Hi Till,
> > > > >
> > > > > Thanks for your comments.
> > > > >
> > > > > * What is the requestId used for in the RPC call?
> > > > >
> > > > > It is the handle that is used as the key in the
> > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe it
> > was
> > > > > called sampleId in the StackTraceSampleCoordinator, but I decided
> to
> > > > rename
> > > > > it because there is also a ThreadInfoSampleService which is
> actually
> > > > > responsible for sampling the JVM numSamples number of times. I
> found
> > > that
> > > > > the notion of what a sample is was a bit confusing. Now one thread
> > info
> > > > > request corresponds to gathering numSamples from a corresponding
> > Task.
> > > > Hope
> > > > > that makes sense.
> > > > >
> > > > > * Would it make sense to group numSubSamples, delayBetweenSamples
> and
> > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would
> > > decrease
> > > > > the number of parameters and group those which are closely related
> > > > > together.
> > > > >
> > > > > Good point. I will rework it accordingly.
> > > > >
> > > > > Best,
> > > > > --
> > > > >
> > > > > Alexander Fedulov | Solutions Architect
> > > > > Follow us @VervericaData
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sent from:
> > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > > >
> > > >
> > >
> >
>

Seth Wiesman-4

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Cool feature +1

There is a subsection called monitoring in the operations section of the
docs. It would fit nicely there.

Seth

On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <[hidden email]>
wrote:

> Hi Piotr,
>
> Thanks for the comments - all valid points.
> We should definitely document how the Flame Graphs are constructed - I will
> work on the docs. Do you have a proposition about the part of which
> page/section they should become?
> I would like to also mention here that I plan to work on further
> improvements, such as the ability to "zoom in" into the Flame Graphs for
> the individual Tasks in the "unsquashed" form, so some of those concerns
> should be mitigated in the future.
>
> Best,
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
>
> On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]>
> wrote:
>
> > Nice feature +1 from my side for it.
> >
> > In the PR I think we are missing documentation. I think it's especially
> > important to mention the limitations of this approach for performance
> > analysis. If we make it easy for the user to get such kind of data, it's
> > important they do not proverbially shoot themselves in their own foot
> with
> > false conclusions. We should clearly mention how those data are sampled,
> > and point to limitations such as:
> > - data from all threads/operators are squashed together, so if there is a
> > data skew it will be averaged out
> > - stack sampling is/can be biased (JVM threads are more likely to be
> > stopped in some places than others, while skipping/rarely stopping in the
> > true hot spots - so one should treat the results with a grain of salt
> below
> > a certain level)
> > - true bottleneck might be obscured by the fact flame graphs are
> squashing
> > results from all of the threads. There can be 60% of time spent in one
> > component according to a flame graph, but it might not necessarily be the
> > bottleneck, which could be in a completely different component running
> > which has a single thread burning 100% of a single CPU core, barely
> visible
> > in the flame graph at all.
> >
> > It's great to have such a nice tool readily and easily available, but we
> > need to make sure people who are using it are aware when it can be
> > misleading.
> >
> > Piotrek
> >
> > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> napisał(a):
> >
> > > Ah ok. Thanks for the clarification Alex.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
> > [hidden email]>
> > > wrote:
> > >
> > > > It is passed back as part of the response to the asynchronous
> callback
> > > > within the coordinator and is used to decide if all outstanding
> > requests
> > > to
> > > > the parallel instances of a particular operator returned
> successfully.
> > If
> > > > so, the request is considered successful, sub-results are combined
> and
> > > the
> > > > thread info result future for an operator completes.
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> > > >
> > > > Best,
> > > >
> > > > --
> > > >
> > > > Alexander Fedulov | Solutions Architect
> > > >
> > > > <https://www.ververica.com/>
> > > >
> > > > Follow us @VervericaData
> > > >
> > > >
> > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]>
> > > > wrote:
> > > >
> > > > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples
> > > need
> > > > to
> > > > > specify the request id? Is it because the caller can send a second
> > > > request
> > > > > with the same id? Or can the caller query the result of a previous
> > > > request
> > > > > by specifying the requestId?
> > > > >
> > > > > If the TaskExecutor does not need to know about the id, then we
> might
> > > be
> > > > > able to drop it.
> > > > >
> > > > > Cheers
> > > > > Till
> > > > >
> > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hi Till,
> > > > > >
> > > > > > Thanks for your comments.
> > > > > >
> > > > > > * What is the requestId used for in the RPC call?
> > > > > >
> > > > > > It is the handle that is used as the key in the
> > > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe
> it
> > > was
> > > > > > called sampleId in the StackTraceSampleCoordinator, but I decided
> > to
> > > > > rename
> > > > > > it because there is also a ThreadInfoSampleService which is
> > actually
> > > > > > responsible for sampling the JVM numSamples number of times. I
> > found
> > > > that
> > > > > > the notion of what a sample is was a bit confusing. Now one
> thread
> > > info
> > > > > > request corresponds to gathering numSamples from a corresponding
> > > Task.
> > > > > Hope
> > > > > > that makes sense.
> > > > > >
> > > > > > * Would it make sense to group numSubSamples, delayBetweenSamples
> > and
> > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would
> > > > decrease
> > > > > > the number of parameters and group those which are closely
> related
> > > > > > together.
> > > > > >
> > > > > > Good point. I will rework it accordingly.
> > > > > >
> > > > > > Best,
> > > > > > --
> > > > > >
> > > > > > Alexander Fedulov | Solutions Architect
> > > > > > Follow us @VervericaData
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sent from:
> > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > > > >
> > > > >
> > > >
> > >
> >
>

David Anderson-4

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

This is going to make performance analysis and optimization much more
accessible. I can't wait to include this in our training courses.

+1

Seth suggested putting the docs for this feature under
Operations/Monitoring, but there's already a page in the docs under
Operations/Debugging for Application Profiling & Debugging, which is more
on point. I think it will be confusing if the flame graphs aren't
together with the other profilers.

David

On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]> wrote:

> Cool feature +1
>
> There is a subsection called monitoring in the operations section of the
> docs. It would fit nicely there.
>
> Seth
>
> On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <[hidden email]>
> wrote:
>
> > Hi Piotr,
> >
> > Thanks for the comments - all valid points.
> > We should definitely document how the Flame Graphs are constructed - I
> will
> > work on the docs. Do you have a proposition about the part of which
> > page/section they should become?
> > I would like to also mention here that I plan to work on further
> > improvements, such as the ability to "zoom in" into the Flame Graphs for
> > the individual Tasks in the "unsquashed" form, so some of those concerns
> > should be mitigated in the future.
> >
> > Best,
> >
> > --
> >
> > Alexander Fedulov | Solutions Architect
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> >
> > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]>
> > wrote:
> >
> > > Nice feature +1 from my side for it.
> > >
> > > In the PR I think we are missing documentation. I think it's especially
> > > important to mention the limitations of this approach for performance
> > > analysis. If we make it easy for the user to get such kind of data,
> it's
> > > important they do not proverbially shoot themselves in their own foot
> > with
> > > false conclusions. We should clearly mention how those data are
> sampled,
> > > and point to limitations such as:
> > > - data from all threads/operators are squashed together, so if there
> is a
> > > data skew it will be averaged out
> > > - stack sampling is/can be biased (JVM threads are more likely to be
> > > stopped in some places than others, while skipping/rarely stopping in
> the
> > > true hot spots - so one should treat the results with a grain of salt
> > below
> > > a certain level)
> > > - true bottleneck might be obscured by the fact flame graphs are
> > squashing
> > > results from all of the threads. There can be 60% of time spent in one
> > > component according to a flame graph, but it might not necessarily be
> the
> > > bottleneck, which could be in a completely different component running
> > > which has a single thread burning 100% of a single CPU core, barely
> > visible
> > > in the flame graph at all.
> > >
> > > It's great to have such a nice tool readily and easily available, but
> we
> > > need to make sure people who are using it are aware when it can be
> > > misleading.
> > >
> > > Piotrek
> > >
> > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]>
> napisał(a):
> > >
> > > > Ah ok. Thanks for the clarification Alex.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > It is passed back as part of the response to the asynchronous
> > callback
> > > > > within the coordinator and is used to decide if all outstanding
> > > requests
> > > > to
> > > > > the parallel instances of a particular operator returned
> > successfully.
> > > If
> > > > > so, the request is considered successful, sub-results are combined
> > and
> > > > the
> > > > > thread info result future for an operator completes.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> > > > >
> > > > > Best,
> > > > >
> > > > > --
> > > > >
> > > > > Alexander Fedulov | Solutions Architect
> > > > >
> > > > > <https://www.ververica.com/>
> > > > >
> > > > > Follow us @VervericaData
> > > > >
> > > > >
> > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Why does the caller of
> TaskExecutorGateway.requestThreadInfoSamples
> > > > need
> > > > > to
> > > > > > specify the request id? Is it because the caller can send a
> second
> > > > > request
> > > > > > with the same id? Or can the caller query the result of a
> previous
> > > > > request
> > > > > > by specifying the requestId?
> > > > > >
> > > > > > If the TaskExecutor does not need to know about the id, then we
> > might
> > > > be
> > > > > > able to drop it.
> > > > > >
> > > > > > Cheers
> > > > > > Till
> > > > > >
> > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Till,
> > > > > > >
> > > > > > > Thanks for your comments.
> > > > > > >
> > > > > > > * What is the requestId used for in the RPC call?
> > > > > > >
> > > > > > > It is the handle that is used as the key in the
> > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe
> > it
> > > > was
> > > > > > > called sampleId in the StackTraceSampleCoordinator, but I
> decided
> > > to
> > > > > > rename
> > > > > > > it because there is also a ThreadInfoSampleService which is
> > > actually
> > > > > > > responsible for sampling the JVM numSamples number of times. I
> > > found
> > > > > that
> > > > > > > the notion of what a sample is was a bit confusing. Now one
> > thread
> > > > info
> > > > > > > request corresponds to gathering numSamples from a
> corresponding
> > > > Task.
> > > > > > Hope
> > > > > > > that makes sense.
> > > > > > >
> > > > > > > * Would it make sense to group numSubSamples,
> delayBetweenSamples
> > > and
> > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This
> would
> > > > > decrease
> > > > > > > the number of parameters and group those which are closely
> > related
> > > > > > > together.
> > > > > > >
> > > > > > > Good point. I will rework it accordingly.
> > > > > > >
> > > > > > > Best,
> > > > > > > --
> > > > > > >
> > > > > > > Alexander Fedulov | Solutions Architect
> > > > > > > Follow us @VervericaData
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sent from:
> > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Alexander Fedulov

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Added docs to the PR.
@David, thanks for the tip, it seems like a good place to put them.

--

Alexander Fedulov | Solutions Architect

<https://www.ververica.com/>

Follow us @VervericaData

On Wed, Mar 3, 2021 at 12:10 PM David Anderson <[hidden email]> wrote:

> This is going to make performance analysis and optimization much more
> accessible. I can't wait to include this in our training courses.
>
> +1
>
> Seth suggested putting the docs for this feature under
> Operations/Monitoring, but there's already a page in the docs under
> Operations/Debugging for Application Profiling & Debugging, which is more
> on point. I think it will be confusing if the flame graphs aren't
> together with the other profilers.
>
> David
>
> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]> wrote:
>
> > Cool feature +1
> >
> > There is a subsection called monitoring in the operations section of the
> > docs. It would fit nicely there.
> >
> > Seth
> >
> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <
> [hidden email]>
> > wrote:
> >
> > > Hi Piotr,
> > >
> > > Thanks for the comments - all valid points.
> > > We should definitely document how the Flame Graphs are constructed - I
> > will
> > > work on the docs. Do you have a proposition about the part of which
> > > page/section they should become?
> > > I would like to also mention here that I plan to work on further
> > > improvements, such as the ability to "zoom in" into the Flame Graphs
> for
> > > the individual Tasks in the "unsquashed" form, so some of those
> concerns
> > > should be mitigated in the future.
> > >
> > > Best,
> > >
> > > --
> > >
> > > Alexander Fedulov | Solutions Architect
> > >
> > > <https://www.ververica.com/>
> > >
> > > Follow us @VervericaData
> > >
> > >
> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]>
> > > wrote:
> > >
> > > > Nice feature +1 from my side for it.
> > > >
> > > > In the PR I think we are missing documentation. I think it's
> especially
> > > > important to mention the limitations of this approach for performance
> > > > analysis. If we make it easy for the user to get such kind of data,
> > it's
> > > > important they do not proverbially shoot themselves in their own foot
> > > with
> > > > false conclusions. We should clearly mention how those data are
> > sampled,
> > > > and point to limitations such as:
> > > > - data from all threads/operators are squashed together, so if there
> > is a
> > > > data skew it will be averaged out
> > > > - stack sampling is/can be biased (JVM threads are more likely to be
> > > > stopped in some places than others, while skipping/rarely stopping in
> > the
> > > > true hot spots - so one should treat the results with a grain of salt
> > > below
> > > > a certain level)
> > > > - true bottleneck might be obscured by the fact flame graphs are
> > > squashing
> > > > results from all of the threads. There can be 60% of time spent in
> one
> > > > component according to a flame graph, but it might not necessarily be
> > the
> > > > bottleneck, which could be in a completely different component
> running
> > > > which has a single thread burning 100% of a single CPU core, barely
> > > visible
> > > > in the flame graph at all.
> > > >
> > > > It's great to have such a nice tool readily and easily available, but
> > we
> > > > need to make sure people who are using it are aware when it can be
> > > > misleading.
> > > >
> > > > Piotrek
> > > >
> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]>
> > napisał(a):
> > > >
> > > > > Ah ok. Thanks for the clarification Alex.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > It is passed back as part of the response to the asynchronous
> > > callback
> > > > > > within the coordinator and is used to decide if all outstanding
> > > > requests
> > > > > to
> > > > > > the parallel instances of a particular operator returned
> > > successfully.
> > > > If
> > > > > > so, the request is considered successful, sub-results are
> combined
> > > and
> > > > > the
> > > > > > thread info result future for an operator completes.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Alexander Fedulov | Solutions Architect
> > > > > >
> > > > > > <https://www.ververica.com/>
> > > > > >
> > > > > > Follow us @VervericaData
> > > > > >
> > > > > >
> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Why does the caller of
> > TaskExecutorGateway.requestThreadInfoSamples
> > > > > need
> > > > > > to
> > > > > > > specify the request id? Is it because the caller can send a
> > second
> > > > > > request
> > > > > > > with the same id? Or can the caller query the result of a
> > previous
> > > > > > request
> > > > > > > by specifying the requestId?
> > > > > > >
> > > > > > > If the TaskExecutor does not need to know about the id, then we
> > > might
> > > > > be
> > > > > > > able to drop it.
> > > > > > >
> > > > > > > Cheers
> > > > > > > Till
> > > > > > >
> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Till,
> > > > > > > >
> > > > > > > > Thanks for your comments.
> > > > > > > >
> > > > > > > > * What is the requestId used for in the RPC call?
> > > > > > > >
> > > > > > > > It is the handle that is used as the key in the
> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I
> believe
> > > it
> > > > > was
> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I
> > decided
> > > > to
> > > > > > > rename
> > > > > > > > it because there is also a ThreadInfoSampleService which is
> > > > actually
> > > > > > > > responsible for sampling the JVM numSamples number of times.
> I
> > > > found
> > > > > > that
> > > > > > > > the notion of what a sample is was a bit confusing. Now one
> > > thread
> > > > > info
> > > > > > > > request corresponds to gathering numSamples from a
> > corresponding
> > > > > Task.
> > > > > > > Hope
> > > > > > > > that makes sense.
> > > > > > > >
> > > > > > > > * Would it make sense to group numSubSamples,
> > delayBetweenSamples
> > > > and
> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This
> > would
> > > > > > decrease
> > > > > > > > the number of parameters and group those which are closely
> > > related
> > > > > > > > together.
> > > > > > > >
> > > > > > > > Good point. I will rework it accordingly.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > --
> > > > > > > >
> > > > > > > > Alexander Fedulov | Solutions Architect
> > > > > > > > Follow us @VervericaData
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Sent from:
> > > > > > >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Alexander Fedulov

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

@Till, I've added the proposed ThreadInfoSamplesRequest and updated the
FLIP and the PR accordingly.

Best,

--

Alexander Fedulov | Solutions Architect

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner

On Wed, Mar 3, 2021 at 5:03 PM Alexander Fedulov <[hidden email]>
wrote:

> Added docs to the PR.
> @David, thanks for the tip, it seems like a good place to put them.
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
>
>
>
> On Wed, Mar 3, 2021 at 12:10 PM David Anderson <[hidden email]>
> wrote:
>
>> This is going to make performance analysis and optimization much more
>> accessible. I can't wait to include this in our training courses.
>>
>> +1
>>
>> Seth suggested putting the docs for this feature under
>> Operations/Monitoring, but there's already a page in the docs under
>> Operations/Debugging for Application Profiling & Debugging, which is more
>> on point. I think it will be confusing if the flame graphs aren't
>> together with the other profilers.
>>
>> David
>>
>> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]> wrote:
>>
>> > Cool feature +1
>> >
>> > There is a subsection called monitoring in the operations section of the
>> > docs. It would fit nicely there.
>> >
>> > Seth
>> >
>> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <
>> [hidden email]>
>> > wrote:
>> >
>> > > Hi Piotr,
>> > >
>> > > Thanks for the comments - all valid points.
>> > > We should definitely document how the Flame Graphs are constructed - I
>> > will
>> > > work on the docs. Do you have a proposition about the part of which
>> > > page/section they should become?
>> > > I would like to also mention here that I plan to work on further
>> > > improvements, such as the ability to "zoom in" into the Flame Graphs
>> for
>> > > the individual Tasks in the "unsquashed" form, so some of those
>> concerns
>> > > should be mitigated in the future.
>> > >
>> > > Best,
>> > >
>> > > --
>> > >
>> > > Alexander Fedulov | Solutions Architect
>> > >
>> > > <https://www.ververica.com/>
>> > >
>> > > Follow us @VervericaData
>> > >
>> > >
>> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]>
>> > > wrote:
>> > >
>> > > > Nice feature +1 from my side for it.
>> > > >
>> > > > In the PR I think we are missing documentation. I think it's
>> especially
>> > > > important to mention the limitations of this approach for
>> performance
>> > > > analysis. If we make it easy for the user to get such kind of data,
>> > it's
>> > > > important they do not proverbially shoot themselves in their own
>> foot
>> > > with
>> > > > false conclusions. We should clearly mention how those data are
>> > sampled,
>> > > > and point to limitations such as:
>> > > > - data from all threads/operators are squashed together, so if there
>> > is a
>> > > > data skew it will be averaged out
>> > > > - stack sampling is/can be biased (JVM threads are more likely to be
>> > > > stopped in some places than others, while skipping/rarely stopping
>> in
>> > the
>> > > > true hot spots - so one should treat the results with a grain of
>> salt
>> > > below
>> > > > a certain level)
>> > > > - true bottleneck might be obscured by the fact flame graphs are
>> > > squashing
>> > > > results from all of the threads. There can be 60% of time spent in
>> one
>> > > > component according to a flame graph, but it might not necessarily
>> be
>> > the
>> > > > bottleneck, which could be in a completely different component
>> running
>> > > > which has a single thread burning 100% of a single CPU core, barely
>> > > visible
>> > > > in the flame graph at all.
>> > > >
>> > > > It's great to have such a nice tool readily and easily available,
>> but
>> > we
>> > > > need to make sure people who are using it are aware when it can be
>> > > > misleading.
>> > > >
>> > > > Piotrek
>> > > >
>> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]>
>> > napisał(a):
>> > > >
>> > > > > Ah ok. Thanks for the clarification Alex.
>> > > > >
>> > > > > Cheers,
>> > > > > Till
>> > > > >
>> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
>> > > > [hidden email]>
>> > > > > wrote:
>> > > > >
>> > > > > > It is passed back as part of the response to the asynchronous
>> > > callback
>> > > > > > within the coordinator and is used to decide if all outstanding
>> > > > requests
>> > > > > to
>> > > > > > the parallel instances of a particular operator returned
>> > > successfully.
>> > > > If
>> > > > > > so, the request is considered successful, sub-results are
>> combined
>> > > and
>> > > > > the
>> > > > > > thread info result future for an operator completes.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
>> > > > > >
>> > > > > > Best,
>> > > > > >
>> > > > > > --
>> > > > > >
>> > > > > > Alexander Fedulov | Solutions Architect
>> > > > > >
>> > > > > > <https://www.ververica.com/>
>> > > > > >
>> > > > > > Follow us @VervericaData
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <
>> > [hidden email]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Why does the caller of
>> > TaskExecutorGateway.requestThreadInfoSamples
>> > > > > need
>> > > > > > to
>> > > > > > > specify the request id? Is it because the caller can send a
>> > second
>> > > > > > request
>> > > > > > > with the same id? Or can the caller query the result of a
>> > previous
>> > > > > > request
>> > > > > > > by specifying the requestId?
>> > > > > > >
>> > > > > > > If the TaskExecutor does not need to know about the id, then
>> we
>> > > might
>> > > > > be
>> > > > > > > able to drop it.
>> > > > > > >
>> > > > > > > Cheers
>> > > > > > > Till
>> > > > > > >
>> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
>> > > > > > [hidden email]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Till,
>> > > > > > > >
>> > > > > > > > Thanks for your comments.
>> > > > > > > >
>> > > > > > > > * What is the requestId used for in the RPC call?
>> > > > > > > >
>> > > > > > > > It is the handle that is used as the key in the
>> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I
>> believe
>> > > it
>> > > > > was
>> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I
>> > decided
>> > > > to
>> > > > > > > rename
>> > > > > > > > it because there is also a ThreadInfoSampleService which is
>> > > > actually
>> > > > > > > > responsible for sampling the JVM numSamples number of
>> times. I
>> > > > found
>> > > > > > that
>> > > > > > > > the notion of what a sample is was a bit confusing. Now one
>> > > thread
>> > > > > info
>> > > > > > > > request corresponds to gathering numSamples from a
>> > corresponding
>> > > > > Task.
>> > > > > > > Hope
>> > > > > > > > that makes sense.
>> > > > > > > >
>> > > > > > > > * Would it make sense to group numSubSamples,
>> > delayBetweenSamples
>> > > > and
>> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This
>> > would
>> > > > > > decrease
>> > > > > > > > the number of parameters and group those which are closely
>> > > related
>> > > > > > > > together.
>> > > > > > > >
>> > > > > > > > Good point. I will rework it accordingly.
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > --
>> > > > > > > >
>> > > > > > > > Alexander Fedulov | Solutions Architect
>> > > > > > > > Follow us @VervericaData
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Sent from:
>> > > > > > >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Till Rohrmann

Re: [DISCUSS] FLIP-165: Operator's Flame Graphs

Thanks Alex. From my perspective we could continue with the vote now.

Cheers,
Till

On Thu, Mar 4, 2021 at 9:15 PM Alexander Fedulov <[hidden email]>
wrote:

> @Till, I've added the proposed ThreadInfoSamplesRequest and updated the
> FLIP and the PR accordingly.
>
> Best,
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>
>
>
> On Wed, Mar 3, 2021 at 5:03 PM Alexander Fedulov <[hidden email]>
> wrote:
>
> > Added docs to the PR.
> > @David, thanks for the tip, it seems like a good place to put them.
> >
> > --
> >
> > Alexander Fedulov | Solutions Architect
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> >
> >
> >
> > On Wed, Mar 3, 2021 at 12:10 PM David Anderson <[hidden email]>
> > wrote:
> >
> >> This is going to make performance analysis and optimization much more
> >> accessible. I can't wait to include this in our training courses.
> >>
> >> +1
> >>
> >> Seth suggested putting the docs for this feature under
> >> Operations/Monitoring, but there's already a page in the docs under
> >> Operations/Debugging for Application Profiling & Debugging, which is
> more
> >> on point. I think it will be confusing if the flame graphs aren't
> >> together with the other profilers.
> >>
> >> David
> >>
> >> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]>
> wrote:
> >>
> >> > Cool feature +1
> >> >
> >> > There is a subsection called monitoring in the operations section of
> the
> >> > docs. It would fit nicely there.
> >> >
> >> > Seth
> >> >
> >> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <
> >> [hidden email]>
> >> > wrote:
> >> >
> >> > > Hi Piotr,
> >> > >
> >> > > Thanks for the comments - all valid points.
> >> > > We should definitely document how the Flame Graphs are constructed
> - I
> >> > will
> >> > > work on the docs. Do you have a proposition about the part of which
> >> > > page/section they should become?
> >> > > I would like to also mention here that I plan to work on further
> >> > > improvements, such as the ability to "zoom in" into the Flame Graphs
> >> for
> >> > > the individual Tasks in the "unsquashed" form, so some of those
> >> concerns
> >> > > should be mitigated in the future.
> >> > >
> >> > > Best,
> >> > >
> >> > > --
> >> > >
> >> > > Alexander Fedulov | Solutions Architect
> >> > >
> >> > > <https://www.ververica.com/>
> >> > >
> >> > > Follow us @VervericaData
> >> > >
> >> > >
> >> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]
> >
> >> > > wrote:
> >> > >
> >> > > > Nice feature +1 from my side for it.
> >> > > >
> >> > > > In the PR I think we are missing documentation. I think it's
> >> especially
> >> > > > important to mention the limitations of this approach for
> >> performance
> >> > > > analysis. If we make it easy for the user to get such kind of
> data,
> >> > it's
> >> > > > important they do not proverbially shoot themselves in their own
> >> foot
> >> > > with
> >> > > > false conclusions. We should clearly mention how those data are
> >> > sampled,
> >> > > > and point to limitations such as:
> >> > > > - data from all threads/operators are squashed together, so if
> there
> >> > is a
> >> > > > data skew it will be averaged out
> >> > > > - stack sampling is/can be biased (JVM threads are more likely to
> be
> >> > > > stopped in some places than others, while skipping/rarely stopping
> >> in
> >> > the
> >> > > > true hot spots - so one should treat the results with a grain of
> >> salt
> >> > > below
> >> > > > a certain level)
> >> > > > - true bottleneck might be obscured by the fact flame graphs are
> >> > > squashing
> >> > > > results from all of the threads. There can be 60% of time spent in
> >> one
> >> > > > component according to a flame graph, but it might not necessarily
> >> be
> >> > the
> >> > > > bottleneck, which could be in a completely different component
> >> running
> >> > > > which has a single thread burning 100% of a single CPU core,
> barely
> >> > > visible
> >> > > > in the flame graph at all.
> >> > > >
> >> > > > It's great to have such a nice tool readily and easily available,
> >> but
> >> > we
> >> > > > need to make sure people who are using it are aware when it can be
> >> > > > misleading.
> >> > > >
> >> > > > Piotrek
> >> > > >
> >> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]>
> >> > napisał(a):
> >> > > >
> >> > > > > Ah ok. Thanks for the clarification Alex.
> >> > > > >
> >> > > > > Cheers,
> >> > > > > Till
> >> > > > >
> >> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <
> >> > > > [hidden email]>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > It is passed back as part of the response to the asynchronous
> >> > > callback
> >> > > > > > within the coordinator and is used to decide if all
> outstanding
> >> > > > requests
> >> > > > > to
> >> > > > > > the parallel instances of a particular operator returned
> >> > > successfully.
> >> > > > If
> >> > > > > > so, the request is considered successful, sub-results are
> >> combined
> >> > > and
> >> > > > > the
> >> > > > > > thread info result future for an operator completes.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277
> >> > > > > >
> >> > > > > > Best,
> >> > > > > >
> >> > > > > > --
> >> > > > > >
> >> > > > > > Alexander Fedulov | Solutions Architect
> >> > > > > >
> >> > > > > > <https://www.ververica.com/>
> >> > > > > >
> >> > > > > > Follow us @VervericaData
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <
> >> > [hidden email]>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Why does the caller of
> >> > TaskExecutorGateway.requestThreadInfoSamples
> >> > > > > need
> >> > > > > > to
> >> > > > > > > specify the request id? Is it because the caller can send a
> >> > second
> >> > > > > > request
> >> > > > > > > with the same id? Or can the caller query the result of a
> >> > previous
> >> > > > > > request
> >> > > > > > > by specifying the requestId?
> >> > > > > > >
> >> > > > > > > If the TaskExecutor does not need to know about the id, then
> >> we
> >> > > might
> >> > > > > be
> >> > > > > > > able to drop it.
> >> > > > > > >
> >> > > > > > > Cheers
> >> > > > > > > Till
> >> > > > > > >
> >> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <
> >> > > > > > [hidden email]>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi Till,
> >> > > > > > > >
> >> > > > > > > > Thanks for your comments.
> >> > > > > > > >
> >> > > > > > > > * What is the requestId used for in the RPC call?
> >> > > > > > > >
> >> > > > > > > > It is the handle that is used as the key in the
> >> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I
> >> believe
> >> > > it
> >> > > > > was
> >> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I
> >> > decided
> >> > > > to
> >> > > > > > > rename
> >> > > > > > > > it because there is also a ThreadInfoSampleService which
> is
> >> > > > actually
> >> > > > > > > > responsible for sampling the JVM numSamples number of
> >> times. I
> >> > > > found
> >> > > > > > that
> >> > > > > > > > the notion of what a sample is was a bit confusing. Now
> one
> >> > > thread
> >> > > > > info
> >> > > > > > > > request corresponds to gathering numSamples from a
> >> > corresponding
> >> > > > > Task.
> >> > > > > > > Hope
> >> > > > > > > > that makes sense.
> >> > > > > > > >
> >> > > > > > > > * Would it make sense to group numSubSamples,
> >> > delayBetweenSamples
> >> > > > and
> >> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This
> >> > would
> >> > > > > > decrease
> >> > > > > > > > the number of parameters and group those which are closely
> >> > > related
> >> > > > > > > > together.
> >> > > > > > > >
> >> > > > > > > > Good point. I will rework it accordingly.
> >> > > > > > > >
> >> > > > > > > > Best,
> >> > > > > > > > --
> >> > > > > > > >
> >> > > > > > > > Alexander Fedulov | Solutions Architect
> >> > > > > > > > Follow us @VervericaData
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Sent from:
> >> > > > > > >
> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>