Hi All,
I would like to start a discussion for FLIP-165: Operator's Flame Graphs [1] A Flame Graph [2] is a visualization that is very effective for providing answers to the questions like: - Which methods are currently consuming CPU resources? - How CPU utilization by one method compares to the others? - Which series of calls on the stack led to executing a particular method? I have already opened a PR [3] that represents the implementation approach proposed in the FLIP. It supports both on-CPU and off-CPU [4] Flame Graphs. Looking forward to your feedback. P.S: I would like to give kudos to David Moravek for his prototyping work [5] on this feature. Although the proposed implementation significantly diverges from his prototype on the Flink side, the work done on connecting the d3-flame-graph library to the right data structure retrieved from Flink was instrumental for enabling this feature. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-165%3A+Operator%27s+Flame+Graphs [2] http://www.brendangregg.com/flamegraphs.html [3] https://github.com/apache/flink/pull/15054 [4] http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html [5] https://issues.apache.org/jira/browse/FLINK-13550?focusedCommentId=17083026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17083026 Best, -- Alexander Fedulov | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton Wehner |
Thanks for starting this discussion Alex. I really like this feature
because it gives better insights into what a Flink job is doing. Quick question on the TaskExecutorGateway extension: * What is the requestId used for in the RPC call? * Would it make sense to group numSubSamples, delayBetweenSamples and maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease the number of parameters and group those which are closely related together. Cheers, Till On Tue, Mar 2, 2021 at 12:45 AM Alexander Fedulov <[hidden email]> wrote: > Hi All, > > I would like to start a discussion for FLIP-165: Operator's Flame Graphs > [1] > > A Flame Graph [2] is a visualization that is very effective for providing > answers to the questions like: > - Which methods are currently consuming CPU resources? > - How CPU utilization by one method compares to the others? > - Which series of calls on the stack led to executing a particular method? > > I have already opened a PR [3] that represents the implementation approach > proposed in the FLIP. It supports both on-CPU and off-CPU [4] Flame Graphs. > > Looking forward to your feedback. > > P.S: I would like to give kudos to David Moravek for his prototyping work > [5] on this feature. Although the proposed implementation significantly > diverges from his prototype on the Flink side, the work done on connecting > the d3-flame-graph library to the right data structure retrieved from Flink > was instrumental for enabling this feature. > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-165%3A+Operator%27s+Flame+Graphs > [2] http://www.brendangregg.com/flamegraphs.html > [3] https://github.com/apache/flink/pull/15054 > [4] http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html > [5] > > https://issues.apache.org/jira/browse/FLINK-13550?focusedCommentId=17083026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17083026 > > Best, > > -- > > Alexander Fedulov | Solutions Architect > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton > Wehner > |
Hi Till,
Thanks for your comments. * What is the requestId used for in the RPC call? It is the handle that is used as the key in the ThreadInfoRequestCoordinator's pending responses Map. I believe it was called sampleId in the StackTraceSampleCoordinator, but I decided to rename it because there is also a ThreadInfoSampleService which is actually responsible for sampling the JVM numSamples number of times. I found that the notion of what a sample is was a bit confusing. Now one thread info request corresponds to gathering numSamples from a corresponding Task. Hope that makes sense. * Would it make sense to group numSubSamples, delayBetweenSamples and maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease the number of parameters and group those which are closely related together. Good point. I will rework it accordingly. Best, -- Alexander Fedulov | Solutions Architect Follow us @VervericaData -- Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ |
Why does the caller of TaskExecutorGateway.requestThreadInfoSamples need to
specify the request id? Is it because the caller can send a second request with the same id? Or can the caller query the result of a previous request by specifying the requestId? If the TaskExecutor does not need to know about the id, then we might be able to drop it. Cheers Till On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <[hidden email]> wrote: > Hi Till, > > Thanks for your comments. > > * What is the requestId used for in the RPC call? > > It is the handle that is used as the key in the > ThreadInfoRequestCoordinator's pending responses Map. I believe it was > called sampleId in the StackTraceSampleCoordinator, but I decided to rename > it because there is also a ThreadInfoSampleService which is actually > responsible for sampling the JVM numSamples number of times. I found that > the notion of what a sample is was a bit confusing. Now one thread info > request corresponds to gathering numSamples from a corresponding Task. Hope > that makes sense. > > * Would it make sense to group numSubSamples, delayBetweenSamples and > maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease > the number of parameters and group those which are closely related > together. > > Good point. I will rework it accordingly. > > Best, > -- > > Alexander Fedulov | Solutions Architect > Follow us @VervericaData > > > > -- > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > |
It is passed back as part of the response to the asynchronous callback
within the coordinator and is used to decide if all outstanding requests to the parallel instances of a particular operator returned successfully. If so, the request is considered successful, sub-results are combined and the thread info result future for an operator completes. https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 Best, -- Alexander Fedulov | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]> wrote: > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples need to > specify the request id? Is it because the caller can send a second request > with the same id? Or can the caller query the result of a previous request > by specifying the requestId? > > If the TaskExecutor does not need to know about the id, then we might be > able to drop it. > > Cheers > Till > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov <[hidden email]> > wrote: > > > Hi Till, > > > > Thanks for your comments. > > > > * What is the requestId used for in the RPC call? > > > > It is the handle that is used as the key in the > > ThreadInfoRequestCoordinator's pending responses Map. I believe it was > > called sampleId in the StackTraceSampleCoordinator, but I decided to > rename > > it because there is also a ThreadInfoSampleService which is actually > > responsible for sampling the JVM numSamples number of times. I found that > > the notion of what a sample is was a bit confusing. Now one thread info > > request corresponds to gathering numSamples from a corresponding Task. > Hope > > that makes sense. > > > > * Would it make sense to group numSubSamples, delayBetweenSamples and > > maxStackTraceDepth into a ThreadSamplesRequest class? This would decrease > > the number of parameters and group those which are closely related > > together. > > > > Good point. I will rework it accordingly. > > > > Best, > > -- > > > > Alexander Fedulov | Solutions Architect > > Follow us @VervericaData > > > > > > > > -- > > Sent from: > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > |
Ah ok. Thanks for the clarification Alex.
Cheers, Till On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <[hidden email]> wrote: > It is passed back as part of the response to the asynchronous callback > within the coordinator and is used to decide if all outstanding requests to > the parallel instances of a particular operator returned successfully. If > so, the request is considered successful, sub-results are combined and the > thread info result future for an operator completes. > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > > Best, > > -- > > Alexander Fedulov | Solutions Architect > > <https://www.ververica.com/> > > Follow us @VervericaData > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]> > wrote: > > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples need > to > > specify the request id? Is it because the caller can send a second > request > > with the same id? Or can the caller query the result of a previous > request > > by specifying the requestId? > > > > If the TaskExecutor does not need to know about the id, then we might be > > able to drop it. > > > > Cheers > > Till > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > [hidden email]> > > wrote: > > > > > Hi Till, > > > > > > Thanks for your comments. > > > > > > * What is the requestId used for in the RPC call? > > > > > > It is the handle that is used as the key in the > > > ThreadInfoRequestCoordinator's pending responses Map. I believe it was > > > called sampleId in the StackTraceSampleCoordinator, but I decided to > > rename > > > it because there is also a ThreadInfoSampleService which is actually > > > responsible for sampling the JVM numSamples number of times. I found > that > > > the notion of what a sample is was a bit confusing. Now one thread info > > > request corresponds to gathering numSamples from a corresponding Task. > > Hope > > > that makes sense. > > > > > > * Would it make sense to group numSubSamples, delayBetweenSamples and > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would > decrease > > > the number of parameters and group those which are closely related > > > together. > > > > > > Good point. I will rework it accordingly. > > > > > > Best, > > > -- > > > > > > Alexander Fedulov | Solutions Architect > > > Follow us @VervericaData > > > > > > > > > > > > -- > > > Sent from: > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > |
Nice feature +1 from my side for it.
In the PR I think we are missing documentation. I think it's especially important to mention the limitations of this approach for performance analysis. If we make it easy for the user to get such kind of data, it's important they do not proverbially shoot themselves in their own foot with false conclusions. We should clearly mention how those data are sampled, and point to limitations such as: - data from all threads/operators are squashed together, so if there is a data skew it will be averaged out - stack sampling is/can be biased (JVM threads are more likely to be stopped in some places than others, while skipping/rarely stopping in the true hot spots - so one should treat the results with a grain of salt below a certain level) - true bottleneck might be obscured by the fact flame graphs are squashing results from all of the threads. There can be 60% of time spent in one component according to a flame graph, but it might not necessarily be the bottleneck, which could be in a completely different component running which has a single thread burning 100% of a single CPU core, barely visible in the flame graph at all. It's great to have such a nice tool readily and easily available, but we need to make sure people who are using it are aware when it can be misleading. Piotrek wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> napisał(a): > Ah ok. Thanks for the clarification Alex. > > Cheers, > Till > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov <[hidden email]> > wrote: > > > It is passed back as part of the response to the asynchronous callback > > within the coordinator and is used to decide if all outstanding requests > to > > the parallel instances of a particular operator returned successfully. If > > so, the request is considered successful, sub-results are combined and > the > > thread info result future for an operator completes. > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > > > > Best, > > > > -- > > > > Alexander Fedulov | Solutions Architect > > > > <https://www.ververica.com/> > > > > Follow us @VervericaData > > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples > need > > to > > > specify the request id? Is it because the caller can send a second > > request > > > with the same id? Or can the caller query the result of a previous > > request > > > by specifying the requestId? > > > > > > If the TaskExecutor does not need to know about the id, then we might > be > > > able to drop it. > > > > > > Cheers > > > Till > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > > [hidden email]> > > > wrote: > > > > > > > Hi Till, > > > > > > > > Thanks for your comments. > > > > > > > > * What is the requestId used for in the RPC call? > > > > > > > > It is the handle that is used as the key in the > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe it > was > > > > called sampleId in the StackTraceSampleCoordinator, but I decided to > > > rename > > > > it because there is also a ThreadInfoSampleService which is actually > > > > responsible for sampling the JVM numSamples number of times. I found > > that > > > > the notion of what a sample is was a bit confusing. Now one thread > info > > > > request corresponds to gathering numSamples from a corresponding > Task. > > > Hope > > > > that makes sense. > > > > > > > > * Would it make sense to group numSubSamples, delayBetweenSamples and > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would > > decrease > > > > the number of parameters and group those which are closely related > > > > together. > > > > > > > > Good point. I will rework it accordingly. > > > > > > > > Best, > > > > -- > > > > > > > > Alexander Fedulov | Solutions Architect > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > -- > > > > Sent from: > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > > |
Hi Piotr,
Thanks for the comments - all valid points. We should definitely document how the Flame Graphs are constructed - I will work on the docs. Do you have a proposition about the part of which page/section they should become? I would like to also mention here that I plan to work on further improvements, such as the ability to "zoom in" into the Flame Graphs for the individual Tasks in the "unsquashed" form, so some of those concerns should be mitigated in the future. Best, -- Alexander Fedulov | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]> wrote: > Nice feature +1 from my side for it. > > In the PR I think we are missing documentation. I think it's especially > important to mention the limitations of this approach for performance > analysis. If we make it easy for the user to get such kind of data, it's > important they do not proverbially shoot themselves in their own foot with > false conclusions. We should clearly mention how those data are sampled, > and point to limitations such as: > - data from all threads/operators are squashed together, so if there is a > data skew it will be averaged out > - stack sampling is/can be biased (JVM threads are more likely to be > stopped in some places than others, while skipping/rarely stopping in the > true hot spots - so one should treat the results with a grain of salt below > a certain level) > - true bottleneck might be obscured by the fact flame graphs are squashing > results from all of the threads. There can be 60% of time spent in one > component according to a flame graph, but it might not necessarily be the > bottleneck, which could be in a completely different component running > which has a single thread burning 100% of a single CPU core, barely visible > in the flame graph at all. > > It's great to have such a nice tool readily and easily available, but we > need to make sure people who are using it are aware when it can be > misleading. > > Piotrek > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> napisał(a): > > > Ah ok. Thanks for the clarification Alex. > > > > Cheers, > > Till > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < > [hidden email]> > > wrote: > > > > > It is passed back as part of the response to the asynchronous callback > > > within the coordinator and is used to decide if all outstanding > requests > > to > > > the parallel instances of a particular operator returned successfully. > If > > > so, the request is considered successful, sub-results are combined and > > the > > > thread info result future for an operator completes. > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > > > > > > Best, > > > > > > -- > > > > > > Alexander Fedulov | Solutions Architect > > > > > > <https://www.ververica.com/> > > > > > > Follow us @VervericaData > > > > > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]> > > > wrote: > > > > > > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples > > need > > > to > > > > specify the request id? Is it because the caller can send a second > > > request > > > > with the same id? Or can the caller query the result of a previous > > > request > > > > by specifying the requestId? > > > > > > > > If the TaskExecutor does not need to know about the id, then we might > > be > > > > able to drop it. > > > > > > > > Cheers > > > > Till > > > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > > > [hidden email]> > > > > wrote: > > > > > > > > > Hi Till, > > > > > > > > > > Thanks for your comments. > > > > > > > > > > * What is the requestId used for in the RPC call? > > > > > > > > > > It is the handle that is used as the key in the > > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe it > > was > > > > > called sampleId in the StackTraceSampleCoordinator, but I decided > to > > > > rename > > > > > it because there is also a ThreadInfoSampleService which is > actually > > > > > responsible for sampling the JVM numSamples number of times. I > found > > > that > > > > > the notion of what a sample is was a bit confusing. Now one thread > > info > > > > > request corresponds to gathering numSamples from a corresponding > > Task. > > > > Hope > > > > > that makes sense. > > > > > > > > > > * Would it make sense to group numSubSamples, delayBetweenSamples > and > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would > > > decrease > > > > > the number of parameters and group those which are closely related > > > > > together. > > > > > > > > > > Good point. I will rework it accordingly. > > > > > > > > > > Best, > > > > > -- > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > > > > > -- > > > > > Sent from: > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > > > > > > > |
Cool feature +1
There is a subsection called monitoring in the operations section of the docs. It would fit nicely there. Seth On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <[hidden email]> wrote: > Hi Piotr, > > Thanks for the comments - all valid points. > We should definitely document how the Flame Graphs are constructed - I will > work on the docs. Do you have a proposition about the part of which > page/section they should become? > I would like to also mention here that I plan to work on further > improvements, such as the ability to "zoom in" into the Flame Graphs for > the individual Tasks in the "unsquashed" form, so some of those concerns > should be mitigated in the future. > > Best, > > -- > > Alexander Fedulov | Solutions Architect > > <https://www.ververica.com/> > > Follow us @VervericaData > > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]> > wrote: > > > Nice feature +1 from my side for it. > > > > In the PR I think we are missing documentation. I think it's especially > > important to mention the limitations of this approach for performance > > analysis. If we make it easy for the user to get such kind of data, it's > > important they do not proverbially shoot themselves in their own foot > with > > false conclusions. We should clearly mention how those data are sampled, > > and point to limitations such as: > > - data from all threads/operators are squashed together, so if there is a > > data skew it will be averaged out > > - stack sampling is/can be biased (JVM threads are more likely to be > > stopped in some places than others, while skipping/rarely stopping in the > > true hot spots - so one should treat the results with a grain of salt > below > > a certain level) > > - true bottleneck might be obscured by the fact flame graphs are > squashing > > results from all of the threads. There can be 60% of time spent in one > > component according to a flame graph, but it might not necessarily be the > > bottleneck, which could be in a completely different component running > > which has a single thread burning 100% of a single CPU core, barely > visible > > in the flame graph at all. > > > > It's great to have such a nice tool readily and easily available, but we > > need to make sure people who are using it are aware when it can be > > misleading. > > > > Piotrek > > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> napisał(a): > > > > > Ah ok. Thanks for the clarification Alex. > > > > > > Cheers, > > > Till > > > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < > > [hidden email]> > > > wrote: > > > > > > > It is passed back as part of the response to the asynchronous > callback > > > > within the coordinator and is used to decide if all outstanding > > requests > > > to > > > > the parallel instances of a particular operator returned > successfully. > > If > > > > so, the request is considered successful, sub-results are combined > and > > > the > > > > thread info result future for an operator completes. > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > > > > > > > > Best, > > > > > > > > -- > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > > > <https://www.ververica.com/> > > > > > > > > Follow us @VervericaData > > > > > > > > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann <[hidden email]> > > > > wrote: > > > > > > > > > Why does the caller of TaskExecutorGateway.requestThreadInfoSamples > > > need > > > > to > > > > > specify the request id? Is it because the caller can send a second > > > > request > > > > > with the same id? Or can the caller query the result of a previous > > > > request > > > > > by specifying the requestId? > > > > > > > > > > If the TaskExecutor does not need to know about the id, then we > might > > > be > > > > > able to drop it. > > > > > > > > > > Cheers > > > > > Till > > > > > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > Hi Till, > > > > > > > > > > > > Thanks for your comments. > > > > > > > > > > > > * What is the requestId used for in the RPC call? > > > > > > > > > > > > It is the handle that is used as the key in the > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe > it > > > was > > > > > > called sampleId in the StackTraceSampleCoordinator, but I decided > > to > > > > > rename > > > > > > it because there is also a ThreadInfoSampleService which is > > actually > > > > > > responsible for sampling the JVM numSamples number of times. I > > found > > > > that > > > > > > the notion of what a sample is was a bit confusing. Now one > thread > > > info > > > > > > request corresponds to gathering numSamples from a corresponding > > > Task. > > > > > Hope > > > > > > that makes sense. > > > > > > > > > > > > * Would it make sense to group numSubSamples, delayBetweenSamples > > and > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This would > > > > decrease > > > > > > the number of parameters and group those which are closely > related > > > > > > together. > > > > > > > > > > > > Good point. I will rework it accordingly. > > > > > > > > > > > > Best, > > > > > > -- > > > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sent from: > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > > > > > > > > > > > > > |
This is going to make performance analysis and optimization much more
accessible. I can't wait to include this in our training courses. +1 Seth suggested putting the docs for this feature under Operations/Monitoring, but there's already a page in the docs under Operations/Debugging for Application Profiling & Debugging, which is more on point. I think it will be confusing if the flame graphs aren't together with the other profilers. David On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]> wrote: > Cool feature +1 > > There is a subsection called monitoring in the operations section of the > docs. It would fit nicely there. > > Seth > > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov <[hidden email]> > wrote: > > > Hi Piotr, > > > > Thanks for the comments - all valid points. > > We should definitely document how the Flame Graphs are constructed - I > will > > work on the docs. Do you have a proposition about the part of which > > page/section they should become? > > I would like to also mention here that I plan to work on further > > improvements, such as the ability to "zoom in" into the Flame Graphs for > > the individual Tasks in the "unsquashed" form, so some of those concerns > > should be mitigated in the future. > > > > Best, > > > > -- > > > > Alexander Fedulov | Solutions Architect > > > > <https://www.ververica.com/> > > > > Follow us @VervericaData > > > > > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]> > > wrote: > > > > > Nice feature +1 from my side for it. > > > > > > In the PR I think we are missing documentation. I think it's especially > > > important to mention the limitations of this approach for performance > > > analysis. If we make it easy for the user to get such kind of data, > it's > > > important they do not proverbially shoot themselves in their own foot > > with > > > false conclusions. We should clearly mention how those data are > sampled, > > > and point to limitations such as: > > > - data from all threads/operators are squashed together, so if there > is a > > > data skew it will be averaged out > > > - stack sampling is/can be biased (JVM threads are more likely to be > > > stopped in some places than others, while skipping/rarely stopping in > the > > > true hot spots - so one should treat the results with a grain of salt > > below > > > a certain level) > > > - true bottleneck might be obscured by the fact flame graphs are > > squashing > > > results from all of the threads. There can be 60% of time spent in one > > > component according to a flame graph, but it might not necessarily be > the > > > bottleneck, which could be in a completely different component running > > > which has a single thread burning 100% of a single CPU core, barely > > visible > > > in the flame graph at all. > > > > > > It's great to have such a nice tool readily and easily available, but > we > > > need to make sure people who are using it are aware when it can be > > > misleading. > > > > > > Piotrek > > > > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> > napisał(a): > > > > > > > Ah ok. Thanks for the clarification Alex. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < > > > [hidden email]> > > > > wrote: > > > > > > > > > It is passed back as part of the response to the asynchronous > > callback > > > > > within the coordinator and is used to decide if all outstanding > > > requests > > > > to > > > > > the parallel instances of a particular operator returned > > successfully. > > > If > > > > > so, the request is considered successful, sub-results are combined > > and > > > > the > > > > > thread info result future for an operator completes. > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > > > > > > > > > > Best, > > > > > > > > > > -- > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > > > > > <https://www.ververica.com/> > > > > > > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann < > [hidden email]> > > > > > wrote: > > > > > > > > > > > Why does the caller of > TaskExecutorGateway.requestThreadInfoSamples > > > > need > > > > > to > > > > > > specify the request id? Is it because the caller can send a > second > > > > > request > > > > > > with the same id? Or can the caller query the result of a > previous > > > > > request > > > > > > by specifying the requestId? > > > > > > > > > > > > If the TaskExecutor does not need to know about the id, then we > > might > > > > be > > > > > > able to drop it. > > > > > > > > > > > > Cheers > > > > > > Till > > > > > > > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > > > > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Hi Till, > > > > > > > > > > > > > > Thanks for your comments. > > > > > > > > > > > > > > * What is the requestId used for in the RPC call? > > > > > > > > > > > > > > It is the handle that is used as the key in the > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I believe > > it > > > > was > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I > decided > > > to > > > > > > rename > > > > > > > it because there is also a ThreadInfoSampleService which is > > > actually > > > > > > > responsible for sampling the JVM numSamples number of times. I > > > found > > > > > that > > > > > > > the notion of what a sample is was a bit confusing. Now one > > thread > > > > info > > > > > > > request corresponds to gathering numSamples from a > corresponding > > > > Task. > > > > > > Hope > > > > > > > that makes sense. > > > > > > > > > > > > > > * Would it make sense to group numSubSamples, > delayBetweenSamples > > > and > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This > would > > > > > decrease > > > > > > > the number of parameters and group those which are closely > > related > > > > > > > together. > > > > > > > > > > > > > > Good point. I will rework it accordingly. > > > > > > > > > > > > > > Best, > > > > > > > -- > > > > > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Sent from: > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Added docs to the PR.
@David, thanks for the tip, it seems like a good place to put them. -- Alexander Fedulov | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData On Wed, Mar 3, 2021 at 12:10 PM David Anderson <[hidden email]> wrote: > This is going to make performance analysis and optimization much more > accessible. I can't wait to include this in our training courses. > > +1 > > Seth suggested putting the docs for this feature under > Operations/Monitoring, but there's already a page in the docs under > Operations/Debugging for Application Profiling & Debugging, which is more > on point. I think it will be confusing if the flame graphs aren't > together with the other profilers. > > David > > On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]> wrote: > > > Cool feature +1 > > > > There is a subsection called monitoring in the operations section of the > > docs. It would fit nicely there. > > > > Seth > > > > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov < > [hidden email]> > > wrote: > > > > > Hi Piotr, > > > > > > Thanks for the comments - all valid points. > > > We should definitely document how the Flame Graphs are constructed - I > > will > > > work on the docs. Do you have a proposition about the part of which > > > page/section they should become? > > > I would like to also mention here that I plan to work on further > > > improvements, such as the ability to "zoom in" into the Flame Graphs > for > > > the individual Tasks in the "unsquashed" form, so some of those > concerns > > > should be mitigated in the future. > > > > > > Best, > > > > > > -- > > > > > > Alexander Fedulov | Solutions Architect > > > > > > <https://www.ververica.com/> > > > > > > Follow us @VervericaData > > > > > > > > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]> > > > wrote: > > > > > > > Nice feature +1 from my side for it. > > > > > > > > In the PR I think we are missing documentation. I think it's > especially > > > > important to mention the limitations of this approach for performance > > > > analysis. If we make it easy for the user to get such kind of data, > > it's > > > > important they do not proverbially shoot themselves in their own foot > > > with > > > > false conclusions. We should clearly mention how those data are > > sampled, > > > > and point to limitations such as: > > > > - data from all threads/operators are squashed together, so if there > > is a > > > > data skew it will be averaged out > > > > - stack sampling is/can be biased (JVM threads are more likely to be > > > > stopped in some places than others, while skipping/rarely stopping in > > the > > > > true hot spots - so one should treat the results with a grain of salt > > > below > > > > a certain level) > > > > - true bottleneck might be obscured by the fact flame graphs are > > > squashing > > > > results from all of the threads. There can be 60% of time spent in > one > > > > component according to a flame graph, but it might not necessarily be > > the > > > > bottleneck, which could be in a completely different component > running > > > > which has a single thread burning 100% of a single CPU core, barely > > > visible > > > > in the flame graph at all. > > > > > > > > It's great to have such a nice tool readily and easily available, but > > we > > > > need to make sure people who are using it are aware when it can be > > > > misleading. > > > > > > > > Piotrek > > > > > > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> > > napisał(a): > > > > > > > > > Ah ok. Thanks for the clarification Alex. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < > > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > It is passed back as part of the response to the asynchronous > > > callback > > > > > > within the coordinator and is used to decide if all outstanding > > > > requests > > > > > to > > > > > > the parallel instances of a particular operator returned > > > successfully. > > > > If > > > > > > so, the request is considered successful, sub-results are > combined > > > and > > > > > the > > > > > > thread info result future for an operator completes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > > > > > > > > > > > > Best, > > > > > > > > > > > > -- > > > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > > > > > > > <https://www.ververica.com/> > > > > > > > > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Why does the caller of > > TaskExecutorGateway.requestThreadInfoSamples > > > > > need > > > > > > to > > > > > > > specify the request id? Is it because the caller can send a > > second > > > > > > request > > > > > > > with the same id? Or can the caller query the result of a > > previous > > > > > > request > > > > > > > by specifying the requestId? > > > > > > > > > > > > > > If the TaskExecutor does not need to know about the id, then we > > > might > > > > > be > > > > > > > able to drop it. > > > > > > > > > > > > > > Cheers > > > > > > > Till > > > > > > > > > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > > > > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Till, > > > > > > > > > > > > > > > > Thanks for your comments. > > > > > > > > > > > > > > > > * What is the requestId used for in the RPC call? > > > > > > > > > > > > > > > > It is the handle that is used as the key in the > > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I > believe > > > it > > > > > was > > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I > > decided > > > > to > > > > > > > rename > > > > > > > > it because there is also a ThreadInfoSampleService which is > > > > actually > > > > > > > > responsible for sampling the JVM numSamples number of times. > I > > > > found > > > > > > that > > > > > > > > the notion of what a sample is was a bit confusing. Now one > > > thread > > > > > info > > > > > > > > request corresponds to gathering numSamples from a > > corresponding > > > > > Task. > > > > > > > Hope > > > > > > > > that makes sense. > > > > > > > > > > > > > > > > * Would it make sense to group numSubSamples, > > delayBetweenSamples > > > > and > > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This > > would > > > > > > decrease > > > > > > > > the number of parameters and group those which are closely > > > related > > > > > > > > together. > > > > > > > > > > > > > > > > Good point. I will rework it accordingly. > > > > > > > > > > > > > > > > Best, > > > > > > > > -- > > > > > > > > > > > > > > > > Alexander Fedulov | Solutions Architect > > > > > > > > Follow us @VervericaData > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Sent from: > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
@Till, I've added the proposed ThreadInfoSamplesRequest and updated the
FLIP and the PR accordingly. Best, -- Alexander Fedulov | Solutions Architect <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton Wehner On Wed, Mar 3, 2021 at 5:03 PM Alexander Fedulov <[hidden email]> wrote: > Added docs to the PR. > @David, thanks for the tip, it seems like a good place to put them. > > -- > > Alexander Fedulov | Solutions Architect > > <https://www.ververica.com/> > > Follow us @VervericaData > > > > > On Wed, Mar 3, 2021 at 12:10 PM David Anderson <[hidden email]> > wrote: > >> This is going to make performance analysis and optimization much more >> accessible. I can't wait to include this in our training courses. >> >> +1 >> >> Seth suggested putting the docs for this feature under >> Operations/Monitoring, but there's already a page in the docs under >> Operations/Debugging for Application Profiling & Debugging, which is more >> on point. I think it will be confusing if the flame graphs aren't >> together with the other profilers. >> >> David >> >> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]> wrote: >> >> > Cool feature +1 >> > >> > There is a subsection called monitoring in the operations section of the >> > docs. It would fit nicely there. >> > >> > Seth >> > >> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov < >> [hidden email]> >> > wrote: >> > >> > > Hi Piotr, >> > > >> > > Thanks for the comments - all valid points. >> > > We should definitely document how the Flame Graphs are constructed - I >> > will >> > > work on the docs. Do you have a proposition about the part of which >> > > page/section they should become? >> > > I would like to also mention here that I plan to work on further >> > > improvements, such as the ability to "zoom in" into the Flame Graphs >> for >> > > the individual Tasks in the "unsquashed" form, so some of those >> concerns >> > > should be mitigated in the future. >> > > >> > > Best, >> > > >> > > -- >> > > >> > > Alexander Fedulov | Solutions Architect >> > > >> > > <https://www.ververica.com/> >> > > >> > > Follow us @VervericaData >> > > >> > > >> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email]> >> > > wrote: >> > > >> > > > Nice feature +1 from my side for it. >> > > > >> > > > In the PR I think we are missing documentation. I think it's >> especially >> > > > important to mention the limitations of this approach for >> performance >> > > > analysis. If we make it easy for the user to get such kind of data, >> > it's >> > > > important they do not proverbially shoot themselves in their own >> foot >> > > with >> > > > false conclusions. We should clearly mention how those data are >> > sampled, >> > > > and point to limitations such as: >> > > > - data from all threads/operators are squashed together, so if there >> > is a >> > > > data skew it will be averaged out >> > > > - stack sampling is/can be biased (JVM threads are more likely to be >> > > > stopped in some places than others, while skipping/rarely stopping >> in >> > the >> > > > true hot spots - so one should treat the results with a grain of >> salt >> > > below >> > > > a certain level) >> > > > - true bottleneck might be obscured by the fact flame graphs are >> > > squashing >> > > > results from all of the threads. There can be 60% of time spent in >> one >> > > > component according to a flame graph, but it might not necessarily >> be >> > the >> > > > bottleneck, which could be in a completely different component >> running >> > > > which has a single thread burning 100% of a single CPU core, barely >> > > visible >> > > > in the flame graph at all. >> > > > >> > > > It's great to have such a nice tool readily and easily available, >> but >> > we >> > > > need to make sure people who are using it are aware when it can be >> > > > misleading. >> > > > >> > > > Piotrek >> > > > >> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> >> > napisał(a): >> > > > >> > > > > Ah ok. Thanks for the clarification Alex. >> > > > > >> > > > > Cheers, >> > > > > Till >> > > > > >> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < >> > > > [hidden email]> >> > > > > wrote: >> > > > > >> > > > > > It is passed back as part of the response to the asynchronous >> > > callback >> > > > > > within the coordinator and is used to decide if all outstanding >> > > > requests >> > > > > to >> > > > > > the parallel instances of a particular operator returned >> > > successfully. >> > > > If >> > > > > > so, the request is considered successful, sub-results are >> combined >> > > and >> > > > > the >> > > > > > thread info result future for an operator completes. >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 >> > > > > > >> > > > > > Best, >> > > > > > >> > > > > > -- >> > > > > > >> > > > > > Alexander Fedulov | Solutions Architect >> > > > > > >> > > > > > <https://www.ververica.com/> >> > > > > > >> > > > > > Follow us @VervericaData >> > > > > > >> > > > > > >> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann < >> > [hidden email]> >> > > > > > wrote: >> > > > > > >> > > > > > > Why does the caller of >> > TaskExecutorGateway.requestThreadInfoSamples >> > > > > need >> > > > > > to >> > > > > > > specify the request id? Is it because the caller can send a >> > second >> > > > > > request >> > > > > > > with the same id? Or can the caller query the result of a >> > previous >> > > > > > request >> > > > > > > by specifying the requestId? >> > > > > > > >> > > > > > > If the TaskExecutor does not need to know about the id, then >> we >> > > might >> > > > > be >> > > > > > > able to drop it. >> > > > > > > >> > > > > > > Cheers >> > > > > > > Till >> > > > > > > >> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < >> > > > > > [hidden email]> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi Till, >> > > > > > > > >> > > > > > > > Thanks for your comments. >> > > > > > > > >> > > > > > > > * What is the requestId used for in the RPC call? >> > > > > > > > >> > > > > > > > It is the handle that is used as the key in the >> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I >> believe >> > > it >> > > > > was >> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I >> > decided >> > > > to >> > > > > > > rename >> > > > > > > > it because there is also a ThreadInfoSampleService which is >> > > > actually >> > > > > > > > responsible for sampling the JVM numSamples number of >> times. I >> > > > found >> > > > > > that >> > > > > > > > the notion of what a sample is was a bit confusing. Now one >> > > thread >> > > > > info >> > > > > > > > request corresponds to gathering numSamples from a >> > corresponding >> > > > > Task. >> > > > > > > Hope >> > > > > > > > that makes sense. >> > > > > > > > >> > > > > > > > * Would it make sense to group numSubSamples, >> > delayBetweenSamples >> > > > and >> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This >> > would >> > > > > > decrease >> > > > > > > > the number of parameters and group those which are closely >> > > related >> > > > > > > > together. >> > > > > > > > >> > > > > > > > Good point. I will rework it accordingly. >> > > > > > > > >> > > > > > > > Best, >> > > > > > > > -- >> > > > > > > > >> > > > > > > > Alexander Fedulov | Solutions Architect >> > > > > > > > Follow us @VervericaData >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > Sent from: >> > > > > > > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > |
Thanks Alex. From my perspective we could continue with the vote now.
Cheers, Till On Thu, Mar 4, 2021 at 9:15 PM Alexander Fedulov <[hidden email]> wrote: > @Till, I've added the proposed ThreadInfoSamplesRequest and updated the > FLIP and the PR accordingly. > > Best, > > -- > > Alexander Fedulov | Solutions Architect > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton > Wehner > > > > On Wed, Mar 3, 2021 at 5:03 PM Alexander Fedulov <[hidden email]> > wrote: > > > Added docs to the PR. > > @David, thanks for the tip, it seems like a good place to put them. > > > > -- > > > > Alexander Fedulov | Solutions Architect > > > > <https://www.ververica.com/> > > > > Follow us @VervericaData > > > > > > > > > > On Wed, Mar 3, 2021 at 12:10 PM David Anderson <[hidden email]> > > wrote: > > > >> This is going to make performance analysis and optimization much more > >> accessible. I can't wait to include this in our training courses. > >> > >> +1 > >> > >> Seth suggested putting the docs for this feature under > >> Operations/Monitoring, but there's already a page in the docs under > >> Operations/Debugging for Application Profiling & Debugging, which is > more > >> on point. I think it will be confusing if the flame graphs aren't > >> together with the other profilers. > >> > >> David > >> > >> On Tue, Mar 2, 2021 at 11:36 PM Seth Wiesman <[hidden email]> > wrote: > >> > >> > Cool feature +1 > >> > > >> > There is a subsection called monitoring in the operations section of > the > >> > docs. It would fit nicely there. > >> > > >> > Seth > >> > > >> > On Tue, Mar 2, 2021 at 4:23 PM Alexander Fedulov < > >> [hidden email]> > >> > wrote: > >> > > >> > > Hi Piotr, > >> > > > >> > > Thanks for the comments - all valid points. > >> > > We should definitely document how the Flame Graphs are constructed > - I > >> > will > >> > > work on the docs. Do you have a proposition about the part of which > >> > > page/section they should become? > >> > > I would like to also mention here that I plan to work on further > >> > > improvements, such as the ability to "zoom in" into the Flame Graphs > >> for > >> > > the individual Tasks in the "unsquashed" form, so some of those > >> concerns > >> > > should be mitigated in the future. > >> > > > >> > > Best, > >> > > > >> > > -- > >> > > > >> > > Alexander Fedulov | Solutions Architect > >> > > > >> > > <https://www.ververica.com/> > >> > > > >> > > Follow us @VervericaData > >> > > > >> > > > >> > > On Tue, Mar 2, 2021 at 3:17 PM Piotr Nowojski <[hidden email] > > > >> > > wrote: > >> > > > >> > > > Nice feature +1 from my side for it. > >> > > > > >> > > > In the PR I think we are missing documentation. I think it's > >> especially > >> > > > important to mention the limitations of this approach for > >> performance > >> > > > analysis. If we make it easy for the user to get such kind of > data, > >> > it's > >> > > > important they do not proverbially shoot themselves in their own > >> foot > >> > > with > >> > > > false conclusions. We should clearly mention how those data are > >> > sampled, > >> > > > and point to limitations such as: > >> > > > - data from all threads/operators are squashed together, so if > there > >> > is a > >> > > > data skew it will be averaged out > >> > > > - stack sampling is/can be biased (JVM threads are more likely to > be > >> > > > stopped in some places than others, while skipping/rarely stopping > >> in > >> > the > >> > > > true hot spots - so one should treat the results with a grain of > >> salt > >> > > below > >> > > > a certain level) > >> > > > - true bottleneck might be obscured by the fact flame graphs are > >> > > squashing > >> > > > results from all of the threads. There can be 60% of time spent in > >> one > >> > > > component according to a flame graph, but it might not necessarily > >> be > >> > the > >> > > > bottleneck, which could be in a completely different component > >> running > >> > > > which has a single thread burning 100% of a single CPU core, > barely > >> > > visible > >> > > > in the flame graph at all. > >> > > > > >> > > > It's great to have such a nice tool readily and easily available, > >> but > >> > we > >> > > > need to make sure people who are using it are aware when it can be > >> > > > misleading. > >> > > > > >> > > > Piotrek > >> > > > > >> > > > wt., 2 mar 2021 o 15:12 Till Rohrmann <[hidden email]> > >> > napisał(a): > >> > > > > >> > > > > Ah ok. Thanks for the clarification Alex. > >> > > > > > >> > > > > Cheers, > >> > > > > Till > >> > > > > > >> > > > > On Tue, Mar 2, 2021 at 2:02 PM Alexander Fedulov < > >> > > > [hidden email]> > >> > > > > wrote: > >> > > > > > >> > > > > > It is passed back as part of the response to the asynchronous > >> > > callback > >> > > > > > within the coordinator and is used to decide if all > outstanding > >> > > > requests > >> > > > > to > >> > > > > > the parallel instances of a particular operator returned > >> > > successfully. > >> > > > If > >> > > > > > so, the request is considered successful, sub-results are > >> combined > >> > > and > >> > > > > the > >> > > > > > thread info result future for an operator completes. > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R150 > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://github.com/apache/flink/pull/15054/commits/281188a025077849efd630f1f7aa801ff79a9afd#diff-20a1c89043e8d480e7af6dd36596b3558be9c6e64f6f4cf065df97fe76411c50R277 > >> > > > > > > >> > > > > > Best, > >> > > > > > > >> > > > > > -- > >> > > > > > > >> > > > > > Alexander Fedulov | Solutions Architect > >> > > > > > > >> > > > > > <https://www.ververica.com/> > >> > > > > > > >> > > > > > Follow us @VervericaData > >> > > > > > > >> > > > > > > >> > > > > > On Tue, Mar 2, 2021 at 12:28 PM Till Rohrmann < > >> > [hidden email]> > >> > > > > > wrote: > >> > > > > > > >> > > > > > > Why does the caller of > >> > TaskExecutorGateway.requestThreadInfoSamples > >> > > > > need > >> > > > > > to > >> > > > > > > specify the request id? Is it because the caller can send a > >> > second > >> > > > > > request > >> > > > > > > with the same id? Or can the caller query the result of a > >> > previous > >> > > > > > request > >> > > > > > > by specifying the requestId? > >> > > > > > > > >> > > > > > > If the TaskExecutor does not need to know about the id, then > >> we > >> > > might > >> > > > > be > >> > > > > > > able to drop it. > >> > > > > > > > >> > > > > > > Cheers > >> > > > > > > Till > >> > > > > > > > >> > > > > > > On Tue, Mar 2, 2021 at 9:42 AM Alexander Fedulov < > >> > > > > > [hidden email]> > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > > Hi Till, > >> > > > > > > > > >> > > > > > > > Thanks for your comments. > >> > > > > > > > > >> > > > > > > > * What is the requestId used for in the RPC call? > >> > > > > > > > > >> > > > > > > > It is the handle that is used as the key in the > >> > > > > > > > ThreadInfoRequestCoordinator's pending responses Map. I > >> believe > >> > > it > >> > > > > was > >> > > > > > > > called sampleId in the StackTraceSampleCoordinator, but I > >> > decided > >> > > > to > >> > > > > > > rename > >> > > > > > > > it because there is also a ThreadInfoSampleService which > is > >> > > > actually > >> > > > > > > > responsible for sampling the JVM numSamples number of > >> times. I > >> > > > found > >> > > > > > that > >> > > > > > > > the notion of what a sample is was a bit confusing. Now > one > >> > > thread > >> > > > > info > >> > > > > > > > request corresponds to gathering numSamples from a > >> > corresponding > >> > > > > Task. > >> > > > > > > Hope > >> > > > > > > > that makes sense. > >> > > > > > > > > >> > > > > > > > * Would it make sense to group numSubSamples, > >> > delayBetweenSamples > >> > > > and > >> > > > > > > > maxStackTraceDepth into a ThreadSamplesRequest class? This > >> > would > >> > > > > > decrease > >> > > > > > > > the number of parameters and group those which are closely > >> > > related > >> > > > > > > > together. > >> > > > > > > > > >> > > > > > > > Good point. I will rework it accordingly. > >> > > > > > > > > >> > > > > > > > Best, > >> > > > > > > > -- > >> > > > > > > > > >> > > > > > > > Alexander Fedulov | Solutions Architect > >> > > > > > > > Follow us @VervericaData > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > -- > >> > > > > > > > Sent from: > >> > > > > > > > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > |
Free forum by Nabble | Edit this page |