[VOTE] FLIP-102: Add More Metrics to TaskManager

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[VOTE] FLIP-102: Add More Metrics to TaskManager

Yadong Xie
Hi all

I want to start the vote for FLIP-102, which proposes to add more metrics
to the task manager in web UI.

To help everyone better understand the proposal, we spent some efforts on
making an online POC

previous web:
http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
POC web:
http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics


The vote will last for at least 72 hours, following the consensus voting
process.

FLIP wiki:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager

Discussion thread:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html

Thanks,

Yadong
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Xintong Song
Thanks for driving this FLIP, Yadong.

+1 (non-binding) for the FLIP in general. I think this really helps our
users to understand and use the new FLIP-49 memory configuration.

I have a few minor comments.
- There's a frame "Other" in the frame "Non-Heap", besides "JVM Overhead"
and "JVM Metaspace". IIUC, the purpose of this is to explain the
mismatching between the metric "non-heap maximum" and the sum of the
configurations "JVM metaspace" & "JVM Overhead". However, from the
perspective of FLIP-49, JVM Overhead accounts for all the JVM non-heap
memory usages except for metaspace. The metrics does not match the
configuration because we did not set the a JVM parameter for "max non-heap
memory" (actually I'm not sure whether it can be specified in java 8). The
current UI might confuse people making them think there are other non-heap
memory usages not accounted by the configurations. Therefore, I would
suggest to remove the "Other" frame, but add another frame inside "JVM
Overhead", besides "Configuration", with "JVM limit" as the title and
"non-heap max metric minus metaspace configuration" as the value .

- In the final release, we have changed "shuffle memory" to "network
memory" because the latter is easier to understand for users. I think we
should be updated it in this FLIP as well.

- There's a typo "Directed" (should be "Direct") at the direct memory
metric.

Thank you~

Xintong Song



On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]> wrote:

> Hi all
>
> I want to start the vote for FLIP-102, which proposes to add more metrics
> to the task manager in web UI.
>
> To help everyone better understand the proposal, we spent some efforts on
> making an online POC
>
> previous web:
>
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> POC web:
>
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
>
>
> The vote will last for at least 72 hours, following the consensus voting
> process.
>
> FLIP wiki:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
>
> Discussion thread:
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
>
> Thanks,
>
> Yadong
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Kurt Young
Some questions related to "managed memory":

1. Should the managed memory be part of direct memory?
2. Should the shuffle memory also be part of the managed memory?

Best,
Kurt


On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <[hidden email]> wrote:

> Thanks for driving this FLIP, Yadong.
>
> +1 (non-binding) for the FLIP in general. I think this really helps our
> users to understand and use the new FLIP-49 memory configuration.
>
> I have a few minor comments.
> - There's a frame "Other" in the frame "Non-Heap", besides "JVM Overhead"
> and "JVM Metaspace". IIUC, the purpose of this is to explain the
> mismatching between the metric "non-heap maximum" and the sum of the
> configurations "JVM metaspace" & "JVM Overhead". However, from the
> perspective of FLIP-49, JVM Overhead accounts for all the JVM non-heap
> memory usages except for metaspace. The metrics does not match the
> configuration because we did not set the a JVM parameter for "max non-heap
> memory" (actually I'm not sure whether it can be specified in java 8). The
> current UI might confuse people making them think there are other non-heap
> memory usages not accounted by the configurations. Therefore, I would
> suggest to remove the "Other" frame, but add another frame inside "JVM
> Overhead", besides "Configuration", with "JVM limit" as the title and
> "non-heap max metric minus metaspace configuration" as the value .
>
> - In the final release, we have changed "shuffle memory" to "network
> memory" because the latter is easier to understand for users. I think we
> should be updated it in this FLIP as well.
>
> - There's a typo "Directed" (should be "Direct") at the direct memory
> metric.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]> wrote:
>
> > Hi all
> >
> > I want to start the vote for FLIP-102, which proposes to add more metrics
> > to the task manager in web UI.
> >
> > To help everyone better understand the proposal, we spent some efforts on
> > making an online POC
> >
> > previous web:
> >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > POC web:
> >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> >
> >
> > The vote will last for at least 72 hours, following the consensus voting
> > process.
> >
> > FLIP wiki:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> >
> > Discussion thread:
> >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> >
> > Thanks,
> >
> > Yadong
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Xintong Song
>
> 1. Should the managed memory be part of direct memory?
>
The answer is no. Managed memory is currently allocated by accessing to
private field of Unsafe. It is not accounted for in JVM's direct memory
limit and corresponding metrics. To that end, it is equivalent to
native memory.


> 2. Should the shuffle memory also be part of the managed memory?

I don't think so. Shuffle (Network) memory is allocated with direct
buffers, and accounted for in JVM's direct memory limit and corresponding
metrics. Moreover, the FLIP-49 memory model expose network memory and
managed memory as two independent components of the overall memory
footprint.


Thank you~

Xintong Song



On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]> wrote:

> Some questions related to "managed memory":
>
> 1. Should the managed memory be part of direct memory?
> 2. Should the shuffle memory also be part of the managed memory?
>
> Best,
> Kurt
>
>
> On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks for driving this FLIP, Yadong.
> >
> > +1 (non-binding) for the FLIP in general. I think this really helps our
> > users to understand and use the new FLIP-49 memory configuration.
> >
> > I have a few minor comments.
> > - There's a frame "Other" in the frame "Non-Heap", besides "JVM Overhead"
> > and "JVM Metaspace". IIUC, the purpose of this is to explain the
> > mismatching between the metric "non-heap maximum" and the sum of the
> > configurations "JVM metaspace" & "JVM Overhead". However, from the
> > perspective of FLIP-49, JVM Overhead accounts for all the JVM non-heap
> > memory usages except for metaspace. The metrics does not match the
> > configuration because we did not set the a JVM parameter for "max
> non-heap
> > memory" (actually I'm not sure whether it can be specified in java 8).
> The
> > current UI might confuse people making them think there are other
> non-heap
> > memory usages not accounted by the configurations. Therefore, I would
> > suggest to remove the "Other" frame, but add another frame inside "JVM
> > Overhead", besides "Configuration", with "JVM limit" as the title and
> > "non-heap max metric minus metaspace configuration" as the value .
> >
> > - In the final release, we have changed "shuffle memory" to "network
> > memory" because the latter is easier to understand for users. I think we
> > should be updated it in this FLIP as well.
> >
> > - There's a typo "Directed" (should be "Direct") at the direct memory
> > metric.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]> wrote:
> >
> > > Hi all
> > >
> > > I want to start the vote for FLIP-102, which proposes to add more
> metrics
> > > to the task manager in web UI.
> > >
> > > To help everyone better understand the proposal, we spent some efforts
> on
> > > making an online POC
> > >
> > > previous web:
> > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > POC web:
> > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > >
> > >
> > > The vote will last for at least 72 hours, following the consensus
> voting
> > > process.
> > >
> > > FLIP wiki:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > >
> > > Discussion thread:
> > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > >
> > > Thanks,
> > >
> > > Yadong
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Yadong Xie
Hi Xintong
thanks for your advice, the POC web and the FLIP doc was updated now
here is the new link:
http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics


Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:

> >
> > 1. Should the managed memory be part of direct memory?
> >
> The answer is no. Managed memory is currently allocated by accessing to
> private field of Unsafe. It is not accounted for in JVM's direct memory
> limit and corresponding metrics. To that end, it is equivalent to
> native memory.
>
>
> > 2. Should the shuffle memory also be part of the managed memory?
>
> I don't think so. Shuffle (Network) memory is allocated with direct
> buffers, and accounted for in JVM's direct memory limit and corresponding
> metrics. Moreover, the FLIP-49 memory model expose network memory and
> managed memory as two independent components of the overall memory
> footprint.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]> wrote:
>
> > Some questions related to "managed memory":
> >
> > 1. Should the managed memory be part of direct memory?
> > 2. Should the shuffle memory also be part of the managed memory?
> >
> > Best,
> > Kurt
> >
> >
> > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Thanks for driving this FLIP, Yadong.
> > >
> > > +1 (non-binding) for the FLIP in general. I think this really helps our
> > > users to understand and use the new FLIP-49 memory configuration.
> > >
> > > I have a few minor comments.
> > > - There's a frame "Other" in the frame "Non-Heap", besides "JVM
> Overhead"
> > > and "JVM Metaspace". IIUC, the purpose of this is to explain the
> > > mismatching between the metric "non-heap maximum" and the sum of the
> > > configurations "JVM metaspace" & "JVM Overhead". However, from the
> > > perspective of FLIP-49, JVM Overhead accounts for all the JVM non-heap
> > > memory usages except for metaspace. The metrics does not match the
> > > configuration because we did not set the a JVM parameter for "max
> > non-heap
> > > memory" (actually I'm not sure whether it can be specified in java 8).
> > The
> > > current UI might confuse people making them think there are other
> > non-heap
> > > memory usages not accounted by the configurations. Therefore, I would
> > > suggest to remove the "Other" frame, but add another frame inside "JVM
> > > Overhead", besides "Configuration", with "JVM limit" as the title and
> > > "non-heap max metric minus metaspace configuration" as the value .
> > >
> > > - In the final release, we have changed "shuffle memory" to "network
> > > memory" because the latter is easier to understand for users. I think
> we
> > > should be updated it in this FLIP as well.
> > >
> > > - There's a typo "Directed" (should be "Direct") at the direct memory
> > > metric.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]>
> wrote:
> > >
> > > > Hi all
> > > >
> > > > I want to start the vote for FLIP-102, which proposes to add more
> > metrics
> > > > to the task manager in web UI.
> > > >
> > > > To help everyone better understand the proposal, we spent some
> efforts
> > on
> > > > making an online POC
> > > >
> > > > previous web:
> > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > POC web:
> > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > >
> > > >
> > > > The vote will last for at least 72 hours, following the consensus
> > voting
> > > > process.
> > > >
> > > > FLIP wiki:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > >
> > > > Discussion thread:
> > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > >
> > > > Thanks,
> > > >
> > > > Yadong
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Till Rohrmann
Thanks for creating this FLIP Yadong. I think your proposal makes it much
easier for the user to understand what's happening on Flink TaskManager's.

I have some comments:

1. Some of the newly introduced metrics involve computations on the
TaskManager. I would like to avoid additional computations introduced by
metrics as much as possible because metrics should not affect the system.
In particular, total memory sizes which are configured should not be
derived computationally (getManagedMemoryTotal, getTotalMemorySize). For
the currently available memory sizes (e.g. getManagedMemoryUsed), one could
think about reporting them on a per slot basis and to do the aggregation on
the client side. Of course, this would increase the size of the response
payload.

2. I'm not entirely sure whether I would split the memory display into JVM
memory and non JVM memory as you've done it int the POC. From a user's
perspective, one could start displaying the total process memory. The next
three most important metrics are the heap, managed memory and network
buffer usage, I guess. If one is interested in more details, one could then
display the remaining direct memory usage, the JVM overhead (I'm not sure
whether I would call this non-heap though) and the mapped memory.

3. Displaying the memory configurations in three nested boxes does not look
so nice to me. I'm not sure how else one could display it, though.

4. What does JVM limit mean in Non-heap.JVM-Overhead?

Cheers,
Till

On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]> wrote:

> Hi Xintong
> thanks for your advice, the POC web and the FLIP doc was updated now
> here is the new link:
>
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
>
>
> Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
>
> > >
> > > 1. Should the managed memory be part of direct memory?
> > >
> > The answer is no. Managed memory is currently allocated by accessing to
> > private field of Unsafe. It is not accounted for in JVM's direct memory
> > limit and corresponding metrics. To that end, it is equivalent to
> > native memory.
> >
> >
> > > 2. Should the shuffle memory also be part of the managed memory?
> >
> > I don't think so. Shuffle (Network) memory is allocated with direct
> > buffers, and accounted for in JVM's direct memory limit and corresponding
> > metrics. Moreover, the FLIP-49 memory model expose network memory and
> > managed memory as two independent components of the overall memory
> > footprint.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]> wrote:
> >
> > > Some questions related to "managed memory":
> > >
> > > 1. Should the managed memory be part of direct memory?
> > > 2. Should the shuffle memory also be part of the managed memory?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for driving this FLIP, Yadong.
> > > >
> > > > +1 (non-binding) for the FLIP in general. I think this really helps
> our
> > > > users to understand and use the new FLIP-49 memory configuration.
> > > >
> > > > I have a few minor comments.
> > > > - There's a frame "Other" in the frame "Non-Heap", besides "JVM
> > Overhead"
> > > > and "JVM Metaspace". IIUC, the purpose of this is to explain the
> > > > mismatching between the metric "non-heap maximum" and the sum of the
> > > > configurations "JVM metaspace" & "JVM Overhead". However, from the
> > > > perspective of FLIP-49, JVM Overhead accounts for all the JVM
> non-heap
> > > > memory usages except for metaspace. The metrics does not match the
> > > > configuration because we did not set the a JVM parameter for "max
> > > non-heap
> > > > memory" (actually I'm not sure whether it can be specified in java
> 8).
> > > The
> > > > current UI might confuse people making them think there are other
> > > non-heap
> > > > memory usages not accounted by the configurations. Therefore, I would
> > > > suggest to remove the "Other" frame, but add another frame inside
> "JVM
> > > > Overhead", besides "Configuration", with "JVM limit" as the title and
> > > > "non-heap max metric minus metaspace configuration" as the value .
> > > >
> > > > - In the final release, we have changed "shuffle memory" to "network
> > > > memory" because the latter is easier to understand for users. I think
> > we
> > > > should be updated it in this FLIP as well.
> > > >
> > > > - There's a typo "Directed" (should be "Direct") at the direct memory
> > > > metric.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]>
> > wrote:
> > > >
> > > > > Hi all
> > > > >
> > > > > I want to start the vote for FLIP-102, which proposes to add more
> > > metrics
> > > > > to the task manager in web UI.
> > > > >
> > > > > To help everyone better understand the proposal, we spent some
> > efforts
> > > on
> > > > > making an online POC
> > > > >
> > > > > previous web:
> > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > POC web:
> > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > >
> > > > >
> > > > > The vote will last for at least 72 hours, following the consensus
> > > voting
> > > > > process.
> > > > >
> > > > > FLIP wiki:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > > >
> > > > > Discussion thread:
> > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Yadong
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

jing
About 1:
  1. ManagedMemory's total could get from TaskExecutorResourceSpec.
  2. ManagedMemory's used registers in SlotMetricGroup is ok. As there's no
MetricGroup for the slot, so we need to create SlotMetricGroup.

Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:

> Thanks for creating this FLIP Yadong. I think your proposal makes it much
> easier for the user to understand what's happening on Flink TaskManager's.
>
> I have some comments:
>
> 1. Some of the newly introduced metrics involve computations on the
> TaskManager. I would like to avoid additional computations introduced by
> metrics as much as possible because metrics should not affect the system.
> In particular, total memory sizes which are configured should not be
> derived computationally (getManagedMemoryTotal, getTotalMemorySize). For
> the currently available memory sizes (e.g. getManagedMemoryUsed), one could
> think about reporting them on a per slot basis and to do the aggregation on
> the client side. Of course, this would increase the size of the response
> payload.
>
> 2. I'm not entirely sure whether I would split the memory display into JVM
> memory and non JVM memory as you've done it int the POC. From a user's
> perspective, one could start displaying the total process memory. The next
> three most important metrics are the heap, managed memory and network
> buffer usage, I guess. If one is interested in more details, one could then
> display the remaining direct memory usage, the JVM overhead (I'm not sure
> whether I would call this non-heap though) and the mapped memory.
>
> 3. Displaying the memory configurations in three nested boxes does not look
> so nice to me. I'm not sure how else one could display it, though.
>
> 4. What does JVM limit mean in Non-heap.JVM-Overhead?
>
> Cheers,
> Till
>
> On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]> wrote:
>
> > Hi Xintong
> > thanks for your advice, the POC web and the FLIP doc was updated now
> > here is the new link:
> >
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> >
> >
> > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> >
> > > >
> > > > 1. Should the managed memory be part of direct memory?
> > > >
> > > The answer is no. Managed memory is currently allocated by accessing to
> > > private field of Unsafe. It is not accounted for in JVM's direct memory
> > > limit and corresponding metrics. To that end, it is equivalent to
> > > native memory.
> > >
> > >
> > > > 2. Should the shuffle memory also be part of the managed memory?
> > >
> > > I don't think so. Shuffle (Network) memory is allocated with direct
> > > buffers, and accounted for in JVM's direct memory limit and
> corresponding
> > > metrics. Moreover, the FLIP-49 memory model expose network memory and
> > > managed memory as two independent components of the overall memory
> > > footprint.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]> wrote:
> > >
> > > > Some questions related to "managed memory":
> > > >
> > > > 1. Should the managed memory be part of direct memory?
> > > > 2. Should the shuffle memory also be part of the managed memory?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Thanks for driving this FLIP, Yadong.
> > > > >
> > > > > +1 (non-binding) for the FLIP in general. I think this really helps
> > our
> > > > > users to understand and use the new FLIP-49 memory configuration.
> > > > >
> > > > > I have a few minor comments.
> > > > > - There's a frame "Other" in the frame "Non-Heap", besides "JVM
> > > Overhead"
> > > > > and "JVM Metaspace". IIUC, the purpose of this is to explain the
> > > > > mismatching between the metric "non-heap maximum" and the sum of
> the
> > > > > configurations "JVM metaspace" & "JVM Overhead". However, from the
> > > > > perspective of FLIP-49, JVM Overhead accounts for all the JVM
> > non-heap
> > > > > memory usages except for metaspace. The metrics does not match the
> > > > > configuration because we did not set the a JVM parameter for "max
> > > > non-heap
> > > > > memory" (actually I'm not sure whether it can be specified in java
> > 8).
> > > > The
> > > > > current UI might confuse people making them think there are other
> > > > non-heap
> > > > > memory usages not accounted by the configurations. Therefore, I
> would
> > > > > suggest to remove the "Other" frame, but add another frame inside
> > "JVM
> > > > > Overhead", besides "Configuration", with "JVM limit" as the title
> and
> > > > > "non-heap max metric minus metaspace configuration" as the value .
> > > > >
> > > > > - In the final release, we have changed "shuffle memory" to
> "network
> > > > > memory" because the latter is easier to understand for users. I
> think
> > > we
> > > > > should be updated it in this FLIP as well.
> > > > >
> > > > > - There's a typo "Directed" (should be "Direct") at the direct
> memory
> > > > > metric.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi all
> > > > > >
> > > > > > I want to start the vote for FLIP-102, which proposes to add more
> > > > metrics
> > > > > > to the task manager in web UI.
> > > > > >
> > > > > > To help everyone better understand the proposal, we spent some
> > > efforts
> > > > on
> > > > > > making an online POC
> > > > > >
> > > > > > previous web:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > POC web:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > >
> > > > > >
> > > > > > The vote will last for at least 72 hours, following the consensus
> > > > voting
> > > > > > process.
> > > > > >
> > > > > > FLIP wiki:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > > > >
> > > > > > Discussion thread:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Yadong
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

jing
In reply to this post by Till Rohrmann
Hi Till
Thanks for your response!

I'm responsible for the RestAPI design part of FLIP-102

1. Some of the newly introduced metrics involve computations on the
> TaskManager. I would like to avoid additional computations introduced by
> metrics as much as possible because metrics should not affect the system.
> In particular, total memory sizes which are configured should not be
> derived computationally (getManagedMemoryTotal, getTotalMemorySize). For
> the currently available memory sizes (e.g. getManagedMemoryUsed), one could
> think about reporting them on a per slot basis and to do the aggregation on
> the client side. Of course, this would increase the size of the response
> payload.


I totally agree with your comment, but I still have a question: where
should the metric of slot's ManagedMemory be registered?

There are two ways to achieve this:

   1. add SlotMetricGroup
   2. register it in TaskManagerMetricGroup, such as 0.Managed.Memory.Used
(ps: 0 as the index of a slot).

Which way do you think is better? Looking forward to your replay.

Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:

> Thanks for creating this FLIP Yadong. I think your proposal makes it much
> easier for the user to understand what's happening on Flink TaskManager's.
>
> I have some comments:
>
> 1. Some of the newly introduced metrics involve computations on the
> TaskManager. I would like to avoid additional computations introduced by
> metrics as much as possible because metrics should not affect the system.
> In particular, total memory sizes which are configured should not be
> derived computationally (getManagedMemoryTotal, getTotalMemorySize). For
> the currently available memory sizes (e.g. getManagedMemoryUsed), one could
> think about reporting them on a per slot basis and to do the aggregation on
> the client side. Of course, this would increase the size of the response
> payload.
>
> 2. I'm not entirely sure whether I would split the memory display into JVM
> memory and non JVM memory as you've done it int the POC. From a user's
> perspective, one could start displaying the total process memory. The next
> three most important metrics are the heap, managed memory and network
> buffer usage, I guess. If one is interested in more details, one could then
> display the remaining direct memory usage, the JVM overhead (I'm not sure
> whether I would call this non-heap though) and the mapped memory.
>
> 3. Displaying the memory configurations in three nested boxes does not look
> so nice to me. I'm not sure how else one could display it, though.
>
> 4. What does JVM limit mean in Non-heap.JVM-Overhead?
>
> Cheers,
> Till
>
> On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]> wrote:
>
> > Hi Xintong
> > thanks for your advice, the POC web and the FLIP doc was updated now
> > here is the new link:
> >
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> >
> >
> > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> >
> > > >
> > > > 1. Should the managed memory be part of direct memory?
> > > >
> > > The answer is no. Managed memory is currently allocated by accessing to
> > > private field of Unsafe. It is not accounted for in JVM's direct memory
> > > limit and corresponding metrics. To that end, it is equivalent to
> > > native memory.
> > >
> > >
> > > > 2. Should the shuffle memory also be part of the managed memory?
> > >
> > > I don't think so. Shuffle (Network) memory is allocated with direct
> > > buffers, and accounted for in JVM's direct memory limit and
> corresponding
> > > metrics. Moreover, the FLIP-49 memory model expose network memory and
> > > managed memory as two independent components of the overall memory
> > > footprint.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]> wrote:
> > >
> > > > Some questions related to "managed memory":
> > > >
> > > > 1. Should the managed memory be part of direct memory?
> > > > 2. Should the shuffle memory also be part of the managed memory?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Thanks for driving this FLIP, Yadong.
> > > > >
> > > > > +1 (non-binding) for the FLIP in general. I think this really helps
> > our
> > > > > users to understand and use the new FLIP-49 memory configuration.
> > > > >
> > > > > I have a few minor comments.
> > > > > - There's a frame "Other" in the frame "Non-Heap", besides "JVM
> > > Overhead"
> > > > > and "JVM Metaspace". IIUC, the purpose of this is to explain the
> > > > > mismatching between the metric "non-heap maximum" and the sum of
> the
> > > > > configurations "JVM metaspace" & "JVM Overhead". However, from the
> > > > > perspective of FLIP-49, JVM Overhead accounts for all the JVM
> > non-heap
> > > > > memory usages except for metaspace. The metrics does not match the
> > > > > configuration because we did not set the a JVM parameter for "max
> > > > non-heap
> > > > > memory" (actually I'm not sure whether it can be specified in java
> > 8).
> > > > The
> > > > > current UI might confuse people making them think there are other
> > > > non-heap
> > > > > memory usages not accounted by the configurations. Therefore, I
> would
> > > > > suggest to remove the "Other" frame, but add another frame inside
> > "JVM
> > > > > Overhead", besides "Configuration", with "JVM limit" as the title
> and
> > > > > "non-heap max metric minus metaspace configuration" as the value .
> > > > >
> > > > > - In the final release, we have changed "shuffle memory" to
> "network
> > > > > memory" because the latter is easier to understand for users. I
> think
> > > we
> > > > > should be updated it in this FLIP as well.
> > > > >
> > > > > - There's a typo "Directed" (should be "Direct") at the direct
> memory
> > > > > metric.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi all
> > > > > >
> > > > > > I want to start the vote for FLIP-102, which proposes to add more
> > > > metrics
> > > > > > to the task manager in web UI.
> > > > > >
> > > > > > To help everyone better understand the proposal, we spent some
> > > efforts
> > > > on
> > > > > > making an online POC
> > > > > >
> > > > > > previous web:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > POC web:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > >
> > > > > >
> > > > > > The vote will last for at least 72 hours, following the consensus
> > > > voting
> > > > > > process.
> > > > > >
> > > > > > FLIP wiki:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > > > >
> > > > > > Discussion thread:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Yadong
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Yadong Xie
In reply to this post by Till Rohrmann
Hi Till

Thanks a lot for your response

> 2. I'm not entirely sure whether I would split the memory ...

Split the memory display comes from the 'ancient' design of the web, it is
ok for me to change it following total/heap/managed/network/direct/jvm
overhead/mapped sequence

> 3. Displaying the memory configurations...

I agree with you that it is not a very nice way, but the hierarchical
relationship of configurations is too complex and hard to display in the
other ways (I have tried)

if anyone has a better idea, please feels no hesitates to help me


> 4. What does JVM limit mean in Non-heap.JVM-Overhead?

JVM limit is "non-heap max metric minus metaspace configuration" as @Xintong
Song <[hidden email]> replyed in this mail thread


Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:

> Thanks for creating this FLIP Yadong. I think your proposal makes it much
> easier for the user to understand what's happening on Flink TaskManager's.
>
> I have some comments:
>
> 1. Some of the newly introduced metrics involve computations on the
> TaskManager. I would like to avoid additional computations introduced by
> metrics as much as possible because metrics should not affect the system.
> In particular, total memory sizes which are configured should not be
> derived computationally (getManagedMemoryTotal, getTotalMemorySize). For
> the currently available memory sizes (e.g. getManagedMemoryUsed), one could
> think about reporting them on a per slot basis and to do the aggregation on
> the client side. Of course, this would increase the size of the response
> payload.
>
> 2. I'm not entirely sure whether I would split the memory display into JVM
> memory and non JVM memory as you've done it int the POC. From a user's
> perspective, one could start displaying the total process memory. The next
> three most important metrics are the heap, managed memory and network
> buffer usage, I guess. If one is interested in more details, one could then
> display the remaining direct memory usage, the JVM overhead (I'm not sure
> whether I would call this non-heap though) and the mapped memory.
>
> 3. Displaying the memory configurations in three nested boxes does not look
> so nice to me. I'm not sure how else one could display it, though.
>
> 4. What does JVM limit mean in Non-heap.JVM-Overhead?
>
> Cheers,
> Till
>
> On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]> wrote:
>
> > Hi Xintong
> > thanks for your advice, the POC web and the FLIP doc was updated now
> > here is the new link:
> >
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> >
> >
> > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> >
> > > >
> > > > 1. Should the managed memory be part of direct memory?
> > > >
> > > The answer is no. Managed memory is currently allocated by accessing to
> > > private field of Unsafe. It is not accounted for in JVM's direct memory
> > > limit and corresponding metrics. To that end, it is equivalent to
> > > native memory.
> > >
> > >
> > > > 2. Should the shuffle memory also be part of the managed memory?
> > >
> > > I don't think so. Shuffle (Network) memory is allocated with direct
> > > buffers, and accounted for in JVM's direct memory limit and
> corresponding
> > > metrics. Moreover, the FLIP-49 memory model expose network memory and
> > > managed memory as two independent components of the overall memory
> > > footprint.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]> wrote:
> > >
> > > > Some questions related to "managed memory":
> > > >
> > > > 1. Should the managed memory be part of direct memory?
> > > > 2. Should the shuffle memory also be part of the managed memory?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Thanks for driving this FLIP, Yadong.
> > > > >
> > > > > +1 (non-binding) for the FLIP in general. I think this really helps
> > our
> > > > > users to understand and use the new FLIP-49 memory configuration.
> > > > >
> > > > > I have a few minor comments.
> > > > > - There's a frame "Other" in the frame "Non-Heap", besides "JVM
> > > Overhead"
> > > > > and "JVM Metaspace". IIUC, the purpose of this is to explain the
> > > > > mismatching between the metric "non-heap maximum" and the sum of
> the
> > > > > configurations "JVM metaspace" & "JVM Overhead". However, from the
> > > > > perspective of FLIP-49, JVM Overhead accounts for all the JVM
> > non-heap
> > > > > memory usages except for metaspace. The metrics does not match the
> > > > > configuration because we did not set the a JVM parameter for "max
> > > > non-heap
> > > > > memory" (actually I'm not sure whether it can be specified in java
> > 8).
> > > > The
> > > > > current UI might confuse people making them think there are other
> > > > non-heap
> > > > > memory usages not accounted by the configurations. Therefore, I
> would
> > > > > suggest to remove the "Other" frame, but add another frame inside
> > "JVM
> > > > > Overhead", besides "Configuration", with "JVM limit" as the title
> and
> > > > > "non-heap max metric minus metaspace configuration" as the value .
> > > > >
> > > > > - In the final release, we have changed "shuffle memory" to
> "network
> > > > > memory" because the latter is easier to understand for users. I
> think
> > > we
> > > > > should be updated it in this FLIP as well.
> > > > >
> > > > > - There's a typo "Directed" (should be "Direct") at the direct
> memory
> > > > > metric.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi all
> > > > > >
> > > > > > I want to start the vote for FLIP-102, which proposes to add more
> > > > metrics
> > > > > > to the task manager in web UI.
> > > > > >
> > > > > > To help everyone better understand the proposal, we spent some
> > > efforts
> > > > on
> > > > > > making an online POC
> > > > > >
> > > > > > previous web:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > POC web:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > >
> > > > > >
> > > > > > The vote will last for at least 72 hours, following the consensus
> > > > voting
> > > > > > process.
> > > > > >
> > > > > > FLIP wiki:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > > > >
> > > > > > Discussion thread:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Yadong
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Till Rohrmann
Thinking a bit more about the problem whether to report the aggregated
memory statistics or the individual slot statistics, I think reporting it
on a per slot basis won't work nicely together with FLIP-56 (dynamic slot
allocation). The problem is that with FLIP-56, we will no longer have
dedicated slots. The number of slots might change over the lifetime of a
TaskExecutor. Hence, it won't be easy to generate a metric path for every
slot which are furthermore also ephemeral. So maybe, the more general and
easier solution would be to report the overall memory usage of a
TaskExecutor even though it means to do some aggregation on the
TaskExecutor.

Concerning the JVM limit: Isn't it mainly the code cache? If we display
this value, then we should explain what exactly it means. I fear that most
users won't understand what JVM limit actually means.

Cheers,
Till

On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]> wrote:

> Hi Till
>
> Thanks a lot for your response
>
> > 2. I'm not entirely sure whether I would split the memory ...
>
> Split the memory display comes from the 'ancient' design of the web, it is
> ok for me to change it following total/heap/managed/network/direct/jvm
> overhead/mapped sequence
>
> > 3. Displaying the memory configurations...
>
> I agree with you that it is not a very nice way, but the hierarchical
> relationship of configurations is too complex and hard to display in the
> other ways (I have tried)
>
> if anyone has a better idea, please feels no hesitates to help me
>
>
> > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
>
> JVM limit is "non-heap max metric minus metaspace configuration" as
> @Xintong
> Song <[hidden email]> replyed in this mail thread
>
>
> Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
>
> > Thanks for creating this FLIP Yadong. I think your proposal makes it much
> > easier for the user to understand what's happening on Flink
> TaskManager's.
> >
> > I have some comments:
> >
> > 1. Some of the newly introduced metrics involve computations on the
> > TaskManager. I would like to avoid additional computations introduced by
> > metrics as much as possible because metrics should not affect the system.
> > In particular, total memory sizes which are configured should not be
> > derived computationally (getManagedMemoryTotal, getTotalMemorySize). For
> > the currently available memory sizes (e.g. getManagedMemoryUsed), one
> could
> > think about reporting them on a per slot basis and to do the aggregation
> on
> > the client side. Of course, this would increase the size of the response
> > payload.
> >
> > 2. I'm not entirely sure whether I would split the memory display into
> JVM
> > memory and non JVM memory as you've done it int the POC. From a user's
> > perspective, one could start displaying the total process memory. The
> next
> > three most important metrics are the heap, managed memory and network
> > buffer usage, I guess. If one is interested in more details, one could
> then
> > display the remaining direct memory usage, the JVM overhead (I'm not sure
> > whether I would call this non-heap though) and the mapped memory.
> >
> > 3. Displaying the memory configurations in three nested boxes does not
> look
> > so nice to me. I'm not sure how else one could display it, though.
> >
> > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> >
> > Cheers,
> > Till
> >
> > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]> wrote:
> >
> > > Hi Xintong
> > > thanks for your advice, the POC web and the FLIP doc was updated now
> > > here is the new link:
> > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> > >
> > >
> > > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> > >
> > > > >
> > > > > 1. Should the managed memory be part of direct memory?
> > > > >
> > > > The answer is no. Managed memory is currently allocated by accessing
> to
> > > > private field of Unsafe. It is not accounted for in JVM's direct
> memory
> > > > limit and corresponding metrics. To that end, it is equivalent to
> > > > native memory.
> > > >
> > > >
> > > > > 2. Should the shuffle memory also be part of the managed memory?
> > > >
> > > > I don't think so. Shuffle (Network) memory is allocated with direct
> > > > buffers, and accounted for in JVM's direct memory limit and
> > corresponding
> > > > metrics. Moreover, the FLIP-49 memory model expose network memory and
> > > > managed memory as two independent components of the overall memory
> > > > footprint.
> > > >
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]>
> wrote:
> > > >
> > > > > Some questions related to "managed memory":
> > > > >
> > > > > 1. Should the managed memory be part of direct memory?
> > > > > 2. Should the shuffle memory also be part of the managed memory?
> > > > >
> > > > > Best,
> > > > > Kurt
> > > > >
> > > > >
> > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for driving this FLIP, Yadong.
> > > > > >
> > > > > > +1 (non-binding) for the FLIP in general. I think this really
> helps
> > > our
> > > > > > users to understand and use the new FLIP-49 memory configuration.
> > > > > >
> > > > > > I have a few minor comments.
> > > > > > - There's a frame "Other" in the frame "Non-Heap", besides "JVM
> > > > Overhead"
> > > > > > and "JVM Metaspace". IIUC, the purpose of this is to explain the
> > > > > > mismatching between the metric "non-heap maximum" and the sum of
> > the
> > > > > > configurations "JVM metaspace" & "JVM Overhead". However, from
> the
> > > > > > perspective of FLIP-49, JVM Overhead accounts for all the JVM
> > > non-heap
> > > > > > memory usages except for metaspace. The metrics does not match
> the
> > > > > > configuration because we did not set the a JVM parameter for "max
> > > > > non-heap
> > > > > > memory" (actually I'm not sure whether it can be specified in
> java
> > > 8).
> > > > > The
> > > > > > current UI might confuse people making them think there are other
> > > > > non-heap
> > > > > > memory usages not accounted by the configurations. Therefore, I
> > would
> > > > > > suggest to remove the "Other" frame, but add another frame inside
> > > "JVM
> > > > > > Overhead", besides "Configuration", with "JVM limit" as the title
> > and
> > > > > > "non-heap max metric minus metaspace configuration" as the value
> .
> > > > > >
> > > > > > - In the final release, we have changed "shuffle memory" to
> > "network
> > > > > > memory" because the latter is easier to understand for users. I
> > think
> > > > we
> > > > > > should be updated it in this FLIP as well.
> > > > > >
> > > > > > - There's a typo "Directed" (should be "Direct") at the direct
> > memory
> > > > > > metric.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <[hidden email]>
> > > > wrote:
> > > > > >
> > > > > > > Hi all
> > > > > > >
> > > > > > > I want to start the vote for FLIP-102, which proposes to add
> more
> > > > > metrics
> > > > > > > to the task manager in web UI.
> > > > > > >
> > > > > > > To help everyone better understand the proposal, we spent some
> > > > efforts
> > > > > on
> > > > > > > making an online POC
> > > > > > >
> > > > > > > previous web:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > > POC web:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > >
> > > > > > >
> > > > > > > The vote will last for at least 72 hours, following the
> consensus
> > > > > voting
> > > > > > > process.
> > > > > > >
> > > > > > > FLIP wiki:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > > > > >
> > > > > > > Discussion thread:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Yadong
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Yadong Xie
Hi all

I have updated the design of the metric page and FLIP doc, please let me
know what you think about it

FLIP-102:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
POC web:
http://101.132.122.69:8081/web/#/task-manager/8e1f1beada3859ee8e46d0960bb1da18/metrics

Till Rohrmann <[hidden email]> 于2020年2月27日周四 下午10:27写道:

> Thinking a bit more about the problem whether to report the aggregated
> memory statistics or the individual slot statistics, I think reporting it
> on a per slot basis won't work nicely together with FLIP-56 (dynamic slot
> allocation). The problem is that with FLIP-56, we will no longer have
> dedicated slots. The number of slots might change over the lifetime of a
> TaskExecutor. Hence, it won't be easy to generate a metric path for every
> slot which are furthermore also ephemeral. So maybe, the more general and
> easier solution would be to report the overall memory usage of a
> TaskExecutor even though it means to do some aggregation on the
> TaskExecutor.
>
> Concerning the JVM limit: Isn't it mainly the code cache? If we display
> this value, then we should explain what exactly it means. I fear that most
> users won't understand what JVM limit actually means.
>
> Cheers,
> Till
>
> On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]> wrote:
>
> > Hi Till
> >
> > Thanks a lot for your response
> >
> > > 2. I'm not entirely sure whether I would split the memory ...
> >
> > Split the memory display comes from the 'ancient' design of the web, it
> is
> > ok for me to change it following total/heap/managed/network/direct/jvm
> > overhead/mapped sequence
> >
> > > 3. Displaying the memory configurations...
> >
> > I agree with you that it is not a very nice way, but the hierarchical
> > relationship of configurations is too complex and hard to display in the
> > other ways (I have tried)
> >
> > if anyone has a better idea, please feels no hesitates to help me
> >
> >
> > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> >
> > JVM limit is "non-heap max metric minus metaspace configuration" as
> > @Xintong
> > Song <[hidden email]> replyed in this mail thread
> >
> >
> > Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
> >
> > > Thanks for creating this FLIP Yadong. I think your proposal makes it
> much
> > > easier for the user to understand what's happening on Flink
> > TaskManager's.
> > >
> > > I have some comments:
> > >
> > > 1. Some of the newly introduced metrics involve computations on the
> > > TaskManager. I would like to avoid additional computations introduced
> by
> > > metrics as much as possible because metrics should not affect the
> system.
> > > In particular, total memory sizes which are configured should not be
> > > derived computationally (getManagedMemoryTotal, getTotalMemorySize).
> For
> > > the currently available memory sizes (e.g. getManagedMemoryUsed), one
> > could
> > > think about reporting them on a per slot basis and to do the
> aggregation
> > on
> > > the client side. Of course, this would increase the size of the
> response
> > > payload.
> > >
> > > 2. I'm not entirely sure whether I would split the memory display into
> > JVM
> > > memory and non JVM memory as you've done it int the POC. From a user's
> > > perspective, one could start displaying the total process memory. The
> > next
> > > three most important metrics are the heap, managed memory and network
> > > buffer usage, I guess. If one is interested in more details, one could
> > then
> > > display the remaining direct memory usage, the JVM overhead (I'm not
> sure
> > > whether I would call this non-heap though) and the mapped memory.
> > >
> > > 3. Displaying the memory configurations in three nested boxes does not
> > look
> > > so nice to me. I'm not sure how else one could display it, though.
> > >
> > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]>
> wrote:
> > >
> > > > Hi Xintong
> > > > thanks for your advice, the POC web and the FLIP doc was updated now
> > > > here is the new link:
> > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> > > >
> > > >
> > > > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> > > >
> > > > > >
> > > > > > 1. Should the managed memory be part of direct memory?
> > > > > >
> > > > > The answer is no. Managed memory is currently allocated by
> accessing
> > to
> > > > > private field of Unsafe. It is not accounted for in JVM's direct
> > memory
> > > > > limit and corresponding metrics. To that end, it is equivalent to
> > > > > native memory.
> > > > >
> > > > >
> > > > > > 2. Should the shuffle memory also be part of the managed memory?
> > > > >
> > > > > I don't think so. Shuffle (Network) memory is allocated with direct
> > > > > buffers, and accounted for in JVM's direct memory limit and
> > > corresponding
> > > > > metrics. Moreover, the FLIP-49 memory model expose network memory
> and
> > > > > managed memory as two independent components of the overall memory
> > > > > footprint.
> > > > >
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]>
> > wrote:
> > > > >
> > > > > > Some questions related to "managed memory":
> > > > > >
> > > > > > 1. Should the managed memory be part of direct memory?
> > > > > > 2. Should the shuffle memory also be part of the managed memory?
> > > > > >
> > > > > > Best,
> > > > > > Kurt
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for driving this FLIP, Yadong.
> > > > > > >
> > > > > > > +1 (non-binding) for the FLIP in general. I think this really
> > helps
> > > > our
> > > > > > > users to understand and use the new FLIP-49 memory
> configuration.
> > > > > > >
> > > > > > > I have a few minor comments.
> > > > > > > - There's a frame "Other" in the frame "Non-Heap", besides "JVM
> > > > > Overhead"
> > > > > > > and "JVM Metaspace". IIUC, the purpose of this is to explain
> the
> > > > > > > mismatching between the metric "non-heap maximum" and the sum
> of
> > > the
> > > > > > > configurations "JVM metaspace" & "JVM Overhead". However, from
> > the
> > > > > > > perspective of FLIP-49, JVM Overhead accounts for all the JVM
> > > > non-heap
> > > > > > > memory usages except for metaspace. The metrics does not match
> > the
> > > > > > > configuration because we did not set the a JVM parameter for
> "max
> > > > > > non-heap
> > > > > > > memory" (actually I'm not sure whether it can be specified in
> > java
> > > > 8).
> > > > > > The
> > > > > > > current UI might confuse people making them think there are
> other
> > > > > > non-heap
> > > > > > > memory usages not accounted by the configurations. Therefore, I
> > > would
> > > > > > > suggest to remove the "Other" frame, but add another frame
> inside
> > > > "JVM
> > > > > > > Overhead", besides "Configuration", with "JVM limit" as the
> title
> > > and
> > > > > > > "non-heap max metric minus metaspace configuration" as the
> value
> > .
> > > > > > >
> > > > > > > - In the final release, we have changed "shuffle memory" to
> > > "network
> > > > > > > memory" because the latter is easier to understand for users. I
> > > think
> > > > > we
> > > > > > > should be updated it in this FLIP as well.
> > > > > > >
> > > > > > > - There's a typo "Directed" (should be "Direct") at the direct
> > > memory
> > > > > > > metric.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <
> [hidden email]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all
> > > > > > > >
> > > > > > > > I want to start the vote for FLIP-102, which proposes to add
> > more
> > > > > > metrics
> > > > > > > > to the task manager in web UI.
> > > > > > > >
> > > > > > > > To help everyone better understand the proposal, we spent
> some
> > > > > efforts
> > > > > > on
> > > > > > > > making an online POC
> > > > > > > >
> > > > > > > > previous web:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > > > POC web:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > > >
> > > > > > > >
> > > > > > > > The vote will last for at least 72 hours, following the
> > consensus
> > > > > > voting
> > > > > > > > process.
> > > > > > > >
> > > > > > > > FLIP wiki:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > > > > > >
> > > > > > > > Discussion thread:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Yadong
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Till Rohrmann
Thanks for updating the FLIP Yadong.

What is the difference between managedMemory and managedMemoryTotal
and networkMemory and networkMemoryTotal in the REST response? If they are
duplicates, then we might be able to remove one.

Apart from that, the proposal looks good to me.

Pulling also Andrey in to hear his opinion about the representation of the
memory components.

Cheers,
Till

On Thu, Mar 19, 2020 at 11:37 AM Yadong Xie <[hidden email]> wrote:

> Hi all
>
> I have updated the design of the metric page and FLIP doc, please let me
> know what you think about it
>
> FLIP-102:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> POC web:
>
> http://101.132.122.69:8081/web/#/task-manager/8e1f1beada3859ee8e46d0960bb1da18/metrics
>
> Till Rohrmann <[hidden email]> 于2020年2月27日周四 下午10:27写道:
>
> > Thinking a bit more about the problem whether to report the aggregated
> > memory statistics or the individual slot statistics, I think reporting it
> > on a per slot basis won't work nicely together with FLIP-56 (dynamic slot
> > allocation). The problem is that with FLIP-56, we will no longer have
> > dedicated slots. The number of slots might change over the lifetime of a
> > TaskExecutor. Hence, it won't be easy to generate a metric path for every
> > slot which are furthermore also ephemeral. So maybe, the more general and
> > easier solution would be to report the overall memory usage of a
> > TaskExecutor even though it means to do some aggregation on the
> > TaskExecutor.
> >
> > Concerning the JVM limit: Isn't it mainly the code cache? If we display
> > this value, then we should explain what exactly it means. I fear that
> most
> > users won't understand what JVM limit actually means.
> >
> > Cheers,
> > Till
> >
> > On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]> wrote:
> >
> > > Hi Till
> > >
> > > Thanks a lot for your response
> > >
> > > > 2. I'm not entirely sure whether I would split the memory ...
> > >
> > > Split the memory display comes from the 'ancient' design of the web, it
> > is
> > > ok for me to change it following total/heap/managed/network/direct/jvm
> > > overhead/mapped sequence
> > >
> > > > 3. Displaying the memory configurations...
> > >
> > > I agree with you that it is not a very nice way, but the hierarchical
> > > relationship of configurations is too complex and hard to display in
> the
> > > other ways (I have tried)
> > >
> > > if anyone has a better idea, please feels no hesitates to help me
> > >
> > >
> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> > >
> > > JVM limit is "non-heap max metric minus metaspace configuration" as
> > > @Xintong
> > > Song <[hidden email]> replyed in this mail thread
> > >
> > >
> > > Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
> > >
> > > > Thanks for creating this FLIP Yadong. I think your proposal makes it
> > much
> > > > easier for the user to understand what's happening on Flink
> > > TaskManager's.
> > > >
> > > > I have some comments:
> > > >
> > > > 1. Some of the newly introduced metrics involve computations on the
> > > > TaskManager. I would like to avoid additional computations introduced
> > by
> > > > metrics as much as possible because metrics should not affect the
> > system.
> > > > In particular, total memory sizes which are configured should not be
> > > > derived computationally (getManagedMemoryTotal, getTotalMemorySize).
> > For
> > > > the currently available memory sizes (e.g. getManagedMemoryUsed), one
> > > could
> > > > think about reporting them on a per slot basis and to do the
> > aggregation
> > > on
> > > > the client side. Of course, this would increase the size of the
> > response
> > > > payload.
> > > >
> > > > 2. I'm not entirely sure whether I would split the memory display
> into
> > > JVM
> > > > memory and non JVM memory as you've done it int the POC. From a
> user's
> > > > perspective, one could start displaying the total process memory. The
> > > next
> > > > three most important metrics are the heap, managed memory and network
> > > > buffer usage, I guess. If one is interested in more details, one
> could
> > > then
> > > > display the remaining direct memory usage, the JVM overhead (I'm not
> > sure
> > > > whether I would call this non-heap though) and the mapped memory.
> > > >
> > > > 3. Displaying the memory configurations in three nested boxes does
> not
> > > look
> > > > so nice to me. I'm not sure how else one could display it, though.
> > > >
> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]>
> > wrote:
> > > >
> > > > > Hi Xintong
> > > > > thanks for your advice, the POC web and the FLIP doc was updated
> now
> > > > > here is the new link:
> > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> > > > >
> > > > >
> > > > > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> > > > >
> > > > > > >
> > > > > > > 1. Should the managed memory be part of direct memory?
> > > > > > >
> > > > > > The answer is no. Managed memory is currently allocated by
> > accessing
> > > to
> > > > > > private field of Unsafe. It is not accounted for in JVM's direct
> > > memory
> > > > > > limit and corresponding metrics. To that end, it is equivalent to
> > > > > > native memory.
> > > > > >
> > > > > >
> > > > > > > 2. Should the shuffle memory also be part of the managed
> memory?
> > > > > >
> > > > > > I don't think so. Shuffle (Network) memory is allocated with
> direct
> > > > > > buffers, and accounted for in JVM's direct memory limit and
> > > > corresponding
> > > > > > metrics. Moreover, the FLIP-49 memory model expose network memory
> > and
> > > > > > managed memory as two independent components of the overall
> memory
> > > > > > footprint.
> > > > > >
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > > Some questions related to "managed memory":
> > > > > > >
> > > > > > > 1. Should the managed memory be part of direct memory?
> > > > > > > 2. Should the shuffle memory also be part of the managed
> memory?
> > > > > > >
> > > > > > > Best,
> > > > > > > Kurt
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for driving this FLIP, Yadong.
> > > > > > > >
> > > > > > > > +1 (non-binding) for the FLIP in general. I think this really
> > > helps
> > > > > our
> > > > > > > > users to understand and use the new FLIP-49 memory
> > configuration.
> > > > > > > >
> > > > > > > > I have a few minor comments.
> > > > > > > > - There's a frame "Other" in the frame "Non-Heap", besides
> "JVM
> > > > > > Overhead"
> > > > > > > > and "JVM Metaspace". IIUC, the purpose of this is to explain
> > the
> > > > > > > > mismatching between the metric "non-heap maximum" and the sum
> > of
> > > > the
> > > > > > > > configurations "JVM metaspace" & "JVM Overhead". However,
> from
> > > the
> > > > > > > > perspective of FLIP-49, JVM Overhead accounts for all the JVM
> > > > > non-heap
> > > > > > > > memory usages except for metaspace. The metrics does not
> match
> > > the
> > > > > > > > configuration because we did not set the a JVM parameter for
> > "max
> > > > > > > non-heap
> > > > > > > > memory" (actually I'm not sure whether it can be specified in
> > > java
> > > > > 8).
> > > > > > > The
> > > > > > > > current UI might confuse people making them think there are
> > other
> > > > > > > non-heap
> > > > > > > > memory usages not accounted by the configurations.
> Therefore, I
> > > > would
> > > > > > > > suggest to remove the "Other" frame, but add another frame
> > inside
> > > > > "JVM
> > > > > > > > Overhead", besides "Configuration", with "JVM limit" as the
> > title
> > > > and
> > > > > > > > "non-heap max metric minus metaspace configuration" as the
> > value
> > > .
> > > > > > > >
> > > > > > > > - In the final release, we have changed "shuffle memory" to
> > > > "network
> > > > > > > > memory" because the latter is easier to understand for
> users. I
> > > > think
> > > > > > we
> > > > > > > > should be updated it in this FLIP as well.
> > > > > > > >
> > > > > > > > - There's a typo "Directed" (should be "Direct") at the
> direct
> > > > memory
> > > > > > > > metric.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <
> > [hidden email]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all
> > > > > > > > >
> > > > > > > > > I want to start the vote for FLIP-102, which proposes to
> add
> > > more
> > > > > > > metrics
> > > > > > > > > to the task manager in web UI.
> > > > > > > > >
> > > > > > > > > To help everyone better understand the proposal, we spent
> > some
> > > > > > efforts
> > > > > > > on
> > > > > > > > > making an online POC
> > > > > > > > >
> > > > > > > > > previous web:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > > > > POC web:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The vote will last for at least 72 hours, following the
> > > consensus
> > > > > > > voting
> > > > > > > > > process.
> > > > > > > > >
> > > > > > > > > FLIP wiki:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > > > > > > >
> > > > > > > > > Discussion thread:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Yadong
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Xintong Song
Sorry for the late response.

I have shared my suggestions with Yadong & Lining offline. I think it would
be better to also post them here, for the public record.

   - I'm not sure about displaying Total Process Memory Used. Currently, we
   do not have a good way to monitor all memory footprints of the process.
   Metrics for some native memory usages (e.g., thread stack) are absent.
   Displaying a partial used memory size could be confusing for users.
   - I would suggest merge the current Mapped Memory metrics into Direct
   Memory. Actually, the metrics are retrieved from MXBeans for direct buffer
   pool and mapped buffer pool. Both of the two pools are accounted for in
   -XX:MaxDirectMemorySize. There's no Flink configuration that can modify the
   individual pool sizes. Therefore, I think displaying the total Direct
   Memory would be good enough. Moreover, in most use cases the size of mapped
   buffer pool is zero and users do not need to understand what is Mapped
   Memory. For expert users who do need the separated metrics for individual
   pools, they can subscribe the metrics on their own.
   - I would suggest to not display Non-Heap Memory. Despite the name, the
   metrics (also retrieved from MXBeans) actually accounts for metaspace, code
   cache, and compressed class space. It does not account for all JVM native
   memory overheads, e.g., thread stack. That means the metrics of Non-Heap
   Memory do not well correspond to any of the FLIP-49 memory components. They
   account for Flink's JVM Metaspace and part of JVM Overhead. I think this
   brings more confusion then help to users, especially primary users.


Thank you~

Xintong Song



On Thu, Mar 26, 2020 at 6:34 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for updating the FLIP Yadong.
>
> What is the difference between managedMemory and managedMemoryTotal
> and networkMemory and networkMemoryTotal in the REST response? If they are
> duplicates, then we might be able to remove one.
>
> Apart from that, the proposal looks good to me.
>
> Pulling also Andrey in to hear his opinion about the representation of the
> memory components.
>
> Cheers,
> Till
>
> On Thu, Mar 19, 2020 at 11:37 AM Yadong Xie <[hidden email]> wrote:
>
>> Hi all
>>
>> I have updated the design of the metric page and FLIP doc, please let me
>> know what you think about it
>>
>> FLIP-102:
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
>> POC web:
>>
>> http://101.132.122.69:8081/web/#/task-manager/8e1f1beada3859ee8e46d0960bb1da18/metrics
>>
>> Till Rohrmann <[hidden email]> 于2020年2月27日周四 下午10:27写道:
>>
>> > Thinking a bit more about the problem whether to report the aggregated
>> > memory statistics or the individual slot statistics, I think reporting
>> it
>> > on a per slot basis won't work nicely together with FLIP-56 (dynamic
>> slot
>> > allocation). The problem is that with FLIP-56, we will no longer have
>> > dedicated slots. The number of slots might change over the lifetime of a
>> > TaskExecutor. Hence, it won't be easy to generate a metric path for
>> every
>> > slot which are furthermore also ephemeral. So maybe, the more general
>> and
>> > easier solution would be to report the overall memory usage of a
>> > TaskExecutor even though it means to do some aggregation on the
>> > TaskExecutor.
>> >
>> > Concerning the JVM limit: Isn't it mainly the code cache? If we display
>> > this value, then we should explain what exactly it means. I fear that
>> most
>> > users won't understand what JVM limit actually means.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]>
>> wrote:
>> >
>> > > Hi Till
>> > >
>> > > Thanks a lot for your response
>> > >
>> > > > 2. I'm not entirely sure whether I would split the memory ...
>> > >
>> > > Split the memory display comes from the 'ancient' design of the web,
>> it
>> > is
>> > > ok for me to change it following total/heap/managed/network/direct/jvm
>> > > overhead/mapped sequence
>> > >
>> > > > 3. Displaying the memory configurations...
>> > >
>> > > I agree with you that it is not a very nice way, but the hierarchical
>> > > relationship of configurations is too complex and hard to display in
>> the
>> > > other ways (I have tried)
>> > >
>> > > if anyone has a better idea, please feels no hesitates to help me
>> > >
>> > >
>> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
>> > >
>> > > JVM limit is "non-heap max metric minus metaspace configuration" as
>> > > @Xintong
>> > > Song <[hidden email]> replyed in this mail thread
>> > >
>> > >
>> > > Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
>> > >
>> > > > Thanks for creating this FLIP Yadong. I think your proposal makes it
>> > much
>> > > > easier for the user to understand what's happening on Flink
>> > > TaskManager's.
>> > > >
>> > > > I have some comments:
>> > > >
>> > > > 1. Some of the newly introduced metrics involve computations on the
>> > > > TaskManager. I would like to avoid additional computations
>> introduced
>> > by
>> > > > metrics as much as possible because metrics should not affect the
>> > system.
>> > > > In particular, total memory sizes which are configured should not be
>> > > > derived computationally (getManagedMemoryTotal, getTotalMemorySize).
>> > For
>> > > > the currently available memory sizes (e.g. getManagedMemoryUsed),
>> one
>> > > could
>> > > > think about reporting them on a per slot basis and to do the
>> > aggregation
>> > > on
>> > > > the client side. Of course, this would increase the size of the
>> > response
>> > > > payload.
>> > > >
>> > > > 2. I'm not entirely sure whether I would split the memory display
>> into
>> > > JVM
>> > > > memory and non JVM memory as you've done it int the POC. From a
>> user's
>> > > > perspective, one could start displaying the total process memory.
>> The
>> > > next
>> > > > three most important metrics are the heap, managed memory and
>> network
>> > > > buffer usage, I guess. If one is interested in more details, one
>> could
>> > > then
>> > > > display the remaining direct memory usage, the JVM overhead (I'm not
>> > sure
>> > > > whether I would call this non-heap though) and the mapped memory.
>> > > >
>> > > > 3. Displaying the memory configurations in three nested boxes does
>> not
>> > > look
>> > > > so nice to me. I'm not sure how else one could display it, though.
>> > > >
>> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
>> > > >
>> > > > Cheers,
>> > > > Till
>> > > >
>> > > > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]>
>> > wrote:
>> > > >
>> > > > > Hi Xintong
>> > > > > thanks for your advice, the POC web and the FLIP doc was updated
>> now
>> > > > > here is the new link:
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
>> > > > >
>> > > > >
>> > > > > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
>> > > > >
>> > > > > > >
>> > > > > > > 1. Should the managed memory be part of direct memory?
>> > > > > > >
>> > > > > > The answer is no. Managed memory is currently allocated by
>> > accessing
>> > > to
>> > > > > > private field of Unsafe. It is not accounted for in JVM's direct
>> > > memory
>> > > > > > limit and corresponding metrics. To that end, it is equivalent
>> to
>> > > > > > native memory.
>> > > > > >
>> > > > > >
>> > > > > > > 2. Should the shuffle memory also be part of the managed
>> memory?
>> > > > > >
>> > > > > > I don't think so. Shuffle (Network) memory is allocated with
>> direct
>> > > > > > buffers, and accounted for in JVM's direct memory limit and
>> > > > corresponding
>> > > > > > metrics. Moreover, the FLIP-49 memory model expose network
>> memory
>> > and
>> > > > > > managed memory as two independent components of the overall
>> memory
>> > > > > > footprint.
>> > > > > >
>> > > > > >
>> > > > > > Thank you~
>> > > > > >
>> > > > > > Xintong Song
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]>
>> > > wrote:
>> > > > > >
>> > > > > > > Some questions related to "managed memory":
>> > > > > > >
>> > > > > > > 1. Should the managed memory be part of direct memory?
>> > > > > > > 2. Should the shuffle memory also be part of the managed
>> memory?
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Kurt
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
>> > > [hidden email]
>> > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Thanks for driving this FLIP, Yadong.
>> > > > > > > >
>> > > > > > > > +1 (non-binding) for the FLIP in general. I think this
>> really
>> > > helps
>> > > > > our
>> > > > > > > > users to understand and use the new FLIP-49 memory
>> > configuration.
>> > > > > > > >
>> > > > > > > > I have a few minor comments.
>> > > > > > > > - There's a frame "Other" in the frame "Non-Heap", besides
>> "JVM
>> > > > > > Overhead"
>> > > > > > > > and "JVM Metaspace". IIUC, the purpose of this is to explain
>> > the
>> > > > > > > > mismatching between the metric "non-heap maximum" and the
>> sum
>> > of
>> > > > the
>> > > > > > > > configurations "JVM metaspace" & "JVM Overhead". However,
>> from
>> > > the
>> > > > > > > > perspective of FLIP-49, JVM Overhead accounts for all the
>> JVM
>> > > > > non-heap
>> > > > > > > > memory usages except for metaspace. The metrics does not
>> match
>> > > the
>> > > > > > > > configuration because we did not set the a JVM parameter for
>> > "max
>> > > > > > > non-heap
>> > > > > > > > memory" (actually I'm not sure whether it can be specified
>> in
>> > > java
>> > > > > 8).
>> > > > > > > The
>> > > > > > > > current UI might confuse people making them think there are
>> > other
>> > > > > > > non-heap
>> > > > > > > > memory usages not accounted by the configurations.
>> Therefore, I
>> > > > would
>> > > > > > > > suggest to remove the "Other" frame, but add another frame
>> > inside
>> > > > > "JVM
>> > > > > > > > Overhead", besides "Configuration", with "JVM limit" as the
>> > title
>> > > > and
>> > > > > > > > "non-heap max metric minus metaspace configuration" as the
>> > value
>> > > .
>> > > > > > > >
>> > > > > > > > - In the final release, we have changed "shuffle memory" to
>> > > > "network
>> > > > > > > > memory" because the latter is easier to understand for
>> users. I
>> > > > think
>> > > > > > we
>> > > > > > > > should be updated it in this FLIP as well.
>> > > > > > > >
>> > > > > > > > - There's a typo "Directed" (should be "Direct") at the
>> direct
>> > > > memory
>> > > > > > > > metric.
>> > > > > > > >
>> > > > > > > > Thank you~
>> > > > > > > >
>> > > > > > > > Xintong Song
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <
>> > [hidden email]>
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi all
>> > > > > > > > >
>> > > > > > > > > I want to start the vote for FLIP-102, which proposes to
>> add
>> > > more
>> > > > > > > metrics
>> > > > > > > > > to the task manager in web UI.
>> > > > > > > > >
>> > > > > > > > > To help everyone better understand the proposal, we spent
>> > some
>> > > > > > efforts
>> > > > > > > on
>> > > > > > > > > making an online POC
>> > > > > > > > >
>> > > > > > > > > previous web:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
>> > > > > > > > > POC web:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > The vote will last for at least 72 hours, following the
>> > > consensus
>> > > > > > > voting
>> > > > > > > > > process.
>> > > > > > > > >
>> > > > > > > > > FLIP wiki:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
>> > > > > > > > >
>> > > > > > > > > Discussion thread:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > >
>> > > > > > > > > Yadong
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Andrey Zagrebin-4
Hi All,

Thanks for this FLIP, Yadong. This is a very good improvement to the
Flink's UI.
It looks like there are still couple of things to resolve before the final
vote.

- I also find the non-heap title in configuration confusing because there
are also other non-heap types of memory. The "off-heap" concept is quite
broad.
What about "JVM specific" meaning that it is not coming directly from Flink?
or we could remove the "Non-heap" box at all and show directly JVM
Metaspace and Overhead as separate boxes,
this would also fit if we decide to keep the Metaspace metric.

- Total Process Memory Used: I agree with Xintong, it is hard to say what
is used there.
Then the size of "Total Process Memory" basically becomes part of
configuration.

- Non-Heap Used/Max/.. Not sure what committed means here. I also think we
should either exclude it or display what is known for sure.
In general, the metaspace usage would be nice to have but it should be then
exactly metaspace usage without any thing else.

- I do not know how the mapped memory works. Is it meant for the new
spilled partitions? If the mapped memory also pulls from the direct
memory limit
then this is something we do not account in our network buffers as I
understand. In this case, this metric may be useful for tuning to understand
how much the mapped memory uses from the direct memory limit to set e.g.
framework off-heap limit correctly and avoid direct OOM.
It could be something to discuss with Zhijiang. e.g. is the direct
memory used there to buffer fetched regions of partition files or what for?

- Not sure, we need an extra wrapping box "other" for the managed memory
atm. I could be just "Managed" or "Managed by Flink".

Best,
Andrey

On Fri, Mar 27, 2020 at 6:13 AM Xintong Song <[hidden email]> wrote:

> Sorry for the late response.
>
> I have shared my suggestions with Yadong & Lining offline. I think it would
> be better to also post them here, for the public record.
>
>    - I'm not sure about displaying Total Process Memory Used. Currently, we
>    do not have a good way to monitor all memory footprints of the process.
>    Metrics for some native memory usages (e.g., thread stack) are absent.
>    Displaying a partial used memory size could be confusing for users.
>    - I would suggest merge the current Mapped Memory metrics into Direct
>    Memory. Actually, the metrics are retrieved from MXBeans for direct
> buffer
>    pool and mapped buffer pool. Both of the two pools are accounted for in
>    -XX:MaxDirectMemorySize. There's no Flink configuration that can modify
> the
>    individual pool sizes. Therefore, I think displaying the total Direct
>    Memory would be good enough. Moreover, in most use cases the size of
> mapped
>    buffer pool is zero and users do not need to understand what is Mapped
>    Memory. For expert users who do need the separated metrics for
> individual
>    pools, they can subscribe the metrics on their own.
>    - I would suggest to not display Non-Heap Memory. Despite the name, the
>    metrics (also retrieved from MXBeans) actually accounts for metaspace,
> code
>    cache, and compressed class space. It does not account for all JVM
> native
>    memory overheads, e.g., thread stack. That means the metrics of Non-Heap
>    Memory do not well correspond to any of the FLIP-49 memory components.
> They
>    account for Flink's JVM Metaspace and part of JVM Overhead. I think this
>    brings more confusion then help to users, especially primary users.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Mar 26, 2020 at 6:34 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Thanks for updating the FLIP Yadong.
> >
> > What is the difference between managedMemory and managedMemoryTotal
> > and networkMemory and networkMemoryTotal in the REST response? If they
> are
> > duplicates, then we might be able to remove one.
> >
> > Apart from that, the proposal looks good to me.
> >
> > Pulling also Andrey in to hear his opinion about the representation of
> the
> > memory components.
> >
> > Cheers,
> > Till
> >
> > On Thu, Mar 19, 2020 at 11:37 AM Yadong Xie <[hidden email]> wrote:
> >
> >> Hi all
> >>
> >> I have updated the design of the metric page and FLIP doc, please let me
> >> know what you think about it
> >>
> >> FLIP-102:
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> >> POC web:
> >>
> >>
> http://101.132.122.69:8081/web/#/task-manager/8e1f1beada3859ee8e46d0960bb1da18/metrics
> >>
> >> Till Rohrmann <[hidden email]> 于2020年2月27日周四 下午10:27写道:
> >>
> >> > Thinking a bit more about the problem whether to report the aggregated
> >> > memory statistics or the individual slot statistics, I think reporting
> >> it
> >> > on a per slot basis won't work nicely together with FLIP-56 (dynamic
> >> slot
> >> > allocation). The problem is that with FLIP-56, we will no longer have
> >> > dedicated slots. The number of slots might change over the lifetime
> of a
> >> > TaskExecutor. Hence, it won't be easy to generate a metric path for
> >> every
> >> > slot which are furthermore also ephemeral. So maybe, the more general
> >> and
> >> > easier solution would be to report the overall memory usage of a
> >> > TaskExecutor even though it means to do some aggregation on the
> >> > TaskExecutor.
> >> >
> >> > Concerning the JVM limit: Isn't it mainly the code cache? If we
> display
> >> > this value, then we should explain what exactly it means. I fear that
> >> most
> >> > users won't understand what JVM limit actually means.
> >> >
> >> > Cheers,
> >> > Till
> >> >
> >> > On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]>
> >> wrote:
> >> >
> >> > > Hi Till
> >> > >
> >> > > Thanks a lot for your response
> >> > >
> >> > > > 2. I'm not entirely sure whether I would split the memory ...
> >> > >
> >> > > Split the memory display comes from the 'ancient' design of the web,
> >> it
> >> > is
> >> > > ok for me to change it following
> total/heap/managed/network/direct/jvm
> >> > > overhead/mapped sequence
> >> > >
> >> > > > 3. Displaying the memory configurations...
> >> > >
> >> > > I agree with you that it is not a very nice way, but the
> hierarchical
> >> > > relationship of configurations is too complex and hard to display in
> >> the
> >> > > other ways (I have tried)
> >> > >
> >> > > if anyone has a better idea, please feels no hesitates to help me
> >> > >
> >> > >
> >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> >> > >
> >> > > JVM limit is "non-heap max metric minus metaspace configuration" as
> >> > > @Xintong
> >> > > Song <[hidden email]> replyed in this mail thread
> >> > >
> >> > >
> >> > > Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
> >> > >
> >> > > > Thanks for creating this FLIP Yadong. I think your proposal makes
> it
> >> > much
> >> > > > easier for the user to understand what's happening on Flink
> >> > > TaskManager's.
> >> > > >
> >> > > > I have some comments:
> >> > > >
> >> > > > 1. Some of the newly introduced metrics involve computations on
> the
> >> > > > TaskManager. I would like to avoid additional computations
> >> introduced
> >> > by
> >> > > > metrics as much as possible because metrics should not affect the
> >> > system.
> >> > > > In particular, total memory sizes which are configured should not
> be
> >> > > > derived computationally (getManagedMemoryTotal,
> getTotalMemorySize).
> >> > For
> >> > > > the currently available memory sizes (e.g. getManagedMemoryUsed),
> >> one
> >> > > could
> >> > > > think about reporting them on a per slot basis and to do the
> >> > aggregation
> >> > > on
> >> > > > the client side. Of course, this would increase the size of the
> >> > response
> >> > > > payload.
> >> > > >
> >> > > > 2. I'm not entirely sure whether I would split the memory display
> >> into
> >> > > JVM
> >> > > > memory and non JVM memory as you've done it int the POC. From a
> >> user's
> >> > > > perspective, one could start displaying the total process memory.
> >> The
> >> > > next
> >> > > > three most important metrics are the heap, managed memory and
> >> network
> >> > > > buffer usage, I guess. If one is interested in more details, one
> >> could
> >> > > then
> >> > > > display the remaining direct memory usage, the JVM overhead (I'm
> not
> >> > sure
> >> > > > whether I would call this non-heap though) and the mapped memory.
> >> > > >
> >> > > > 3. Displaying the memory configurations in three nested boxes does
> >> not
> >> > > look
> >> > > > so nice to me. I'm not sure how else one could display it, though.
> >> > > >
> >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> >> > > >
> >> > > > Cheers,
> >> > > > Till
> >> > > >
> >> > > > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]>
> >> > wrote:
> >> > > >
> >> > > > > Hi Xintong
> >> > > > > thanks for your advice, the POC web and the FLIP doc was updated
> >> now
> >> > > > > here is the new link:
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> >> > > > >
> >> > > > >
> >> > > > > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> >> > > > >
> >> > > > > > >
> >> > > > > > > 1. Should the managed memory be part of direct memory?
> >> > > > > > >
> >> > > > > > The answer is no. Managed memory is currently allocated by
> >> > accessing
> >> > > to
> >> > > > > > private field of Unsafe. It is not accounted for in JVM's
> direct
> >> > > memory
> >> > > > > > limit and corresponding metrics. To that end, it is equivalent
> >> to
> >> > > > > > native memory.
> >> > > > > >
> >> > > > > >
> >> > > > > > > 2. Should the shuffle memory also be part of the managed
> >> memory?
> >> > > > > >
> >> > > > > > I don't think so. Shuffle (Network) memory is allocated with
> >> direct
> >> > > > > > buffers, and accounted for in JVM's direct memory limit and
> >> > > > corresponding
> >> > > > > > metrics. Moreover, the FLIP-49 memory model expose network
> >> memory
> >> > and
> >> > > > > > managed memory as two independent components of the overall
> >> memory
> >> > > > > > footprint.
> >> > > > > >
> >> > > > > >
> >> > > > > > Thank you~
> >> > > > > >
> >> > > > > > Xintong Song
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]
> >
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Some questions related to "managed memory":
> >> > > > > > >
> >> > > > > > > 1. Should the managed memory be part of direct memory?
> >> > > > > > > 2. Should the shuffle memory also be part of the managed
> >> memory?
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > Kurt
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
> >> > > [hidden email]
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Thanks for driving this FLIP, Yadong.
> >> > > > > > > >
> >> > > > > > > > +1 (non-binding) for the FLIP in general. I think this
> >> really
> >> > > helps
> >> > > > > our
> >> > > > > > > > users to understand and use the new FLIP-49 memory
> >> > configuration.
> >> > > > > > > >
> >> > > > > > > > I have a few minor comments.
> >> > > > > > > > - There's a frame "Other" in the frame "Non-Heap", besides
> >> "JVM
> >> > > > > > Overhead"
> >> > > > > > > > and "JVM Metaspace". IIUC, the purpose of this is to
> explain
> >> > the
> >> > > > > > > > mismatching between the metric "non-heap maximum" and the
> >> sum
> >> > of
> >> > > > the
> >> > > > > > > > configurations "JVM metaspace" & "JVM Overhead". However,
> >> from
> >> > > the
> >> > > > > > > > perspective of FLIP-49, JVM Overhead accounts for all the
> >> JVM
> >> > > > > non-heap
> >> > > > > > > > memory usages except for metaspace. The metrics does not
> >> match
> >> > > the
> >> > > > > > > > configuration because we did not set the a JVM parameter
> for
> >> > "max
> >> > > > > > > non-heap
> >> > > > > > > > memory" (actually I'm not sure whether it can be specified
> >> in
> >> > > java
> >> > > > > 8).
> >> > > > > > > The
> >> > > > > > > > current UI might confuse people making them think there
> are
> >> > other
> >> > > > > > > non-heap
> >> > > > > > > > memory usages not accounted by the configurations.
> >> Therefore, I
> >> > > > would
> >> > > > > > > > suggest to remove the "Other" frame, but add another frame
> >> > inside
> >> > > > > "JVM
> >> > > > > > > > Overhead", besides "Configuration", with "JVM limit" as
> the
> >> > title
> >> > > > and
> >> > > > > > > > "non-heap max metric minus metaspace configuration" as the
> >> > value
> >> > > .
> >> > > > > > > >
> >> > > > > > > > - In the final release, we have changed "shuffle memory"
> to
> >> > > > "network
> >> > > > > > > > memory" because the latter is easier to understand for
> >> users. I
> >> > > > think
> >> > > > > > we
> >> > > > > > > > should be updated it in this FLIP as well.
> >> > > > > > > >
> >> > > > > > > > - There's a typo "Directed" (should be "Direct") at the
> >> direct
> >> > > > memory
> >> > > > > > > > metric.
> >> > > > > > > >
> >> > > > > > > > Thank you~
> >> > > > > > > >
> >> > > > > > > > Xintong Song
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <
> >> > [hidden email]>
> >> > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi all
> >> > > > > > > > >
> >> > > > > > > > > I want to start the vote for FLIP-102, which proposes to
> >> add
> >> > > more
> >> > > > > > > metrics
> >> > > > > > > > > to the task manager in web UI.
> >> > > > > > > > >
> >> > > > > > > > > To help everyone better understand the proposal, we
> spent
> >> > some
> >> > > > > > efforts
> >> > > > > > > on
> >> > > > > > > > > making an online POC
> >> > > > > > > > >
> >> > > > > > > > > previous web:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> >> > > > > > > > > POC web:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > The vote will last for at least 72 hours, following the
> >> > > consensus
> >> > > > > > > voting
> >> > > > > > > > > process.
> >> > > > > > > > >
> >> > > > > > > > > FLIP wiki:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> >> > > > > > > > >
> >> > > > > > > > > Discussion thread:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> >> > > > > > > > >
> >> > > > > > > > > Thanks,
> >> > > > > > > > >
> >> > > > > > > > > Yadong
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Zhijiang(wangzhijiang999)
Thanks for the FLIP, Yadong. In general I think this work is valuable for users to better understand the Flink's memory usages in different dimensions.

Sorry for not going through every detailed discussions below, and I try to do that later if possible. Firstly I try to answer some Andrey's concerns with mmap.

> - I do not know how the mapped memory works. Is it meant for the new spilled partitions? If the mapped memory also pulls from the direct
> memory limit then this is something we do not account in our network buffers as I understand. In this case, this metric may be useful for tuning to understand
> how much the mapped memory uses from the direct memory limit to set e.g. framework off-heap limit correctly and avoid direct OOM.
> It could be something to discuss with Zhijiang. e.g. is the direct memory used there to buffer fetched regions of partition files or what for?

Yes, the mapped memory is used in bounded blocking partition for batch jobs now, but not the default mode.

 AIK it is not related and limited to the setting of `MaxDirectMemory`, so we do not need to worry about the current direct memory setting and the potential OOM issue.
It is up to the address space to determine the mapped file size, and in 64 bit system we can regard the limitless size in theory.

Regarding the size of mapped buffer pool from MXBean, it only indicates how much file size were already mapped before, even it is unchanged to not reflect the real
physical memory use. E.g. when the file was mapped 100GB region at the beginning, the mapped buffer pool from MXBean would be 100GB. But how many physical
memories are really consumed is up to the specific read or write operations in practice, and also controlled by the operator system. E.g some unused regions might be
exchanged into SWAP virtual memory when physical memory is limited.

From this point, I guess it is no meaningful to show the size of mapped buffer pool for users who may be more concerned with how many physical memories are really
used.

Best,
Zhijiang


------------------------------------------------------------------
From:Andrey Zagrebin <[hidden email]>
Send Time:2020 Mar. 30 (Mon.) 22:56
To:dev <[hidden email]>
Subject:Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Hi All,

Thanks for this FLIP, Yadong. This is a very good improvement to the
Flink's UI.
It looks like there are still couple of things to resolve before the final
vote.

- I also find the non-heap title in configuration confusing because there
are also other non-heap types of memory. The "off-heap" concept is quite
broad.
What about "JVM specific" meaning that it is not coming directly from Flink?
or we could remove the "Non-heap" box at all and show directly JVM
Metaspace and Overhead as separate boxes,
this would also fit if we decide to keep the Metaspace metric.

- Total Process Memory Used: I agree with Xintong, it is hard to say what
is used there.
Then the size of "Total Process Memory" basically becomes part of
configuration.

- Non-Heap Used/Max/.. Not sure what committed means here. I also think we
should either exclude it or display what is known for sure.
In general, the metaspace usage would be nice to have but it should be then
exactly metaspace usage without any thing else.

- I do not know how the mapped memory works. Is it meant for the new
spilled partitions? If the mapped memory also pulls from the direct
memory limit
then this is something we do not account in our network buffers as I
understand. In this case, this metric may be useful for tuning to understand
how much the mapped memory uses from the direct memory limit to set e.g.
framework off-heap limit correctly and avoid direct OOM.
It could be something to discuss with Zhijiang. e.g. is the direct
memory used there to buffer fetched regions of partition files or what for?

- Not sure, we need an extra wrapping box "other" for the managed memory
atm. I could be just "Managed" or "Managed by Flink".

Best,
Andrey

On Fri, Mar 27, 2020 at 6:13 AM Xintong Song <[hidden email]> wrote:

> Sorry for the late response.
>
> I have shared my suggestions with Yadong & Lining offline. I think it would
> be better to also post them here, for the public record.
>
>    - I'm not sure about displaying Total Process Memory Used. Currently, we
>    do not have a good way to monitor all memory footprints of the process.
>    Metrics for some native memory usages (e.g., thread stack) are absent.
>    Displaying a partial used memory size could be confusing for users.
>    - I would suggest merge the current Mapped Memory metrics into Direct
>    Memory. Actually, the metrics are retrieved from MXBeans for direct
> buffer
>    pool and mapped buffer pool. Both of the two pools are accounted for in
>    -XX:MaxDirectMemorySize. There's no Flink configuration that can modify
> the
>    individual pool sizes. Therefore, I think displaying the total Direct
>    Memory would be good enough. Moreover, in most use cases the size of
> mapped
>    buffer pool is zero and users do not need to understand what is Mapped
>    Memory. For expert users who do need the separated metrics for
> individual
>    pools, they can subscribe the metrics on their own.
>    - I would suggest to not display Non-Heap Memory. Despite the name, the
>    metrics (also retrieved from MXBeans) actually accounts for metaspace,
> code
>    cache, and compressed class space. It does not account for all JVM
> native
>    memory overheads, e.g., thread stack. That means the metrics of Non-Heap
>    Memory do not well correspond to any of the FLIP-49 memory components.
> They
>    account for Flink's JVM Metaspace and part of JVM Overhead. I think this
>    brings more confusion then help to users, especially primary users.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Mar 26, 2020 at 6:34 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Thanks for updating the FLIP Yadong.
> >
> > What is the difference between managedMemory and managedMemoryTotal
> > and networkMemory and networkMemoryTotal in the REST response? If they
> are
> > duplicates, then we might be able to remove one.
> >
> > Apart from that, the proposal looks good to me.
> >
> > Pulling also Andrey in to hear his opinion about the representation of
> the
> > memory components.
> >
> > Cheers,
> > Till
> >
> > On Thu, Mar 19, 2020 at 11:37 AM Yadong Xie <[hidden email]> wrote:
> >
> >> Hi all
> >>
> >> I have updated the design of the metric page and FLIP doc, please let me
> >> know what you think about it
> >>
> >> FLIP-102:
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> >> POC web:
> >>
> >>
> http://101.132.122.69:8081/web/#/task-manager/8e1f1beada3859ee8e46d0960bb1da18/metrics
> >>
> >> Till Rohrmann <[hidden email]> 于2020年2月27日周四 下午10:27写道:
> >>
> >> > Thinking a bit more about the problem whether to report the aggregated
> >> > memory statistics or the individual slot statistics, I think reporting
> >> it
> >> > on a per slot basis won't work nicely together with FLIP-56 (dynamic
> >> slot
> >> > allocation). The problem is that with FLIP-56, we will no longer have
> >> > dedicated slots. The number of slots might change over the lifetime
> of a
> >> > TaskExecutor. Hence, it won't be easy to generate a metric path for
> >> every
> >> > slot which are furthermore also ephemeral. So maybe, the more general
> >> and
> >> > easier solution would be to report the overall memory usage of a
> >> > TaskExecutor even though it means to do some aggregation on the
> >> > TaskExecutor.
> >> >
> >> > Concerning the JVM limit: Isn't it mainly the code cache? If we
> display
> >> > this value, then we should explain what exactly it means. I fear that
> >> most
> >> > users won't understand what JVM limit actually means.
> >> >
> >> > Cheers,
> >> > Till
> >> >
> >> > On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]>
> >> wrote:
> >> >
> >> > > Hi Till
> >> > >
> >> > > Thanks a lot for your response
> >> > >
> >> > > > 2. I'm not entirely sure whether I would split the memory ...
> >> > >
> >> > > Split the memory display comes from the 'ancient' design of the web,
> >> it
> >> > is
> >> > > ok for me to change it following
> total/heap/managed/network/direct/jvm
> >> > > overhead/mapped sequence
> >> > >
> >> > > > 3. Displaying the memory configurations...
> >> > >
> >> > > I agree with you that it is not a very nice way, but the
> hierarchical
> >> > > relationship of configurations is too complex and hard to display in
> >> the
> >> > > other ways (I have tried)
> >> > >
> >> > > if anyone has a better idea, please feels no hesitates to help me
> >> > >
> >> > >
> >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> >> > >
> >> > > JVM limit is "non-heap max metric minus metaspace configuration" as
> >> > > @Xintong
> >> > > Song <[hidden email]> replyed in this mail thread
> >> > >
> >> > >
> >> > > Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
> >> > >
> >> > > > Thanks for creating this FLIP Yadong. I think your proposal makes
> it
> >> > much
> >> > > > easier for the user to understand what's happening on Flink
> >> > > TaskManager's.
> >> > > >
> >> > > > I have some comments:
> >> > > >
> >> > > > 1. Some of the newly introduced metrics involve computations on
> the
> >> > > > TaskManager. I would like to avoid additional computations
> >> introduced
> >> > by
> >> > > > metrics as much as possible because metrics should not affect the
> >> > system.
> >> > > > In particular, total memory sizes which are configured should not
> be
> >> > > > derived computationally (getManagedMemoryTotal,
> getTotalMemorySize).
> >> > For
> >> > > > the currently available memory sizes (e.g. getManagedMemoryUsed),
> >> one
> >> > > could
> >> > > > think about reporting them on a per slot basis and to do the
> >> > aggregation
> >> > > on
> >> > > > the client side. Of course, this would increase the size of the
> >> > response
> >> > > > payload.
> >> > > >
> >> > > > 2. I'm not entirely sure whether I would split the memory display
> >> into
> >> > > JVM
> >> > > > memory and non JVM memory as you've done it int the POC. From a
> >> user's
> >> > > > perspective, one could start displaying the total process memory.
> >> The
> >> > > next
> >> > > > three most important metrics are the heap, managed memory and
> >> network
> >> > > > buffer usage, I guess. If one is interested in more details, one
> >> could
> >> > > then
> >> > > > display the remaining direct memory usage, the JVM overhead (I'm
> not
> >> > sure
> >> > > > whether I would call this non-heap though) and the mapped memory.
> >> > > >
> >> > > > 3. Displaying the memory configurations in three nested boxes does
> >> not
> >> > > look
> >> > > > so nice to me. I'm not sure how else one could display it, though.
> >> > > >
> >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> >> > > >
> >> > > > Cheers,
> >> > > > Till
> >> > > >
> >> > > > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]>
> >> > wrote:
> >> > > >
> >> > > > > Hi Xintong
> >> > > > > thanks for your advice, the POC web and the FLIP doc was updated
> >> now
> >> > > > > here is the new link:
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> >> > > > >
> >> > > > >
> >> > > > > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> >> > > > >
> >> > > > > > >
> >> > > > > > > 1. Should the managed memory be part of direct memory?
> >> > > > > > >
> >> > > > > > The answer is no. Managed memory is currently allocated by
> >> > accessing
> >> > > to
> >> > > > > > private field of Unsafe. It is not accounted for in JVM's
> direct
> >> > > memory
> >> > > > > > limit and corresponding metrics. To that end, it is equivalent
> >> to
> >> > > > > > native memory.
> >> > > > > >
> >> > > > > >
> >> > > > > > > 2. Should the shuffle memory also be part of the managed
> >> memory?
> >> > > > > >
> >> > > > > > I don't think so. Shuffle (Network) memory is allocated with
> >> direct
> >> > > > > > buffers, and accounted for in JVM's direct memory limit and
> >> > > > corresponding
> >> > > > > > metrics. Moreover, the FLIP-49 memory model expose network
> >> memory
> >> > and
> >> > > > > > managed memory as two independent components of the overall
> >> memory
> >> > > > > > footprint.
> >> > > > > >
> >> > > > > >
> >> > > > > > Thank you~
> >> > > > > >
> >> > > > > > Xintong Song
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <[hidden email]
> >
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Some questions related to "managed memory":
> >> > > > > > >
> >> > > > > > > 1. Should the managed memory be part of direct memory?
> >> > > > > > > 2. Should the shuffle memory also be part of the managed
> >> memory?
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > Kurt
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
> >> > > [hidden email]
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Thanks for driving this FLIP, Yadong.
> >> > > > > > > >
> >> > > > > > > > +1 (non-binding) for the FLIP in general. I think this
> >> really
> >> > > helps
> >> > > > > our
> >> > > > > > > > users to understand and use the new FLIP-49 memory
> >> > configuration.
> >> > > > > > > >
> >> > > > > > > > I have a few minor comments.
> >> > > > > > > > - There's a frame "Other" in the frame "Non-Heap", besides
> >> "JVM
> >> > > > > > Overhead"
> >> > > > > > > > and "JVM Metaspace". IIUC, the purpose of this is to
> explain
> >> > the
> >> > > > > > > > mismatching between the metric "non-heap maximum" and the
> >> sum
> >> > of
> >> > > > the
> >> > > > > > > > configurations "JVM metaspace" & "JVM Overhead". However,
> >> from
> >> > > the
> >> > > > > > > > perspective of FLIP-49, JVM Overhead accounts for all the
> >> JVM
> >> > > > > non-heap
> >> > > > > > > > memory usages except for metaspace. The metrics does not
> >> match
> >> > > the
> >> > > > > > > > configuration because we did not set the a JVM parameter
> for
> >> > "max
> >> > > > > > > non-heap
> >> > > > > > > > memory" (actually I'm not sure whether it can be specified
> >> in
> >> > > java
> >> > > > > 8).
> >> > > > > > > The
> >> > > > > > > > current UI might confuse people making them think there
> are
> >> > other
> >> > > > > > > non-heap
> >> > > > > > > > memory usages not accounted by the configurations.
> >> Therefore, I
> >> > > > would
> >> > > > > > > > suggest to remove the "Other" frame, but add another frame
> >> > inside
> >> > > > > "JVM
> >> > > > > > > > Overhead", besides "Configuration", with "JVM limit" as
> the
> >> > title
> >> > > > and
> >> > > > > > > > "non-heap max metric minus metaspace configuration" as the
> >> > value
> >> > > .
> >> > > > > > > >
> >> > > > > > > > - In the final release, we have changed "shuffle memory"
> to
> >> > > > "network
> >> > > > > > > > memory" because the latter is easier to understand for
> >> users. I
> >> > > > think
> >> > > > > > we
> >> > > > > > > > should be updated it in this FLIP as well.
> >> > > > > > > >
> >> > > > > > > > - There's a typo "Directed" (should be "Direct") at the
> >> direct
> >> > > > memory
> >> > > > > > > > metric.
> >> > > > > > > >
> >> > > > > > > > Thank you~
> >> > > > > > > >
> >> > > > > > > > Xintong Song
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <
> >> > [hidden email]>
> >> > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi all
> >> > > > > > > > >
> >> > > > > > > > > I want to start the vote for FLIP-102, which proposes to
> >> add
> >> > > more
> >> > > > > > > metrics
> >> > > > > > > > > to the task manager in web UI.
> >> > > > > > > > >
> >> > > > > > > > > To help everyone better understand the proposal, we
> spent
> >> > some
> >> > > > > > efforts
> >> > > > > > > on
> >> > > > > > > > > making an online POC
> >> > > > > > > > >
> >> > > > > > > > > previous web:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> >> > > > > > > > > POC web:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > The vote will last for at least 72 hours, following the
> >> > > consensus
> >> > > > > > > voting
> >> > > > > > > > > process.
> >> > > > > > > > >
> >> > > > > > > > > FLIP wiki:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> >> > > > > > > > >
> >> > > > > > > > > Discussion thread:
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> >> > > > > > > > >
> >> > > > > > > > > Thanks,
> >> > > > > > > > >
> >> > > > > > > > > Yadong
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Andrey Zagrebin-4
Hi guys,

Thanks for more details Zhijiang.
It also looks to me that mapped memory size is mostly driven by OS limits
and bit-ness of JVM (32/64).

Thinking more about the 'Metrics' tab layout, couple of more things have
come into my mind.

# 'Metrics' tab -> 'Memory': 'Metrics' and 'Configuration' tabs

It contains only memory specific things and the design suggests not only
metrics but configuration as well.
Moreover, there are other metrics on top which are not in the metrics tab.
Therefore, I would name it 'Memory' and then add sub-tabs: e.g. 'Metrics'
and 'Configuration' tab.
Alternatively, one could consider splitting 'Metrics' into 'Metrics' and
'Configuration' tabs.

# Metrics (a bit different structure)

I would put memory metrics into 4 groups:
- JVM Memory
- Managed
- Network
- Garbage collection

Alternatively, one could consider:
- Managed by JVM (same as JVM Memory)
- Managed by Flink (Managed Segments and Network buffers)
- Garbage collection

## Total memory (remove from metrics)

As mentioned in the discussions before, it is hard to measure the total
memory usage.
Therefore, I would put into the configuration tab, see below.

## JVM Memory

Here we can have Heap, Non-Heap, Direct and mapped because they are all
managed by JVM.
Heap and direct can stay as they are.

### Non-Heap (could stay for now)

I think it is ok to keep Non-Heap for now because we had it also before.
This metric does not correlate explicitly with FLIP-49 but it is exposed by
JVM.
Once, we find better things to show (related only to JVM, e.g. Metaspace
etc), we can reconsider this as a follow-up.

### Mapped (looks still valuable)

As I understand at the moment, this can have a value for users to monitor
spilling of batch partitions.

### Metaspace (new, sub-component of Non-Heap, follow-up)

We have never had anything for the Metaspace. The recent experience shows
that it can be useful.
I would put it on road map as a follow-up though, because it also needs
some research and preparation on server side [1].

# Configuration (see Flink user docs picture)

We already have a picture in the docs representing memory components in
Flink [2].
The layout in this picture can be also used in this FLIP to depict the
actual configuration.
This would be more clear for users to see the same as we have in docs.

The configuration can also depict size of the total process and total Flink
memory according to docs.

As mentioned above, I also suggest to put it into a separate tab.

Best,
Andrey

[1]
https://kb.novaordis.com/index.php/Memory_Monitoring_and_Management_Platform_MBeans#Metaspace
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview


On Wed, Apr 1, 2020 at 8:03 PM Zhijiang <[hidden email]>
wrote:

> Thanks for the FLIP, Yadong. In general I think this work is valuable for
> users to better understand the Flink's memory usages in different
> dimensions.
>
> Sorry for not going through every detailed discussions below, and I try to
> do that later if possible. Firstly I try to answer some Andrey's concerns
> with mmap.
>
> > - I do not know how the mapped memory works. Is it meant for the new
> spilled partitions? If the mapped memory also pulls from the direct
> > memory limit then this is something we do not account in our network
> buffers as I understand. In this case, this metric may be useful for tuning
> to understand
> > how much the mapped memory uses from the direct memory limit to set e.g.
> framework off-heap limit correctly and avoid direct OOM.
> > It could be something to discuss with Zhijiang. e.g. is the direct
> memory used there to buffer fetched regions of partition files or what for?
>
> Yes, the mapped memory is used in bounded blocking partition for batch
> jobs now, but not the default mode.
>
>  AIK it is not related and limited to the setting of `MaxDirectMemory`, so
> we do not need to worry about the current direct memory setting and the
> potential OOM issue.
> It is up to the address space to determine the mapped file size, and in 64
> bit system we can regard the limitless size in theory.
>
> Regarding the size of mapped buffer pool from MXBean, it only indicates
> how much file size were already mapped before, even it is unchanged to not
> reflect the real
> physical memory use. E.g. when the file was mapped 100GB region at the
> beginning, the mapped buffer pool from MXBean would be 100GB. But how many
> physical
> memories are really consumed is up to the specific read or write
> operations in practice, and also controlled by the operator system. E.g
> some unused regions might be
> exchanged into SWAP virtual memory when physical memory is limited.
>
> From this point, I guess it is no meaningful to show the size of mapped
> buffer pool for users who may be more concerned with how many physical
> memories are really
> used.
>
> Best,
> Zhijiang
>
>
> ------------------------------------------------------------------
> From:Andrey Zagrebin <[hidden email]>
> Send Time:2020 Mar. 30 (Mon.) 22:56
> To:dev <[hidden email]>
> Subject:Re: [VOTE] FLIP-102: Add More Metrics to TaskManager
>
> Hi All,
>
> Thanks for this FLIP, Yadong. This is a very good improvement to the
> Flink's UI.
> It looks like there are still couple of things to resolve before the final
> vote.
>
> - I also find the non-heap title in configuration confusing because there
> are also other non-heap types of memory. The "off-heap" concept is quite
> broad.
> What about "JVM specific" meaning that it is not coming directly from
> Flink?
> or we could remove the "Non-heap" box at all and show directly JVM
> Metaspace and Overhead as separate boxes,
> this would also fit if we decide to keep the Metaspace metric.
>
> - Total Process Memory Used: I agree with Xintong, it is hard to say what
> is used there.
> Then the size of "Total Process Memory" basically becomes part of
> configuration.
>
> - Non-Heap Used/Max/.. Not sure what committed means here. I also think we
> should either exclude it or display what is known for sure.
> In general, the metaspace usage would be nice to have but it should be then
> exactly metaspace usage without any thing else.
>
> - I do not know how the mapped memory works. Is it meant for the new
> spilled partitions? If the mapped memory also pulls from the direct
> memory limit
> then this is something we do not account in our network buffers as I
> understand. In this case, this metric may be useful for tuning to
> understand
> how much the mapped memory uses from the direct memory limit to set e.g.
> framework off-heap limit correctly and avoid direct OOM.
> It could be something to discuss with Zhijiang. e.g. is the direct
> memory used there to buffer fetched regions of partition files or what for?
>
> - Not sure, we need an extra wrapping box "other" for the managed memory
> atm. I could be just "Managed" or "Managed by Flink".
>
> Best,
> Andrey
>
> On Fri, Mar 27, 2020 at 6:13 AM Xintong Song <[hidden email]>
> wrote:
>
> > Sorry for the late response.
> >
> > I have shared my suggestions with Yadong & Lining offline. I think it
> would
> > be better to also post them here, for the public record.
> >
> >    - I'm not sure about displaying Total Process Memory Used. Currently,
> we
> >    do not have a good way to monitor all memory footprints of the
> process.
> >    Metrics for some native memory usages (e.g., thread stack) are absent.
> >    Displaying a partial used memory size could be confusing for users.
> >    - I would suggest merge the current Mapped Memory metrics into Direct
> >    Memory. Actually, the metrics are retrieved from MXBeans for direct
> > buffer
> >    pool and mapped buffer pool. Both of the two pools are accounted for
> in
> >    -XX:MaxDirectMemorySize. There's no Flink configuration that can
> modify
> > the
> >    individual pool sizes. Therefore, I think displaying the total Direct
> >    Memory would be good enough. Moreover, in most use cases the size of
> > mapped
> >    buffer pool is zero and users do not need to understand what is Mapped
> >    Memory. For expert users who do need the separated metrics for
> > individual
> >    pools, they can subscribe the metrics on their own.
> >    - I would suggest to not display Non-Heap Memory. Despite the name,
> the
> >    metrics (also retrieved from MXBeans) actually accounts for metaspace,
> > code
> >    cache, and compressed class space. It does not account for all JVM
> > native
> >    memory overheads, e.g., thread stack. That means the metrics of
> Non-Heap
> >    Memory do not well correspond to any of the FLIP-49 memory components.
> > They
> >    account for Flink's JVM Metaspace and part of JVM Overhead. I think
> this
> >    brings more confusion then help to users, especially primary users.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Mar 26, 2020 at 6:34 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Thanks for updating the FLIP Yadong.
> > >
> > > What is the difference between managedMemory and managedMemoryTotal
> > > and networkMemory and networkMemoryTotal in the REST response? If they
> > are
> > > duplicates, then we might be able to remove one.
> > >
> > > Apart from that, the proposal looks good to me.
> > >
> > > Pulling also Andrey in to hear his opinion about the representation of
> > the
> > > memory components.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Mar 19, 2020 at 11:37 AM Yadong Xie <[hidden email]>
> wrote:
> > >
> > >> Hi all
> > >>
> > >> I have updated the design of the metric page and FLIP doc, please let
> me
> > >> know what you think about it
> > >>
> > >> FLIP-102:
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > >> POC web:
> > >>
> > >>
> >
> http://101.132.122.69:8081/web/#/task-manager/8e1f1beada3859ee8e46d0960bb1da18/metrics
> > >>
> > >> Till Rohrmann <[hidden email]> 于2020年2月27日周四 下午10:27写道:
> > >>
> > >> > Thinking a bit more about the problem whether to report the
> aggregated
> > >> > memory statistics or the individual slot statistics, I think
> reporting
> > >> it
> > >> > on a per slot basis won't work nicely together with FLIP-56 (dynamic
> > >> slot
> > >> > allocation). The problem is that with FLIP-56, we will no longer
> have
> > >> > dedicated slots. The number of slots might change over the lifetime
> > of a
> > >> > TaskExecutor. Hence, it won't be easy to generate a metric path for
> > >> every
> > >> > slot which are furthermore also ephemeral. So maybe, the more
> general
> > >> and
> > >> > easier solution would be to report the overall memory usage of a
> > >> > TaskExecutor even though it means to do some aggregation on the
> > >> > TaskExecutor.
> > >> >
> > >> > Concerning the JVM limit: Isn't it mainly the code cache? If we
> > display
> > >> > this value, then we should explain what exactly it means. I fear
> that
> > >> most
> > >> > users won't understand what JVM limit actually means.
> > >> >
> > >> > Cheers,
> > >> > Till
> > >> >
> > >> > On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]>
> > >> wrote:
> > >> >
> > >> > > Hi Till
> > >> > >
> > >> > > Thanks a lot for your response
> > >> > >
> > >> > > > 2. I'm not entirely sure whether I would split the memory ...
> > >> > >
> > >> > > Split the memory display comes from the 'ancient' design of the
> web,
> > >> it
> > >> > is
> > >> > > ok for me to change it following
> > total/heap/managed/network/direct/jvm
> > >> > > overhead/mapped sequence
> > >> > >
> > >> > > > 3. Displaying the memory configurations...
> > >> > >
> > >> > > I agree with you that it is not a very nice way, but the
> > hierarchical
> > >> > > relationship of configurations is too complex and hard to display
> in
> > >> the
> > >> > > other ways (I have tried)
> > >> > >
> > >> > > if anyone has a better idea, please feels no hesitates to help me
> > >> > >
> > >> > >
> > >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> > >> > >
> > >> > > JVM limit is "non-heap max metric minus metaspace configuration"
> as
> > >> > > @Xintong
> > >> > > Song <[hidden email]> replyed in this mail thread
> > >> > >
> > >> > >
> > >> > > Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
> > >> > >
> > >> > > > Thanks for creating this FLIP Yadong. I think your proposal
> makes
> > it
> > >> > much
> > >> > > > easier for the user to understand what's happening on Flink
> > >> > > TaskManager's.
> > >> > > >
> > >> > > > I have some comments:
> > >> > > >
> > >> > > > 1. Some of the newly introduced metrics involve computations on
> > the
> > >> > > > TaskManager. I would like to avoid additional computations
> > >> introduced
> > >> > by
> > >> > > > metrics as much as possible because metrics should not affect
> the
> > >> > system.
> > >> > > > In particular, total memory sizes which are configured should
> not
> > be
> > >> > > > derived computationally (getManagedMemoryTotal,
> > getTotalMemorySize).
> > >> > For
> > >> > > > the currently available memory sizes (e.g.
> getManagedMemoryUsed),
> > >> one
> > >> > > could
> > >> > > > think about reporting them on a per slot basis and to do the
> > >> > aggregation
> > >> > > on
> > >> > > > the client side. Of course, this would increase the size of the
> > >> > response
> > >> > > > payload.
> > >> > > >
> > >> > > > 2. I'm not entirely sure whether I would split the memory
> display
> > >> into
> > >> > > JVM
> > >> > > > memory and non JVM memory as you've done it int the POC. From a
> > >> user's
> > >> > > > perspective, one could start displaying the total process
> memory.
> > >> The
> > >> > > next
> > >> > > > three most important metrics are the heap, managed memory and
> > >> network
> > >> > > > buffer usage, I guess. If one is interested in more details, one
> > >> could
> > >> > > then
> > >> > > > display the remaining direct memory usage, the JVM overhead (I'm
> > not
> > >> > sure
> > >> > > > whether I would call this non-heap though) and the mapped
> memory.
> > >> > > >
> > >> > > > 3. Displaying the memory configurations in three nested boxes
> does
> > >> not
> > >> > > look
> > >> > > > so nice to me. I'm not sure how else one could display it,
> though.
> > >> > > >
> > >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> > >> > > >
> > >> > > > Cheers,
> > >> > > > Till
> > >> > > >
> > >> > > > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <[hidden email]
> >
> > >> > wrote:
> > >> > > >
> > >> > > > > Hi Xintong
> > >> > > > > thanks for your advice, the POC web and the FLIP doc was
> updated
> > >> now
> > >> > > > > here is the new link:
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> > >> > > > >
> > >> > > > >
> > >> > > > > Xintong Song <[hidden email]> 于2020年2月21日周五 下午12:00写道:
> > >> > > > >
> > >> > > > > > >
> > >> > > > > > > 1. Should the managed memory be part of direct memory?
> > >> > > > > > >
> > >> > > > > > The answer is no. Managed memory is currently allocated by
> > >> > accessing
> > >> > > to
> > >> > > > > > private field of Unsafe. It is not accounted for in JVM's
> > direct
> > >> > > memory
> > >> > > > > > limit and corresponding metrics. To that end, it is
> equivalent
> > >> to
> > >> > > > > > native memory.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > > 2. Should the shuffle memory also be part of the managed
> > >> memory?
> > >> > > > > >
> > >> > > > > > I don't think so. Shuffle (Network) memory is allocated with
> > >> direct
> > >> > > > > > buffers, and accounted for in JVM's direct memory limit and
> > >> > > > corresponding
> > >> > > > > > metrics. Moreover, the FLIP-49 memory model expose network
> > >> memory
> > >> > and
> > >> > > > > > managed memory as two independent components of the overall
> > >> memory
> > >> > > > > > footprint.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Thank you~
> > >> > > > > >
> > >> > > > > > Xintong Song
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <
> [hidden email]
> > >
> > >> > > wrote:
> > >> > > > > >
> > >> > > > > > > Some questions related to "managed memory":
> > >> > > > > > >
> > >> > > > > > > 1. Should the managed memory be part of direct memory?
> > >> > > > > > > 2. Should the shuffle memory also be part of the managed
> > >> memory?
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Kurt
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
> > >> > > [hidden email]
> > >> > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Thanks for driving this FLIP, Yadong.
> > >> > > > > > > >
> > >> > > > > > > > +1 (non-binding) for the FLIP in general. I think this
> > >> really
> > >> > > helps
> > >> > > > > our
> > >> > > > > > > > users to understand and use the new FLIP-49 memory
> > >> > configuration.
> > >> > > > > > > >
> > >> > > > > > > > I have a few minor comments.
> > >> > > > > > > > - There's a frame "Other" in the frame "Non-Heap",
> besides
> > >> "JVM
> > >> > > > > > Overhead"
> > >> > > > > > > > and "JVM Metaspace". IIUC, the purpose of this is to
> > explain
> > >> > the
> > >> > > > > > > > mismatching between the metric "non-heap maximum" and
> the
> > >> sum
> > >> > of
> > >> > > > the
> > >> > > > > > > > configurations "JVM metaspace" & "JVM Overhead".
> However,
> > >> from
> > >> > > the
> > >> > > > > > > > perspective of FLIP-49, JVM Overhead accounts for all
> the
> > >> JVM
> > >> > > > > non-heap
> > >> > > > > > > > memory usages except for metaspace. The metrics does not
> > >> match
> > >> > > the
> > >> > > > > > > > configuration because we did not set the a JVM parameter
> > for
> > >> > "max
> > >> > > > > > > non-heap
> > >> > > > > > > > memory" (actually I'm not sure whether it can be
> specified
> > >> in
> > >> > > java
> > >> > > > > 8).
> > >> > > > > > > The
> > >> > > > > > > > current UI might confuse people making them think there
> > are
> > >> > other
> > >> > > > > > > non-heap
> > >> > > > > > > > memory usages not accounted by the configurations.
> > >> Therefore, I
> > >> > > > would
> > >> > > > > > > > suggest to remove the "Other" frame, but add another
> frame
> > >> > inside
> > >> > > > > "JVM
> > >> > > > > > > > Overhead", besides "Configuration", with "JVM limit" as
> > the
> > >> > title
> > >> > > > and
> > >> > > > > > > > "non-heap max metric minus metaspace configuration" as
> the
> > >> > value
> > >> > > .
> > >> > > > > > > >
> > >> > > > > > > > - In the final release, we have changed "shuffle memory"
> > to
> > >> > > > "network
> > >> > > > > > > > memory" because the latter is easier to understand for
> > >> users. I
> > >> > > > think
> > >> > > > > > we
> > >> > > > > > > > should be updated it in this FLIP as well.
> > >> > > > > > > >
> > >> > > > > > > > - There's a typo "Directed" (should be "Direct") at the
> > >> direct
> > >> > > > memory
> > >> > > > > > > > metric.
> > >> > > > > > > >
> > >> > > > > > > > Thank you~
> > >> > > > > > > >
> > >> > > > > > > > Xintong Song
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <
> > >> > [hidden email]>
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi all
> > >> > > > > > > > >
> > >> > > > > > > > > I want to start the vote for FLIP-102, which proposes
> to
> > >> add
> > >> > > more
> > >> > > > > > > metrics
> > >> > > > > > > > > to the task manager in web UI.
> > >> > > > > > > > >
> > >> > > > > > > > > To help everyone better understand the proposal, we
> > spent
> > >> > some
> > >> > > > > > efforts
> > >> > > > > > > on
> > >> > > > > > > > > making an online POC
> > >> > > > > > > > >
> > >> > > > > > > > > previous web:
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > >> > > > > > > > > POC web:
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > The vote will last for at least 72 hours, following
> the
> > >> > > consensus
> > >> > > > > > > voting
> > >> > > > > > > > > process.
> > >> > > > > > > > >
> > >> > > > > > > > > FLIP wiki:
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > >> > > > > > > > >
> > >> > > > > > > > > Discussion thread:
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks,
> > >> > > > > > > > >
> > >> > > > > > > > > Yadong
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Xintong Song
Thanks for the inputs, Andrey & Zhijiang.

# Mapped

I did some experiments myself. It seems I was wrong about mapped buffer
being accounted into max direct limit. Therefore, I agree with
Zhijiang's suggestion that these metrics might be less valuable for users.

# Overlapped Metrics

It seems that the following metrics are not well corresponding to FLIP-49.

- Mapped
- Non Heap

They are partially overlapped with FLIP-49 memory components. My concern
for them is that, users may misunderstand them as additional memory usages
besides FLIP-49 memory components. However, I also agree with Andrey that
it may not be a good idea to simply remove them, because we had these
metrics before, and they did and will still provide some value to users in
certain cases.

To that end, I would suggest to keep them, and put them into a separated
"Advanced" group. The "Advanced" group could be folded by default, and
should explicitly suggests that these metrics are partially overlapped with
the basic metrics, and is controlled by JVM automatically.

This could be addressed either in this FLIP or as a follow up. I would not
block the vote for this.

Thank you~

Xintong Song



On Mon, Apr 6, 2020 at 4:59 PM Andrey Zagrebin <[hidden email]> wrote:

> Hi guys,
>
> Thanks for more details Zhijiang.
> It also looks to me that mapped memory size is mostly driven by OS limits
> and bit-ness of JVM (32/64).
>
> Thinking more about the 'Metrics' tab layout, couple of more things have
> come into my mind.
>
> # 'Metrics' tab -> 'Memory': 'Metrics' and 'Configuration' tabs
>
> It contains only memory specific things and the design suggests not only
> metrics but configuration as well.
> Moreover, there are other metrics on top which are not in the metrics tab.
> Therefore, I would name it 'Memory' and then add sub-tabs: e.g. 'Metrics'
> and 'Configuration' tab.
> Alternatively, one could consider splitting 'Metrics' into 'Metrics' and
> 'Configuration' tabs.
>
> # Metrics (a bit different structure)
>
> I would put memory metrics into 4 groups:
> - JVM Memory
> - Managed
> - Network
> - Garbage collection
>
> Alternatively, one could consider:
> - Managed by JVM (same as JVM Memory)
> - Managed by Flink (Managed Segments and Network buffers)
> - Garbage collection
>
> ## Total memory (remove from metrics)
>
> As mentioned in the discussions before, it is hard to measure the total
> memory usage.
> Therefore, I would put into the configuration tab, see below.
>
> ## JVM Memory
>
> Here we can have Heap, Non-Heap, Direct and mapped because they are all
> managed by JVM.
> Heap and direct can stay as they are.
>
> ### Non-Heap (could stay for now)
>
> I think it is ok to keep Non-Heap for now because we had it also before.
> This metric does not correlate explicitly with FLIP-49 but it is exposed by
> JVM.
> Once, we find better things to show (related only to JVM, e.g. Metaspace
> etc), we can reconsider this as a follow-up.
>
> ### Mapped (looks still valuable)
>
> As I understand at the moment, this can have a value for users to monitor
> spilling of batch partitions.
>
> ### Metaspace (new, sub-component of Non-Heap, follow-up)
>
> We have never had anything for the Metaspace. The recent experience shows
> that it can be useful.
> I would put it on road map as a follow-up though, because it also needs
> some research and preparation on server side [1].
>
> # Configuration (see Flink user docs picture)
>
> We already have a picture in the docs representing memory components in
> Flink [2].
> The layout in this picture can be also used in this FLIP to depict the
> actual configuration.
> This would be more clear for users to see the same as we have in docs.
>
> The configuration can also depict size of the total process and total Flink
> memory according to docs.
>
> As mentioned above, I also suggest to put it into a separate tab.
>
> Best,
> Andrey
>
> [1]
>
> https://kb.novaordis.com/index.php/Memory_Monitoring_and_Management_Platform_MBeans#Metaspace
> [2]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
>
>
> On Wed, Apr 1, 2020 at 8:03 PM Zhijiang <[hidden email]
> .invalid>
> wrote:
>
> > Thanks for the FLIP, Yadong. In general I think this work is valuable for
> > users to better understand the Flink's memory usages in different
> > dimensions.
> >
> > Sorry for not going through every detailed discussions below, and I try
> to
> > do that later if possible. Firstly I try to answer some Andrey's concerns
> > with mmap.
> >
> > > - I do not know how the mapped memory works. Is it meant for the new
> > spilled partitions? If the mapped memory also pulls from the direct
> > > memory limit then this is something we do not account in our network
> > buffers as I understand. In this case, this metric may be useful for
> tuning
> > to understand
> > > how much the mapped memory uses from the direct memory limit to set
> e.g.
> > framework off-heap limit correctly and avoid direct OOM.
> > > It could be something to discuss with Zhijiang. e.g. is the direct
> > memory used there to buffer fetched regions of partition files or what
> for?
> >
> > Yes, the mapped memory is used in bounded blocking partition for batch
> > jobs now, but not the default mode.
> >
> >  AIK it is not related and limited to the setting of `MaxDirectMemory`,
> so
> > we do not need to worry about the current direct memory setting and the
> > potential OOM issue.
> > It is up to the address space to determine the mapped file size, and in
> 64
> > bit system we can regard the limitless size in theory.
> >
> > Regarding the size of mapped buffer pool from MXBean, it only indicates
> > how much file size were already mapped before, even it is unchanged to
> not
> > reflect the real
> > physical memory use. E.g. when the file was mapped 100GB region at the
> > beginning, the mapped buffer pool from MXBean would be 100GB. But how
> many
> > physical
> > memories are really consumed is up to the specific read or write
> > operations in practice, and also controlled by the operator system. E.g
> > some unused regions might be
> > exchanged into SWAP virtual memory when physical memory is limited.
> >
> > From this point, I guess it is no meaningful to show the size of mapped
> > buffer pool for users who may be more concerned with how many physical
> > memories are really
> > used.
> >
> > Best,
> > Zhijiang
> >
> >
> > ------------------------------------------------------------------
> > From:Andrey Zagrebin <[hidden email]>
> > Send Time:2020 Mar. 30 (Mon.) 22:56
> > To:dev <[hidden email]>
> > Subject:Re: [VOTE] FLIP-102: Add More Metrics to TaskManager
> >
> > Hi All,
> >
> > Thanks for this FLIP, Yadong. This is a very good improvement to the
> > Flink's UI.
> > It looks like there are still couple of things to resolve before the
> final
> > vote.
> >
> > - I also find the non-heap title in configuration confusing because there
> > are also other non-heap types of memory. The "off-heap" concept is quite
> > broad.
> > What about "JVM specific" meaning that it is not coming directly from
> > Flink?
> > or we could remove the "Non-heap" box at all and show directly JVM
> > Metaspace and Overhead as separate boxes,
> > this would also fit if we decide to keep the Metaspace metric.
> >
> > - Total Process Memory Used: I agree with Xintong, it is hard to say what
> > is used there.
> > Then the size of "Total Process Memory" basically becomes part of
> > configuration.
> >
> > - Non-Heap Used/Max/.. Not sure what committed means here. I also think
> we
> > should either exclude it or display what is known for sure.
> > In general, the metaspace usage would be nice to have but it should be
> then
> > exactly metaspace usage without any thing else.
> >
> > - I do not know how the mapped memory works. Is it meant for the new
> > spilled partitions? If the mapped memory also pulls from the direct
> > memory limit
> > then this is something we do not account in our network buffers as I
> > understand. In this case, this metric may be useful for tuning to
> > understand
> > how much the mapped memory uses from the direct memory limit to set e.g.
> > framework off-heap limit correctly and avoid direct OOM.
> > It could be something to discuss with Zhijiang. e.g. is the direct
> > memory used there to buffer fetched regions of partition files or what
> for?
> >
> > - Not sure, we need an extra wrapping box "other" for the managed memory
> > atm. I could be just "Managed" or "Managed by Flink".
> >
> > Best,
> > Andrey
> >
> > On Fri, Mar 27, 2020 at 6:13 AM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Sorry for the late response.
> > >
> > > I have shared my suggestions with Yadong & Lining offline. I think it
> > would
> > > be better to also post them here, for the public record.
> > >
> > >    - I'm not sure about displaying Total Process Memory Used.
> Currently,
> > we
> > >    do not have a good way to monitor all memory footprints of the
> > process.
> > >    Metrics for some native memory usages (e.g., thread stack) are
> absent.
> > >    Displaying a partial used memory size could be confusing for users.
> > >    - I would suggest merge the current Mapped Memory metrics into
> Direct
> > >    Memory. Actually, the metrics are retrieved from MXBeans for direct
> > > buffer
> > >    pool and mapped buffer pool. Both of the two pools are accounted for
> > in
> > >    -XX:MaxDirectMemorySize. There's no Flink configuration that can
> > modify
> > > the
> > >    individual pool sizes. Therefore, I think displaying the total
> Direct
> > >    Memory would be good enough. Moreover, in most use cases the size of
> > > mapped
> > >    buffer pool is zero and users do not need to understand what is
> Mapped
> > >    Memory. For expert users who do need the separated metrics for
> > > individual
> > >    pools, they can subscribe the metrics on their own.
> > >    - I would suggest to not display Non-Heap Memory. Despite the name,
> > the
> > >    metrics (also retrieved from MXBeans) actually accounts for
> metaspace,
> > > code
> > >    cache, and compressed class space. It does not account for all JVM
> > > native
> > >    memory overheads, e.g., thread stack. That means the metrics of
> > Non-Heap
> > >    Memory do not well correspond to any of the FLIP-49 memory
> components.
> > > They
> > >    account for Flink's JVM Metaspace and part of JVM Overhead. I think
> > this
> > >    brings more confusion then help to users, especially primary users.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Thu, Mar 26, 2020 at 6:34 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > > > Thanks for updating the FLIP Yadong.
> > > >
> > > > What is the difference between managedMemory and managedMemoryTotal
> > > > and networkMemory and networkMemoryTotal in the REST response? If
> they
> > > are
> > > > duplicates, then we might be able to remove one.
> > > >
> > > > Apart from that, the proposal looks good to me.
> > > >
> > > > Pulling also Andrey in to hear his opinion about the representation
> of
> > > the
> > > > memory components.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Thu, Mar 19, 2020 at 11:37 AM Yadong Xie <[hidden email]>
> > wrote:
> > > >
> > > >> Hi all
> > > >>
> > > >> I have updated the design of the metric page and FLIP doc, please
> let
> > me
> > > >> know what you think about it
> > > >>
> > > >> FLIP-102:
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > >> POC web:
> > > >>
> > > >>
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/8e1f1beada3859ee8e46d0960bb1da18/metrics
> > > >>
> > > >> Till Rohrmann <[hidden email]> 于2020年2月27日周四 下午10:27写道:
> > > >>
> > > >> > Thinking a bit more about the problem whether to report the
> > aggregated
> > > >> > memory statistics or the individual slot statistics, I think
> > reporting
> > > >> it
> > > >> > on a per slot basis won't work nicely together with FLIP-56
> (dynamic
> > > >> slot
> > > >> > allocation). The problem is that with FLIP-56, we will no longer
> > have
> > > >> > dedicated slots. The number of slots might change over the
> lifetime
> > > of a
> > > >> > TaskExecutor. Hence, it won't be easy to generate a metric path
> for
> > > >> every
> > > >> > slot which are furthermore also ephemeral. So maybe, the more
> > general
> > > >> and
> > > >> > easier solution would be to report the overall memory usage of a
> > > >> > TaskExecutor even though it means to do some aggregation on the
> > > >> > TaskExecutor.
> > > >> >
> > > >> > Concerning the JVM limit: Isn't it mainly the code cache? If we
> > > display
> > > >> > this value, then we should explain what exactly it means. I fear
> > that
> > > >> most
> > > >> > users won't understand what JVM limit actually means.
> > > >> >
> > > >> > Cheers,
> > > >> > Till
> > > >> >
> > > >> > On Wed, Feb 26, 2020 at 11:15 AM Yadong Xie <[hidden email]>
> > > >> wrote:
> > > >> >
> > > >> > > Hi Till
> > > >> > >
> > > >> > > Thanks a lot for your response
> > > >> > >
> > > >> > > > 2. I'm not entirely sure whether I would split the memory ...
> > > >> > >
> > > >> > > Split the memory display comes from the 'ancient' design of the
> > web,
> > > >> it
> > > >> > is
> > > >> > > ok for me to change it following
> > > total/heap/managed/network/direct/jvm
> > > >> > > overhead/mapped sequence
> > > >> > >
> > > >> > > > 3. Displaying the memory configurations...
> > > >> > >
> > > >> > > I agree with you that it is not a very nice way, but the
> > > hierarchical
> > > >> > > relationship of configurations is too complex and hard to
> display
> > in
> > > >> the
> > > >> > > other ways (I have tried)
> > > >> > >
> > > >> > > if anyone has a better idea, please feels no hesitates to help
> me
> > > >> > >
> > > >> > >
> > > >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> > > >> > >
> > > >> > > JVM limit is "non-heap max metric minus metaspace configuration"
> > as
> > > >> > > @Xintong
> > > >> > > Song <[hidden email]> replyed in this mail thread
> > > >> > >
> > > >> > >
> > > >> > > Till Rohrmann <[hidden email]> 于2020年2月25日周二 下午6:58写道:
> > > >> > >
> > > >> > > > Thanks for creating this FLIP Yadong. I think your proposal
> > makes
> > > it
> > > >> > much
> > > >> > > > easier for the user to understand what's happening on Flink
> > > >> > > TaskManager's.
> > > >> > > >
> > > >> > > > I have some comments:
> > > >> > > >
> > > >> > > > 1. Some of the newly introduced metrics involve computations
> on
> > > the
> > > >> > > > TaskManager. I would like to avoid additional computations
> > > >> introduced
> > > >> > by
> > > >> > > > metrics as much as possible because metrics should not affect
> > the
> > > >> > system.
> > > >> > > > In particular, total memory sizes which are configured should
> > not
> > > be
> > > >> > > > derived computationally (getManagedMemoryTotal,
> > > getTotalMemorySize).
> > > >> > For
> > > >> > > > the currently available memory sizes (e.g.
> > getManagedMemoryUsed),
> > > >> one
> > > >> > > could
> > > >> > > > think about reporting them on a per slot basis and to do the
> > > >> > aggregation
> > > >> > > on
> > > >> > > > the client side. Of course, this would increase the size of
> the
> > > >> > response
> > > >> > > > payload.
> > > >> > > >
> > > >> > > > 2. I'm not entirely sure whether I would split the memory
> > display
> > > >> into
> > > >> > > JVM
> > > >> > > > memory and non JVM memory as you've done it int the POC. From
> a
> > > >> user's
> > > >> > > > perspective, one could start displaying the total process
> > memory.
> > > >> The
> > > >> > > next
> > > >> > > > three most important metrics are the heap, managed memory and
> > > >> network
> > > >> > > > buffer usage, I guess. If one is interested in more details,
> one
> > > >> could
> > > >> > > then
> > > >> > > > display the remaining direct memory usage, the JVM overhead
> (I'm
> > > not
> > > >> > sure
> > > >> > > > whether I would call this non-heap though) and the mapped
> > memory.
> > > >> > > >
> > > >> > > > 3. Displaying the memory configurations in three nested boxes
> > does
> > > >> not
> > > >> > > look
> > > >> > > > so nice to me. I'm not sure how else one could display it,
> > though.
> > > >> > > >
> > > >> > > > 4. What does JVM limit mean in Non-heap.JVM-Overhead?
> > > >> > > >
> > > >> > > > Cheers,
> > > >> > > > Till
> > > >> > > >
> > > >> > > > On Tue, Feb 25, 2020 at 8:19 AM Yadong Xie <
> [hidden email]
> > >
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > Hi Xintong
> > > >> > > > > thanks for your advice, the POC web and the FLIP doc was
> > updated
> > > >> now
> > > >> > > > > here is the new link:
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/7e7cf0293645c8537caab915c829aa73/metrics
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > Xintong Song <[hidden email]> 于2020年2月21日周五
> 下午12:00写道:
> > > >> > > > >
> > > >> > > > > > >
> > > >> > > > > > > 1. Should the managed memory be part of direct memory?
> > > >> > > > > > >
> > > >> > > > > > The answer is no. Managed memory is currently allocated by
> > > >> > accessing
> > > >> > > to
> > > >> > > > > > private field of Unsafe. It is not accounted for in JVM's
> > > direct
> > > >> > > memory
> > > >> > > > > > limit and corresponding metrics. To that end, it is
> > equivalent
> > > >> to
> > > >> > > > > > native memory.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > > 2. Should the shuffle memory also be part of the managed
> > > >> memory?
> > > >> > > > > >
> > > >> > > > > > I don't think so. Shuffle (Network) memory is allocated
> with
> > > >> direct
> > > >> > > > > > buffers, and accounted for in JVM's direct memory limit
> and
> > > >> > > > corresponding
> > > >> > > > > > metrics. Moreover, the FLIP-49 memory model expose network
> > > >> memory
> > > >> > and
> > > >> > > > > > managed memory as two independent components of the
> overall
> > > >> memory
> > > >> > > > > > footprint.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > Thank you~
> > > >> > > > > >
> > > >> > > > > > Xintong Song
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Fri, Feb 21, 2020 at 11:45 AM Kurt Young <
> > [hidden email]
> > > >
> > > >> > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Some questions related to "managed memory":
> > > >> > > > > > >
> > > >> > > > > > > 1. Should the managed memory be part of direct memory?
> > > >> > > > > > > 2. Should the shuffle memory also be part of the managed
> > > >> memory?
> > > >> > > > > > >
> > > >> > > > > > > Best,
> > > >> > > > > > > Kurt
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Fri, Feb 21, 2020 at 10:41 AM Xintong Song <
> > > >> > > [hidden email]
> > > >> > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Thanks for driving this FLIP, Yadong.
> > > >> > > > > > > >
> > > >> > > > > > > > +1 (non-binding) for the FLIP in general. I think this
> > > >> really
> > > >> > > helps
> > > >> > > > > our
> > > >> > > > > > > > users to understand and use the new FLIP-49 memory
> > > >> > configuration.
> > > >> > > > > > > >
> > > >> > > > > > > > I have a few minor comments.
> > > >> > > > > > > > - There's a frame "Other" in the frame "Non-Heap",
> > besides
> > > >> "JVM
> > > >> > > > > > Overhead"
> > > >> > > > > > > > and "JVM Metaspace". IIUC, the purpose of this is to
> > > explain
> > > >> > the
> > > >> > > > > > > > mismatching between the metric "non-heap maximum" and
> > the
> > > >> sum
> > > >> > of
> > > >> > > > the
> > > >> > > > > > > > configurations "JVM metaspace" & "JVM Overhead".
> > However,
> > > >> from
> > > >> > > the
> > > >> > > > > > > > perspective of FLIP-49, JVM Overhead accounts for all
> > the
> > > >> JVM
> > > >> > > > > non-heap
> > > >> > > > > > > > memory usages except for metaspace. The metrics does
> not
> > > >> match
> > > >> > > the
> > > >> > > > > > > > configuration because we did not set the a JVM
> parameter
> > > for
> > > >> > "max
> > > >> > > > > > > non-heap
> > > >> > > > > > > > memory" (actually I'm not sure whether it can be
> > specified
> > > >> in
> > > >> > > java
> > > >> > > > > 8).
> > > >> > > > > > > The
> > > >> > > > > > > > current UI might confuse people making them think
> there
> > > are
> > > >> > other
> > > >> > > > > > > non-heap
> > > >> > > > > > > > memory usages not accounted by the configurations.
> > > >> Therefore, I
> > > >> > > > would
> > > >> > > > > > > > suggest to remove the "Other" frame, but add another
> > frame
> > > >> > inside
> > > >> > > > > "JVM
> > > >> > > > > > > > Overhead", besides "Configuration", with "JVM limit"
> as
> > > the
> > > >> > title
> > > >> > > > and
> > > >> > > > > > > > "non-heap max metric minus metaspace configuration" as
> > the
> > > >> > value
> > > >> > > .
> > > >> > > > > > > >
> > > >> > > > > > > > - In the final release, we have changed "shuffle
> memory"
> > > to
> > > >> > > > "network
> > > >> > > > > > > > memory" because the latter is easier to understand for
> > > >> users. I
> > > >> > > > think
> > > >> > > > > > we
> > > >> > > > > > > > should be updated it in this FLIP as well.
> > > >> > > > > > > >
> > > >> > > > > > > > - There's a typo "Directed" (should be "Direct") at
> the
> > > >> direct
> > > >> > > > memory
> > > >> > > > > > > > metric.
> > > >> > > > > > > >
> > > >> > > > > > > > Thank you~
> > > >> > > > > > > >
> > > >> > > > > > > > Xintong Song
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On Thu, Feb 20, 2020 at 5:52 PM Yadong Xie <
> > > >> > [hidden email]>
> > > >> > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi all
> > > >> > > > > > > > >
> > > >> > > > > > > > > I want to start the vote for FLIP-102, which
> proposes
> > to
> > > >> add
> > > >> > > more
> > > >> > > > > > > metrics
> > > >> > > > > > > > > to the task manager in web UI.
> > > >> > > > > > > > >
> > > >> > > > > > > > > To help everyone better understand the proposal, we
> > > spent
> > > >> > some
> > > >> > > > > > efforts
> > > >> > > > > > > on
> > > >> > > > > > > > > making an online POC
> > > >> > > > > > > > >
> > > >> > > > > > > > > previous web:
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://101.132.122.69:8081/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > >> > > > > > > > > POC web:
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://101.132.122.69:8081/web/#/task-manager/6df6c5f37b2bff125dbc3a7388128559/metrics
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > The vote will last for at least 72 hours, following
> > the
> > > >> > > consensus
> > > >> > > > > > > voting
> > > >> > > > > > > > > process.
> > > >> > > > > > > > >
> > > >> > > > > > > > > FLIP wiki:
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager
> > > >> > > > > > > > >
> > > >> > > > > > > > > Discussion thread:
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks,
> > > >> > > > > > > > >
> > > >> > > > > > > > > Yadong
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Matthias
Hi everyone,
let me take the opportunity to revive this discussion: I looked into the
code and the FLIP-102 proposal for a bit. Bear with me while I'm getting
into the topic. 8)

I looked into JDK's ManagementFramework experimenting with the different
pools. The Metaspace memory pool is accessible through ManagementFramework's
getMemoryPoolMXBeans() method as stated in Andrey's reference [1].

This would mean that we could adapt the current proposal to replace the
Nonheap usage pane by a pane displaying the Metaspace usage. This way, we
could align the memory usage overview with the memory model getting closer
to what's introduced in FLIP-49.

This would help removing confusion around the NonHeap term. The only issue I
see is that JVM Overhead would still not be represented in the memory usage
overview.

Best,
Matthias

[1]
https://kb.novaordis.com/index.php/Memory_Monitoring_and_Management_Platform_MBeans#Metaspace



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Matthias
In reply to this post by jing
Hi Jing,
I recently joined Ververica and started looking into FLIP-102. I'm trying to
figure out how we would implement the proposal on the backend side.
I looked into the proposal for the REST API response and a few questions
popped up:
- Is there a reason for us to introduce a nested structure
TaskManagerMetricsInfo in the response object? I would rather keep it
consistent in a flat structure instead, i.e. having all the members of
TaskManagerResourceInfo being members of TaskManagerMetricsInfo.
  Alternatively, one could think of grouping the metrics collecting the
different values (i.e. max, used, committed) per metric in a JSON object.
But this would apply for all the other metrics of TaskManagerMetricsInfo as
well.
- metrics.resource.managedMemory and metrics.resource.networkMemory have
counterparts in metrics.networkMemory[Used|Total] and
metrics.managedMemory[Used|Total]: Is this redundant data or do they have
different semantics?
- Is metrics.resource.totalProcessMemory a basic sum over all provided
values? I see the necessity to have this member if we decide to not provide
the memory usage for all memory pools (e.g. providing Metaspace but leaving
Code Cache and Compressed Class Space as Non-Heap pools out of the
response). Otherwise, would it be worth it to remove this member from the
response for simplicity reasons since we could sum up the memory on the
frontend side?

Best,
Matthias



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Andrey Zagrebin-4
Hi All,

Thanks for reviving the discussion, Matthias!

This would mean that we could adapt the current proposal to replace the
> Nonheap usage pane by a pane displaying the Metaspace usage.
>
I do not know the value of having the Nonheap usage in metrics. I can see
that the metaspace metric can be interesting for the users to debug OOMs.
We had the Nonheap usage before, so as discussed, I would be a bit careful
removing. I believe it deserves a separate poll in user ML
whether the Nonheap usage is useless or not.
As a current solution, we could keep both or merge them into one box with a
slash, like Metaspace/Nonheap -> 5Mb/10Mb, if the majority agrees that this
is not confusing and clear that the metaspace is a part of Nonheap.

Btw, the "Nonheap" in the configuration box of the current FLIP-102 is
probably incorrect or confusing as it does not one-to-one correspond to the
Nonheap JVM metric.

The only issue I see is that JVM Overhead would still not be represented in
> the memory usage
> overview.

My understanding is that we do not need a usage metric for JVM Overhead as
it is a virtual unmanaged component which is more about configuring the max
total process memory.

Is there a reason for us to introduce a nested structure
> TaskManagerMetricsInfo in the response object? I would rather keep it
> consistent in a flat structure instead, i.e. having all the members of
> TaskManagerResourceInfo being members of TaskManagerMetricsInfo

I would suggest introducing a separate REST call for
TaskManagerResourceInfo.
Semantically, TaskManagerResourceInfo is more about the TM configuration
and it is not directly related to the usage metrics.
In future, I would avoid having calls with many responsibilities and maybe
consider splitting the 'TM details' call into metrics etc unless there is a
concern for having to do more calls instead of one from UI.

Alternatively, one could think of grouping the metrics collecting the
> different values (i.e. max, used, committed) per metric in a JSON object.
> But this would apply for all the other metrics of TaskManagerMetricsInfo
> as
> well.

I would personally prefer this for metrics but I am not pushing for this.

metrics.resource.managedMemory and metrics.resource.networkMemory have
> counterparts in metrics.networkMemory[Used|Total] and
> metrics.managedMemory[Used|Total]: Is this redundant data or do they have
> different semantics?

As I understand, they have different semantics. The later is about
configuration, the former is about current usage metrics.

Is metrics.resource.totalProcessMemory a basic sum over all provided
> values?

this is again about configuration, I do not think it makes sense to come up
with a usage metric for the totalProcessMemory component.

Best,
Andrey


On Thu, Aug 20, 2020 at 9:06 AM Matthias <[hidden email]> wrote:

> Hi Jing,
> I recently joined Ververica and started looking into FLIP-102. I'm trying
> to
> figure out how we would implement the proposal on the backend side.
> I looked into the proposal for the REST API response and a few questions
> popped up:
> - Is there a reason for us to introduce a nested structure
> TaskManagerMetricsInfo in the response object? I would rather keep it
> consistent in a flat structure instead, i.e. having all the members of
> TaskManagerResourceInfo being members of TaskManagerMetricsInfo.
>   Alternatively, one could think of grouping the metrics collecting the
> different values (i.e. max, used, committed) per metric in a JSON object.
> But this would apply for all the other metrics of TaskManagerMetricsInfo as
> well.
> - metrics.resource.managedMemory and metrics.resource.networkMemory have
> counterparts in metrics.networkMemory[Used|Total] and
> metrics.managedMemory[Used|Total]: Is this redundant data or do they have
> different semantics?
> - Is metrics.resource.totalProcessMemory a basic sum over all provided
> values? I see the necessity to have this member if we decide to not provide
> the memory usage for all memory pools (e.g. providing Metaspace but leaving
> Code Cache and Compressed Class Space as Non-Heap pools out of the
> response). Otherwise, would it be worth it to remove this member from the
> response for simplicity reasons since we could sum up the memory on the
> frontend side?
>
> Best,
> Matthias
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>
12