[VOTE] FLIP-102: Add More Metrics to TaskManager

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Matthias
Good points, Andrey. Thanks for clarification. I made some minor
adaptations to the FLIP now:
- Renamed the `resource` member into `configuration` and made it a
top-level member besides `metrics` and `hardware` since it's not fitting
the volatile metrics context that well.
- I restructured the table under Proposed Changes to cover Metaspace now.
Additionally, I renamed `shuffle` into `network` to match the memory model
of FLIP-49.
- The UI in the screenshot picture has a bug: The counts of Direct and
Mapped are accompanied by a memory unit even though they are plain counts.

On Thu, Aug 20, 2020 at 4:10 PM Andrey Zagrebin <[hidden email]>
wrote:

> Hi All,
>
> Thanks for reviving the discussion, Matthias!
>
> This would mean that we could adapt the current proposal to replace the
> > Nonheap usage pane by a pane displaying the Metaspace usage.
> >
> I do not know the value of having the Nonheap usage in metrics. I can see
> that the metaspace metric can be interesting for the users to debug OOMs.
> We had the Nonheap usage before, so as discussed, I would be a bit careful
> removing. I believe it deserves a separate poll in user ML
> whether the Nonheap usage is useless or not.
> As a current solution, we could keep both or merge them into one box with a
> slash, like Metaspace/Nonheap -> 5Mb/10Mb, if the majority agrees that this
> is not confusing and clear that the metaspace is a part of Nonheap.
>

That would be a good solution representing both metrics. I adapted the
table in FLIP-102's Confluence accordingly for now to have it visualized.
Let's see what others are thinking about it.


>
> Btw, the "Nonheap" in the configuration box of the current FLIP-102 is
> probably incorrect or confusing as it does not one-to-one correspond to the
> Nonheap JVM metric.
>
> The only issue I see is that JVM Overhead would still not be represented in
> > the memory usage
> > overview.
>
> My understanding is that we do not need a usage metric for JVM Overhead as
> it is a virtual unmanaged component which is more about configuring the max
> total process memory.
>
> Is there a reason for us to introduce a nested structure
> > TaskManagerMetricsInfo in the response object? I would rather keep it
> > consistent in a flat structure instead, i.e. having all the members of
> > TaskManagerResourceInfo being members of TaskManagerMetricsInfo
>
> I would suggest introducing a separate REST call for
> TaskManagerResourceInfo.
> Semantically, TaskManagerResourceInfo is more about the TM configuration
> and it is not directly related to the usage metrics.
> In future, I would avoid having calls with many responsibilities and maybe
> consider splitting the 'TM details' call into metrics etc unless there is a
> concern for having to do more calls instead of one from UI.
>

Good point. The growing size of the JSON response record might make it
worth splitting it up into different endpoints serving different groups of
data (e.g. /metrics for volatile values and /configuration for static ones).


>
> Alternatively, one could think of grouping the metrics collecting the
> > different values (i.e. max, used, committed) per metric in a JSON object.
> > But this would apply for all the other metrics of TaskManagerMetricsInfo
> > as
> > well.
>
> I would personally prefer this for metrics but I am not pushing for this.
>
> metrics.resource.managedMemory and metrics.resource.networkMemory have
> > counterparts in metrics.networkMemory[Used|Total] and
> > metrics.managedMemory[Used|Total]: Is this redundant data or do they have
> > different semantics?
>
> As I understand, they have different semantics. The later is about
> configuration, the former is about current usage metrics.
>

I see. Makes sense.

>
> Is metrics.resource.totalProcessMemory a basic sum over all provided
> > values?
>
> this is again about configuration, I do not think it makes sense to come up
> with a usage metric for the totalProcessMemory component.
>

Got it.


> Best,
> Andrey
>
>
> On Thu, Aug 20, 2020 at 9:06 AM Matthias <[hidden email]> wrote:
>
> > Hi Jing,
> > I recently joined Ververica and started looking into FLIP-102. I'm trying
> > to
> > figure out how we would implement the proposal on the backend side.
> > I looked into the proposal for the REST API response and a few questions
> > popped up:
> > - Is there a reason for us to introduce a nested structure
> > TaskManagerMetricsInfo in the response object? I would rather keep it
> > consistent in a flat structure instead, i.e. having all the members of
> > TaskManagerResourceInfo being members of TaskManagerMetricsInfo.
> >   Alternatively, one could think of grouping the metrics collecting the
> > different values (i.e. max, used, committed) per metric in a JSON object.
> > But this would apply for all the other metrics of TaskManagerMetricsInfo
> as
> > well.
> > - metrics.resource.managedMemory and metrics.resource.networkMemory have
> > counterparts in metrics.networkMemory[Used|Total] and
> > metrics.managedMemory[Used|Total]: Is this redundant data or do they have
> > different semantics?
> > - Is metrics.resource.totalProcessMemory a basic sum over all provided
> > values? I see the necessity to have this member if we decide to not
> provide
> > the memory usage for all memory pools (e.g. providing Metaspace but
> leaving
> > Code Cache and Compressed Class Space as Non-Heap pools out of the
> > response). Otherwise, would it be worth it to remove this member from the
> > response for simplicity reasons since we could sum up the memory on the
> > frontend side?
> >
> > Best,
> > Matthias
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >
>


--

Matthias Pohl | Engineer

Follow us @VervericaData Ververica <https://www.ververica.com/>

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-102: Add More Metrics to TaskManager

Yadong Xie
Hi all

there have been lots of discussions since the vote started and many
suggestions have been made

Matthias and I had updated the FLIP-102
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-102%3A+Add+More+Metrics+to+TaskManager>
following the suggestions and discussions

I want to cancel the vote here and start a new one, thanks

Matthias Pohl <[hidden email]> 于2020年8月21日周五 上午3:33写道:

> Good points, Andrey. Thanks for clarification. I made some minor
> adaptations to the FLIP now:
> - Renamed the `resource` member into `configuration` and made it a
> top-level member besides `metrics` and `hardware` since it's not fitting
> the volatile metrics context that well.
> - I restructured the table under Proposed Changes to cover Metaspace now.
> Additionally, I renamed `shuffle` into `network` to match the memory model
> of FLIP-49.
> - The UI in the screenshot picture has a bug: The counts of Direct and
> Mapped are accompanied by a memory unit even though they are plain counts.
>
> On Thu, Aug 20, 2020 at 4:10 PM Andrey Zagrebin <[hidden email]>
> wrote:
>
> > Hi All,
> >
> > Thanks for reviving the discussion, Matthias!
> >
> > This would mean that we could adapt the current proposal to replace the
> > > Nonheap usage pane by a pane displaying the Metaspace usage.
> > >
> > I do not know the value of having the Nonheap usage in metrics. I can see
> > that the metaspace metric can be interesting for the users to debug OOMs.
> > We had the Nonheap usage before, so as discussed, I would be a bit
> careful
> > removing. I believe it deserves a separate poll in user ML
> > whether the Nonheap usage is useless or not.
> > As a current solution, we could keep both or merge them into one box
> with a
> > slash, like Metaspace/Nonheap -> 5Mb/10Mb, if the majority agrees that
> this
> > is not confusing and clear that the metaspace is a part of Nonheap.
> >
>
> That would be a good solution representing both metrics. I adapted the
> table in FLIP-102's Confluence accordingly for now to have it visualized.
> Let's see what others are thinking about it.
>
>
> >
> > Btw, the "Nonheap" in the configuration box of the current FLIP-102 is
> > probably incorrect or confusing as it does not one-to-one correspond to
> the
> > Nonheap JVM metric.
> >
> > The only issue I see is that JVM Overhead would still not be represented
> in
> > > the memory usage
> > > overview.
> >
> > My understanding is that we do not need a usage metric for JVM Overhead
> as
> > it is a virtual unmanaged component which is more about configuring the
> max
> > total process memory.
> >
> > Is there a reason for us to introduce a nested structure
> > > TaskManagerMetricsInfo in the response object? I would rather keep it
> > > consistent in a flat structure instead, i.e. having all the members of
> > > TaskManagerResourceInfo being members of TaskManagerMetricsInfo
> >
> > I would suggest introducing a separate REST call for
> > TaskManagerResourceInfo.
> > Semantically, TaskManagerResourceInfo is more about the TM configuration
> > and it is not directly related to the usage metrics.
> > In future, I would avoid having calls with many responsibilities and
> maybe
> > consider splitting the 'TM details' call into metrics etc unless there
> is a
> > concern for having to do more calls instead of one from UI.
> >
>
> Good point. The growing size of the JSON response record might make it
> worth splitting it up into different endpoints serving different groups of
> data (e.g. /metrics for volatile values and /configuration for static
> ones).
>
>
> >
> > Alternatively, one could think of grouping the metrics collecting the
> > > different values (i.e. max, used, committed) per metric in a JSON
> object.
> > > But this would apply for all the other metrics of
> TaskManagerMetricsInfo
> > > as
> > > well.
> >
> > I would personally prefer this for metrics but I am not pushing for this.
> >
> > metrics.resource.managedMemory and metrics.resource.networkMemory have
> > > counterparts in metrics.networkMemory[Used|Total] and
> > > metrics.managedMemory[Used|Total]: Is this redundant data or do they
> have
> > > different semantics?
> >
> > As I understand, they have different semantics. The later is about
> > configuration, the former is about current usage metrics.
> >
>
> I see. Makes sense.
>
> >
> > Is metrics.resource.totalProcessMemory a basic sum over all provided
> > > values?
> >
> > this is again about configuration, I do not think it makes sense to come
> up
> > with a usage metric for the totalProcessMemory component.
> >
>
> Got it.
>
>
> > Best,
> > Andrey
> >
> >
> > On Thu, Aug 20, 2020 at 9:06 AM Matthias <[hidden email]> wrote:
> >
> > > Hi Jing,
> > > I recently joined Ververica and started looking into FLIP-102. I'm
> trying
> > > to
> > > figure out how we would implement the proposal on the backend side.
> > > I looked into the proposal for the REST API response and a few
> questions
> > > popped up:
> > > - Is there a reason for us to introduce a nested structure
> > > TaskManagerMetricsInfo in the response object? I would rather keep it
> > > consistent in a flat structure instead, i.e. having all the members of
> > > TaskManagerResourceInfo being members of TaskManagerMetricsInfo.
> > >   Alternatively, one could think of grouping the metrics collecting the
> > > different values (i.e. max, used, committed) per metric in a JSON
> object.
> > > But this would apply for all the other metrics of
> TaskManagerMetricsInfo
> > as
> > > well.
> > > - metrics.resource.managedMemory and metrics.resource.networkMemory
> have
> > > counterparts in metrics.networkMemory[Used|Total] and
> > > metrics.managedMemory[Used|Total]: Is this redundant data or do they
> have
> > > different semantics?
> > > - Is metrics.resource.totalProcessMemory a basic sum over all provided
> > > values? I see the necessity to have this member if we decide to not
> > provide
> > > the memory usage for all memory pools (e.g. providing Metaspace but
> > leaving
> > > Code Cache and Compressed Class Space as Non-Heap pools out of the
> > > response). Otherwise, would it be worth it to remove this member from
> the
> > > response for simplicity reasons since we could sum up the memory on the
> > > frontend side?
> > >
> > > Best,
> > > Matthias
> > >
> > >
> > >
> > > --
> > > Sent from:
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > >
> >
>
>
> --
>
> Matthias Pohl | Engineer
>
> Follow us @VervericaData Ververica <https://www.ververica.com/>
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>
12