(DEPRECATED) Apache Flink Mailing List archive.

Enhance Flink's monitoring capabilities

Classic

List

Threaded

13 messages Options

Nils E

Enhance Flink's monitoring capabilities

Hello together,

I am trying to enhance Flink's monitoring capabilities in style of the GSoC
2014 Proposal by Rajika Kumarasiri [1].

Short abstract:
He suggested to use the Java standard, the Java Mangement Extensions(JMX).
The idea is to put an MBean-Server in the JobManager, so that the
JobManager itself and all Taskmanagers in the cluster can register their
MBeans to this server via RMI.
Different monitoring stages (No, standard, full) reduce the affect on the
system performance.
The JMX service should be accessible in an improved web-component using an
RESTful API.
He also suggested the use of the SIGAR[2] JNI library to gather the system
information.
In my opinion this point is discussible. In Java 7 they introduced Platform
MXBeans[3] which already cover the basic system information, and so in my
eyes the use of a JNI library might be a little overkill. But of course
this depends on the aimed depth of monitoring.

So the primary question:
What parameters/system properties/utilizations/work loads should be
monitored in your opinions?

Have a nice weekend!
Nils

[1]
https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Rajika-Kumarasiri
[2] https://support.hyperic.com/display/SIGAR/Home
[3]
https://docs.oracle.com/javase/7/docs/technotes/guides/management/overview.html

Fabian Hueske

Re: Enhance Flink's monitoring capabilities

Hi Nils,

Flink's current monitoring is quite limited and basically restricted to
status updates of the parallel tasks (scheduled, started, finished,
canceled, failed, etc.).
There is also some code lying around to collect system stats such as CPU,
memory, and network utilization. However, it is not used right now, AFAIK.
In case of a long running job, it is hard to figure out what is going on
and whether a program makes progress or not.

Having a monitoring infrastructure which allows to add, collect, and query
new metrics with low effort would be a great addition to Flink.
From what I know, JMX was explicitly designed for this purpose and seems to
be a good fit. Since it is a Java standard, other tools can easily connect
and retrieve monitoring data.

As a starting point, I would focus to get an early prototype that uses JMX
to collect a single metric such as number of tuples processed by a Map
function.
Having such a showcase, would help to have a good discussion about how to
implement the monitoring infrastructure.
The question of metrics to collect is orthogonal to that. If we have a good
system to collect and gather stats, these can be added one by one.

Cheers, Fabian

2014-11-21 18:32 GMT+01:00 Nils E <[hidden email]>:

> Hello together,
>
> I am trying to enhance Flink's monitoring capabilities in style of the GSoC
> 2014 Proposal by Rajika Kumarasiri [1].
>
> Short abstract:
> He suggested to use the Java standard, the Java Mangement Extensions(JMX).
> The idea is to put an MBean-Server in the JobManager, so that the
> JobManager itself and all Taskmanagers in the cluster can register their
> MBeans to this server via RMI.
> Different monitoring stages (No, standard, full) reduce the affect on the
> system performance.
> The JMX service should be accessible in an improved web-component using an
> RESTful API.
> He also suggested the use of the SIGAR[2] JNI library to gather the system
> information.
> In my opinion this point is discussible. In Java 7 they introduced Platform
> MXBeans[3] which already cover the basic system information, and so in my
> eyes the use of a JNI library might be a little overkill. But of course
> this depends on the aimed depth of monitoring.
>
> So the primary question:
> What parameters/system properties/utilizations/work loads should be
> monitored in your opinions?
>
> Have a nice weekend!
> Nils
>
> [1]
>
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Rajika-Kumarasiri
> [2] https://support.hyperic.com/display/SIGAR/Home
> [3]
>
> https://docs.oracle.com/javase/7/docs/technotes/guides/management/overview.html
>

Ufuk Celebi-2

Re: Enhance Flink's monitoring capabilities

On 23 Nov 2014, at 00:03, Fabian Hueske <[hidden email]> wrote:

> Hi Nils,
>
> Flink's current monitoring is quite limited and basically restricted to
> status updates of the parallel tasks (scheduled, started, finished,
> canceled, failed, etc.).
> There is also some code lying around to collect system stats such as CPU,
> memory, and network utilization. However, it is not used right now, AFAIK.
> In case of a long running job, it is hard to figure out what is going on
> and whether a program makes progress or not.
>
> Having a monitoring infrastructure which allows to add, collect, and query
> new metrics with low effort would be a great addition to Flink.
> From what I know, JMX was explicitly designed for this purpose and seems to
> be a good fit. Since it is a Java standard, other tools can easily connect
> and retrieve monitoring data.
>
> As a starting point, I would focus to get an early prototype that uses JMX
> to collect a single metric such as number of tuples processed by a Map
> function.
> Having such a showcase, would help to have a good discussion about how to
> implement the monitoring infrastructure.
> The question of metrics to collect is orthogonal to that. If we have a good
> system to collect and gather stats, these can be added one by one.

+1

I don't have experience with JMX, but I agree with Fabian that the architecture of this monitoring service is very important and should come first. It should be flexible enough to easily support the collection of metrics by any operator and the user.

Every task manager needs expose this service to collect (and aggregate) data, which then would be collected at a central instance (e.g. the JobManager). I am not sure at this point, but it might be worthwhile to think about separating this central monitoring service from the JobManager in order to reduce JobManager load and have more flexibility, e.g. running it as a central history server to monitor multiple JobManager instances (for example in YARN setups).

– Ufuk

Robert Metzger

Re: Enhance Flink's monitoring capabilities

Hi Nils,

I'm not sure if its a good idea to use JMX. I fear that we are
overengineering something here for features that we don't really need.
I don't know any tools that can evaluate these JMX information (I think
thats the main argument for using JMX).
Also doing this kind of monitoring (connecting to JVMs running somewhere in
a cluster from a local client) is often really complicated (firewalls, port
forwardings, ssh, ...) that its probably to impractical. So users will
probably only use the web interface.
I think the main reason why we agreed to use JMX for that proposal was,
that the student who proposed the topic knew the JMX system very well.

What I would do is the following:
a) Work on the Akka branch of Till until its merged (
https://github.com/apache/incubator-flink/pull/149). There is no point in
changing the current RPC / JobManager / TaskManager infrastructure if we
are going to replace it very soon (I think its a matter of days).

b) Just "piggyback" on the heartbeat that the TaskManagers are sending to
the JobManager and include metrics there.
Then, collect them at the JobManager and expose them via the web interface.

c) Metrics:
I would start with Garbage Collection statistics.
Then:
- "bytes in" per operator
- input splits processed for DataSources
- Current Iteration number / Avg Iteration time
- Disk IO / Network IO stats.

Let me know if you need more information.

Best,
Robert

On Sun, Nov 23, 2014 at 11:28 PM, Ufuk Celebi <[hidden email]> wrote:

>
> On 23 Nov 2014, at 00:03, Fabian Hueske <[hidden email]> wrote:
>
> > Hi Nils,
> >
> > Flink's current monitoring is quite limited and basically restricted to
> > status updates of the parallel tasks (scheduled, started, finished,
> > canceled, failed, etc.).
> > There is also some code lying around to collect system stats such as CPU,
> > memory, and network utilization. However, it is not used right now,
> AFAIK.
> > In case of a long running job, it is hard to figure out what is going on
> > and whether a program makes progress or not.
> >
> > Having a monitoring infrastructure which allows to add, collect, and
> query
> > new metrics with low effort would be a great addition to Flink.
> > From what I know, JMX was explicitly designed for this purpose and seems
> to
> > be a good fit. Since it is a Java standard, other tools can easily
> connect
> > and retrieve monitoring data.
> >
> > As a starting point, I would focus to get an early prototype that uses
> JMX
> > to collect a single metric such as number of tuples processed by a Map
> > function.
> > Having such a showcase, would help to have a good discussion about how to
> > implement the monitoring infrastructure.
> > The question of metrics to collect is orthogonal to that. If we have a
> good
> > system to collect and gather stats, these can be added one by one.
>
> +1
>
> I don't have experience with JMX, but I agree with Fabian that the
> architecture of this monitoring service is very important and should come
> first. It should be flexible enough to easily support the collection of
> metrics by any operator and the user.
>
> Every task manager needs expose this service to collect (and aggregate)
> data, which then would be collected at a central instance (e.g. the
> JobManager). I am not sure at this point, but it might be worthwhile to
> think about separating this central monitoring service from the JobManager
> in order to reduce JobManager load and have more flexibility, e.g. running
> it as a central history server to monitor multiple JobManager instances
> (for example in YARN setups).
>
> – Ufuk

aalexandrov

Re: Enhance Flink's monitoring capabilities

In reply to this post by Nils E

Hello Nils,

I am going to work on a similar issue related to tracking some basics statistics of the intermediate results produced by dataflows during execution.

I just create a Jira issue here:

https://issues.apache.org/jira/browse/FLINK-1297

If you already have some work done on extending the monitoring capabilities in a branch, it might be good to sync-up the development in order to avoid duplicated work (e.g. using the same communication channel used to send the data from the task managers to the job manager).

Robert Metzger

Re: Enhance Flink's monitoring capabilities

Hey Nils,

I have played around a bit with a little prototype. You can find the code
here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
another branch in my repo).
You can see the changes that I applied on top of Till's Akka branch here:
https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1

What the code does is collecting statistics about each TaskManager in the
system. These stats are assembled into a "MetricsReport" which is send with
the periodical heartbeat to the JobManager. The JobManager stores the
latest MetricsReport for each TaskManager (in the Instance object for each
TM).
When the user accesses the TaskManager overview, the latest MetricsReport
is send as a JSONObject to the browser.

to test my changes, check out the code, build it
mvn clean package -DskipTests -Dcheckstyle.skip=true
go into
cd
flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
and start the web interface
/bin/start-local.sh

Go to localhost:8081, in the "TaskManager" view, you can see some metrics.
Here is a screenshot: http://img42.com/eNPve

I named my branch after this issue, as it is probably describing best what
we're working on here: FLINK-456
<https://issues.apache.org/jira/browse/FLINK-456>

As I said in the beginning, its really just a prototype. Let me know if you
have any further questions.
For the "per TaskManager" reports, we should probably integrate some more
statistics. Also, the presentation of the numbers is very very basic right
now. I think there are many good libraries for visualizing these kinds of
stats.
Also, the numbers currently represent only a "snapshot", however, some of
the numbers can be accumulated (read/write bytes of the io manager).
Another missing feature is storing a little history of numbers to visualize
metrics over time.

I'm trying to find time to look into "per job" metrics as well. They will
require a bit more infrastructure to distinguish them on the JobManager
side and to get them on the TaskManagers.

Best,
Robert

On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
[hidden email]> wrote:

> Hello Nils,
>
> I am going to work on a similar issue related to tracking some basics
> statistics of the intermediate results produced by dataflows during
> execution.
>
> I just create a Jira issue here:
>
> https://issues.apache.org/jira/browse/FLINK-1297
>
> If you already have some work done on extending the monitoring capabilities
> in a branch, it might be good to sync-up the development in order to avoid
> duplicated work (e.g. using the same communication channel used to send the
> data from the task managers to the job manager).
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>

Henry Saputra

Re: Enhance Flink's monitoring capabilities

Hi Robert,

From I have seen it so far, it is probably better and easier for Flink
to leverage metrics library [1] for the metrics collection rather than
building organically.

Several ASF projects like Spark [2] and Tajo have used it with great success.

One of the main reasons is maintainability and the breath of types of
metric could and should be collected.

- Henry

[1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
[2] https://spark.apache.org/docs/1.0.1/monitoring.html
[3] https://issues.apache.org/jira/browse/TAJO-333

On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]> wrote:

> Hey Nils,
>
> I have played around a bit with a little prototype. You can find the code
> here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> another branch in my repo).
> You can see the changes that I applied on top of Till's Akka branch here:
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>
> What the code does is collecting statistics about each TaskManager in the
> system. These stats are assembled into a "MetricsReport" which is send with
> the periodical heartbeat to the JobManager. The JobManager stores the
> latest MetricsReport for each TaskManager (in the Instance object for each
> TM).
> When the user accesses the TaskManager overview, the latest MetricsReport
> is send as a JSONObject to the browser.
>
> to test my changes, check out the code, build it
> mvn clean package -DskipTests -Dcheckstyle.skip=true
> go into
> cd
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> and start the web interface
> /bin/start-local.sh
>
> Go to localhost:8081, in the "TaskManager" view, you can see some metrics.
> Here is a screenshot: http://img42.com/eNPve
>
> I named my branch after this issue, as it is probably describing best what
> we're working on here: FLINK-456
> <https://issues.apache.org/jira/browse/FLINK-456>
>
> As I said in the beginning, its really just a prototype. Let me know if you
> have any further questions.
> For the "per TaskManager" reports, we should probably integrate some more
> statistics. Also, the presentation of the numbers is very very basic right
> now. I think there are many good libraries for visualizing these kinds of
> stats.
> Also, the numbers currently represent only a "snapshot", however, some of
> the numbers can be accumulated (read/write bytes of the io manager).
> Another missing feature is storing a little history of numbers to visualize
> metrics over time.
>
> I'm trying to find time to look into "per job" metrics as well. They will
> require a bit more infrastructure to distinguish them on the JobManager
> side and to get them on the TaskManagers.
>
>
> Best,
> Robert
>
>
>
> On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> [hidden email]> wrote:
>
>> Hello Nils,
>>
>> I am going to work on a similar issue related to tracking some basics
>> statistics of the intermediate results produced by dataflows during
>> execution.
>>
>> I just create a Jira issue here:
>>
>> https://issues.apache.org/jira/browse/FLINK-1297
>>
>> If you already have some work done on extending the monitoring capabilities
>> in a branch, it might be good to sync-up the development in order to avoid
>> duplicated work (e.g. using the same communication channel used to send the
>> data from the task managers to the job manager).
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
>> archive at Nabble.com.
>>

Stephan Ewen

Re: Enhance Flink's monitoring capabilities

That actually sounds like a great idea. I discussed a bit with Robert
offline on Friday, and it seems that Metrics has most of what we talked
about.

I also like the way they make it extensible, so people can capture their
own metrics.

On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]>
wrote:

> Hi Robert,
>
> From I have seen it so far, it is probably better and easier for Flink
> to leverage metrics library [1] for the metrics collection rather than
> building organically.
>
> Several ASF projects like Spark [2] and Tajo have used it with great
> success.
>
> One of the main reasons is maintainability and the breath of types of
> metric could and should be collected.
>
> - Henry
>
> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
> [3] https://issues.apache.org/jira/browse/TAJO-333
>
> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]>
> wrote:
> > Hey Nils,
> >
> > I have played around a bit with a little prototype. You can find the code
> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> > another branch in my repo).
> > You can see the changes that I applied on top of Till's Akka branch here:
> >
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
> >
> > What the code does is collecting statistics about each TaskManager in the
> > system. These stats are assembled into a "MetricsReport" which is send
> with
> > the periodical heartbeat to the JobManager. The JobManager stores the
> > latest MetricsReport for each TaskManager (in the Instance object for
> each
> > TM).
> > When the user accesses the TaskManager overview, the latest MetricsReport
> > is send as a JSONObject to the browser.
> >
> > to test my changes, check out the code, build it
> > mvn clean package -DskipTests -Dcheckstyle.skip=true
> > go into
> > cd
> >
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> > and start the web interface
> > /bin/start-local.sh
> >
> > Go to localhost:8081, in the "TaskManager" view, you can see some
> metrics.
> > Here is a screenshot: http://img42.com/eNPve
> >
> > I named my branch after this issue, as it is probably describing best
> what
> > we're working on here: FLINK-456
> > <https://issues.apache.org/jira/browse/FLINK-456>
> >
> > As I said in the beginning, its really just a prototype. Let me know if
> you
> > have any further questions.
> > For the "per TaskManager" reports, we should probably integrate some more
> > statistics. Also, the presentation of the numbers is very very basic
> right
> > now. I think there are many good libraries for visualizing these kinds of
> > stats.
> > Also, the numbers currently represent only a "snapshot", however, some of
> > the numbers can be accumulated (read/write bytes of the io manager).
> > Another missing feature is storing a little history of numbers to
> visualize
> > metrics over time.
> >
> > I'm trying to find time to look into "per job" metrics as well. They will
> > require a bit more infrastructure to distinguish them on the JobManager
> > side and to get them on the TaskManagers.
> >
> >
> > Best,
> > Robert
> >
> >
> >
> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> > [hidden email]> wrote:
> >
> >> Hello Nils,
> >>
> >> I am going to work on a similar issue related to tracking some basics
> >> statistics of the intermediate results produced by dataflows during
> >> execution.
> >>
> >> I just create a Jira issue here:
> >>
> >> https://issues.apache.org/jira/browse/FLINK-1297
> >>
> >> If you already have some work done on extending the monitoring
> capabilities
> >> in a branch, it might be good to sync-up the development in order to
> avoid
> >> duplicated work (e.g. using the same communication channel used to send
> the
> >> data from the task managers to the job manager).
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
> list
> >> archive at Nabble.com.
> >>
>

Henry Saputra

Re: Enhance Flink's monitoring capabilities

+1

It's extensibility is one of the reasons it has been used in other projects.

On Sunday, December 7, 2014, Stephan Ewen <[hidden email]> wrote:

> That actually sounds like a great idea. I discussed a bit with Robert
> offline on Friday, and it seems that Metrics has most of what we talked
> about.
>
> I also like the way they make it extensible, so people can capture their
> own metrics.
>
> On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]
> <javascript:;>>
> wrote:
>
> > Hi Robert,
> >
> > From I have seen it so far, it is probably better and easier for Flink
> > to leverage metrics library [1] for the metrics collection rather than
> > building organically.
> >
> > Several ASF projects like Spark [2] and Tajo have used it with great
> > success.
> >
> > One of the main reasons is maintainability and the breath of types of
> > metric could and should be collected.
> >
> > - Henry
> >
> > [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
> > [2] https://spark.apache.org/docs/1.0.1/monitoring.html
> > [3] https://issues.apache.org/jira/browse/TAJO-333
> >
> > On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]
> <javascript:;>>
> > wrote:
> > > Hey Nils,
> > >
> > > I have played around a bit with a little prototype. You can find the
> code
> > > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> > > another branch in my repo).
> > > You can see the changes that I applied on top of Till's Akka branch
> here:
> > >
> >
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
> > >
> > > What the code does is collecting statistics about each TaskManager in
> the
> > > system. These stats are assembled into a "MetricsReport" which is send
> > with
> > > the periodical heartbeat to the JobManager. The JobManager stores the
> > > latest MetricsReport for each TaskManager (in the Instance object for
> > each
> > > TM).
> > > When the user accesses the TaskManager overview, the latest
> MetricsReport
> > > is send as a JSONObject to the browser.
> > >
> > > to test my changes, check out the code, build it
> > > mvn clean package -DskipTests -Dcheckstyle.skip=true
> > > go into
> > > cd
> > >
> >
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> > > and start the web interface
> > > /bin/start-local.sh
> > >
> > > Go to localhost:8081, in the "TaskManager" view, you can see some
> > metrics.
> > > Here is a screenshot: http://img42.com/eNPve
> > >
> > > I named my branch after this issue, as it is probably describing best
> > what
> > > we're working on here: FLINK-456
> > > <https://issues.apache.org/jira/browse/FLINK-456>
> > >
> > > As I said in the beginning, its really just a prototype. Let me know if
> > you
> > > have any further questions.
> > > For the "per TaskManager" reports, we should probably integrate some
> more
> > > statistics. Also, the presentation of the numbers is very very basic
> > right
> > > now. I think there are many good libraries for visualizing these kinds
> of
> > > stats.
> > > Also, the numbers currently represent only a "snapshot", however, some
> of
> > > the numbers can be accumulated (read/write bytes of the io manager).
> > > Another missing feature is storing a little history of numbers to
> > visualize
> > > metrics over time.
> > >
> > > I'm trying to find time to look into "per job" metrics as well. They
> will
> > > require a bit more infrastructure to distinguish them on the JobManager
> > > side and to get them on the TaskManagers.
> > >
> > >
> > > Best,
> > > Robert
> > >
> > >
> > >
> > > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> > > [hidden email] <javascript:;>> wrote:
> > >
> > >> Hello Nils,
> > >>
> > >> I am going to work on a similar issue related to tracking some basics
> > >> statistics of the intermediate results produced by dataflows during
> > >> execution.
> > >>
> > >> I just create a Jira issue here:
> > >>
> > >> https://issues.apache.org/jira/browse/FLINK-1297
> > >>
> > >> If you already have some work done on extending the monitoring
> > capabilities
> > >> in a branch, it might be good to sync-up the development in order to
> > avoid
> > >> duplicated work (e.g. using the same communication channel used to
> send
> > the
> > >> data from the task managers to the job manager).
> > >>
> > >>
> > >>
> > >> --
> > >> View this message in context:
> > >>
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> > >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
> > list
> > >> archive at Nabble.com.
> > >>
> >
>

Henry Saputra

Re: Enhance Flink's monitoring capabilities

In reply to this post by Stephan Ewen

Just curious, is there any JIRA filed for this or was it just in
preliminary proposal talk?

- Henry

On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[hidden email]> wrote:

> That actually sounds like a great idea. I discussed a bit with Robert
> offline on Friday, and it seems that Metrics has most of what we talked
> about.
>
> I also like the way they make it extensible, so people can capture their
> own metrics.
>
> On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]>
> wrote:
>
>> Hi Robert,
>>
>> From I have seen it so far, it is probably better and easier for Flink
>> to leverage metrics library [1] for the metrics collection rather than
>> building organically.
>>
>> Several ASF projects like Spark [2] and Tajo have used it with great
>> success.
>>
>> One of the main reasons is maintainability and the breath of types of
>> metric could and should be collected.
>>
>> - Henry
>>
>> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
>> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
>> [3] https://issues.apache.org/jira/browse/TAJO-333
>>
>> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]>
>> wrote:
>> > Hey Nils,
>> >
>> > I have played around a bit with a little prototype. You can find the code
>> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
>> > another branch in my repo).
>> > You can see the changes that I applied on top of Till's Akka branch here:
>> >
>> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>> >
>> > What the code does is collecting statistics about each TaskManager in the
>> > system. These stats are assembled into a "MetricsReport" which is send
>> with
>> > the periodical heartbeat to the JobManager. The JobManager stores the
>> > latest MetricsReport for each TaskManager (in the Instance object for
>> each
>> > TM).
>> > When the user accesses the TaskManager overview, the latest MetricsReport
>> > is send as a JSONObject to the browser.
>> >
>> > to test my changes, check out the code, build it
>> > mvn clean package -DskipTests -Dcheckstyle.skip=true
>> > go into
>> > cd
>> >
>> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
>> > and start the web interface
>> > /bin/start-local.sh
>> >
>> > Go to localhost:8081, in the "TaskManager" view, you can see some
>> metrics.
>> > Here is a screenshot: http://img42.com/eNPve
>> >
>> > I named my branch after this issue, as it is probably describing best
>> what
>> > we're working on here: FLINK-456
>> > <https://issues.apache.org/jira/browse/FLINK-456>
>> >
>> > As I said in the beginning, its really just a prototype. Let me know if
>> you
>> > have any further questions.
>> > For the "per TaskManager" reports, we should probably integrate some more
>> > statistics. Also, the presentation of the numbers is very very basic
>> right
>> > now. I think there are many good libraries for visualizing these kinds of
>> > stats.
>> > Also, the numbers currently represent only a "snapshot", however, some of
>> > the numbers can be accumulated (read/write bytes of the io manager).
>> > Another missing feature is storing a little history of numbers to
>> visualize
>> > metrics over time.
>> >
>> > I'm trying to find time to look into "per job" metrics as well. They will
>> > require a bit more infrastructure to distinguish them on the JobManager
>> > side and to get them on the TaskManagers.
>> >
>> >
>> > Best,
>> > Robert
>> >
>> >
>> >
>> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
>> > [hidden email]> wrote:
>> >
>> >> Hello Nils,
>> >>
>> >> I am going to work on a similar issue related to tracking some basics
>> >> statistics of the intermediate results produced by dataflows during
>> >> execution.
>> >>
>> >> I just create a Jira issue here:
>> >>
>> >> https://issues.apache.org/jira/browse/FLINK-1297
>> >>
>> >> If you already have some work done on extending the monitoring
>> capabilities
>> >> in a branch, it might be good to sync-up the development in order to
>> avoid
>> >> duplicated work (e.g. using the same communication channel used to send
>> the
>> >> data from the task managers to the job manager).
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
>> list
>> >> archive at Nabble.com.
>> >>
>>

Robert Metzger

Re: Enhance Flink's monitoring capabilities

I think this (very old) issue is somewhat closely describing the feature:
https://issues.apache.org/jira/browse/FLINK-456

On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <[hidden email]>
wrote:

> Just curious, is there any JIRA filed for this or was it just in
> preliminary proposal talk?
>
> - Henry
>
> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[hidden email]> wrote:
> > That actually sounds like a great idea. I discussed a bit with Robert
> > offline on Friday, and it seems that Metrics has most of what we talked
> > about.
> >
> > I also like the way they make it extensible, so people can capture their
> > own metrics.
> >
> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]>
> > wrote:
> >
> >> Hi Robert,
> >>
> >> From I have seen it so far, it is probably better and easier for Flink
> >> to leverage metrics library [1] for the metrics collection rather than
> >> building organically.
> >>
> >> Several ASF projects like Spark [2] and Tajo have used it with great
> >> success.
> >>
> >> One of the main reasons is maintainability and the breath of types of
> >> metric could and should be collected.
> >>
> >> - Henry
> >>
> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
> >> [3] https://issues.apache.org/jira/browse/TAJO-333
> >>
> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]>
> >> wrote:
> >> > Hey Nils,
> >> >
> >> > I have played around a bit with a little prototype. You can find the
> code
> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
> >> > another branch in my repo).
> >> > You can see the changes that I applied on top of Till's Akka branch
> here:
> >> >
> >>
> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
> >> >
> >> > What the code does is collecting statistics about each TaskManager in
> the
> >> > system. These stats are assembled into a "MetricsReport" which is send
> >> with
> >> > the periodical heartbeat to the JobManager. The JobManager stores the
> >> > latest MetricsReport for each TaskManager (in the Instance object for
> >> each
> >> > TM).
> >> > When the user accesses the TaskManager overview, the latest
> MetricsReport
> >> > is send as a JSONObject to the browser.
> >> >
> >> > to test my changes, check out the code, build it
> >> > mvn clean package -DskipTests -Dcheckstyle.skip=true
> >> > go into
> >> > cd
> >> >
> >>
> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
> >> > and start the web interface
> >> > /bin/start-local.sh
> >> >
> >> > Go to localhost:8081, in the "TaskManager" view, you can see some
> >> metrics.
> >> > Here is a screenshot: http://img42.com/eNPve
> >> >
> >> > I named my branch after this issue, as it is probably describing best
> >> what
> >> > we're working on here: FLINK-456
> >> > <https://issues.apache.org/jira/browse/FLINK-456>
> >> >
> >> > As I said in the beginning, its really just a prototype. Let me know
> if
> >> you
> >> > have any further questions.
> >> > For the "per TaskManager" reports, we should probably integrate some
> more
> >> > statistics. Also, the presentation of the numbers is very very basic
> >> right
> >> > now. I think there are many good libraries for visualizing these
> kinds of
> >> > stats.
> >> > Also, the numbers currently represent only a "snapshot", however,
> some of
> >> > the numbers can be accumulated (read/write bytes of the io manager).
> >> > Another missing feature is storing a little history of numbers to
> >> visualize
> >> > metrics over time.
> >> >
> >> > I'm trying to find time to look into "per job" metrics as well. They
> will
> >> > require a bit more infrastructure to distinguish them on the
> JobManager
> >> > side and to get them on the TaskManagers.
> >> >
> >> >
> >> > Best,
> >> > Robert
> >> >
> >> >
> >> >
> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
> >> > [hidden email]> wrote:
> >> >
> >> >> Hello Nils,
> >> >>
> >> >> I am going to work on a similar issue related to tracking some basics
> >> >> statistics of the intermediate results produced by dataflows during
> >> >> execution.
> >> >>
> >> >> I just create a Jira issue here:
> >> >>
> >> >> https://issues.apache.org/jira/browse/FLINK-1297
> >> >>
> >> >> If you already have some work done on extending the monitoring
> >> capabilities
> >> >> in a branch, it might be good to sync-up the development in order to
> >> avoid
> >> >> duplicated work (e.g. using the same communication channel used to
> send
> >> the
> >> >> data from the task managers to the job manager).
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
> >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
> >> list
> >> >> archive at Nabble.com.
> >> >>
> >>
>

Henry Saputra

Re: Enhance Flink's monitoring capabilities

Thanks Robert, looks like we could use this JIRA to do the work

- Henry

On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger <[hidden email]> wrote:

> I think this (very old) issue is somewhat closely describing the feature:
> https://issues.apache.org/jira/browse/FLINK-456
>
>
>
> On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <[hidden email]>
> wrote:
>
>> Just curious, is there any JIRA filed for this or was it just in
>> preliminary proposal talk?
>>
>> - Henry
>>
>> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[hidden email]> wrote:
>> > That actually sounds like a great idea. I discussed a bit with Robert
>> > offline on Friday, and it seems that Metrics has most of what we talked
>> > about.
>> >
>> > I also like the way they make it extensible, so people can capture their
>> > own metrics.
>> >
>> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]>
>> > wrote:
>> >
>> >> Hi Robert,
>> >>
>> >> From I have seen it so far, it is probably better and easier for Flink
>> >> to leverage metrics library [1] for the metrics collection rather than
>> >> building organically.
>> >>
>> >> Several ASF projects like Spark [2] and Tajo have used it with great
>> >> success.
>> >>
>> >> One of the main reasons is maintainability and the breath of types of
>> >> metric could and should be collected.
>> >>
>> >> - Henry
>> >>
>> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
>> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
>> >> [3] https://issues.apache.org/jira/browse/TAJO-333
>> >>
>> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]>
>> >> wrote:
>> >> > Hey Nils,
>> >> >
>> >> > I have played around a bit with a little prototype. You can find the
>> code
>> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
>> >> > another branch in my repo).
>> >> > You can see the changes that I applied on top of Till's Akka branch
>> here:
>> >> >
>> >>
>> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>> >> >
>> >> > What the code does is collecting statistics about each TaskManager in
>> the
>> >> > system. These stats are assembled into a "MetricsReport" which is send
>> >> with
>> >> > the periodical heartbeat to the JobManager. The JobManager stores the
>> >> > latest MetricsReport for each TaskManager (in the Instance object for
>> >> each
>> >> > TM).
>> >> > When the user accesses the TaskManager overview, the latest
>> MetricsReport
>> >> > is send as a JSONObject to the browser.
>> >> >
>> >> > to test my changes, check out the code, build it
>> >> > mvn clean package -DskipTests -Dcheckstyle.skip=true
>> >> > go into
>> >> > cd
>> >> >
>> >>
>> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
>> >> > and start the web interface
>> >> > /bin/start-local.sh
>> >> >
>> >> > Go to localhost:8081, in the "TaskManager" view, you can see some
>> >> metrics.
>> >> > Here is a screenshot: http://img42.com/eNPve
>> >> >
>> >> > I named my branch after this issue, as it is probably describing best
>> >> what
>> >> > we're working on here: FLINK-456
>> >> > <https://issues.apache.org/jira/browse/FLINK-456>
>> >> >
>> >> > As I said in the beginning, its really just a prototype. Let me know
>> if
>> >> you
>> >> > have any further questions.
>> >> > For the "per TaskManager" reports, we should probably integrate some
>> more
>> >> > statistics. Also, the presentation of the numbers is very very basic
>> >> right
>> >> > now. I think there are many good libraries for visualizing these
>> kinds of
>> >> > stats.
>> >> > Also, the numbers currently represent only a "snapshot", however,
>> some of
>> >> > the numbers can be accumulated (read/write bytes of the io manager).
>> >> > Another missing feature is storing a little history of numbers to
>> >> visualize
>> >> > metrics over time.
>> >> >
>> >> > I'm trying to find time to look into "per job" metrics as well. They
>> will
>> >> > require a bit more infrastructure to distinguish them on the
>> JobManager
>> >> > side and to get them on the TaskManagers.
>> >> >
>> >> >
>> >> > Best,
>> >> > Robert
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
>> >> > [hidden email]> wrote:
>> >> >
>> >> >> Hello Nils,
>> >> >>
>> >> >> I am going to work on a similar issue related to tracking some basics
>> >> >> statistics of the intermediate results produced by dataflows during
>> >> >> execution.
>> >> >>
>> >> >> I just create a Jira issue here:
>> >> >>
>> >> >> https://issues.apache.org/jira/browse/FLINK-1297
>> >> >>
>> >> >> If you already have some work done on extending the monitoring
>> >> capabilities
>> >> >> in a branch, it might be good to sync-up the development in order to
>> >> avoid
>> >> >> duplicated work (e.g. using the same communication channel used to
>> send
>> >> the
>> >> >> data from the task managers to the job manager).
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing
>> >> list
>> >> >> archive at Nabble.com.
>> >> >>
>> >>
>>

aalexandrov

Fwd: Enhance Flink's monitoring capabilities

I have created an issue for the related dataflow statistics tracking
feature here:

https://issues.apache.org/jira/browse/FLINK-1297

FLINK-456 seems to have some overlap with what I described. I suggest to
either have three separate issues or at least work on resolving FLINK-1297
and FLINK-456 in three stages:

1. agree upon a design and implement the basic service architecture and the
model;
2. implement dataflow statistics tracking on top of (1): min, max, count,
count distinct;
3. implement runtime statistics tracking on top of (1): CPU, I/O load;

It makes sense to have a design document (probably Markdown) with some
figures to agree on the scope and implementation aspects on (1) as Henry
Proposed in the "Statistics collection for optimization" thread before we
start with the actual implementation.

Robert's prototype branch (
https://github.com/rmetzger/incubator-flink/tree/flink456) on top of the
latest version of Till's Akka rework seems to be a good starting point to
fork for the actual work on (1). I suggest that after that we somehow
divide and conquer (2) and (3).

Regards,
Alexander

---------- Forwarded message ----------
From: Henry Saputra <[hidden email]>
Date: 2014-12-12 6:18 GMT+01:00
Subject: Re: Enhance Flink's monitoring capabilities
To: "[hidden email]" <[hidden email]>

Thanks Robert, looks like we could use this JIRA to do the work

- Henry

On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger <[hidden email]> wrote:

their

>> > own metrics.
>> >
>> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]>
>> > wrote:
>> >
>> >> Hi Robert,
>> >>
>> >> From I have seen it so far, it is probably better and easier for Flink
>> >> to leverage metrics library [1] for the metrics collection rather than
>> >> building organically.
>> >>
>> >> Several ASF projects like Spark [2] and Tajo have used it with great
>> >> success.
>> >>
>> >> One of the main reasons is maintainability and the breath of types of
>> >> metric could and should be collected.
>> >>
>> >> - Henry
>> >>
>> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/
>> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html
>> >> [3] https://issues.apache.org/jira/browse/TAJO-333
>> >>
>> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]>
>> >> wrote:
>> >> > Hey Nils,
>> >> >
>> >> > I have played around a bit with a little prototype. You can find the
>> code
>> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its
>> >> > another branch in my repo).
>> >> > You can see the changes that I applied on top of Till's Akka branch
>> here:
>> >> >
>> >>
>>

https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1
>> >> >
>> >> > What the code does is collecting statistics about each TaskManager
in
>> the
>> >> > system. These stats are assembled into a "MetricsReport" which is
send
>> >> with
>> >> > the periodical heartbeat to the JobManager. The JobManager stores
the
>> >> > latest MetricsReport for each TaskManager (in the Instance object
for

>> >> each
>> >> > TM).
>> >> > When the user accesses the TaskManager overview, the latest
>> MetricsReport
>> >> > is send as a JSONObject to the browser.
>> >> >
>> >> > to test my changes, check out the code, build it
>> >> > mvn clean package -DskipTests -Dcheckstyle.skip=true
>> >> > go into
>> >> > cd
>> >> >
>> >>
>>

flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/
>> >> > and start the web interface
>> >> > /bin/start-local.sh
>> >> >
>> >> > Go to localhost:8081, in the "TaskManager" view, you can see some
>> >> metrics.
>> >> > Here is a screenshot: http://img42.com/eNPve
>> >> >
>> >> > I named my branch after this issue, as it is probably describing
best

>> >> what
>> >> > we're working on here: FLINK-456
>> >> > <https://issues.apache.org/jira/browse/FLINK-456>
>> >> >
>> >> > As I said in the beginning, its really just a prototype. Let me know
>> if
>> >> you
>> >> > have any further questions.
>> >> > For the "per TaskManager" reports, we should probably integrate some
>> more
>> >> > statistics. Also, the presentation of the numbers is very very basic
>> >> right
>> >> > now. I think there are many good libraries for visualizing these
>> kinds of
>> >> > stats.
>> >> > Also, the numbers currently represent only a "snapshot", however,
>> some of
>> >> > the numbers can be accumulated (read/write bytes of the io manager).
>> >> > Another missing feature is storing a little history of numbers to
>> >> visualize
>> >> > metrics over time.
>> >> >
>> >> > I'm trying to find time to look into "per job" metrics as well. They
>> will
>> >> > require a bit more infrastructure to distinguish them on the
>> JobManager
>> >> > side and to get them on the TaskManagers.
>> >> >
>> >> >
>> >> > Best,
>> >> > Robert
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov <
>> >> > [hidden email]> wrote:
>> >> >
>> >> >> Hello Nils,
>> >> >>
>> >> >> I am going to work on a similar issue related to tracking some

basics

>> >> >> statistics of the intermediate results produced by dataflows during
>> >> >> execution.
>> >> >>
>> >> >> I just create a Jira issue here:
>> >> >>
>> >> >> https://issues.apache.org/jira/browse/FLINK-1297
>> >> >>
>> >> >> If you already have some work done on extending the monitoring
>> >> capabilities
>> >> >> in a branch, it might be good to sync-up the development in order

>> >> avoid
>> >> >> duplicated work (e.g. using the same communication channel used to
>> send
>> >> the
>> >> >> data from the task managers to the job manager).
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>>

http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html
>> >> >> Sent from the Apache Flink (Incubator) Mailing List archive.
mailing
>> >> list
>> >> >> archive at Nabble.com.
>> >> >>
>> >>
>>