Hello together,
I am trying to enhance Flink's monitoring capabilities in style of the GSoC 2014 Proposal by Rajika Kumarasiri [1]. Short abstract: He suggested to use the Java standard, the Java Mangement Extensions(JMX). The idea is to put an MBean-Server in the JobManager, so that the JobManager itself and all Taskmanagers in the cluster can register their MBeans to this server via RMI. Different monitoring stages (No, standard, full) reduce the affect on the system performance. The JMX service should be accessible in an improved web-component using an RESTful API. He also suggested the use of the SIGAR[2] JNI library to gather the system information. In my opinion this point is discussible. In Java 7 they introduced Platform MXBeans[3] which already cover the basic system information, and so in my eyes the use of a JNI library might be a little overkill. But of course this depends on the aimed depth of monitoring. So the primary question: What parameters/system properties/utilizations/work loads should be monitored in your opinions? Have a nice weekend! Nils [1] https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Rajika-Kumarasiri [2] https://support.hyperic.com/display/SIGAR/Home [3] https://docs.oracle.com/javase/7/docs/technotes/guides/management/overview.html |
Hi Nils,
Flink's current monitoring is quite limited and basically restricted to status updates of the parallel tasks (scheduled, started, finished, canceled, failed, etc.). There is also some code lying around to collect system stats such as CPU, memory, and network utilization. However, it is not used right now, AFAIK. In case of a long running job, it is hard to figure out what is going on and whether a program makes progress or not. Having a monitoring infrastructure which allows to add, collect, and query new metrics with low effort would be a great addition to Flink. From what I know, JMX was explicitly designed for this purpose and seems to be a good fit. Since it is a Java standard, other tools can easily connect and retrieve monitoring data. As a starting point, I would focus to get an early prototype that uses JMX to collect a single metric such as number of tuples processed by a Map function. Having such a showcase, would help to have a good discussion about how to implement the monitoring infrastructure. The question of metrics to collect is orthogonal to that. If we have a good system to collect and gather stats, these can be added one by one. Cheers, Fabian 2014-11-21 18:32 GMT+01:00 Nils E <[hidden email]>: > Hello together, > > I am trying to enhance Flink's monitoring capabilities in style of the GSoC > 2014 Proposal by Rajika Kumarasiri [1]. > > Short abstract: > He suggested to use the Java standard, the Java Mangement Extensions(JMX). > The idea is to put an MBean-Server in the JobManager, so that the > JobManager itself and all Taskmanagers in the cluster can register their > MBeans to this server via RMI. > Different monitoring stages (No, standard, full) reduce the affect on the > system performance. > The JMX service should be accessible in an improved web-component using an > RESTful API. > He also suggested the use of the SIGAR[2] JNI library to gather the system > information. > In my opinion this point is discussible. In Java 7 they introduced Platform > MXBeans[3] which already cover the basic system information, and so in my > eyes the use of a JNI library might be a little overkill. But of course > this depends on the aimed depth of monitoring. > > So the primary question: > What parameters/system properties/utilizations/work loads should be > monitored in your opinions? > > Have a nice weekend! > Nils > > [1] > > https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Rajika-Kumarasiri > [2] https://support.hyperic.com/display/SIGAR/Home > [3] > > https://docs.oracle.com/javase/7/docs/technotes/guides/management/overview.html > |
On 23 Nov 2014, at 00:03, Fabian Hueske <[hidden email]> wrote: > Hi Nils, > > Flink's current monitoring is quite limited and basically restricted to > status updates of the parallel tasks (scheduled, started, finished, > canceled, failed, etc.). > There is also some code lying around to collect system stats such as CPU, > memory, and network utilization. However, it is not used right now, AFAIK. > In case of a long running job, it is hard to figure out what is going on > and whether a program makes progress or not. > > Having a monitoring infrastructure which allows to add, collect, and query > new metrics with low effort would be a great addition to Flink. > From what I know, JMX was explicitly designed for this purpose and seems to > be a good fit. Since it is a Java standard, other tools can easily connect > and retrieve monitoring data. > > As a starting point, I would focus to get an early prototype that uses JMX > to collect a single metric such as number of tuples processed by a Map > function. > Having such a showcase, would help to have a good discussion about how to > implement the monitoring infrastructure. > The question of metrics to collect is orthogonal to that. If we have a good > system to collect and gather stats, these can be added one by one. +1 I don't have experience with JMX, but I agree with Fabian that the architecture of this monitoring service is very important and should come first. It should be flexible enough to easily support the collection of metrics by any operator and the user. Every task manager needs expose this service to collect (and aggregate) data, which then would be collected at a central instance (e.g. the JobManager). I am not sure at this point, but it might be worthwhile to think about separating this central monitoring service from the JobManager in order to reduce JobManager load and have more flexibility, e.g. running it as a central history server to monitor multiple JobManager instances (for example in YARN setups). – Ufuk |
Hi Nils,
I'm not sure if its a good idea to use JMX. I fear that we are overengineering something here for features that we don't really need. I don't know any tools that can evaluate these JMX information (I think thats the main argument for using JMX). Also doing this kind of monitoring (connecting to JVMs running somewhere in a cluster from a local client) is often really complicated (firewalls, port forwardings, ssh, ...) that its probably to impractical. So users will probably only use the web interface. I think the main reason why we agreed to use JMX for that proposal was, that the student who proposed the topic knew the JMX system very well. What I would do is the following: a) Work on the Akka branch of Till until its merged ( https://github.com/apache/incubator-flink/pull/149). There is no point in changing the current RPC / JobManager / TaskManager infrastructure if we are going to replace it very soon (I think its a matter of days). b) Just "piggyback" on the heartbeat that the TaskManagers are sending to the JobManager and include metrics there. Then, collect them at the JobManager and expose them via the web interface. c) Metrics: I would start with Garbage Collection statistics. Then: - "bytes in" per operator - input splits processed for DataSources - Current Iteration number / Avg Iteration time - Disk IO / Network IO stats. Let me know if you need more information. Best, Robert On Sun, Nov 23, 2014 at 11:28 PM, Ufuk Celebi <[hidden email]> wrote: > > On 23 Nov 2014, at 00:03, Fabian Hueske <[hidden email]> wrote: > > > Hi Nils, > > > > Flink's current monitoring is quite limited and basically restricted to > > status updates of the parallel tasks (scheduled, started, finished, > > canceled, failed, etc.). > > There is also some code lying around to collect system stats such as CPU, > > memory, and network utilization. However, it is not used right now, > AFAIK. > > In case of a long running job, it is hard to figure out what is going on > > and whether a program makes progress or not. > > > > Having a monitoring infrastructure which allows to add, collect, and > query > > new metrics with low effort would be a great addition to Flink. > > From what I know, JMX was explicitly designed for this purpose and seems > to > > be a good fit. Since it is a Java standard, other tools can easily > connect > > and retrieve monitoring data. > > > > As a starting point, I would focus to get an early prototype that uses > JMX > > to collect a single metric such as number of tuples processed by a Map > > function. > > Having such a showcase, would help to have a good discussion about how to > > implement the monitoring infrastructure. > > The question of metrics to collect is orthogonal to that. If we have a > good > > system to collect and gather stats, these can be added one by one. > > +1 > > I don't have experience with JMX, but I agree with Fabian that the > architecture of this monitoring service is very important and should come > first. It should be flexible enough to easily support the collection of > metrics by any operator and the user. > > Every task manager needs expose this service to collect (and aggregate) > data, which then would be collected at a central instance (e.g. the > JobManager). I am not sure at this point, but it might be worthwhile to > think about separating this central monitoring service from the JobManager > in order to reduce JobManager load and have more flexibility, e.g. running > it as a central history server to monitor multiple JobManager instances > (for example in YARN setups). > > – Ufuk |
In reply to this post by Nils E
Hello Nils,
I am going to work on a similar issue related to tracking some basics statistics of the intermediate results produced by dataflows during execution. I just create a Jira issue here: https://issues.apache.org/jira/browse/FLINK-1297 If you already have some work done on extending the monitoring capabilities in a branch, it might be good to sync-up the development in order to avoid duplicated work (e.g. using the same communication channel used to send the data from the task managers to the job manager). |
Hey Nils,
I have played around a bit with a little prototype. You can find the code here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its another branch in my repo). You can see the changes that I applied on top of Till's Akka branch here: https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 What the code does is collecting statistics about each TaskManager in the system. These stats are assembled into a "MetricsReport" which is send with the periodical heartbeat to the JobManager. The JobManager stores the latest MetricsReport for each TaskManager (in the Instance object for each TM). When the user accesses the TaskManager overview, the latest MetricsReport is send as a JSONObject to the browser. to test my changes, check out the code, build it mvn clean package -DskipTests -Dcheckstyle.skip=true go into cd flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ and start the web interface /bin/start-local.sh Go to localhost:8081, in the "TaskManager" view, you can see some metrics. Here is a screenshot: http://img42.com/eNPve I named my branch after this issue, as it is probably describing best what we're working on here: FLINK-456 <https://issues.apache.org/jira/browse/FLINK-456> As I said in the beginning, its really just a prototype. Let me know if you have any further questions. For the "per TaskManager" reports, we should probably integrate some more statistics. Also, the presentation of the numbers is very very basic right now. I think there are many good libraries for visualizing these kinds of stats. Also, the numbers currently represent only a "snapshot", however, some of the numbers can be accumulated (read/write bytes of the io manager). Another missing feature is storing a little history of numbers to visualize metrics over time. I'm trying to find time to look into "per job" metrics as well. They will require a bit more infrastructure to distinguish them on the JobManager side and to get them on the TaskManagers. Best, Robert On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < [hidden email]> wrote: > Hello Nils, > > I am going to work on a similar issue related to tracking some basics > statistics of the intermediate results produced by dataflows during > execution. > > I just create a Jira issue here: > > https://issues.apache.org/jira/browse/FLINK-1297 > > If you already have some work done on extending the monitoring capabilities > in a branch, it might be good to sync-up the development in order to avoid > duplicated work (e.g. using the same communication channel used to send the > data from the task managers to the job manager). > > > > -- > View this message in context: > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html > Sent from the Apache Flink (Incubator) Mailing List archive. mailing list > archive at Nabble.com. > |
Hi Robert,
From I have seen it so far, it is probably better and easier for Flink to leverage metrics library [1] for the metrics collection rather than building organically. Several ASF projects like Spark [2] and Tajo have used it with great success. One of the main reasons is maintainability and the breath of types of metric could and should be collected. - Henry [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ [2] https://spark.apache.org/docs/1.0.1/monitoring.html [3] https://issues.apache.org/jira/browse/TAJO-333 On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]> wrote: > Hey Nils, > > I have played around a bit with a little prototype. You can find the code > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its > another branch in my repo). > You can see the changes that I applied on top of Till's Akka branch here: > https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 > > What the code does is collecting statistics about each TaskManager in the > system. These stats are assembled into a "MetricsReport" which is send with > the periodical heartbeat to the JobManager. The JobManager stores the > latest MetricsReport for each TaskManager (in the Instance object for each > TM). > When the user accesses the TaskManager overview, the latest MetricsReport > is send as a JSONObject to the browser. > > to test my changes, check out the code, build it > mvn clean package -DskipTests -Dcheckstyle.skip=true > go into > cd > flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ > and start the web interface > /bin/start-local.sh > > Go to localhost:8081, in the "TaskManager" view, you can see some metrics. > Here is a screenshot: http://img42.com/eNPve > > I named my branch after this issue, as it is probably describing best what > we're working on here: FLINK-456 > <https://issues.apache.org/jira/browse/FLINK-456> > > As I said in the beginning, its really just a prototype. Let me know if you > have any further questions. > For the "per TaskManager" reports, we should probably integrate some more > statistics. Also, the presentation of the numbers is very very basic right > now. I think there are many good libraries for visualizing these kinds of > stats. > Also, the numbers currently represent only a "snapshot", however, some of > the numbers can be accumulated (read/write bytes of the io manager). > Another missing feature is storing a little history of numbers to visualize > metrics over time. > > I'm trying to find time to look into "per job" metrics as well. They will > require a bit more infrastructure to distinguish them on the JobManager > side and to get them on the TaskManagers. > > > Best, > Robert > > > > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < > [hidden email]> wrote: > >> Hello Nils, >> >> I am going to work on a similar issue related to tracking some basics >> statistics of the intermediate results produced by dataflows during >> execution. >> >> I just create a Jira issue here: >> >> https://issues.apache.org/jira/browse/FLINK-1297 >> >> If you already have some work done on extending the monitoring capabilities >> in a branch, it might be good to sync-up the development in order to avoid >> duplicated work (e.g. using the same communication channel used to send the >> data from the task managers to the job manager). >> >> >> >> -- >> View this message in context: >> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list >> archive at Nabble.com. >> |
That actually sounds like a great idea. I discussed a bit with Robert
offline on Friday, and it seems that Metrics has most of what we talked about. I also like the way they make it extensible, so people can capture their own metrics. On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]> wrote: > Hi Robert, > > From I have seen it so far, it is probably better and easier for Flink > to leverage metrics library [1] for the metrics collection rather than > building organically. > > Several ASF projects like Spark [2] and Tajo have used it with great > success. > > One of the main reasons is maintainability and the breath of types of > metric could and should be collected. > > - Henry > > [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ > [2] https://spark.apache.org/docs/1.0.1/monitoring.html > [3] https://issues.apache.org/jira/browse/TAJO-333 > > On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]> > wrote: > > Hey Nils, > > > > I have played around a bit with a little prototype. You can find the code > > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its > > another branch in my repo). > > You can see the changes that I applied on top of Till's Akka branch here: > > > https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 > > > > What the code does is collecting statistics about each TaskManager in the > > system. These stats are assembled into a "MetricsReport" which is send > with > > the periodical heartbeat to the JobManager. The JobManager stores the > > latest MetricsReport for each TaskManager (in the Instance object for > each > > TM). > > When the user accesses the TaskManager overview, the latest MetricsReport > > is send as a JSONObject to the browser. > > > > to test my changes, check out the code, build it > > mvn clean package -DskipTests -Dcheckstyle.skip=true > > go into > > cd > > > flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ > > and start the web interface > > /bin/start-local.sh > > > > Go to localhost:8081, in the "TaskManager" view, you can see some > metrics. > > Here is a screenshot: http://img42.com/eNPve > > > > I named my branch after this issue, as it is probably describing best > what > > we're working on here: FLINK-456 > > <https://issues.apache.org/jira/browse/FLINK-456> > > > > As I said in the beginning, its really just a prototype. Let me know if > you > > have any further questions. > > For the "per TaskManager" reports, we should probably integrate some more > > statistics. Also, the presentation of the numbers is very very basic > right > > now. I think there are many good libraries for visualizing these kinds of > > stats. > > Also, the numbers currently represent only a "snapshot", however, some of > > the numbers can be accumulated (read/write bytes of the io manager). > > Another missing feature is storing a little history of numbers to > visualize > > metrics over time. > > > > I'm trying to find time to look into "per job" metrics as well. They will > > require a bit more infrastructure to distinguish them on the JobManager > > side and to get them on the TaskManagers. > > > > > > Best, > > Robert > > > > > > > > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < > > [hidden email]> wrote: > > > >> Hello Nils, > >> > >> I am going to work on a similar issue related to tracking some basics > >> statistics of the intermediate results produced by dataflows during > >> execution. > >> > >> I just create a Jira issue here: > >> > >> https://issues.apache.org/jira/browse/FLINK-1297 > >> > >> If you already have some work done on extending the monitoring > capabilities > >> in a branch, it might be good to sync-up the development in order to > avoid > >> duplicated work (e.g. using the same communication channel used to send > the > >> data from the task managers to the job manager). > >> > >> > >> > >> -- > >> View this message in context: > >> > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html > >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing > list > >> archive at Nabble.com. > >> > |
+1
It's extensibility is one of the reasons it has been used in other projects. On Sunday, December 7, 2014, Stephan Ewen <[hidden email]> wrote: > That actually sounds like a great idea. I discussed a bit with Robert > offline on Friday, and it seems that Metrics has most of what we talked > about. > > I also like the way they make it extensible, so people can capture their > own metrics. > > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email] > <javascript:;>> > wrote: > > > Hi Robert, > > > > From I have seen it so far, it is probably better and easier for Flink > > to leverage metrics library [1] for the metrics collection rather than > > building organically. > > > > Several ASF projects like Spark [2] and Tajo have used it with great > > success. > > > > One of the main reasons is maintainability and the breath of types of > > metric could and should be collected. > > > > - Henry > > > > [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ > > [2] https://spark.apache.org/docs/1.0.1/monitoring.html > > [3] https://issues.apache.org/jira/browse/TAJO-333 > > > > On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email] > <javascript:;>> > > wrote: > > > Hey Nils, > > > > > > I have played around a bit with a little prototype. You can find the > code > > > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its > > > another branch in my repo). > > > You can see the changes that I applied on top of Till's Akka branch > here: > > > > > > https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 > > > > > > What the code does is collecting statistics about each TaskManager in > the > > > system. These stats are assembled into a "MetricsReport" which is send > > with > > > the periodical heartbeat to the JobManager. The JobManager stores the > > > latest MetricsReport for each TaskManager (in the Instance object for > > each > > > TM). > > > When the user accesses the TaskManager overview, the latest > MetricsReport > > > is send as a JSONObject to the browser. > > > > > > to test my changes, check out the code, build it > > > mvn clean package -DskipTests -Dcheckstyle.skip=true > > > go into > > > cd > > > > > > flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ > > > and start the web interface > > > /bin/start-local.sh > > > > > > Go to localhost:8081, in the "TaskManager" view, you can see some > > metrics. > > > Here is a screenshot: http://img42.com/eNPve > > > > > > I named my branch after this issue, as it is probably describing best > > what > > > we're working on here: FLINK-456 > > > <https://issues.apache.org/jira/browse/FLINK-456> > > > > > > As I said in the beginning, its really just a prototype. Let me know if > > you > > > have any further questions. > > > For the "per TaskManager" reports, we should probably integrate some > more > > > statistics. Also, the presentation of the numbers is very very basic > > right > > > now. I think there are many good libraries for visualizing these kinds > of > > > stats. > > > Also, the numbers currently represent only a "snapshot", however, some > of > > > the numbers can be accumulated (read/write bytes of the io manager). > > > Another missing feature is storing a little history of numbers to > > visualize > > > metrics over time. > > > > > > I'm trying to find time to look into "per job" metrics as well. They > will > > > require a bit more infrastructure to distinguish them on the JobManager > > > side and to get them on the TaskManagers. > > > > > > > > > Best, > > > Robert > > > > > > > > > > > > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < > > > [hidden email] <javascript:;>> wrote: > > > > > >> Hello Nils, > > >> > > >> I am going to work on a similar issue related to tracking some basics > > >> statistics of the intermediate results produced by dataflows during > > >> execution. > > >> > > >> I just create a Jira issue here: > > >> > > >> https://issues.apache.org/jira/browse/FLINK-1297 > > >> > > >> If you already have some work done on extending the monitoring > > capabilities > > >> in a branch, it might be good to sync-up the development in order to > > avoid > > >> duplicated work (e.g. using the same communication channel used to > send > > the > > >> data from the task managers to the job manager). > > >> > > >> > > >> > > >> -- > > >> View this message in context: > > >> > > > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html > > >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing > > list > > >> archive at Nabble.com. > > >> > > > |
In reply to this post by Stephan Ewen
Just curious, is there any JIRA filed for this or was it just in
preliminary proposal talk? - Henry On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[hidden email]> wrote: > That actually sounds like a great idea. I discussed a bit with Robert > offline on Friday, and it seems that Metrics has most of what we talked > about. > > I also like the way they make it extensible, so people can capture their > own metrics. > > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]> > wrote: > >> Hi Robert, >> >> From I have seen it so far, it is probably better and easier for Flink >> to leverage metrics library [1] for the metrics collection rather than >> building organically. >> >> Several ASF projects like Spark [2] and Tajo have used it with great >> success. >> >> One of the main reasons is maintainability and the breath of types of >> metric could and should be collected. >> >> - Henry >> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html >> [3] https://issues.apache.org/jira/browse/TAJO-333 >> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]> >> wrote: >> > Hey Nils, >> > >> > I have played around a bit with a little prototype. You can find the code >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its >> > another branch in my repo). >> > You can see the changes that I applied on top of Till's Akka branch here: >> > >> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 >> > >> > What the code does is collecting statistics about each TaskManager in the >> > system. These stats are assembled into a "MetricsReport" which is send >> with >> > the periodical heartbeat to the JobManager. The JobManager stores the >> > latest MetricsReport for each TaskManager (in the Instance object for >> each >> > TM). >> > When the user accesses the TaskManager overview, the latest MetricsReport >> > is send as a JSONObject to the browser. >> > >> > to test my changes, check out the code, build it >> > mvn clean package -DskipTests -Dcheckstyle.skip=true >> > go into >> > cd >> > >> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ >> > and start the web interface >> > /bin/start-local.sh >> > >> > Go to localhost:8081, in the "TaskManager" view, you can see some >> metrics. >> > Here is a screenshot: http://img42.com/eNPve >> > >> > I named my branch after this issue, as it is probably describing best >> what >> > we're working on here: FLINK-456 >> > <https://issues.apache.org/jira/browse/FLINK-456> >> > >> > As I said in the beginning, its really just a prototype. Let me know if >> you >> > have any further questions. >> > For the "per TaskManager" reports, we should probably integrate some more >> > statistics. Also, the presentation of the numbers is very very basic >> right >> > now. I think there are many good libraries for visualizing these kinds of >> > stats. >> > Also, the numbers currently represent only a "snapshot", however, some of >> > the numbers can be accumulated (read/write bytes of the io manager). >> > Another missing feature is storing a little history of numbers to >> visualize >> > metrics over time. >> > >> > I'm trying to find time to look into "per job" metrics as well. They will >> > require a bit more infrastructure to distinguish them on the JobManager >> > side and to get them on the TaskManagers. >> > >> > >> > Best, >> > Robert >> > >> > >> > >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < >> > [hidden email]> wrote: >> > >> >> Hello Nils, >> >> >> >> I am going to work on a similar issue related to tracking some basics >> >> statistics of the intermediate results produced by dataflows during >> >> execution. >> >> >> >> I just create a Jira issue here: >> >> >> >> https://issues.apache.org/jira/browse/FLINK-1297 >> >> >> >> If you already have some work done on extending the monitoring >> capabilities >> >> in a branch, it might be good to sync-up the development in order to >> avoid >> >> duplicated work (e.g. using the same communication channel used to send >> the >> >> data from the task managers to the job manager). >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing >> list >> >> archive at Nabble.com. >> >> >> |
I think this (very old) issue is somewhat closely describing the feature:
https://issues.apache.org/jira/browse/FLINK-456 On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <[hidden email]> wrote: > Just curious, is there any JIRA filed for this or was it just in > preliminary proposal talk? > > - Henry > > On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[hidden email]> wrote: > > That actually sounds like a great idea. I discussed a bit with Robert > > offline on Friday, and it seems that Metrics has most of what we talked > > about. > > > > I also like the way they make it extensible, so people can capture their > > own metrics. > > > > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]> > > wrote: > > > >> Hi Robert, > >> > >> From I have seen it so far, it is probably better and easier for Flink > >> to leverage metrics library [1] for the metrics collection rather than > >> building organically. > >> > >> Several ASF projects like Spark [2] and Tajo have used it with great > >> success. > >> > >> One of the main reasons is maintainability and the breath of types of > >> metric could and should be collected. > >> > >> - Henry > >> > >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ > >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html > >> [3] https://issues.apache.org/jira/browse/TAJO-333 > >> > >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]> > >> wrote: > >> > Hey Nils, > >> > > >> > I have played around a bit with a little prototype. You can find the > code > >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its > >> > another branch in my repo). > >> > You can see the changes that I applied on top of Till's Akka branch > here: > >> > > >> > https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 > >> > > >> > What the code does is collecting statistics about each TaskManager in > the > >> > system. These stats are assembled into a "MetricsReport" which is send > >> with > >> > the periodical heartbeat to the JobManager. The JobManager stores the > >> > latest MetricsReport for each TaskManager (in the Instance object for > >> each > >> > TM). > >> > When the user accesses the TaskManager overview, the latest > MetricsReport > >> > is send as a JSONObject to the browser. > >> > > >> > to test my changes, check out the code, build it > >> > mvn clean package -DskipTests -Dcheckstyle.skip=true > >> > go into > >> > cd > >> > > >> > flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ > >> > and start the web interface > >> > /bin/start-local.sh > >> > > >> > Go to localhost:8081, in the "TaskManager" view, you can see some > >> metrics. > >> > Here is a screenshot: http://img42.com/eNPve > >> > > >> > I named my branch after this issue, as it is probably describing best > >> what > >> > we're working on here: FLINK-456 > >> > <https://issues.apache.org/jira/browse/FLINK-456> > >> > > >> > As I said in the beginning, its really just a prototype. Let me know > if > >> you > >> > have any further questions. > >> > For the "per TaskManager" reports, we should probably integrate some > more > >> > statistics. Also, the presentation of the numbers is very very basic > >> right > >> > now. I think there are many good libraries for visualizing these > kinds of > >> > stats. > >> > Also, the numbers currently represent only a "snapshot", however, > some of > >> > the numbers can be accumulated (read/write bytes of the io manager). > >> > Another missing feature is storing a little history of numbers to > >> visualize > >> > metrics over time. > >> > > >> > I'm trying to find time to look into "per job" metrics as well. They > will > >> > require a bit more infrastructure to distinguish them on the > JobManager > >> > side and to get them on the TaskManagers. > >> > > >> > > >> > Best, > >> > Robert > >> > > >> > > >> > > >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < > >> > [hidden email]> wrote: > >> > > >> >> Hello Nils, > >> >> > >> >> I am going to work on a similar issue related to tracking some basics > >> >> statistics of the intermediate results produced by dataflows during > >> >> execution. > >> >> > >> >> I just create a Jira issue here: > >> >> > >> >> https://issues.apache.org/jira/browse/FLINK-1297 > >> >> > >> >> If you already have some work done on extending the monitoring > >> capabilities > >> >> in a branch, it might be good to sync-up the development in order to > >> avoid > >> >> duplicated work (e.g. using the same communication channel used to > send > >> the > >> >> data from the task managers to the job manager). > >> >> > >> >> > >> >> > >> >> -- > >> >> View this message in context: > >> >> > >> > http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html > >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing > >> list > >> >> archive at Nabble.com. > >> >> > >> > |
Thanks Robert, looks like we could use this JIRA to do the work
- Henry On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger <[hidden email]> wrote: > I think this (very old) issue is somewhat closely describing the feature: > https://issues.apache.org/jira/browse/FLINK-456 > > > > On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <[hidden email]> > wrote: > >> Just curious, is there any JIRA filed for this or was it just in >> preliminary proposal talk? >> >> - Henry >> >> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[hidden email]> wrote: >> > That actually sounds like a great idea. I discussed a bit with Robert >> > offline on Friday, and it seems that Metrics has most of what we talked >> > about. >> > >> > I also like the way they make it extensible, so people can capture their >> > own metrics. >> > >> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]> >> > wrote: >> > >> >> Hi Robert, >> >> >> >> From I have seen it so far, it is probably better and easier for Flink >> >> to leverage metrics library [1] for the metrics collection rather than >> >> building organically. >> >> >> >> Several ASF projects like Spark [2] and Tajo have used it with great >> >> success. >> >> >> >> One of the main reasons is maintainability and the breath of types of >> >> metric could and should be collected. >> >> >> >> - Henry >> >> >> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ >> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html >> >> [3] https://issues.apache.org/jira/browse/TAJO-333 >> >> >> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]> >> >> wrote: >> >> > Hey Nils, >> >> > >> >> > I have played around a bit with a little prototype. You can find the >> code >> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its >> >> > another branch in my repo). >> >> > You can see the changes that I applied on top of Till's Akka branch >> here: >> >> > >> >> >> https://github.com/rmetzger/incubator-flink/compare/tillrohrmann:akka_scala...rmetzger:flink456?expand=1 >> >> > >> >> > What the code does is collecting statistics about each TaskManager in >> the >> >> > system. These stats are assembled into a "MetricsReport" which is send >> >> with >> >> > the periodical heartbeat to the JobManager. The JobManager stores the >> >> > latest MetricsReport for each TaskManager (in the Instance object for >> >> each >> >> > TM). >> >> > When the user accesses the TaskManager overview, the latest >> MetricsReport >> >> > is send as a JSONObject to the browser. >> >> > >> >> > to test my changes, check out the code, build it >> >> > mvn clean package -DskipTests -Dcheckstyle.skip=true >> >> > go into >> >> > cd >> >> > >> >> >> flink-dist/target/flink-0.8-incubating-SNAPSHOT-bin/flink-0.8-incubating-SNAPSHOT/ >> >> > and start the web interface >> >> > /bin/start-local.sh >> >> > >> >> > Go to localhost:8081, in the "TaskManager" view, you can see some >> >> metrics. >> >> > Here is a screenshot: http://img42.com/eNPve >> >> > >> >> > I named my branch after this issue, as it is probably describing best >> >> what >> >> > we're working on here: FLINK-456 >> >> > <https://issues.apache.org/jira/browse/FLINK-456> >> >> > >> >> > As I said in the beginning, its really just a prototype. Let me know >> if >> >> you >> >> > have any further questions. >> >> > For the "per TaskManager" reports, we should probably integrate some >> more >> >> > statistics. Also, the presentation of the numbers is very very basic >> >> right >> >> > now. I think there are many good libraries for visualizing these >> kinds of >> >> > stats. >> >> > Also, the numbers currently represent only a "snapshot", however, >> some of >> >> > the numbers can be accumulated (read/write bytes of the io manager). >> >> > Another missing feature is storing a little history of numbers to >> >> visualize >> >> > metrics over time. >> >> > >> >> > I'm trying to find time to look into "per job" metrics as well. They >> will >> >> > require a bit more infrastructure to distinguish them on the >> JobManager >> >> > side and to get them on the TaskManagers. >> >> > >> >> > >> >> > Best, >> >> > Robert >> >> > >> >> > >> >> > >> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < >> >> > [hidden email]> wrote: >> >> > >> >> >> Hello Nils, >> >> >> >> >> >> I am going to work on a similar issue related to tracking some basics >> >> >> statistics of the intermediate results produced by dataflows during >> >> >> execution. >> >> >> >> >> >> I just create a Jira issue here: >> >> >> >> >> >> https://issues.apache.org/jira/browse/FLINK-1297 >> >> >> >> >> >> If you already have some work done on extending the monitoring >> >> capabilities >> >> >> in a branch, it might be good to sync-up the development in order to >> >> avoid >> >> >> duplicated work (e.g. using the same communication channel used to >> send >> >> the >> >> >> data from the task managers to the job manager). >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> View this message in context: >> >> >> >> >> >> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-tp2573p2713.html >> >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing >> >> list >> >> >> archive at Nabble.com. >> >> >> >> >> >> |
I have created an issue for the related dataflow statistics tracking
feature here: https://issues.apache.org/jira/browse/FLINK-1297 FLINK-456 seems to have some overlap with what I described. I suggest to either have three separate issues or at least work on resolving FLINK-1297 and FLINK-456 in three stages: 1. agree upon a design and implement the basic service architecture and the model; 2. implement dataflow statistics tracking on top of (1): min, max, count, count distinct; 3. implement runtime statistics tracking on top of (1): CPU, I/O load; It makes sense to have a design document (probably Markdown) with some figures to agree on the scope and implementation aspects on (1) as Henry Proposed in the "Statistics collection for optimization" thread before we start with the actual implementation. Robert's prototype branch ( https://github.com/rmetzger/incubator-flink/tree/flink456) on top of the latest version of Till's Akka rework seems to be a good starting point to fork for the actual work on (1). I suggest that after that we somehow divide and conquer (2) and (3). Regards, Alexander ---------- Forwarded message ---------- From: Henry Saputra <[hidden email]> Date: 2014-12-12 6:18 GMT+01:00 Subject: Re: Enhance Flink's monitoring capabilities To: "[hidden email]" <[hidden email]> Thanks Robert, looks like we could use this JIRA to do the work - Henry On Thu, Dec 11, 2014 at 9:25 AM, Robert Metzger <[hidden email]> wrote: > I think this (very old) issue is somewhat closely describing the feature: > https://issues.apache.org/jira/browse/FLINK-456 > > > > On Thu, Dec 11, 2014 at 8:32 AM, Henry Saputra <[hidden email]> > wrote: > >> Just curious, is there any JIRA filed for this or was it just in >> preliminary proposal talk? >> >> - Henry >> >> On Sun, Dec 7, 2014 at 3:36 PM, Stephan Ewen <[hidden email]> wrote: >> > That actually sounds like a great idea. I discussed a bit with Robert >> > offline on Friday, and it seems that Metrics has most of what we talked >> > about. >> > >> > I also like the way they make it extensible, so people can capture >> > own metrics. >> > >> > On Sun, Dec 7, 2014 at 6:02 AM, Henry Saputra <[hidden email]> >> > wrote: >> > >> >> Hi Robert, >> >> >> >> From I have seen it so far, it is probably better and easier for Flink >> >> to leverage metrics library [1] for the metrics collection rather than >> >> building organically. >> >> >> >> Several ASF projects like Spark [2] and Tajo have used it with great >> >> success. >> >> >> >> One of the main reasons is maintainability and the breath of types of >> >> metric could and should be collected. >> >> >> >> - Henry >> >> >> >> [1] https://dropwizard.github.io/metrics/3.1.0/getting-started/ >> >> [2] https://spark.apache.org/docs/1.0.1/monitoring.html >> >> [3] https://issues.apache.org/jira/browse/TAJO-333 >> >> >> >> On Sat, Dec 6, 2014 at 11:13 AM, Robert Metzger <[hidden email]> >> >> wrote: >> >> > Hey Nils, >> >> > >> >> > I have played around a bit with a little prototype. You can find the >> code >> >> > here: https://github.com/rmetzger/incubator-flink/tree/flink456 (its >> >> > another branch in my repo). >> >> > You can see the changes that I applied on top of Till's Akka branch >> here: >> >> > >> >> >> >> >> > >> >> > What the code does is collecting statistics about each TaskManager in >> the >> >> > system. These stats are assembled into a "MetricsReport" which is send >> >> with >> >> > the periodical heartbeat to the JobManager. The JobManager stores the >> >> > latest MetricsReport for each TaskManager (in the Instance object for >> >> each >> >> > TM). >> >> > When the user accesses the TaskManager overview, the latest >> MetricsReport >> >> > is send as a JSONObject to the browser. >> >> > >> >> > to test my changes, check out the code, build it >> >> > mvn clean package -DskipTests -Dcheckstyle.skip=true >> >> > go into >> >> > cd >> >> > >> >> >> >> >> > and start the web interface >> >> > /bin/start-local.sh >> >> > >> >> > Go to localhost:8081, in the "TaskManager" view, you can see some >> >> metrics. >> >> > Here is a screenshot: http://img42.com/eNPve >> >> > >> >> > I named my branch after this issue, as it is probably describing best >> >> what >> >> > we're working on here: FLINK-456 >> >> > <https://issues.apache.org/jira/browse/FLINK-456> >> >> > >> >> > As I said in the beginning, its really just a prototype. Let me know >> if >> >> you >> >> > have any further questions. >> >> > For the "per TaskManager" reports, we should probably integrate some >> more >> >> > statistics. Also, the presentation of the numbers is very very basic >> >> right >> >> > now. I think there are many good libraries for visualizing these >> kinds of >> >> > stats. >> >> > Also, the numbers currently represent only a "snapshot", however, >> some of >> >> > the numbers can be accumulated (read/write bytes of the io manager). >> >> > Another missing feature is storing a little history of numbers to >> >> visualize >> >> > metrics over time. >> >> > >> >> > I'm trying to find time to look into "per job" metrics as well. They >> will >> >> > require a bit more infrastructure to distinguish them on the >> JobManager >> >> > side and to get them on the TaskManagers. >> >> > >> >> > >> >> > Best, >> >> > Robert >> >> > >> >> > >> >> > >> >> > On Tue, Dec 2, 2014 at 2:53 PM, aalexandrov < >> >> > [hidden email]> wrote: >> >> > >> >> >> Hello Nils, >> >> >> >> >> >> I am going to work on a similar issue related to tracking some >> >> >> statistics of the intermediate results produced by dataflows during >> >> >> execution. >> >> >> >> >> >> I just create a Jira issue here: >> >> >> >> >> >> https://issues.apache.org/jira/browse/FLINK-1297 >> >> >> >> >> >> If you already have some work done on extending the monitoring >> >> capabilities >> >> >> in a branch, it might be good to sync-up the development in order >> >> avoid >> >> >> duplicated work (e.g. using the same communication channel used to >> send >> >> the >> >> >> data from the task managers to the job manager). >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> View this message in context: >> >> >> >> >> >> >> >> >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing >> >> list >> >> >> archive at Nabble.com. >> >> >> >> >> >> |
Free forum by Nabble | Edit this page |