(DEPRECATED) Apache Flink Mailing List archive.

Statistics collection for optimization

Classic

List

Threaded

12 messages Options

aalexandrov

Statistics collection for optimization

Just a quick shout to check whether somebody is already working on a
statistics collection component?

If yes, can you point me to previous discussions in the mailing list and a
WIP branch -- I want to bring myself up to date with the ongoing efforts.

If not, I would like to start working on that component and ideally
integrate some parts of it in the 0.8 release.

Cheers!

Ufuk Celebi-2

Re: Statistics collection for optimization

Very nice to hear :)

See this thread:
http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html

On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
[hidden email]> wrote:

> Just a quick shout to check whether somebody is already working on a
> statistics collection component?
>
> If yes, can you point me to previous discussions in the mailing list and a
> WIP branch -- I want to bring myself up to date with the ongoing efforts.
>
> If not, I would like to start working on that component and ideally
> integrate some parts of it in the 0.8 release.
>
> Cheers!
>

Kostas Tzoumas-2

Re: Statistics collection for optimization

From the status of that thread and absence of a JIRA (as far as I could
tell), I would suggest that you start working on this and announce it on
the other thread, perhaps Nils would be interested in jumping in.

On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]> wrote:

> Very nice to hear :)
>
> See this thread:
>
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
>
> On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> [hidden email]> wrote:
>
> > Just a quick shout to check whether somebody is already working on a
> > statistics collection component?
> >
> > If yes, can you point me to previous discussions in the mailing list and
> a
> > WIP branch -- I want to bring myself up to date with the ongoing efforts.
> >
> > If not, I would like to start working on that component and ideally
> > integrate some parts of it in the 0.8 release.
> >
> > Cheers!
> >
>

Robert Metzger

Re: Statistics collection for optimization

The thread mentioned by Ufuk is an ongoing discussion, thats why there is
no JIRA yet.
To my understanding, its a student doing a project on Flink.

Also, I would like to give you the same advice I already gave to Nils: I
would highly recommend using Till's Akka branch for starting to work on
that.

On Tue, Dec 2, 2014 at 2:12 PM, Kostas Tzoumas <[hidden email]> wrote:

> From the status of that thread and absence of a JIRA (as far as I could
> tell), I would suggest that you start working on this and announce it on
> the other thread, perhaps Nils would be interested in jumping in.
>
> On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]> wrote:
>
> > Very nice to hear :)
> >
> > See this thread:
> >
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
> >
> > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> > [hidden email]> wrote:
> >
> > > Just a quick shout to check whether somebody is already working on a
> > > statistics collection component?
> > >
> > > If yes, can you point me to previous discussions in the mailing list
> and
> > a
> > > WIP branch -- I want to bring myself up to date with the ongoing
> efforts.
> > >
> > > If not, I would like to start working on that component and ideally
> > > integrate some parts of it in the 0.8 release.
> > >
> > > Cheers!
> > >
> >
>

aalexandrov

Re: Statistics collection for optimization

In reply to this post by Kostas Tzoumas-2

I checked the thread. I am not sure whether this is aligned with what I
want to contribute.

The discussion in the other thread seems to be going in the direction of
general-purpose monitoring (you are talking about Disk + Network IO, input
splits).

I would like to have a very thin code base that can be (1) transparently
injected in UDFs (if you can manipulate the AST), or wrapped in identity
mappers (if you cannot) in order to gather collection statistics (min, max,
distinct, maybe some histograms) to facilitate incremental optimization.

I agree that this should be based on existing infrastructure (Akka) and
should not be over over-engineered.

I will announce this in the other branch and create a JIRA ticket to fix
the parameters of what has to be done and the best way to implement it with
the other contributors.

2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:

Robert Metzger

Re: Statistics collection for optimization

Yes. I also got the impression that you are looking for something slightly
different.

It is probably easier for you right now to "hack" something into the system
to get these statistics.

On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov <
[hidden email]> wrote:

> I checked the thread. I am not sure whether this is aligned with what I
> want to contribute.
>
> The discussion in the other thread seems to be going in the direction of
> general-purpose monitoring (you are talking about Disk + Network IO, input
> splits).
>
> I would like to have a very thin code base that can be (1) transparently
> injected in UDFs (if you can manipulate the AST), or wrapped in identity
> mappers (if you cannot) in order to gather collection statistics (min, max,
> distinct, maybe some histograms) to facilitate incremental optimization.
>
> I agree that this should be based on existing infrastructure (Akka) and
> should not be over over-engineered.
>
> I will announce this in the other branch and create a JIRA ticket to fix
> the parameters of what has to be done and the best way to implement it with
> the other contributors.
>
>
>
> 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:
>
> > From the status of that thread and absence of a JIRA (as far as I could
> > tell), I would suggest that you start working on this and announce it on
> > the other thread, perhaps Nils would be interested in jumping in.
> >
> > On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]> wrote:
> >
> > > Very nice to hear :)
> > >
> > > See this thread:
> > >
> > >
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
> > >
> > > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> > > [hidden email]> wrote:
> > >
> > > > Just a quick shout to check whether somebody is already working on a
> > > > statistics collection component?
> > > >
> > > > If yes, can you point me to previous discussions in the mailing list
> > and
> > > a
> > > > WIP branch -- I want to bring myself up to date with the ongoing
> > efforts.
> > > >
> > > > If not, I would like to start working on that component and ideally
> > > > integrate some parts of it in the 0.8 release.
> > > >
> > > > Cheers!
> > > >
> > >
> >
>

Ufuk Celebi-2

Re: Statistics collection for optimization

Have you also thought about adding the statistics collection with the
writers, i.e. the collector or record writer?

If all you care about is the data that the user emits from her code, that
should be fine.

On Tue, Dec 2, 2014 at 2:33 PM, Robert Metzger <[hidden email]> wrote:

> Yes. I also got the impression that you are looking for something slightly
> different.
>
> It is probably easier for you right now to "hack" something into the system
> to get these statistics.
>
> On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov <
> [hidden email]> wrote:
>
> > I checked the thread. I am not sure whether this is aligned with what I
> > want to contribute.
> >
> > The discussion in the other thread seems to be going in the direction of
> > general-purpose monitoring (you are talking about Disk + Network IO,
> input
> > splits).
> >
> > I would like to have a very thin code base that can be (1) transparently
> > injected in UDFs (if you can manipulate the AST), or wrapped in identity
> > mappers (if you cannot) in order to gather collection statistics (min,
> max,
> > distinct, maybe some histograms) to facilitate incremental optimization.
> >
> > I agree that this should be based on existing infrastructure (Akka) and
> > should not be over over-engineered.
> >
> > I will announce this in the other branch and create a JIRA ticket to fix
> > the parameters of what has to be done and the best way to implement it
> with
> > the other contributors.
> >
> >
> >
> > 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:
> >
> > > From the status of that thread and absence of a JIRA (as far as I could
> > > tell), I would suggest that you start working on this and announce it
> on
> > > the other thread, perhaps Nils would be interested in jumping in.
> > >
> > > On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]> wrote:
> > >
> > > > Very nice to hear :)
> > > >
> > > > See this thread:
> > > >
> > > >
> > >
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
> > > >
> > > > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> > > > [hidden email]> wrote:
> > > >
> > > > > Just a quick shout to check whether somebody is already working on
> a
> > > > > statistics collection component?
> > > > >
> > > > > If yes, can you point me to previous discussions in the mailing
> list
> > > and
> > > > a
> > > > > WIP branch -- I want to bring myself up to date with the ongoing
> > > efforts.
> > > > >
> > > > > If not, I would like to start working on that component and ideally
> > > > > integrate some parts of it in the 0.8 release.
> > > > >
> > > > > Cheers!
> > > > >
> > > >
> > >
> >
>

aalexandrov

Re: Statistics collection for optimization

This is another way to do it.

I just created a JIRA issue for that:

https://issues.apache.org/jira/browse/FLINK-1297

If you can give me some pointers and suggest implementation strategies I
can try to prototype something in a feature branch over the weekend and
share it for review.

2014-12-02 14:43 GMT+01:00 Ufuk Celebi <[hidden email]>:

> Have you also thought about adding the statistics collection with the
> writers, i.e. the collector or record writer?
>
> If all you care about is the data that the user emits from her code, that
> should be fine.
>
> On Tue, Dec 2, 2014 at 2:33 PM, Robert Metzger <[hidden email]>
> wrote:
>
> > Yes. I also got the impression that you are looking for something
> slightly
> > different.
> >
> > It is probably easier for you right now to "hack" something into the
> system
> > to get these statistics.
> >
> > On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov <
> > [hidden email]> wrote:
> >
> > > I checked the thread. I am not sure whether this is aligned with what I
> > > want to contribute.
> > >
> > > The discussion in the other thread seems to be going in the direction
> of
> > > general-purpose monitoring (you are talking about Disk + Network IO,
> > input
> > > splits).
> > >
> > > I would like to have a very thin code base that can be (1)
> transparently
> > > injected in UDFs (if you can manipulate the AST), or wrapped in
> identity
> > > mappers (if you cannot) in order to gather collection statistics (min,
> > max,
> > > distinct, maybe some histograms) to facilitate incremental
> optimization.
> > >
> > > I agree that this should be based on existing infrastructure (Akka) and
> > > should not be over over-engineered.
> > >
> > > I will announce this in the other branch and create a JIRA ticket to
> fix
> > > the parameters of what has to be done and the best way to implement it
> > with
> > > the other contributors.
> > >
> > >
> > >
> > > 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:
> > >
> > > > From the status of that thread and absence of a JIRA (as far as I
> could
> > > > tell), I would suggest that you start working on this and announce it
> > on
> > > > the other thread, perhaps Nils would be interested in jumping in.
> > > >
> > > > On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]> wrote:
> > > >
> > > > > Very nice to hear :)
> > > > >
> > > > > See this thread:
> > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
> > > > >
> > > > > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > > Just a quick shout to check whether somebody is already working
> on
> > a
> > > > > > statistics collection component?
> > > > > >
> > > > > > If yes, can you point me to previous discussions in the mailing
> > list
> > > > and
> > > > > a
> > > > > > WIP branch -- I want to bring myself up to date with the ongoing
> > > > efforts.
> > > > > >
> > > > > > If not, I would like to start working on that component and
> ideally
> > > > > > integrate some parts of it in the 0.8 release.
> > > > > >
> > > > > > Cheers!
> > > > > >
> > > > >
> > > >
> > >
> >
>

Fabian Hueske

Re: Statistics collection for optimization

I see mainly two use cases to locally collect data on TMs and send it (and
aggregate it) on the JM.

1) Monitoring of the system and running jobs: This might include system
stats (CPU, disk usage, network traffic & buffer usage, internal memory
utilization, ...) but also progress information (number of processed
elements, histogram of UDF in/out ratio, UDF exec times, etc.).
2) Statistics collection for optimization: Stats would include key counts &
distributions, record count & sizes, UDF stats (in/out ratio, exec times,
...). Depending on the expertise of the user, this information could also
be valuable monitoring information.

In both cases, we need a service to ship collected data from the TMs to the
JM and aggregated and store it there.
Once this service is in place, the collection of metrics could be
independently implemented.

2014-12-02 14:57 GMT+01:00 Alexander Alexandrov <
[hidden email]>:

> This is another way to do it.
>
> I just created a JIRA issue for that:
>
> https://issues.apache.org/jira/browse/FLINK-1297
>
> If you can give me some pointers and suggest implementation strategies I
> can try to prototype something in a feature branch over the weekend and
> share it for review.
>
>
>
> 2014-12-02 14:43 GMT+01:00 Ufuk Celebi <[hidden email]>:
>
> > Have you also thought about adding the statistics collection with the
> > writers, i.e. the collector or record writer?
> >
> > If all you care about is the data that the user emits from her code, that
> > should be fine.
> >
> > On Tue, Dec 2, 2014 at 2:33 PM, Robert Metzger <[hidden email]>
> > wrote:
> >
> > > Yes. I also got the impression that you are looking for something
> > slightly
> > > different.
> > >
> > > It is probably easier for you right now to "hack" something into the
> > system
> > > to get these statistics.
> > >
> > > On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov <
> > > [hidden email]> wrote:
> > >
> > > > I checked the thread. I am not sure whether this is aligned with
> what I
> > > > want to contribute.
> > > >
> > > > The discussion in the other thread seems to be going in the direction
> > of
> > > > general-purpose monitoring (you are talking about Disk + Network IO,
> > > input
> > > > splits).
> > > >
> > > > I would like to have a very thin code base that can be (1)
> > transparently
> > > > injected in UDFs (if you can manipulate the AST), or wrapped in
> > identity
> > > > mappers (if you cannot) in order to gather collection statistics
> (min,
> > > max,
> > > > distinct, maybe some histograms) to facilitate incremental
> > optimization.
> > > >
> > > > I agree that this should be based on existing infrastructure (Akka)
> and
> > > > should not be over over-engineered.
> > > >
> > > > I will announce this in the other branch and create a JIRA ticket to
> > fix
> > > > the parameters of what has to be done and the best way to implement
> it
> > > with
> > > > the other contributors.
> > > >
> > > >
> > > >
> > > > 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:
> > > >
> > > > > From the status of that thread and absence of a JIRA (as far as I
> > could
> > > > > tell), I would suggest that you start working on this and announce
> it
> > > on
> > > > > the other thread, perhaps Nils would be interested in jumping in.
> > > > >
> > > > > On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]>
> wrote:
> > > > >
> > > > > > Very nice to hear :)
> > > > > >
> > > > > > See this thread:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
> > > > > >
> > > > > > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> > > > > > [hidden email]> wrote:
> > > > > >
> > > > > > > Just a quick shout to check whether somebody is already working
> > on
> > > a
> > > > > > > statistics collection component?
> > > > > > >
> > > > > > > If yes, can you point me to previous discussions in the mailing
> > > list
> > > > > and
> > > > > > a
> > > > > > > WIP branch -- I want to bring myself up to date with the
> ongoing
> > > > > efforts.
> > > > > > >
> > > > > > > If not, I would like to start working on that component and
> > ideally
> > > > > > > integrate some parts of it in the 0.8 release.
> > > > > > >
> > > > > > > Cheers!
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Márton Balassi

Re: Statistics collection for optimization

It would be nice to have integration with the existing tools, e.g. Ganglia.
[1] These already cover system statistics, (CPU, network, I/O...) and one
can define own stats to monitor.
Hadoop is nicely integrated with it.

[1] http://ganglia.sourceforge.net/

On Tue, Dec 2, 2014 at 9:37 PM, Fabian Hueske <[hidden email]> wrote:

> I see mainly two use cases to locally collect data on TMs and send it (and
> aggregate it) on the JM.
>
> 1) Monitoring of the system and running jobs: This might include system
> stats (CPU, disk usage, network traffic & buffer usage, internal memory
> utilization, ...) but also progress information (number of processed
> elements, histogram of UDF in/out ratio, UDF exec times, etc.).
> 2) Statistics collection for optimization: Stats would include key counts &
> distributions, record count & sizes, UDF stats (in/out ratio, exec times,
> ...). Depending on the expertise of the user, this information could also
> be valuable monitoring information.
>
> In both cases, we need a service to ship collected data from the TMs to the
> JM and aggregated and store it there.
> Once this service is in place, the collection of metrics could be
> independently implemented.
>
> 2014-12-02 14:57 GMT+01:00 Alexander Alexandrov <
> [hidden email]>:
>
> > This is another way to do it.
> >
> > I just created a JIRA issue for that:
> >
> > https://issues.apache.org/jira/browse/FLINK-1297
> >
> > If you can give me some pointers and suggest implementation strategies I
> > can try to prototype something in a feature branch over the weekend and
> > share it for review.
> >
> >
> >
> > 2014-12-02 14:43 GMT+01:00 Ufuk Celebi <[hidden email]>:
> >
> > > Have you also thought about adding the statistics collection with the
> > > writers, i.e. the collector or record writer?
> > >
> > > If all you care about is the data that the user emits from her code,
> that
> > > should be fine.
> > >
> > > On Tue, Dec 2, 2014 at 2:33 PM, Robert Metzger <[hidden email]>
> > > wrote:
> > >
> > > > Yes. I also got the impression that you are looking for something
> > > slightly
> > > > different.
> > > >
> > > > It is probably easier for you right now to "hack" something into the
> > > system
> > > > to get these statistics.
> > > >
> > > > On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov <
> > > > [hidden email]> wrote:
> > > >
> > > > > I checked the thread. I am not sure whether this is aligned with
> > what I
> > > > > want to contribute.
> > > > >
> > > > > The discussion in the other thread seems to be going in the
> direction
> > > of
> > > > > general-purpose monitoring (you are talking about Disk + Network
> IO,
> > > > input
> > > > > splits).
> > > > >
> > > > > I would like to have a very thin code base that can be (1)
> > > transparently
> > > > > injected in UDFs (if you can manipulate the AST), or wrapped in
> > > identity
> > > > > mappers (if you cannot) in order to gather collection statistics
> > (min,
> > > > max,
> > > > > distinct, maybe some histograms) to facilitate incremental
> > > optimization.
> > > > >
> > > > > I agree that this should be based on existing infrastructure (Akka)
> > and
> > > > > should not be over over-engineered.
> > > > >
> > > > > I will announce this in the other branch and create a JIRA ticket
> to
> > > fix
> > > > > the parameters of what has to be done and the best way to implement
> > it
> > > > with
> > > > > the other contributors.
> > > > >
> > > > >
> > > > >
> > > > > 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:
> > > > >
> > > > > > From the status of that thread and absence of a JIRA (as far as I
> > > could
> > > > > > tell), I would suggest that you start working on this and
> announce
> > it
> > > > on
> > > > > > the other thread, perhaps Nils would be interested in jumping in.
> > > > > >
> > > > > > On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]>
> > wrote:
> > > > > >
> > > > > > > Very nice to hear :)
> > > > > > >
> > > > > > > See this thread:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
> > > > > > >
> > > > > > > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> > > > > > > [hidden email]> wrote:
> > > > > > >
> > > > > > > > Just a quick shout to check whether somebody is already
> working
> > > on
> > > > a
> > > > > > > > statistics collection component?
> > > > > > > >
> > > > > > > > If yes, can you point me to previous discussions in the
> mailing
> > > > list
> > > > > > and
> > > > > > > a
> > > > > > > > WIP branch -- I want to bring myself up to date with the
> > ongoing
> > > > > > efforts.
> > > > > > > >
> > > > > > > > If not, I would like to start working on that component and
> > > ideally
> > > > > > > > integrate some parts of it in the 0.8 release.
> > > > > > > >
> > > > > > > > Cheers!
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Robert Metzger

Re: Statistics collection for optimization

Fabian told me that he recently looked into the Java JMX infrastructure.
Once we have the metrics collection integrated into the JobManager, we can
expose these numbers via a JMX service by the JM. Ganglia (and other tools)
can connect to the JMX interface to retrieve metrics.

But I the first step is certainly to get the infrastructure inside our
system in place.

On Tue, Dec 2, 2014 at 10:28 PM, Márton Balassi <[hidden email]>
wrote:

> It would be nice to have integration with the existing tools, e.g. Ganglia.
> [1] These already cover system statistics, (CPU, network, I/O...) and one
> can define own stats to monitor.
> Hadoop is nicely integrated with it.
>
> [1] http://ganglia.sourceforge.net/
>
> On Tue, Dec 2, 2014 at 9:37 PM, Fabian Hueske <[hidden email]> wrote:
>
> > I see mainly two use cases to locally collect data on TMs and send it
> (and
> > aggregate it) on the JM.
> >
> > 1) Monitoring of the system and running jobs: This might include system
> > stats (CPU, disk usage, network traffic & buffer usage, internal memory
> > utilization, ...) but also progress information (number of processed
> > elements, histogram of UDF in/out ratio, UDF exec times, etc.).
> > 2) Statistics collection for optimization: Stats would include key
> counts &
> > distributions, record count & sizes, UDF stats (in/out ratio, exec times,
> > ...). Depending on the expertise of the user, this information could also
> > be valuable monitoring information.
> >
> > In both cases, we need a service to ship collected data from the TMs to
> the
> > JM and aggregated and store it there.
> > Once this service is in place, the collection of metrics could be
> > independently implemented.
> >
> > 2014-12-02 14:57 GMT+01:00 Alexander Alexandrov <
> > [hidden email]>:
> >
> > > This is another way to do it.
> > >
> > > I just created a JIRA issue for that:
> > >
> > > https://issues.apache.org/jira/browse/FLINK-1297
> > >
> > > If you can give me some pointers and suggest implementation strategies
> I
> > > can try to prototype something in a feature branch over the weekend and
> > > share it for review.
> > >
> > >
> > >
> > > 2014-12-02 14:43 GMT+01:00 Ufuk Celebi <[hidden email]>:
> > >
> > > > Have you also thought about adding the statistics collection with the
> > > > writers, i.e. the collector or record writer?
> > > >
> > > > If all you care about is the data that the user emits from her code,
> > that
> > > > should be fine.
> > > >
> > > > On Tue, Dec 2, 2014 at 2:33 PM, Robert Metzger <[hidden email]>
> > > > wrote:
> > > >
> > > > > Yes. I also got the impression that you are looking for something
> > > > slightly
> > > > > different.
> > > > >
> > > > > It is probably easier for you right now to "hack" something into
> the
> > > > system
> > > > > to get these statistics.
> > > > >
> > > > > On Tue, Dec 2, 2014 at 2:25 PM, Alexander Alexandrov <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > > I checked the thread. I am not sure whether this is aligned with
> > > what I
> > > > > > want to contribute.
> > > > > >
> > > > > > The discussion in the other thread seems to be going in the
> > direction
> > > > of
> > > > > > general-purpose monitoring (you are talking about Disk + Network
> > IO,
> > > > > input
> > > > > > splits).
> > > > > >
> > > > > > I would like to have a very thin code base that can be (1)
> > > > transparently
> > > > > > injected in UDFs (if you can manipulate the AST), or wrapped in
> > > > identity
> > > > > > mappers (if you cannot) in order to gather collection statistics
> > > (min,
> > > > > max,
> > > > > > distinct, maybe some histograms) to facilitate incremental
> > > > optimization.
> > > > > >
> > > > > > I agree that this should be based on existing infrastructure
> (Akka)
> > > and
> > > > > > should not be over over-engineered.
> > > > > >
> > > > > > I will announce this in the other branch and create a JIRA ticket
> > to
> > > > fix
> > > > > > the parameters of what has to be done and the best way to
> implement
> > > it
> > > > > with
> > > > > > the other contributors.
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:
> > > > > >
> > > > > > > From the status of that thread and absence of a JIRA (as far
> as I
> > > > could
> > > > > > > tell), I would suggest that you start working on this and
> > announce
> > > it
> > > > > on
> > > > > > > the other thread, perhaps Nils would be interested in jumping
> in.
> > > > > > >
> > > > > > > On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]>
> > > wrote:
> > > > > > >
> > > > > > > > Very nice to hear :)
> > > > > > > >
> > > > > > > > See this thread:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
> > > > > > > >
> > > > > > > > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
> > > > > > > > [hidden email]> wrote:
> > > > > > > >
> > > > > > > > > Just a quick shout to check whether somebody is already
> > working
> > > > on
> > > > > a
> > > > > > > > > statistics collection component?
> > > > > > > > >
> > > > > > > > > If yes, can you point me to previous discussions in the
> > mailing
> > > > > list
> > > > > > > and
> > > > > > > > a
> > > > > > > > > WIP branch -- I want to bring myself up to date with the
> > > ongoing
> > > > > > > efforts.
> > > > > > > > >
> > > > > > > > > If not, I would like to start working on that component and
> > > > ideally
> > > > > > > > > integrate some parts of it in the 0.8 release.
> > > > > > > > >
> > > > > > > > > Cheers!
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Henry Saputra

Re: Statistics collection for optimization

In reply to this post by aalexandrov

Hi Guys,

I finally have time to look at this. Could you create a design
document to share how would you approach this feature?

There are 2 parts I guess, one is the metrics collection itself and
second is publishing to external either via JMX to external service.

We could use Metrics [1] aka codahale's metrics library to do it.

- Henry

[1] https://github.com/dropwizard/metrics

On Tue, Dec 2, 2014 at 5:25 AM, Alexander Alexandrov
<[hidden email]> wrote:

> I checked the thread. I am not sure whether this is aligned with what I
> want to contribute.
>
> The discussion in the other thread seems to be going in the direction of
> general-purpose monitoring (you are talking about Disk + Network IO, input
> splits).
>
> I would like to have a very thin code base that can be (1) transparently
> injected in UDFs (if you can manipulate the AST), or wrapped in identity
> mappers (if you cannot) in order to gather collection statistics (min, max,
> distinct, maybe some histograms) to facilitate incremental optimization.
>
> I agree that this should be based on existing infrastructure (Akka) and
> should not be over over-engineered.
>
> I will announce this in the other branch and create a JIRA ticket to fix
> the parameters of what has to be done and the best way to implement it with
> the other contributors.
>
>
>
> 2014-12-02 14:12 GMT+01:00 Kostas Tzoumas <[hidden email]>:
>
>> From the status of that thread and absence of a JIRA (as far as I could
>> tell), I would suggest that you start working on this and announce it on
>> the other thread, perhaps Nils would be interested in jumping in.
>>
>> On Tue, Dec 2, 2014 at 2:06 PM, Ufuk Celebi <[hidden email]> wrote:
>>
>> > Very nice to hear :)
>> >
>> > See this thread:
>> >
>> >
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Enhance-Flink-s-monitoring-capabilities-td2573.html
>> >
>> > On Tue, Dec 2, 2014 at 2:00 PM, Alexander Alexandrov <
>> > [hidden email]> wrote:
>> >
>> > > Just a quick shout to check whether somebody is already working on a
>> > > statistics collection component?
>> > >
>> > > If yes, can you point me to previous discussions in the mailing list
>> and
>> > a
>> > > WIP branch -- I want to bring myself up to date with the ongoing
>> efforts.
>> > >
>> > > If not, I would like to start working on that component and ideally
>> > > integrate some parts of it in the 0.8 release.
>> > >
>> > > Cheers!
>> > >
>> >
>>