Adding custom monitoring to Flink

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding custom monitoring to Flink

Maxim
Hi!
I'm looking into integrating Flink into our stack and one of the
requirements is to report metrics to an internal system. The current
Accumulators are not adequate to provide visibility that we need to run
such a system in production. We want much more information about the
internal cluster state and ability to calculate aggregates ourselves. The
core reporting API accepts a metric name, metric type (gauge, counter,
timer) and a set of key value pairs that act as dimensions.

The ideal solution for us would report the metrics through such API and
provide default binding to existing Accumulators, but allow overriding it
to our internal reporting client.

Is it something that could be added to the Flink or there are other plans
for monitoring?

Thanks!

Maxim.
Reply | Threaded
Open this post in threaded view
|

Re: Adding custom monitoring to Flink

Chesnay Schepler-3
I'm currently working on a metric system that
a) exposes several TaskManger metrics
b) allows gathering metrics in various parts of a task, most notably
user-defined functions.

The first version makes these metrics available via JMX on each
TaskManager.
While a mechanism to make that pluggable is /planned/ there are no
details on that yet.

I /guess/ once it is merged you should be able to modify one of the
classes so that the data is directly
exported to your tool, but i would have to know more about it to make a
definite assessment.

There are no plans to funnel all those metrics unaggregated through
Flink's accumulator mechanism;
only a selection that will be aggregated locally and on the JobManager
to display in the Dashboard.

Out of curiosity, what metrics are you interested in?

On 14.04.2016 20:59, Maxim wrote:

> Hi!
> I'm looking into integrating Flink into our stack and one of the
> requirements is to report metrics to an internal system. The current
> Accumulators are not adequate to provide visibility that we need to run
> such a system in production. We want much more information about the
> internal cluster state and ability to calculate aggregates ourselves. The
> core reporting API accepts a metric name, metric type (gauge, counter,
> timer) and a set of key value pairs that act as dimensions.
>
> The ideal solution for us would report the metrics through such API and
> provide default binding to existing Accumulators, but allow overriding it
> to our internal reporting client.
>
> Is it something that could be added to the Flink or there are other plans
> for monitoring?
>
> Thanks!
>
> Maxim.
>

Reply | Threaded
Open this post in threaded view
|

Re: Adding custom monitoring to Flink

Maxim
I don't have full list of metrics, but everything that is related to
runtime performance and possible bottlenecks of the system. All
interprocess communication counters, errors, latencies, checkpoint sizes
and checkpointing latencies. Buffer allocations and releases, etc.
As we aggregate ourselves we can produce multiple views of the same metric:
min, max, tp99, tp99.9, top n, etc.

Could you point to the doc/Jira/diff for your change?


On Thu, Apr 14, 2016 at 12:32 PM, Chesnay Schepler <[hidden email]>
wrote:

> I'm currently working on a metric system that
> a) exposes several TaskManger metrics
> b) allows gathering metrics in various parts of a task, most notably
> user-defined functions.
>
> The first version makes these metrics available via JMX on each
> TaskManager.
> While a mechanism to make that pluggable is /planned/ there are no details
> on that yet.
>
> I /guess/ once it is merged you should be able to modify one of the
> classes so that the data is directly
> exported to your tool, but i would have to know more about it to make a
> definite assessment.
>
> There are no plans to funnel all those metrics unaggregated through
> Flink's accumulator mechanism;
> only a selection that will be aggregated locally and on the JobManager to
> display in the Dashboard.
>
> Out of curiosity, what metrics are you interested in?
>
>
> On 14.04.2016 20:59, Maxim wrote:
>
>> Hi!
>> I'm looking into integrating Flink into our stack and one of the
>> requirements is to report metrics to an internal system. The current
>> Accumulators are not adequate to provide visibility that we need to run
>> such a system in production. We want much more information about the
>> internal cluster state and ability to calculate aggregates ourselves. The
>> core reporting API accepts a metric name, metric type (gauge, counter,
>> timer) and a set of key value pairs that act as dimensions.
>>
>> The ideal solution for us would report the metrics through such API and
>> provide default binding to existing Accumulators, but allow overriding it
>> to our internal reporting client.
>>
>> Is it something that could be added to the Flink or there are other plans
>> for monitoring?
>>
>> Thanks!
>>
>> Maxim.
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Adding custom monitoring to Flink

Till Rohrmann
Hi Maxim,

I think the corresponding JIRA issue is
https://issues.apache.org/jira/browse/FLINK-456

Cheers,
Till

On Thu, Apr 14, 2016 at 10:50 PM, Maxim <[hidden email]> wrote:

> I don't have full list of metrics, but everything that is related to
> runtime performance and possible bottlenecks of the system. All
> interprocess communication counters, errors, latencies, checkpoint sizes
> and checkpointing latencies. Buffer allocations and releases, etc.
> As we aggregate ourselves we can produce multiple views of the same metric:
> min, max, tp99, tp99.9, top n, etc.
>
> Could you point to the doc/Jira/diff for your change?
>
>
> On Thu, Apr 14, 2016 at 12:32 PM, Chesnay Schepler <[hidden email]>
> wrote:
>
> > I'm currently working on a metric system that
> > a) exposes several TaskManger metrics
> > b) allows gathering metrics in various parts of a task, most notably
> > user-defined functions.
> >
> > The first version makes these metrics available via JMX on each
> > TaskManager.
> > While a mechanism to make that pluggable is /planned/ there are no
> details
> > on that yet.
> >
> > I /guess/ once it is merged you should be able to modify one of the
> > classes so that the data is directly
> > exported to your tool, but i would have to know more about it to make a
> > definite assessment.
> >
> > There are no plans to funnel all those metrics unaggregated through
> > Flink's accumulator mechanism;
> > only a selection that will be aggregated locally and on the JobManager to
> > display in the Dashboard.
> >
> > Out of curiosity, what metrics are you interested in?
> >
> >
> > On 14.04.2016 20:59, Maxim wrote:
> >
> >> Hi!
> >> I'm looking into integrating Flink into our stack and one of the
> >> requirements is to report metrics to an internal system. The current
> >> Accumulators are not adequate to provide visibility that we need to run
> >> such a system in production. We want much more information about the
> >> internal cluster state and ability to calculate aggregates ourselves.
> The
> >> core reporting API accepts a metric name, metric type (gauge, counter,
> >> timer) and a set of key value pairs that act as dimensions.
> >>
> >> The ideal solution for us would report the metrics through such API and
> >> provide default binding to existing Accumulators, but allow overriding
> it
> >> to our internal reporting client.
> >>
> >> Is it something that could be added to the Flink or there are other
> plans
> >> for monitoring?
> >>
> >> Thanks!
> >>
> >> Maxim.
> >>
> >>
> >
>