Monitoring backpressure

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Monitoring backpressure

Gyula Fóra-2
Hey guys,

Is there any way to monitor the backpressure in the Flink job? I find it
hard to debug slow operators because of the backpressure mechanism so it
would be good to get some info out of the network layer on what exactly
caused the backpressure.

For example:

task1 -> task2 -> task3 -> task4

I want to figure out whether task 2 or task 3 is slow.

Any ideas?

Thanks,
Gyula
Reply | Threaded
Open this post in threaded view
|

Re: Monitoring backpressure

Stephan Ewen
I discussed about this quite a bit with other people.

It is not totally straightforward. One could try and measure exhaustion of
the output buffer pools, but that fluctuates a lot - it would need some
work to get a stable metric from that...

If you have a profiler that you can attach to the processes, you could
check whether a lot of time is spent within the "requestBufferBlocking()"
method of the buffer pool...

Stephan


On Mon, Dec 7, 2015 at 9:45 AM, Gyula Fóra <[hidden email]> wrote:

> Hey guys,
>
> Is there any way to monitor the backpressure in the Flink job? I find it
> hard to debug slow operators because of the backpressure mechanism so it
> would be good to get some info out of the network layer on what exactly
> caused the backpressure.
>
> For example:
>
> task1 -> task2 -> task3 -> task4
>
> I want to figure out whether task 2 or task 3 is slow.
>
> Any ideas?
>
> Thanks,
> Gyula
>
Reply | Threaded
Open this post in threaded view
|

Re: Monitoring backpressure

Gyula Fóra
Thanks Stephan,

I will try with the profiler for now.

Gyula

Stephan Ewen <[hidden email]> ezt írta (időpont: 2015. dec. 7., H, 10:51):

> I discussed about this quite a bit with other people.
>
> It is not totally straightforward. One could try and measure exhaustion of
> the output buffer pools, but that fluctuates a lot - it would need some
> work to get a stable metric from that...
>
> If you have a profiler that you can attach to the processes, you could
> check whether a lot of time is spent within the "requestBufferBlocking()"
> method of the buffer pool...
>
> Stephan
>
>
> On Mon, Dec 7, 2015 at 9:45 AM, Gyula Fóra <[hidden email]> wrote:
>
> > Hey guys,
> >
> > Is there any way to monitor the backpressure in the Flink job? I find it
> > hard to debug slow operators because of the backpressure mechanism so it
> > would be good to get some info out of the network layer on what exactly
> > caused the backpressure.
> >
> > For example:
> >
> > task1 -> task2 -> task3 -> task4
> >
> > I want to figure out whether task 2 or task 3 is slow.
> >
> > Any ideas?
> >
> > Thanks,
> > Gyula
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Monitoring backpressure

alan@opsclarity.com
In reply to this post by Stephan Ewen
Hey Stephan,

My company (OpsClarity) is building monitoring integration for flink, and being that backpressure is one of the most critical concepts in a streaming system, we need a way to expose backpressure state to a monitoring system (such as ours).  I see that the flink-ui has a way to sample the pipeline and mark a stage as high|med|ok wrt backpressure.  I'd love to be able to encode that as a metric, perhaps as simple as 2|1|0, so that we can plot backpressure state over time per stage.  This would also allow users to set alerts on backpressure state.

How might we get access to this backpressure state information?

Thanks,
Alan
Reply | Threaded
Open this post in threaded view
|

Re: Monitoring backpressure

Chesnay Schepler-3
Hello Alan,

the backpressure information can be retrieved from the web ui's REST API
<https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/rest_api.html>.

|/jobs/<jobid>/vertices/<vertexid>/backpressure

This will give you a JSON object that looks something like this:

{
|

    |status:"ok"|"deprecated"|
    |backpressure-level: "ok"|"low"|"high"|
    |end-timestamp:<timestamp>|
    |subtasks:[|

        |{|

            |subtask: 0|
            |backpressure-level: "ok"|"low"|"high"|
            |ratio: <ratio>|

        |},|

    |}|
    ||

|}

For more details you can check out the JobVertexBackPressureHandler class.

Regards,
Chesnay
|
On 07.12.2016 00:58, [hidden email] wrote:

> Hey Stephan,
>
> My company (OpsClarity) is building monitoring integration for flink, and
> being that backpressure is one of the most critical concepts in a streaming
> system, we need a way to expose backpressure state to a monitoring system
> (such as ours).  I see that the flink-ui has a way to sample the pipeline
> and mark a stage as high|med|ok wrt backpressure.  I'd love to be able to
> encode that as a metric, perhaps as simple as 2|1|0, so that we can plot
> backpressure state over time per stage.  This would also allow users to set
> alerts on backpressure state.
>
> How might we get access to this backpressure state information?
>
> Thanks,
> Alan
>
>
>
> --
> View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Monitoring-backpressure-tp9472p14868.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.
>