Decompose failure recovery time

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Decompose failure recovery time

Zhinan Cheng-2
Hi all,

I am working on measuring the failure recovery time of Flink and I
want to decompose the recovery time into different parts, say the time
to detect the failure, the time to restart the job, and the time to
restore the checkpointing.

Unfortunately, I cannot find  any information in Flink doc to solve
this, Is there any way that Flink has provided for this, otherwise,
how can I solve this?

Thanks a lot for your help.

Regards,
Juno
Reply | Threaded
Open this post in threaded view
|

Re: Decompose failure recovery time

Piotr Nowojski-5
Hi,

> I want to decompose the recovery time into different parts, say
> (1) the time to detect the failure,
> (2) the time to restart the job,
> (3) and the time to restore the checkpointing.

1. Maybe I'm missing something, but as far as I can tell, Flink can not
help you with that. Time to detect the failure, would be a time between the
failure occurred, and the time when JobManager realises about this failure.
If we could reliably measure/check when the first one happened, then we
could immediately trigger failover. You are interested in this exactly
because there is no reliable way to detect the failure immediately. You
could approximate this via analysing the logs.

2. Maybe there are some metrics that you could use, if not you check use
the REST API [1] to monitor for the job status. Again you could also do it
via analysing the logs.

3. In the future this might be measurable using the REST API (similar as
the point 2.), but currently there is no way to do it that way. There is a
ticket for that [2]. I think currently the only way is to do it is via
analysing the logs.

If you just need to do this once, I would analyse the logs manually. If you
want to do it many times or monitor this continuously, I would write some
simple script (python?) to mix checking REST API calls for 2. with logs
analysing.

Piotrek


[1]
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
[2] https://issues.apache.org/jira/browse/FLINK-17012
wt., 18 sie 2020 o 04:07 Zhinan Cheng <[hidden email]> napisał(a):

> Hi all,
>
> I am working on measuring the failure recovery time of Flink and I
> want to decompose the recovery time into different parts, say the time
> to detect the failure, the time to restart the job, and the time to
> restore the checkpointing.
>
> Unfortunately, I cannot find  any information in Flink doc to solve
> this, Is there any way that Flink has provided for this, otherwise,
> how can I solve this?
>
> Thanks a lot for your help.
>
> Regards,
> Juno
>
Reply | Threaded
Open this post in threaded view
|

Re: Decompose failure recovery time

Zhinan Cheng-2
Hi Piotr,

Thanks a lot for your help.
Yes, I finally realize that I can only approximate the time for [1]
and [3] and measure [2] by monitoring the uptime and downtime metric
provided by Flink.

And now my problem is that I found the time in [2] can be up to 40s, I
wonder why it takes so long to restart the job.
The log actually shows that the time to switch all operator instances
from CANCELING to CANCELED is around 30s, do you have any ideas about
this?

Many thanks.

Regards,
Zhinan

On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <[hidden email]> wrote:

>
> Hi,
>
> > I want to decompose the recovery time into different parts, say
> > (1) the time to detect the failure,
> > (2) the time to restart the job,
> > (3) and the time to restore the checkpointing.
>
> 1. Maybe I'm missing something, but as far as I can tell, Flink can not
> help you with that. Time to detect the failure, would be a time between the
> failure occurred, and the time when JobManager realises about this failure.
> If we could reliably measure/check when the first one happened, then we
> could immediately trigger failover. You are interested in this exactly
> because there is no reliable way to detect the failure immediately. You
> could approximate this via analysing the logs.
>
> 2. Maybe there are some metrics that you could use, if not you check use
> the REST API [1] to monitor for the job status. Again you could also do it
> via analysing the logs.
>
> 3. In the future this might be measurable using the REST API (similar as
> the point 2.), but currently there is no way to do it that way. There is a
> ticket for that [2]. I think currently the only way is to do it is via
> analysing the logs.
>
> If you just need to do this once, I would analyse the logs manually. If you
> want to do it many times or monitor this continuously, I would write some
> simple script (python?) to mix checking REST API calls for 2. with logs
> analysing.
>
> Piotrek
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
> [2] https://issues.apache.org/jira/browse/FLINK-17012
> wt., 18 sie 2020 o 04:07 Zhinan Cheng <[hidden email]> napisał(a):
>
> > Hi all,
> >
> > I am working on measuring the failure recovery time of Flink and I
> > want to decompose the recovery time into different parts, say the time
> > to detect the failure, the time to restart the job, and the time to
> > restore the checkpointing.
> >
> > Unfortunately, I cannot find  any information in Flink doc to solve
> > this, Is there any way that Flink has provided for this, otherwise,
> > how can I solve this?
> >
> > Thanks a lot for your help.
> >
> > Regards,
> > Juno
> >

On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <[hidden email]> wrote:

>
> Hi,
>
> > I want to decompose the recovery time into different parts, say
> > (1) the time to detect the failure,
> > (2) the time to restart the job,
> > (3) and the time to restore the checkpointing.
>
> 1. Maybe I'm missing something, but as far as I can tell, Flink can not
> help you with that. Time to detect the failure, would be a time between the
> failure occurred, and the time when JobManager realises about this failure.
> If we could reliably measure/check when the first one happened, then we
> could immediately trigger failover. You are interested in this exactly
> because there is no reliable way to detect the failure immediately. You
> could approximate this via analysing the logs.
>
> 2. Maybe there are some metrics that you could use, if not you check use
> the REST API [1] to monitor for the job status. Again you could also do it
> via analysing the logs.
>
> 3. In the future this might be measurable using the REST API (similar as
> the point 2.), but currently there is no way to do it that way. There is a
> ticket for that [2]. I think currently the only way is to do it is via
> analysing the logs.
>
> If you just need to do this once, I would analyse the logs manually. If you
> want to do it many times or monitor this continuously, I would write some
> simple script (python?) to mix checking REST API calls for 2. with logs
> analysing.
>
> Piotrek
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
> [2] https://issues.apache.org/jira/browse/FLINK-17012
> wt., 18 sie 2020 o 04:07 Zhinan Cheng <[hidden email]> napisał(a):
>
> > Hi all,
> >
> > I am working on measuring the failure recovery time of Flink and I
> > want to decompose the recovery time into different parts, say the time
> > to detect the failure, the time to restart the job, and the time to
> > restore the checkpointing.
> >
> > Unfortunately, I cannot find  any information in Flink doc to solve
> > this, Is there any way that Flink has provided for this, otherwise,
> > how can I solve this?
> >
> > Thanks a lot for your help.
> >
> > Regards,
> > Juno
> >
Reply | Threaded
Open this post in threaded view
|

Re: Decompose failure recovery time

Piotr Nowojski-5
Hi Zhinan,

It's hard to say, but my guess it takes that long for the tasks to respond
to cancellation which consists of a couple of steps. If a task is currently
busy processing something, it has to respond to interruption
(`java.lang.Thread#interrupt`). If it takes 30 seconds for a task to react
to the interruption and clean up it's resources, that can cause problems
and there is very little that Flink can do.

If you want to debug it further, I would suggest collecting stack traces
during cancellation (or even better: profile the code during cancellation).
This would help you answer the question, what are the task threads busy
with.

Probably not a solution, but I'm mentioning it just in case, you can
shorten the `task.cancellation.timeout` period.  By default it's 180s.
After that, whole TaskManager will be killed. If you have spare
TaskManagers or you can restart them very quickly, lowering this timeout
might help to some extent (in an exchange for dirty shutdown, without
cleaning up the resources).

Piotrek

czw., 20 sie 2020 o 18:00 Zhinan Cheng <[hidden email]> napisał(a):

> Hi Piotr,
>
> Thanks a lot for your help.
> Yes, I finally realize that I can only approximate the time for [1]
> and [3] and measure [2] by monitoring the uptime and downtime metric
> provided by Flink.
>
> And now my problem is that I found the time in [2] can be up to 40s, I
> wonder why it takes so long to restart the job.
> The log actually shows that the time to switch all operator instances
> from CANCELING to CANCELED is around 30s, do you have any ideas about
> this?
>
> Many thanks.
>
> Regards,
> Zhinan
>
> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <[hidden email]> wrote:
> >
> > Hi,
> >
> > > I want to decompose the recovery time into different parts, say
> > > (1) the time to detect the failure,
> > > (2) the time to restart the job,
> > > (3) and the time to restore the checkpointing.
> >
> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not
> > help you with that. Time to detect the failure, would be a time between
> the
> > failure occurred, and the time when JobManager realises about this
> failure.
> > If we could reliably measure/check when the first one happened, then we
> > could immediately trigger failover. You are interested in this exactly
> > because there is no reliable way to detect the failure immediately. You
> > could approximate this via analysing the logs.
> >
> > 2. Maybe there are some metrics that you could use, if not you check use
> > the REST API [1] to monitor for the job status. Again you could also do
> it
> > via analysing the logs.
> >
> > 3. In the future this might be measurable using the REST API (similar as
> > the point 2.), but currently there is no way to do it that way. There is
> a
> > ticket for that [2]. I think currently the only way is to do it is via
> > analysing the logs.
> >
> > If you just need to do this once, I would analyse the logs manually. If
> you
> > want to do it many times or monitor this continuously, I would write some
> > simple script (python?) to mix checking REST API calls for 2. with logs
> > analysing.
> >
> > Piotrek
> >
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
> > [2] https://issues.apache.org/jira/browse/FLINK-17012
> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <[hidden email]>
> napisał(a):
> >
> > > Hi all,
> > >
> > > I am working on measuring the failure recovery time of Flink and I
> > > want to decompose the recovery time into different parts, say the time
> > > to detect the failure, the time to restart the job, and the time to
> > > restore the checkpointing.
> > >
> > > Unfortunately, I cannot find  any information in Flink doc to solve
> > > this, Is there any way that Flink has provided for this, otherwise,
> > > how can I solve this?
> > >
> > > Thanks a lot for your help.
> > >
> > > Regards,
> > > Juno
> > >
>
> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <[hidden email]> wrote:
> >
> > Hi,
> >
> > > I want to decompose the recovery time into different parts, say
> > > (1) the time to detect the failure,
> > > (2) the time to restart the job,
> > > (3) and the time to restore the checkpointing.
> >
> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not
> > help you with that. Time to detect the failure, would be a time between
> the
> > failure occurred, and the time when JobManager realises about this
> failure.
> > If we could reliably measure/check when the first one happened, then we
> > could immediately trigger failover. You are interested in this exactly
> > because there is no reliable way to detect the failure immediately. You
> > could approximate this via analysing the logs.
> >
> > 2. Maybe there are some metrics that you could use, if not you check use
> > the REST API [1] to monitor for the job status. Again you could also do
> it
> > via analysing the logs.
> >
> > 3. In the future this might be measurable using the REST API (similar as
> > the point 2.), but currently there is no way to do it that way. There is
> a
> > ticket for that [2]. I think currently the only way is to do it is via
> > analysing the logs.
> >
> > If you just need to do this once, I would analyse the logs manually. If
> you
> > want to do it many times or monitor this continuously, I would write some
> > simple script (python?) to mix checking REST API calls for 2. with logs
> > analysing.
> >
> > Piotrek
> >
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
> > [2] https://issues.apache.org/jira/browse/FLINK-17012
> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <[hidden email]>
> napisał(a):
> >
> > > Hi all,
> > >
> > > I am working on measuring the failure recovery time of Flink and I
> > > want to decompose the recovery time into different parts, say the time
> > > to detect the failure, the time to restart the job, and the time to
> > > restore the checkpointing.
> > >
> > > Unfortunately, I cannot find  any information in Flink doc to solve
> > > this, Is there any way that Flink has provided for this, otherwise,
> > > how can I solve this?
> > >
> > > Thanks a lot for your help.
> > >
> > > Regards,
> > > Juno
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: Decompose failure recovery time

Zhinan Cheng-2
Hi Piotr,

Thanks a lot.
I will try your suggestion to see what happen.

Regards,
Zhinan

On Fri, 21 Aug 2020 at 00:40, Piotr Nowojski <[hidden email]> wrote:

>
> Hi Zhinan,
>
> It's hard to say, but my guess it takes that long for the tasks to respond to cancellation which consists of a couple of steps. If a task is currently busy processing something, it has to respond to interruption (`java.lang.Thread#interrupt`). If it takes 30 seconds for a task to react to the interruption and clean up it's resources, that can cause problems and there is very little that Flink can do.
>
> If you want to debug it further, I would suggest collecting stack traces during cancellation (or even better: profile the code during cancellation). This would help you answer the question, what are the task threads busy with.
>
> Probably not a solution, but I'm mentioning it just in case, you can shorten the `task.cancellation.timeout` period.  By default it's 180s. After that, whole TaskManager will be killed. If you have spare TaskManagers or you can restart them very quickly, lowering this timeout might help to some extent (in an exchange for dirty shutdown, without cleaning up the resources).
>
> Piotrek
>
> czw., 20 sie 2020 o 18:00 Zhinan Cheng <[hidden email]> napisał(a):
>>
>> Hi Piotr,
>>
>> Thanks a lot for your help.
>> Yes, I finally realize that I can only approximate the time for [1]
>> and [3] and measure [2] by monitoring the uptime and downtime metric
>> provided by Flink.
>>
>> And now my problem is that I found the time in [2] can be up to 40s, I
>> wonder why it takes so long to restart the job.
>> The log actually shows that the time to switch all operator instances
>> from CANCELING to CANCELED is around 30s, do you have any ideas about
>> this?
>>
>> Many thanks.
>>
>> Regards,
>> Zhinan
>>
>> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > > I want to decompose the recovery time into different parts, say
>> > > (1) the time to detect the failure,
>> > > (2) the time to restart the job,
>> > > (3) and the time to restore the checkpointing.
>> >
>> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not
>> > help you with that. Time to detect the failure, would be a time between the
>> > failure occurred, and the time when JobManager realises about this failure.
>> > If we could reliably measure/check when the first one happened, then we
>> > could immediately trigger failover. You are interested in this exactly
>> > because there is no reliable way to detect the failure immediately. You
>> > could approximate this via analysing the logs.
>> >
>> > 2. Maybe there are some metrics that you could use, if not you check use
>> > the REST API [1] to monitor for the job status. Again you could also do it
>> > via analysing the logs.
>> >
>> > 3. In the future this might be measurable using the REST API (similar as
>> > the point 2.), but currently there is no way to do it that way. There is a
>> > ticket for that [2]. I think currently the only way is to do it is via
>> > analysing the logs.
>> >
>> > If you just need to do this once, I would analyse the logs manually. If you
>> > want to do it many times or monitor this continuously, I would write some
>> > simple script (python?) to mix checking REST API calls for 2. with logs
>> > analysing.
>> >
>> > Piotrek
>> >
>> >
>> > [1]
>> > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
>> > [2] https://issues.apache.org/jira/browse/FLINK-17012
>> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <[hidden email]> napisał(a):
>> >
>> > > Hi all,
>> > >
>> > > I am working on measuring the failure recovery time of Flink and I
>> > > want to decompose the recovery time into different parts, say the time
>> > > to detect the failure, the time to restart the job, and the time to
>> > > restore the checkpointing.
>> > >
>> > > Unfortunately, I cannot find  any information in Flink doc to solve
>> > > this, Is there any way that Flink has provided for this, otherwise,
>> > > how can I solve this?
>> > >
>> > > Thanks a lot for your help.
>> > >
>> > > Regards,
>> > > Juno
>> > >
>>
>> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > > I want to decompose the recovery time into different parts, say
>> > > (1) the time to detect the failure,
>> > > (2) the time to restart the job,
>> > > (3) and the time to restore the checkpointing.
>> >
>> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not
>> > help you with that. Time to detect the failure, would be a time between the
>> > failure occurred, and the time when JobManager realises about this failure.
>> > If we could reliably measure/check when the first one happened, then we
>> > could immediately trigger failover. You are interested in this exactly
>> > because there is no reliable way to detect the failure immediately. You
>> > could approximate this via analysing the logs.
>> >
>> > 2. Maybe there are some metrics that you could use, if not you check use
>> > the REST API [1] to monitor for the job status. Again you could also do it
>> > via analysing the logs.
>> >
>> > 3. In the future this might be measurable using the REST API (similar as
>> > the point 2.), but currently there is no way to do it that way. There is a
>> > ticket for that [2]. I think currently the only way is to do it is via
>> > analysing the logs.
>> >
>> > If you just need to do this once, I would analyse the logs manually. If you
>> > want to do it many times or monitor this continuously, I would write some
>> > simple script (python?) to mix checking REST API calls for 2. with logs
>> > analysing.
>> >
>> > Piotrek
>> >
>> >
>> > [1]
>> > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
>> > [2] https://issues.apache.org/jira/browse/FLINK-17012
>> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <[hidden email]> napisał(a):
>> >
>> > > Hi all,
>> > >
>> > > I am working on measuring the failure recovery time of Flink and I
>> > > want to decompose the recovery time into different parts, say the time
>> > > to detect the failure, the time to restart the job, and the time to
>> > > restore the checkpointing.
>> > >
>> > > Unfortunately, I cannot find  any information in Flink doc to solve
>> > > this, Is there any way that Flink has provided for this, otherwise,
>> > > how can I solve this?
>> > >
>> > > Thanks a lot for your help.
>> > >
>> > > Regards,
>> > > Juno
>> > >