Total recovery time estimation after checkpoint recovery

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Total recovery time estimation after checkpoint recovery

Woods, Jessica Hui
??Hi,

I am working with Flink at the moment and am interested in knowing how one could estimate the Total Recovery Time for an application after checkpoint recovery. What I am specifically interested in is knowing the time needed for the recovery of the state + the catch-up phase (since the application's source tasks are reset to an earlier input position after recovery, this would be the data it processed before the failure and data that accumulated while the application was down).

My questions are, What important considerations should I take into account to estimate this time and which parts of the codebase would this modification involve?

Thanks,
Jessica

Reply | Threaded
Open this post in threaded view
|

Re: Total recovery time estimation after checkpoint recovery

Till Rohrmann
Hi Jessica,

multiple factors affect the total recovery time. First of all, Flink needs
to detect that something went wrong. In the worst case this happens through
the missing heartbeat of a died machine. The default heartbeat value is
configured to 50s but one can tune it.

Next, Flink needs to cancel the running tasks. The time needed for this
operation is mainly influenced what the user code is doing. In the normal
case, this should be quite fast.

After all tasks have been cancelled, Flink will ask the configured restart
strategy how much time it should wait before restarting the job. Once this
happens, Flink will restart the job from the last valid checkpoint. If you
have activated local recovery and all previously used machines are still
available, then the recovery should be almost instantaneously. If this is
not the case, then Flink needs to download the checkpoint data from the
persistent storage. The time to do this mainly depends on the state size
and network/IO capacity.

The size of the checkpoint can depend on the type of checkpoint you are
choosing. Incremental checkpoints have the benefit that they are usually
faster to create but they can blow up the effective size of the checkpoint
a bit. This, however, strongly depends on the access pattern and how
RocksDB compacts the sst files.

Once this is done, then Flink will start executing the job from the
checkpointed position. Depending on your process rate, the checkpoint
position and the rate of incoming events, this is the last part which
decides how fast Flink will catch up. If p is the process rate, i the rate
of incoming events and diff the difference between the checkpoint position
and the head of the queue, then it takes diff / (p - i) seconds until Flink
has caught up with the head.

Cheers,
Till

On Mon, Feb 10, 2020 at 10:30 PM Woods, Jessica Hui <
[hidden email]> wrote:

> ??Hi,
>
> I am working with Flink at the moment and am interested in knowing how one
> could estimate the Total Recovery Time for an application after checkpoint
> recovery. What I am specifically interested in is knowing the time needed
> for the recovery of the state + the catch-up phase (since the application's
> source tasks are reset to an earlier input position after recovery, this
> would be the data it processed before the failure and data that accumulated
> while the application was down).
>
> My questions are, What important considerations should I take into account
> to estimate this time and which parts of the codebase would this
> modification involve?
>
> Thanks,
> Jessica
>
>
Reply | Threaded
Open this post in threaded view
|

Total recovery time estimation after checkpoint recovery

Woods, Jessica Hui
In reply to this post by Woods, Jessica Hui
??Hi,

I am working with Apache Flink and am interested in knowing how one could estimate the total amount of time an application spends in recovery, including the input stream "catch-up" after checkpoint recovery. What I am specifically interested in is knowing the time needed for the recovery of the state + the catch-up phase (since the application's source tasks are reset to an earlier input position after recovery, this would be the data it processed before the failure and data that accumulated while the application was down).

My question is, what important considerations should I take into account when estimating this time and which portions of the Apache Flink codebase would be most helpful?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Total recovery time estimation after checkpoint recovery

Till Rohrmann
Hi Jessica,

did you receive my previous email with the explanation?

Cheers,
Till

On Sat, Feb 15, 2020 at 11:45 PM Woods, Jessica Hui <
[hidden email]> wrote:

> ??Hi,
>
> I am working with Apache Flink and am interested in knowing how one could
> estimate the total amount of time an application spends in recovery,
> including the input stream "catch-up" after checkpoint recovery. What I am
> specifically interested in is knowing the time needed for the recovery of
> the state + the catch-up phase (since the application's source tasks are
> reset to an earlier input position after recovery, this would be the data
> it processed before the failure and data that accumulated while the
> application was down).
>
> My question is, what important considerations should I take into account
> when estimating this time and which portions of the Apache Flink codebase
> would be most helpful?
>
> Thanks
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Total recovery time estimation after checkpoint recovery

Woods, Jessica Hui
Hi Till,

No, I have not received any emails regarding my question. Could you please forward your response to me?

Thanks
________________________________________
From: Till Rohrmann <[hidden email]>
Sent: Tuesday, February 18, 2020 4:43 PM
To: dev
Subject: Re: Total recovery time estimation after checkpoint recovery

Hi Jessica,

did you receive my previous email with the explanation?

Cheers,
Till

On Sat, Feb 15, 2020 at 11:45 PM Woods, Jessica Hui <
[hidden email]> wrote:

> ??Hi,
>
> I am working with Apache Flink and am interested in knowing how one could
> estimate the total amount of time an application spends in recovery,
> including the input stream "catch-up" after checkpoint recovery. What I am
> specifically interested in is knowing the time needed for the recovery of
> the state + the catch-up phase (since the application's source tasks are
> reset to an earlier input position after recovery, this would be the data
> it processed before the failure and data that accumulated while the
> application was down).
>
> My question is, what important considerations should I take into account
> when estimating this time and which portions of the Apache Flink codebase
> would be most helpful?
>
> Thanks
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Total recovery time estimation after checkpoint recovery

Till Rohrmann
All right here is the copy:

Multiple factors affect the total recovery time. First of all, Flink needs
to detect that something went wrong. In the worst case this happens through
the missing heartbeat of a died machine. The default heartbeat value is
configured to 50s but one can tune it.

Next, Flink needs to cancel the running tasks. The time needed for this
operation is mainly influenced what the user code is doing. In the normal
case, this should be quite fast.

After all tasks have been cancelled, Flink will ask the configured restart
strategy how much time it should wait before restarting the job. Once this
happens, Flink will restart the job from the last valid checkpoint. If you
have activated local recovery and all previously used machines are still
available, then the recovery should be almost instantaneously. If this is
not the case, then Flink needs to download the checkpoint data from the
persistent storage. The time to do this mainly depends on the state size
and network/IO capacity.

The size of the checkpoint can depend on the type of checkpoint you are
choosing. Incremental checkpoints have the benefit that they are usually
faster to create but they can blow up the effective size of the checkpoint
a bit. This, however, strongly depends on the access pattern and how
RocksDB compacts the sst files.

Once this is done, then Flink will start executing the job from the
checkpointed position. Depending on your process rate, the checkpoint
position and the rate of incoming events, this is the last part which
decides how fast Flink will catch up. If p is the process rate, i the rate
of incoming events and diff the difference between the checkpoint position
and the head of the queue, then it takes diff / (p - i) seconds until Flink
has caught up with the head.

Cheers,
Till

On Wed, Feb 19, 2020 at 12:33 AM Woods, Jessica Hui <
[hidden email]> wrote:

> Hi Till,
>
> No, I have not received any emails regarding my question. Could you please
> forward your response to me?
>
> Thanks
> ________________________________________
> From: Till Rohrmann <[hidden email]>
> Sent: Tuesday, February 18, 2020 4:43 PM
> To: dev
> Subject: Re: Total recovery time estimation after checkpoint recovery
>
> Hi Jessica,
>
> did you receive my previous email with the explanation?
>
> Cheers,
> Till
>
> On Sat, Feb 15, 2020 at 11:45 PM Woods, Jessica Hui <
> [hidden email]> wrote:
>
> > ??Hi,
> >
> > I am working with Apache Flink and am interested in knowing how one could
> > estimate the total amount of time an application spends in recovery,
> > including the input stream "catch-up" after checkpoint recovery. What I
> am
> > specifically interested in is knowing the time needed for the recovery of
> > the state + the catch-up phase (since the application's source tasks are
> > reset to an earlier input position after recovery, this would be the data
> > it processed before the failure and data that accumulated while the
> > application was down).
> >
> > My question is, what important considerations should I take into account
> > when estimating this time and which portions of the Apache Flink codebase
> > would be most helpful?
> >
> > Thanks
> >
> >
>