??Hi,
I am working with Flink at the moment and am interested in knowing how one could estimate the Total Recovery Time for an application after checkpoint recovery. What I am specifically interested in is knowing the time needed for the recovery of the state + the catch-up phase (since the application's source tasks are reset to an earlier input position after recovery, this would be the data it processed before the failure and data that accumulated while the application was down). My questions are, What important considerations should I take into account to estimate this time and which parts of the codebase would this modification involve? Thanks, Jessica |
Hi Jessica,
multiple factors affect the total recovery time. First of all, Flink needs to detect that something went wrong. In the worst case this happens through the missing heartbeat of a died machine. The default heartbeat value is configured to 50s but one can tune it. Next, Flink needs to cancel the running tasks. The time needed for this operation is mainly influenced what the user code is doing. In the normal case, this should be quite fast. After all tasks have been cancelled, Flink will ask the configured restart strategy how much time it should wait before restarting the job. Once this happens, Flink will restart the job from the last valid checkpoint. If you have activated local recovery and all previously used machines are still available, then the recovery should be almost instantaneously. If this is not the case, then Flink needs to download the checkpoint data from the persistent storage. The time to do this mainly depends on the state size and network/IO capacity. The size of the checkpoint can depend on the type of checkpoint you are choosing. Incremental checkpoints have the benefit that they are usually faster to create but they can blow up the effective size of the checkpoint a bit. This, however, strongly depends on the access pattern and how RocksDB compacts the sst files. Once this is done, then Flink will start executing the job from the checkpointed position. Depending on your process rate, the checkpoint position and the rate of incoming events, this is the last part which decides how fast Flink will catch up. If p is the process rate, i the rate of incoming events and diff the difference between the checkpoint position and the head of the queue, then it takes diff / (p - i) seconds until Flink has caught up with the head. Cheers, Till On Mon, Feb 10, 2020 at 10:30 PM Woods, Jessica Hui < [hidden email]> wrote: > ??Hi, > > I am working with Flink at the moment and am interested in knowing how one > could estimate the Total Recovery Time for an application after checkpoint > recovery. What I am specifically interested in is knowing the time needed > for the recovery of the state + the catch-up phase (since the application's > source tasks are reset to an earlier input position after recovery, this > would be the data it processed before the failure and data that accumulated > while the application was down). > > My questions are, What important considerations should I take into account > to estimate this time and which parts of the codebase would this > modification involve? > > Thanks, > Jessica > > |
In reply to this post by Woods, Jessica Hui
??Hi,
I am working with Apache Flink and am interested in knowing how one could estimate the total amount of time an application spends in recovery, including the input stream "catch-up" after checkpoint recovery. What I am specifically interested in is knowing the time needed for the recovery of the state + the catch-up phase (since the application's source tasks are reset to an earlier input position after recovery, this would be the data it processed before the failure and data that accumulated while the application was down). My question is, what important considerations should I take into account when estimating this time and which portions of the Apache Flink codebase would be most helpful? Thanks |
Hi Jessica,
did you receive my previous email with the explanation? Cheers, Till On Sat, Feb 15, 2020 at 11:45 PM Woods, Jessica Hui < [hidden email]> wrote: > ??Hi, > > I am working with Apache Flink and am interested in knowing how one could > estimate the total amount of time an application spends in recovery, > including the input stream "catch-up" after checkpoint recovery. What I am > specifically interested in is knowing the time needed for the recovery of > the state + the catch-up phase (since the application's source tasks are > reset to an earlier input position after recovery, this would be the data > it processed before the failure and data that accumulated while the > application was down). > > My question is, what important considerations should I take into account > when estimating this time and which portions of the Apache Flink codebase > would be most helpful? > > Thanks > > |
Hi Till,
No, I have not received any emails regarding my question. Could you please forward your response to me? Thanks ________________________________________ From: Till Rohrmann <[hidden email]> Sent: Tuesday, February 18, 2020 4:43 PM To: dev Subject: Re: Total recovery time estimation after checkpoint recovery Hi Jessica, did you receive my previous email with the explanation? Cheers, Till On Sat, Feb 15, 2020 at 11:45 PM Woods, Jessica Hui < [hidden email]> wrote: > ??Hi, > > I am working with Apache Flink and am interested in knowing how one could > estimate the total amount of time an application spends in recovery, > including the input stream "catch-up" after checkpoint recovery. What I am > specifically interested in is knowing the time needed for the recovery of > the state + the catch-up phase (since the application's source tasks are > reset to an earlier input position after recovery, this would be the data > it processed before the failure and data that accumulated while the > application was down). > > My question is, what important considerations should I take into account > when estimating this time and which portions of the Apache Flink codebase > would be most helpful? > > Thanks > > |
All right here is the copy:
Multiple factors affect the total recovery time. First of all, Flink needs to detect that something went wrong. In the worst case this happens through the missing heartbeat of a died machine. The default heartbeat value is configured to 50s but one can tune it. Next, Flink needs to cancel the running tasks. The time needed for this operation is mainly influenced what the user code is doing. In the normal case, this should be quite fast. After all tasks have been cancelled, Flink will ask the configured restart strategy how much time it should wait before restarting the job. Once this happens, Flink will restart the job from the last valid checkpoint. If you have activated local recovery and all previously used machines are still available, then the recovery should be almost instantaneously. If this is not the case, then Flink needs to download the checkpoint data from the persistent storage. The time to do this mainly depends on the state size and network/IO capacity. The size of the checkpoint can depend on the type of checkpoint you are choosing. Incremental checkpoints have the benefit that they are usually faster to create but they can blow up the effective size of the checkpoint a bit. This, however, strongly depends on the access pattern and how RocksDB compacts the sst files. Once this is done, then Flink will start executing the job from the checkpointed position. Depending on your process rate, the checkpoint position and the rate of incoming events, this is the last part which decides how fast Flink will catch up. If p is the process rate, i the rate of incoming events and diff the difference between the checkpoint position and the head of the queue, then it takes diff / (p - i) seconds until Flink has caught up with the head. Cheers, Till On Wed, Feb 19, 2020 at 12:33 AM Woods, Jessica Hui < [hidden email]> wrote: > Hi Till, > > No, I have not received any emails regarding my question. Could you please > forward your response to me? > > Thanks > ________________________________________ > From: Till Rohrmann <[hidden email]> > Sent: Tuesday, February 18, 2020 4:43 PM > To: dev > Subject: Re: Total recovery time estimation after checkpoint recovery > > Hi Jessica, > > did you receive my previous email with the explanation? > > Cheers, > Till > > On Sat, Feb 15, 2020 at 11:45 PM Woods, Jessica Hui < > [hidden email]> wrote: > > > ??Hi, > > > > I am working with Apache Flink and am interested in knowing how one could > > estimate the total amount of time an application spends in recovery, > > including the input stream "catch-up" after checkpoint recovery. What I > am > > specifically interested in is knowing the time needed for the recovery of > > the state + the catch-up phase (since the application's source tasks are > > reset to an earlier input position after recovery, this would be the data > > it processed before the failure and data that accumulated while the > > application was down). > > > > My question is, what important considerations should I take into account > > when estimating this time and which portions of the Apache Flink codebase > > would be most helpful? > > > > Thanks > > > > > |
Free forum by Nabble | Edit this page |