Hi,
We are running into intermittent checkpoint failures while checkpointing to S3. As described in this thread - http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html>, we see that the job restarts when it encounters such a failure. As mentioned in the thread, I see that there is an option to not fail tasks on checkpoint errors - *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this would mean that the job would continue running even in the case of persistent checkpoint failures. Is my understanding here correct? If above is true, then is there a way to configure an allowable number of checkpoint failures? i.e. something along the lines of "Don't fail the job if there are <=X number of checkpoint failures", so that *only *transient failures can be ignored. Thanks, Lakshmi |
Hi Lakshmi,
Your understanding of " *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this is set to false, the task will only decline a the checkpoint and continue running. I think it is also a good choice to allow a number of failures to be set. Flink currently only supports whether the Task fails if the checkpoint fails. It is not supported to configure a threshold. You can create an issue in JIRA to feedback this requirement. Thanks, vino. 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>: > Hi, > > We are running into intermittent checkpoint failures while checkpointing to > S3. > > As described in this thread - > http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/1-5-some-thing-weird-td21309.html > <http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > we see that the job restarts when it encounters such a failure. > > As mentioned in the thread, I see that there is an option to not fail tasks > on checkpoint errors - > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this > would mean that the job would continue running even in the case of > persistent checkpoint failures. Is my understanding here correct? > > If above is true, then is there a way to configure an allowable number of > checkpoint failures? i.e. something along the lines of "Don't fail the job > if there are <=X number of checkpoint failures", so that *only *transient > failures can be ignored. > > Thanks, > Lakshmi > |
Hi Lakshmi,
you could somewhat achieve the described behaviour by setting setFailOnCheckpointintErrors(true) and using the FailureRateRestartStrategy as the restart strategy. That way checkpoint failures will trigger a job restart (this is the downside) which is handled by the restart strategy. The FailureRateRestartStrategy allows for x failures to happen within in a given time interval. If this number is exceeded, then the job will terminally fail. Cheers, Till On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> wrote: > Hi Lakshmi, > > Your understanding of " > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this > is set to false, the task will only decline a the checkpoint and continue > running. > > I think it is also a good choice to allow a number of failures to be set. > Flink currently only supports whether the Task fails if the checkpoint > fails. It is not supported to configure a threshold. > > You can create an issue in JIRA to feedback this requirement. > > Thanks, vino. > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>: > > > Hi, > > > > We are running into intermittent checkpoint failures while checkpointing > to > > S3. > > > > As described in this thread - > > http://apache-flink-user-mailing-list-archive.2336050. > > n4.nabble.com/1-5-some-thing-weird-td21309.html > > <http://apache-flink-user-mailing-list-archive.2336050. > > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > > we see that the job restarts when it encounters such a failure. > > > > As mentioned in the thread, I see that there is an option to not fail > tasks > > on checkpoint errors - > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this > > would mean that the job would continue running even in the case of > > persistent checkpoint failures. Is my understanding here correct? > > > > If above is true, then is there a way to configure an allowable number of > > checkpoint failures? i.e. something along the lines of "Don't fail the > job > > if there are <=X number of checkpoint failures", so that *only *transient > > failures can be ignored. > > > > Thanks, > > Lakshmi > > > |
Hi Till,
I think the way you proposed is a solution. But I think we also can provide a solution to prevent Checkpoint from failing indefinitely, in case the Job does not fail. Instead, a threshold is given to allow the checkpoint to fail a few times. When this threshold is reached, we decide to let the job fail. Thanks, vino. 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>: > Hi Lakshmi, > > you could somewhat achieve the described behaviour by setting > setFailOnCheckpointintErrors(true) and using the > FailureRateRestartStrategy > as the restart strategy. That way checkpoint failures will trigger a job > restart (this is the downside) which is handled by the restart strategy. > The FailureRateRestartStrategy allows for x failures to happen within in a > given time interval. If this number is exceeded, then the job will > terminally fail. > > Cheers, > Till > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> wrote: > > > Hi Lakshmi, > > > > Your understanding of " > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If > this > > is set to false, the task will only decline a the checkpoint and continue > > running. > > > > I think it is also a good choice to allow a number of failures to be set. > > Flink currently only supports whether the Task fails if the checkpoint > > fails. It is not supported to configure a threshold. > > > > You can create an issue in JIRA to feedback this requirement. > > > > Thanks, vino. > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>: > > > > > Hi, > > > > > > We are running into intermittent checkpoint failures while > checkpointing > > to > > > S3. > > > > > > As described in this thread - > > > http://apache-flink-user-mailing-list-archive.2336050. > > > n4.nabble.com/1-5-some-thing-weird-td21309.html > > > <http://apache-flink-user-mailing-list-archive.2336050. > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > > > we see that the job restarts when it encounters such a failure. > > > > > > As mentioned in the thread, I see that there is an option to not fail > > tasks > > > on checkpoint errors - > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, > this > > > would mean that the job would continue running even in the case of > > > persistent checkpoint failures. Is my understanding here correct? > > > > > > If above is true, then is there a way to configure an allowable number > of > > > checkpoint failures? i.e. something along the lines of "Don't fail the > > job > > > if there are <=X number of checkpoint failures", so that *only > *transient > > > failures can be ignored. > > > > > > Thanks, > > > Lakshmi > > > > > > |
Hi,
What we are looking for is that the job does *not* restart on transient checkpoint failures and we would like to cap the number of allowable subsequent failures until a restart occurs. The reason is that every restart is a service interruption that is potentially very expensive. Thanks, Thomas On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]> wrote: > Hi Till, > > I think the way you proposed is a solution. But I think we also can provide > a solution to prevent Checkpoint from failing indefinitely, in case the Job > does not fail. > > Instead, a threshold is given to allow the checkpoint to fail a few times. > When this threshold is reached, we decide to let the job fail. > > Thanks, vino. > > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>: > > > Hi Lakshmi, > > > > you could somewhat achieve the described behaviour by setting > > setFailOnCheckpointintErrors(true) and using the > > FailureRateRestartStrategy > > as the restart strategy. That way checkpoint failures will trigger a job > > restart (this is the downside) which is handled by the restart strategy. > > The FailureRateRestartStrategy allows for x failures to happen within in > a > > given time interval. If this number is exceeded, then the job will > > terminally fail. > > > > Cheers, > > Till > > > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> wrote: > > > > > Hi Lakshmi, > > > > > > Your understanding of " > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If > > this > > > is set to false, the task will only decline a the checkpoint and > continue > > > running. > > > > > > I think it is also a good choice to allow a number of failures to be > set. > > > Flink currently only supports whether the Task fails if the checkpoint > > > fails. It is not supported to configure a threshold. > > > > > > You can create an issue in JIRA to feedback this requirement. > > > > > > Thanks, vino. > > > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>: > > > > > > > Hi, > > > > > > > > We are running into intermittent checkpoint failures while > > checkpointing > > > to > > > > S3. > > > > > > > > As described in this thread - > > > > http://apache-flink-user-mailing-list-archive.2336050. > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html > > > > <http://apache-flink-user-mailing-list-archive.2336050. > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > > > > we see that the job restarts when it encounters such a failure. > > > > > > > > As mentioned in the thread, I see that there is an option to not fail > > > tasks > > > > on checkpoint errors - > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, > > this > > > > would mean that the job would continue running even in the case of > > > > persistent checkpoint failures. Is my understanding here correct? > > > > > > > > If above is true, then is there a way to configure an allowable > number > > of > > > > checkpoint failures? i.e. something along the lines of "Don't fail > the > > > job > > > > if there are <=X number of checkpoint failures", so that *only > > *transient > > > > failures can be ignored. > > > > > > > > Thanks, > > > > Lakshmi > > > > > > > > > > |
Hi Thomas,
What I am saying is what you mean, maybe I am not very accurate. Thanks, vino. 2018-08-06 21:22 GMT+08:00 Thomas Weise <[hidden email]>: > Hi, > > What we are looking for is that the job does *not* restart on transient > checkpoint failures and we would like to cap the number of allowable > subsequent failures until a restart occurs. > > The reason is that every restart is a service interruption that is > potentially very expensive. > > Thanks, > Thomas > > > > > > > > On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]> wrote: > > > Hi Till, > > > > I think the way you proposed is a solution. But I think we also can > provide > > a solution to prevent Checkpoint from failing indefinitely, in case the > Job > > does not fail. > > > > Instead, a threshold is given to allow the checkpoint to fail a few > times. > > When this threshold is reached, we decide to let the job fail. > > > > Thanks, vino. > > > > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>: > > > > > Hi Lakshmi, > > > > > > you could somewhat achieve the described behaviour by setting > > > setFailOnCheckpointintErrors(true) and using the > > > FailureRateRestartStrategy > > > as the restart strategy. That way checkpoint failures will trigger a > job > > > restart (this is the downside) which is handled by the restart > strategy. > > > The FailureRateRestartStrategy allows for x failures to happen within > in > > a > > > given time interval. If this number is exceeded, then the job will > > > terminally fail. > > > > > > Cheers, > > > Till > > > > > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> > wrote: > > > > > > > Hi Lakshmi, > > > > > > > > Your understanding of " > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, > If > > > this > > > > is set to false, the task will only decline a the checkpoint and > > continue > > > > running. > > > > > > > > I think it is also a good choice to allow a number of failures to be > > set. > > > > Flink currently only supports whether the Task fails if the > checkpoint > > > > fails. It is not supported to configure a threshold. > > > > > > > > You can create an issue in JIRA to feedback this requirement. > > > > > > > > Thanks, vino. > > > > > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>: > > > > > > > > > Hi, > > > > > > > > > > We are running into intermittent checkpoint failures while > > > checkpointing > > > > to > > > > > S3. > > > > > > > > > > As described in this thread - > > > > > http://apache-flink-user-mailing-list-archive.2336050. > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html > > > > > <http://apache-flink-user-mailing-list-archive.2336050. > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > > > > > we see that the job restarts when it encounters such a failure. > > > > > > > > > > As mentioned in the thread, I see that there is an option to not > fail > > > > tasks > > > > > on checkpoint errors - > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, > > > this > > > > > would mean that the job would continue running even in the case of > > > > > persistent checkpoint failures. Is my understanding here correct? > > > > > > > > > > If above is true, then is there a way to configure an allowable > > number > > > of > > > > > checkpoint failures? i.e. something along the lines of "Don't fail > > the > > > > job > > > > > if there are <=X number of checkpoint failures", so that *only > > > *transient > > > > > failures can be ignored. > > > > > > > > > > Thanks, > > > > > Lakshmi > > > > > > > > > > > > > > > |
Hi vino,
Yes, I believe we are on the same page. I created https://issues.apache.org/jira/browse/FLINK-10074 to track it. Thanks, Thomas On Mon, Aug 6, 2018 at 8:42 AM vino yang <[hidden email]> wrote: > Hi Thomas, > > What I am saying is what you mean, maybe I am not very accurate. > > Thanks, vino. > > 2018-08-06 21:22 GMT+08:00 Thomas Weise <[hidden email]>: > > > Hi, > > > > What we are looking for is that the job does *not* restart on transient > > checkpoint failures and we would like to cap the number of allowable > > subsequent failures until a restart occurs. > > > > The reason is that every restart is a service interruption that is > > potentially very expensive. > > > > Thanks, > > Thomas > > > > > > > > > > > > > > > > On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]> wrote: > > > > > Hi Till, > > > > > > I think the way you proposed is a solution. But I think we also can > > provide > > > a solution to prevent Checkpoint from failing indefinitely, in case the > > Job > > > does not fail. > > > > > > Instead, a threshold is given to allow the checkpoint to fail a few > > times. > > > When this threshold is reached, we decide to let the job fail. > > > > > > Thanks, vino. > > > > > > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>: > > > > > > > Hi Lakshmi, > > > > > > > > you could somewhat achieve the described behaviour by setting > > > > setFailOnCheckpointintErrors(true) and using the > > > > FailureRateRestartStrategy > > > > as the restart strategy. That way checkpoint failures will trigger a > > job > > > > restart (this is the downside) which is handled by the restart > > strategy. > > > > The FailureRateRestartStrategy allows for x failures to happen within > > in > > > a > > > > given time interval. If this number is exceeded, then the job will > > > > terminally fail. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> > > wrote: > > > > > > > > > Hi Lakshmi, > > > > > > > > > > Your understanding of " > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, > > If > > > > this > > > > > is set to false, the task will only decline a the checkpoint and > > > continue > > > > > running. > > > > > > > > > > I think it is also a good choice to allow a number of failures to > be > > > set. > > > > > Flink currently only supports whether the Task fails if the > > checkpoint > > > > > fails. It is not supported to configure a threshold. > > > > > > > > > > You can create an issue in JIRA to feedback this requirement. > > > > > > > > > > Thanks, vino. > > > > > > > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>: > > > > > > > > > > > Hi, > > > > > > > > > > > > We are running into intermittent checkpoint failures while > > > > checkpointing > > > > > to > > > > > > S3. > > > > > > > > > > > > As described in this thread - > > > > > > http://apache-flink-user-mailing-list-archive.2336050. > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html > > > > > > <http://apache-flink-user-mailing-list-archive.2336050. > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > > > > > > we see that the job restarts when it encounters such a failure. > > > > > > > > > > > > As mentioned in the thread, I see that there is an option to not > > fail > > > > > tasks > > > > > > on checkpoint errors - > > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. > *However, > > > > this > > > > > > would mean that the job would continue running even in the case > of > > > > > > persistent checkpoint failures. Is my understanding here correct? > > > > > > > > > > > > If above is true, then is there a way to configure an allowable > > > number > > > > of > > > > > > checkpoint failures? i.e. something along the lines of "Don't > fail > > > the > > > > > job > > > > > > if there are <=X number of checkpoint failures", so that *only > > > > *transient > > > > > > failures can be ignored. > > > > > > > > > > > > Thanks, > > > > > > Lakshmi > > > > > > > > > > > > > > > > > > > > > |
Thanks Thomas for opening the issue. This is indeed a useful feature to
make the checkpointing more controllable. On Mon, Aug 6, 2018 at 6:36 PM Thomas Weise <[hidden email]> wrote: > Hi vino, > > Yes, I believe we are on the same page. I created > https://issues.apache.org/jira/browse/FLINK-10074 to track it. > > Thanks, > Thomas > > On Mon, Aug 6, 2018 at 8:42 AM vino yang <[hidden email]> wrote: > > > Hi Thomas, > > > > What I am saying is what you mean, maybe I am not very accurate. > > > > Thanks, vino. > > > > 2018-08-06 21:22 GMT+08:00 Thomas Weise <[hidden email]>: > > > > > Hi, > > > > > > What we are looking for is that the job does *not* restart on transient > > > checkpoint failures and we would like to cap the number of allowable > > > subsequent failures until a restart occurs. > > > > > > The reason is that every restart is a service interruption that is > > > potentially very expensive. > > > > > > Thanks, > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]> > wrote: > > > > > > > Hi Till, > > > > > > > > I think the way you proposed is a solution. But I think we also can > > > provide > > > > a solution to prevent Checkpoint from failing indefinitely, in case > the > > > Job > > > > does not fail. > > > > > > > > Instead, a threshold is given to allow the checkpoint to fail a few > > > times. > > > > When this threshold is reached, we decide to let the job fail. > > > > > > > > Thanks, vino. > > > > > > > > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>: > > > > > > > > > Hi Lakshmi, > > > > > > > > > > you could somewhat achieve the described behaviour by setting > > > > > setFailOnCheckpointintErrors(true) and using the > > > > > FailureRateRestartStrategy > > > > > as the restart strategy. That way checkpoint failures will trigger > a > > > job > > > > > restart (this is the downside) which is handled by the restart > > > strategy. > > > > > The FailureRateRestartStrategy allows for x failures to happen > within > > > in > > > > a > > > > > given time interval. If this number is exceeded, then the job will > > > > > terminally fail. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> > > > wrote: > > > > > > > > > > > Hi Lakshmi, > > > > > > > > > > > > Your understanding of " > > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is > correct, > > > If > > > > > this > > > > > > is set to false, the task will only decline a the checkpoint and > > > > continue > > > > > > running. > > > > > > > > > > > > I think it is also a good choice to allow a number of failures to > > be > > > > set. > > > > > > Flink currently only supports whether the Task fails if the > > > checkpoint > > > > > > fails. It is not supported to configure a threshold. > > > > > > > > > > > > You can create an issue in JIRA to feedback this requirement. > > > > > > > > > > > > Thanks, vino. > > > > > > > > > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > We are running into intermittent checkpoint failures while > > > > > checkpointing > > > > > > to > > > > > > > S3. > > > > > > > > > > > > > > As described in this thread - > > > > > > > http://apache-flink-user-mailing-list-archive.2336050. > > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html > > > > > > > <http://apache-flink-user-mailing-list-archive.2336050. > > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>, > > > > > > > we see that the job restarts when it encounters such a failure. > > > > > > > > > > > > > > As mentioned in the thread, I see that there is an option to > not > > > fail > > > > > > tasks > > > > > > > on checkpoint errors - > > > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. > > *However, > > > > > this > > > > > > > would mean that the job would continue running even in the case > > of > > > > > > > persistent checkpoint failures. Is my understanding here > correct? > > > > > > > > > > > > > > If above is true, then is there a way to configure an allowable > > > > number > > > > > of > > > > > > > checkpoint failures? i.e. something along the lines of "Don't > > fail > > > > the > > > > > > job > > > > > > > if there are <=X number of checkpoint failures", so that *only > > > > > *transient > > > > > > > failures can be ignored. > > > > > > > > > > > > > > Thanks, > > > > > > > Lakshmi > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |