Setting an allowable number of checkpoint failures

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Setting an allowable number of checkpoint failures

lrao@lyft.com
Hi,

We are running into intermittent checkpoint failures while checkpointing to
S3.

As described in this thread -
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-5-some-thing-weird-td21309.html>,
we see that the job restarts when it encounters such a failure.

As mentioned in the thread, I see that there is an option to not fail tasks
on checkpoint errors -
*CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this
would mean that the job would continue running even in the case of
persistent checkpoint failures. Is my understanding here correct?

If above is true, then is there a way to configure an allowable number of
checkpoint failures? i.e. something along the lines of "Don't fail the job
if there are <=X number of checkpoint failures", so that *only *transient
failures can be ignored.

Thanks,
Lakshmi
Reply | Threaded
Open this post in threaded view
|

Re: Setting an allowable number of checkpoint failures

vino yang
Hi Lakshmi,

Your understanding of "
*CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this
is set to false, the task will only decline a the checkpoint and continue
running.

I think it is also a good choice to allow a number of failures to be set.
Flink currently only supports whether the Task fails if the checkpoint
fails. It is not supported to configure a threshold.

You can create an issue in JIRA to feedback this requirement.

Thanks, vino.

2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>:

> Hi,
>
> We are running into intermittent checkpoint failures while checkpointing to
> S3.
>
> As described in this thread -
>  http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/1-5-some-thing-weird-td21309.html
> <http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> we see that the job restarts when it encounters such a failure.
>
> As mentioned in the thread, I see that there is an option to not fail tasks
> on checkpoint errors -
> *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this
> would mean that the job would continue running even in the case of
> persistent checkpoint failures. Is my understanding here correct?
>
> If above is true, then is there a way to configure an allowable number of
> checkpoint failures? i.e. something along the lines of "Don't fail the job
> if there are <=X number of checkpoint failures", so that *only *transient
> failures can be ignored.
>
> Thanks,
> Lakshmi
>
Reply | Threaded
Open this post in threaded view
|

Re: Setting an allowable number of checkpoint failures

Till Rohrmann
Hi Lakshmi,

you could somewhat achieve the described behaviour by setting
setFailOnCheckpointintErrors(true) and using the FailureRateRestartStrategy
as the restart strategy. That way checkpoint failures will trigger a job
restart (this is the downside) which is handled by the restart strategy.
The FailureRateRestartStrategy allows for x failures to happen within in a
given time interval. If this number is exceeded, then the job will
terminally fail.

Cheers,
Till

On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> wrote:

> Hi Lakshmi,
>
> Your understanding of "
> *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this
> is set to false, the task will only decline a the checkpoint and continue
> running.
>
> I think it is also a good choice to allow a number of failures to be set.
> Flink currently only supports whether the Task fails if the checkpoint
> fails. It is not supported to configure a threshold.
>
> You can create an issue in JIRA to feedback this requirement.
>
> Thanks, vino.
>
> 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>:
>
> > Hi,
> >
> > We are running into intermittent checkpoint failures while checkpointing
> to
> > S3.
> >
> > As described in this thread -
> >  http://apache-flink-user-mailing-list-archive.2336050.
> > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > <http://apache-flink-user-mailing-list-archive.2336050.
> > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > we see that the job restarts when it encounters such a failure.
> >
> > As mentioned in the thread, I see that there is an option to not fail
> tasks
> > on checkpoint errors -
> > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this
> > would mean that the job would continue running even in the case of
> > persistent checkpoint failures. Is my understanding here correct?
> >
> > If above is true, then is there a way to configure an allowable number of
> > checkpoint failures? i.e. something along the lines of "Don't fail the
> job
> > if there are <=X number of checkpoint failures", so that *only *transient
> > failures can be ignored.
> >
> > Thanks,
> > Lakshmi
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Setting an allowable number of checkpoint failures

vino yang
Hi Till,

I think the way you proposed is a solution. But I think we also can provide
a solution to prevent Checkpoint from failing indefinitely, in case the Job
does not fail.

Instead, a threshold is given to allow the checkpoint to fail a few times.
When this threshold is reached, we decide to let the job fail.

Thanks, vino.

2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>:

> Hi Lakshmi,
>
> you could somewhat achieve the described behaviour by setting
> setFailOnCheckpointintErrors(true) and using the
> FailureRateRestartStrategy
> as the restart strategy. That way checkpoint failures will trigger a job
> restart (this is the downside) which is handled by the restart strategy.
> The FailureRateRestartStrategy allows for x failures to happen within in a
> given time interval. If this number is exceeded, then the job will
> terminally fail.
>
> Cheers,
> Till
>
> On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> wrote:
>
> > Hi Lakshmi,
> >
> > Your understanding of "
> > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If
> this
> > is set to false, the task will only decline a the checkpoint and continue
> > running.
> >
> > I think it is also a good choice to allow a number of failures to be set.
> > Flink currently only supports whether the Task fails if the checkpoint
> > fails. It is not supported to configure a threshold.
> >
> > You can create an issue in JIRA to feedback this requirement.
> >
> > Thanks, vino.
> >
> > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>:
> >
> > > Hi,
> > >
> > > We are running into intermittent checkpoint failures while
> checkpointing
> > to
> > > S3.
> > >
> > > As described in this thread -
> > >  http://apache-flink-user-mailing-list-archive.2336050.
> > > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > > <http://apache-flink-user-mailing-list-archive.2336050.
> > > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > > we see that the job restarts when it encounters such a failure.
> > >
> > > As mentioned in the thread, I see that there is an option to not fail
> > tasks
> > > on checkpoint errors -
> > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However,
> this
> > > would mean that the job would continue running even in the case of
> > > persistent checkpoint failures. Is my understanding here correct?
> > >
> > > If above is true, then is there a way to configure an allowable number
> of
> > > checkpoint failures? i.e. something along the lines of "Don't fail the
> > job
> > > if there are <=X number of checkpoint failures", so that *only
> *transient
> > > failures can be ignored.
> > >
> > > Thanks,
> > > Lakshmi
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Setting an allowable number of checkpoint failures

Thomas Weise
Hi,

What we are looking for is that the job does *not* restart on transient
checkpoint failures and we would like to cap the number of allowable
subsequent failures until a restart occurs.

The reason is that every restart is a service interruption that is
potentially very expensive.

Thanks,
Thomas







On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]> wrote:

> Hi Till,
>
> I think the way you proposed is a solution. But I think we also can provide
> a solution to prevent Checkpoint from failing indefinitely, in case the Job
> does not fail.
>
> Instead, a threshold is given to allow the checkpoint to fail a few times.
> When this threshold is reached, we decide to let the job fail.
>
> Thanks, vino.
>
> 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>:
>
> > Hi Lakshmi,
> >
> > you could somewhat achieve the described behaviour by setting
> > setFailOnCheckpointintErrors(true) and using the
> > FailureRateRestartStrategy
> > as the restart strategy. That way checkpoint failures will trigger a job
> > restart (this is the downside) which is handled by the restart strategy.
> > The FailureRateRestartStrategy allows for x failures to happen within in
> a
> > given time interval. If this number is exceeded, then the job will
> > terminally fail.
> >
> > Cheers,
> > Till
> >
> > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]> wrote:
> >
> > > Hi Lakshmi,
> > >
> > > Your understanding of "
> > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If
> > this
> > > is set to false, the task will only decline a the checkpoint and
> continue
> > > running.
> > >
> > > I think it is also a good choice to allow a number of failures to be
> set.
> > > Flink currently only supports whether the Task fails if the checkpoint
> > > fails. It is not supported to configure a threshold.
> > >
> > > You can create an issue in JIRA to feedback this requirement.
> > >
> > > Thanks, vino.
> > >
> > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>:
> > >
> > > > Hi,
> > > >
> > > > We are running into intermittent checkpoint failures while
> > checkpointing
> > > to
> > > > S3.
> > > >
> > > > As described in this thread -
> > > >  http://apache-flink-user-mailing-list-archive.2336050.
> > > > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > > > <http://apache-flink-user-mailing-list-archive.2336050.
> > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > > > we see that the job restarts when it encounters such a failure.
> > > >
> > > > As mentioned in the thread, I see that there is an option to not fail
> > > tasks
> > > > on checkpoint errors -
> > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However,
> > this
> > > > would mean that the job would continue running even in the case of
> > > > persistent checkpoint failures. Is my understanding here correct?
> > > >
> > > > If above is true, then is there a way to configure an allowable
> number
> > of
> > > > checkpoint failures? i.e. something along the lines of "Don't fail
> the
> > > job
> > > > if there are <=X number of checkpoint failures", so that *only
> > *transient
> > > > failures can be ignored.
> > > >
> > > > Thanks,
> > > > Lakshmi
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Setting an allowable number of checkpoint failures

vino yang
Hi Thomas,

What I am saying is what you mean, maybe I am not very accurate.

Thanks, vino.

2018-08-06 21:22 GMT+08:00 Thomas Weise <[hidden email]>:

> Hi,
>
> What we are looking for is that the job does *not* restart on transient
> checkpoint failures and we would like to cap the number of allowable
> subsequent failures until a restart occurs.
>
> The reason is that every restart is a service interruption that is
> potentially very expensive.
>
> Thanks,
> Thomas
>
>
>
>
>
>
>
> On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]> wrote:
>
> > Hi Till,
> >
> > I think the way you proposed is a solution. But I think we also can
> provide
> > a solution to prevent Checkpoint from failing indefinitely, in case the
> Job
> > does not fail.
> >
> > Instead, a threshold is given to allow the checkpoint to fail a few
> times.
> > When this threshold is reached, we decide to let the job fail.
> >
> > Thanks, vino.
> >
> > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>:
> >
> > > Hi Lakshmi,
> > >
> > > you could somewhat achieve the described behaviour by setting
> > > setFailOnCheckpointintErrors(true) and using the
> > > FailureRateRestartStrategy
> > > as the restart strategy. That way checkpoint failures will trigger a
> job
> > > restart (this is the downside) which is handled by the restart
> strategy.
> > > The FailureRateRestartStrategy allows for x failures to happen within
> in
> > a
> > > given time interval. If this number is exceeded, then the job will
> > > terminally fail.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]>
> wrote:
> > >
> > > > Hi Lakshmi,
> > > >
> > > > Your understanding of "
> > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct,
> If
> > > this
> > > > is set to false, the task will only decline a the checkpoint and
> > continue
> > > > running.
> > > >
> > > > I think it is also a good choice to allow a number of failures to be
> > set.
> > > > Flink currently only supports whether the Task fails if the
> checkpoint
> > > > fails. It is not supported to configure a threshold.
> > > >
> > > > You can create an issue in JIRA to feedback this requirement.
> > > >
> > > > Thanks, vino.
> > > >
> > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are running into intermittent checkpoint failures while
> > > checkpointing
> > > > to
> > > > > S3.
> > > > >
> > > > > As described in this thread -
> > > > >  http://apache-flink-user-mailing-list-archive.2336050.
> > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > > > > <http://apache-flink-user-mailing-list-archive.2336050.
> > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > > > > we see that the job restarts when it encounters such a failure.
> > > > >
> > > > > As mentioned in the thread, I see that there is an option to not
> fail
> > > > tasks
> > > > > on checkpoint errors -
> > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However,
> > > this
> > > > > would mean that the job would continue running even in the case of
> > > > > persistent checkpoint failures. Is my understanding here correct?
> > > > >
> > > > > If above is true, then is there a way to configure an allowable
> > number
> > > of
> > > > > checkpoint failures? i.e. something along the lines of "Don't fail
> > the
> > > > job
> > > > > if there are <=X number of checkpoint failures", so that *only
> > > *transient
> > > > > failures can be ignored.
> > > > >
> > > > > Thanks,
> > > > > Lakshmi
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Setting an allowable number of checkpoint failures

Thomas Weise
Hi vino,

Yes, I believe we are on the same page. I created
https://issues.apache.org/jira/browse/FLINK-10074 to track it.

Thanks,
Thomas

On Mon, Aug 6, 2018 at 8:42 AM vino yang <[hidden email]> wrote:

> Hi Thomas,
>
> What I am saying is what you mean, maybe I am not very accurate.
>
> Thanks, vino.
>
> 2018-08-06 21:22 GMT+08:00 Thomas Weise <[hidden email]>:
>
> > Hi,
> >
> > What we are looking for is that the job does *not* restart on transient
> > checkpoint failures and we would like to cap the number of allowable
> > subsequent failures until a restart occurs.
> >
> > The reason is that every restart is a service interruption that is
> > potentially very expensive.
> >
> > Thanks,
> > Thomas
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]> wrote:
> >
> > > Hi Till,
> > >
> > > I think the way you proposed is a solution. But I think we also can
> > provide
> > > a solution to prevent Checkpoint from failing indefinitely, in case the
> > Job
> > > does not fail.
> > >
> > > Instead, a threshold is given to allow the checkpoint to fail a few
> > times.
> > > When this threshold is reached, we decide to let the job fail.
> > >
> > > Thanks, vino.
> > >
> > > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>:
> > >
> > > > Hi Lakshmi,
> > > >
> > > > you could somewhat achieve the described behaviour by setting
> > > > setFailOnCheckpointintErrors(true) and using the
> > > > FailureRateRestartStrategy
> > > > as the restart strategy. That way checkpoint failures will trigger a
> > job
> > > > restart (this is the downside) which is handled by the restart
> > strategy.
> > > > The FailureRateRestartStrategy allows for x failures to happen within
> > in
> > > a
> > > > given time interval. If this number is exceeded, then the job will
> > > > terminally fail.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]>
> > wrote:
> > > >
> > > > > Hi Lakshmi,
> > > > >
> > > > > Your understanding of "
> > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct,
> > If
> > > > this
> > > > > is set to false, the task will only decline a the checkpoint and
> > > continue
> > > > > running.
> > > > >
> > > > > I think it is also a good choice to allow a number of failures to
> be
> > > set.
> > > > > Flink currently only supports whether the Task fails if the
> > checkpoint
> > > > > fails. It is not supported to configure a threshold.
> > > > >
> > > > > You can create an issue in JIRA to feedback this requirement.
> > > > >
> > > > > Thanks, vino.
> > > > >
> > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > We are running into intermittent checkpoint failures while
> > > > checkpointing
> > > > > to
> > > > > > S3.
> > > > > >
> > > > > > As described in this thread -
> > > > > >  http://apache-flink-user-mailing-list-archive.2336050.
> > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > > > > > <http://apache-flink-user-mailing-list-archive.2336050.
> > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > > > > > we see that the job restarts when it encounters such a failure.
> > > > > >
> > > > > > As mentioned in the thread, I see that there is an option to not
> > fail
> > > > > tasks
> > > > > > on checkpoint errors -
> > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**.
> *However,
> > > > this
> > > > > > would mean that the job would continue running even in the case
> of
> > > > > > persistent checkpoint failures. Is my understanding here correct?
> > > > > >
> > > > > > If above is true, then is there a way to configure an allowable
> > > number
> > > > of
> > > > > > checkpoint failures? i.e. something along the lines of "Don't
> fail
> > > the
> > > > > job
> > > > > > if there are <=X number of checkpoint failures", so that *only
> > > > *transient
> > > > > > failures can be ignored.
> > > > > >
> > > > > > Thanks,
> > > > > > Lakshmi
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Setting an allowable number of checkpoint failures

Till Rohrmann
Thanks Thomas for opening the issue. This is indeed a useful feature to
make the checkpointing more controllable.

On Mon, Aug 6, 2018 at 6:36 PM Thomas Weise <[hidden email]> wrote:

> Hi vino,
>
> Yes, I believe we are on the same page. I created
> https://issues.apache.org/jira/browse/FLINK-10074 to track it.
>
> Thanks,
> Thomas
>
> On Mon, Aug 6, 2018 at 8:42 AM vino yang <[hidden email]> wrote:
>
> > Hi Thomas,
> >
> > What I am saying is what you mean, maybe I am not very accurate.
> >
> > Thanks, vino.
> >
> > 2018-08-06 21:22 GMT+08:00 Thomas Weise <[hidden email]>:
> >
> > > Hi,
> > >
> > > What we are looking for is that the job does *not* restart on transient
> > > checkpoint failures and we would like to cap the number of allowable
> > > subsequent failures until a restart occurs.
> > >
> > > The reason is that every restart is a service interruption that is
> > > potentially very expensive.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Aug 6, 2018 at 5:09 AM vino yang <[hidden email]>
> wrote:
> > >
> > > > Hi Till,
> > > >
> > > > I think the way you proposed is a solution. But I think we also can
> > > provide
> > > > a solution to prevent Checkpoint from failing indefinitely, in case
> the
> > > Job
> > > > does not fail.
> > > >
> > > > Instead, a threshold is given to allow the checkpoint to fail a few
> > > times.
> > > > When this threshold is reached, we decide to let the job fail.
> > > >
> > > > Thanks, vino.
> > > >
> > > > 2018-08-06 15:14 GMT+08:00 Till Rohrmann <[hidden email]>:
> > > >
> > > > > Hi Lakshmi,
> > > > >
> > > > > you could somewhat achieve the described behaviour by setting
> > > > > setFailOnCheckpointintErrors(true) and using the
> > > > > FailureRateRestartStrategy
> > > > > as the restart strategy. That way checkpoint failures will trigger
> a
> > > job
> > > > > restart (this is the downside) which is handled by the restart
> > > strategy.
> > > > > The FailureRateRestartStrategy allows for x failures to happen
> within
> > > in
> > > > a
> > > > > given time interval. If this number is exceeded, then the job will
> > > > > terminally fail.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Sat, Aug 4, 2018 at 4:58 AM vino yang <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi Lakshmi,
> > > > > >
> > > > > > Your understanding of "
> > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is
> correct,
> > > If
> > > > > this
> > > > > > is set to false, the task will only decline a the checkpoint and
> > > > continue
> > > > > > running.
> > > > > >
> > > > > > I think it is also a good choice to allow a number of failures to
> > be
> > > > set.
> > > > > > Flink currently only supports whether the Task fails if the
> > > checkpoint
> > > > > > fails. It is not supported to configure a threshold.
> > > > > >
> > > > > > You can create an issue in JIRA to feedback this requirement.
> > > > > >
> > > > > > Thanks, vino.
> > > > > >
> > > > > > 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <[hidden email]>:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > We are running into intermittent checkpoint failures while
> > > > > checkpointing
> > > > > > to
> > > > > > > S3.
> > > > > > >
> > > > > > > As described in this thread -
> > > > > > >  http://apache-flink-user-mailing-list-archive.2336050.
> > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > > > > > > <http://apache-flink-user-mailing-list-archive.2336050.
> > > > > > > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > > > > > > we see that the job restarts when it encounters such a failure.
> > > > > > >
> > > > > > > As mentioned in the thread, I see that there is an option to
> not
> > > fail
> > > > > > tasks
> > > > > > > on checkpoint errors -
> > > > > > > *CheckpointConfig#setFailOnCheckpointingErrors(false)**.
> > *However,
> > > > > this
> > > > > > > would mean that the job would continue running even in the case
> > of
> > > > > > > persistent checkpoint failures. Is my understanding here
> correct?
> > > > > > >
> > > > > > > If above is true, then is there a way to configure an allowable
> > > > number
> > > > > of
> > > > > > > checkpoint failures? i.e. something along the lines of "Don't
> > fail
> > > > the
> > > > > > job
> > > > > > > if there are <=X number of checkpoint failures", so that *only
> > > > > *transient
> > > > > > > failures can be ignored.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Lakshmi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>