Why are checkpoint failures so serious?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Why are checkpoint failures so serious?

Ron Crocker
What would it take to be a little more flexible in handling checkpoint failures?

Right now I have a team that’s checkpointing into S3, via the FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. They’re transient, though, and a retry would likely work.

However, when they fail, their job exits and restarts from the last checkpoint. That’s fine, but I’d rather it tried again before failing, and even after failing just keep running and do another checkpoint. Maybe this is something that should be configurable - # of retries, failure strategy, …

Ron
Reply | Threaded
Open this post in threaded view
|

Re: Why are checkpoint failures so serious?

Till Rohrmann
Hi Ron,

you should be able to turn off the Task failure in case of a checkpoint
failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
This setting should change the behavior such that checkpoint failures will
simply fail the distributed checkpoint.

Cheers,
Till

On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:

> What would it take to be a little more flexible in handling checkpoint
> failures?
>
> Right now I have a team that’s checkpointing into S3, via the
> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
> They’re transient, though, and a retry would likely work.
>
> However, when they fail, their job exits and restarts from the last
> checkpoint. That’s fine, but I’d rather it tried again before failing, and
> even after failing just keep running and do another checkpoint. Maybe this
> is something that should be configurable - # of retries, failure strategy, …
>
> Ron
Reply | Threaded
Open this post in threaded view
|

Re: Why are checkpoint failures so serious?

Aljoscha Krettek-2
Hi Ron,

Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it.

Best,
Aljoscha


> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote:
>
> Hi Ron,
>
> you should be able to turn off the Task failure in case of a checkpoint
> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
> This setting should change the behavior such that checkpoint failures will
> simply fail the distributed checkpoint.
>
> Cheers,
> Till
>
> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:
>
>> What would it take to be a little more flexible in handling checkpoint
>> failures?
>>
>> Right now I have a team that’s checkpointing into S3, via the
>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>> They’re transient, though, and a retry would likely work.
>>
>> However, when they fail, their job exits and restarts from the last
>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>> even after failing just keep running and do another checkpoint. Maybe this
>> is something that should be configurable - # of retries, failure strategy, …
>>
>> Ron

Reply | Threaded
Open this post in threaded view
|

Re: Why are checkpoint failures so serious?

Ron Crocker
Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork to get this, but I’ll do it if I have to.

Ron

> On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <[hidden email]> wrote:
>
> Hi Ron,
>
> Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it.
>
> Best,
> Aljoscha
>
>
>> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote:
>>
>> Hi Ron,
>>
>> you should be able to turn off the Task failure in case of a checkpoint
>> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
>> This setting should change the behavior such that checkpoint failures will
>> simply fail the distributed checkpoint.
>>
>> Cheers,
>> Till
>>
>> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:
>>
>>> What would it take to be a little more flexible in handling checkpoint
>>> failures?
>>>
>>> Right now I have a team that’s checkpointing into S3, via the
>>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>>> They’re transient, though, and a retry would likely work.
>>>
>>> However, when they fail, their job exits and restarts from the last
>>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>>> even after failing just keep running and do another checkpoint. Maybe this
>>> is something that should be configurable - # of retries, failure strategy, …
>>>
>>> Ron
>

Reply | Threaded
Open this post in threaded view
|

Re: Why are checkpoint failures so serious?

Aljoscha Krettek-2
Hi,

I think there's currently no option for achieving this on Flink 1.4.x.

Best,
Aljoscha

> On 15. Feb 2018, at 18:11, Ron Crocker <[hidden email]> wrote:
>
> Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork to get this, but I’ll do it if I have to.
>
> Ron
>
>> On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <[hidden email]> wrote:
>>
>> Hi Ron,
>>
>> Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it.
>>
>> Best,
>> Aljoscha
>>
>>
>>> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote:
>>>
>>> Hi Ron,
>>>
>>> you should be able to turn off the Task failure in case of a checkpoint
>>> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
>>> This setting should change the behavior such that checkpoint failures will
>>> simply fail the distributed checkpoint.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:
>>>
>>>> What would it take to be a little more flexible in handling checkpoint
>>>> failures?
>>>>
>>>> Right now I have a team that’s checkpointing into S3, via the
>>>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>>>> They’re transient, though, and a retry would likely work.
>>>>
>>>> However, when they fail, their job exits and restarts from the last
>>>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>>>> even after failing just keep running and do another checkpoint. Maybe this
>>>> is something that should be configurable - # of retries, failure strategy, …
>>>>
>>>> Ron
>>
>