(DEPRECATED) Apache Flink Mailing List archive.

Why are checkpoint failures so serious?

Classic

List

Threaded

5 messages Options

Ron Crocker

Why are checkpoint failures so serious?

What would it take to be a little more flexible in handling checkpoint failures?

Right now I have a team that’s checkpointing into S3, via the FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. They’re transient, though, and a retry would likely work.

However, when they fail, their job exits and restarts from the last checkpoint. That’s fine, but I’d rather it tried again before failing, and even after failing just keep running and do another checkpoint. Maybe this is something that should be configurable - # of retries, failure strategy, …

Ron

Till Rohrmann

Re: Why are checkpoint failures so serious?

Hi Ron,

you should be able to turn off the Task failure in case of a checkpoint
failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
This setting should change the behavior such that checkpoint failures will
simply fail the distributed checkpoint.

Cheers,
Till

On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:

> What would it take to be a little more flexible in handling checkpoint
> failures?
>
> Right now I have a team that’s checkpointing into S3, via the
> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
> They’re transient, though, and a retry would likely work.
>
> However, when they fail, their job exits and restarts from the last
> checkpoint. That’s fine, but I’d rather it tried again before failing, and
> even after failing just keep running and do another checkpoint. Maybe this
> is something that should be configurable - # of retries, failure strategy, …
>
> Ron

Aljoscha Krettek-2

Re: Why are checkpoint failures so serious?

Hi Ron,

Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it.

Best,
Aljoscha

> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote:
>
> Hi Ron,
>
> you should be able to turn off the Task failure in case of a checkpoint
> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
> This setting should change the behavior such that checkpoint failures will
> simply fail the distributed checkpoint.
>
> Cheers,
> Till
>
> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:
>
>> What would it take to be a little more flexible in handling checkpoint
>> failures?
>>
>> Right now I have a team that’s checkpointing into S3, via the
>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>> They’re transient, though, and a retry would likely work.
>>
>> However, when they fail, their job exits and restarts from the last
>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>> even after failing just keep running and do another checkpoint. Maybe this
>> is something that should be configurable - # of retries, failure strategy, …
>>
>> Ron

Ron Crocker

Re: Why are checkpoint failures so serious?

Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork to get this, but I’ll do it if I have to.

Ron

> On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <[hidden email]> wrote:
>
> Hi Ron,
>
> Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it.
>
> Best,
> Aljoscha
>
>
>> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote:
>>
>> Hi Ron,
>>
>> you should be able to turn off the Task failure in case of a checkpoint
>> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
>> This setting should change the behavior such that checkpoint failures will
>> simply fail the distributed checkpoint.
>>
>> Cheers,
>> Till
>>
>> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:
>>
>>> What would it take to be a little more flexible in handling checkpoint
>>> failures?
>>>
>>> Right now I have a team that’s checkpointing into S3, via the
>>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>>> They’re transient, though, and a retry would likely work.
>>>
>>> However, when they fail, their job exits and restarts from the last
>>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>>> even after failing just keep running and do another checkpoint. Maybe this
>>> is something that should be configurable - # of retries, failure strategy, …
>>>
>>> Ron
>

Aljoscha Krettek-2

Re: Why are checkpoint failures so serious?

Hi,

I think there's currently no option for achieving this on Flink 1.4.x.

Best,
Aljoscha

> On 15. Feb 2018, at 18:11, Ron Crocker <[hidden email]> wrote:
>
> Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork to get this, but I’ll do it if I have to.
>
> Ron
>
>> On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <[hidden email]> wrote:
>>
>> Hi Ron,
>>
>> Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it.
>>
>> Best,
>> Aljoscha
>>
>>
>>> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote:
>>>
>>> Hi Ron,
>>>
>>> you should be able to turn off the Task failure in case of a checkpoint
>>> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
>>> This setting should change the behavior such that checkpoint failures will
>>> simply fail the distributed checkpoint.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote:
>>>
>>>> What would it take to be a little more flexible in handling checkpoint
>>>> failures?
>>>>
>>>> Right now I have a team that’s checkpointing into S3, via the
>>>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>>>> They’re transient, though, and a retry would likely work.
>>>>
>>>> However, when they fail, their job exits and restarts from the last
>>>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>>>> even after failing just keep running and do another checkpoint. Maybe this
>>>> is something that should be configurable - # of retries, failure strategy, …
>>>>
>>>> Ron
>>
>