What would it take to be a little more flexible in handling checkpoint failures?
Right now I have a team that’s checkpointing into S3, via the FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. They’re transient, though, and a retry would likely work. However, when they fail, their job exits and restarts from the last checkpoint. That’s fine, but I’d rather it tried again before failing, and even after failing just keep running and do another checkpoint. Maybe this is something that should be configurable - # of retries, failure strategy, … Ron |
Hi Ron,
you should be able to turn off the Task failure in case of a checkpoint failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`. This setting should change the behavior such that checkpoint failures will simply fail the distributed checkpoint. Cheers, Till On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote: > What would it take to be a little more flexible in handling checkpoint > failures? > > Right now I have a team that’s checkpointing into S3, via the > FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. > They’re transient, though, and a retry would likely work. > > However, when they fail, their job exits and restarts from the last > checkpoint. That’s fine, but I’d rather it tried again before failing, and > even after failing just keep running and do another checkpoint. Maybe this > is something that should be configurable - # of retries, failure strategy, … > > Ron |
Hi Ron,
Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it. Best, Aljoscha > On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote: > > Hi Ron, > > you should be able to turn off the Task failure in case of a checkpoint > failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`. > This setting should change the behavior such that checkpoint failures will > simply fail the distributed checkpoint. > > Cheers, > Till > > On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote: > >> What would it take to be a little more flexible in handling checkpoint >> failures? >> >> Right now I have a team that’s checkpointing into S3, via the >> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. >> They’re transient, though, and a retry would likely work. >> >> However, when they fail, their job exits and restarts from the last >> checkpoint. That’s fine, but I’d rather it tried again before failing, and >> even after failing just keep running and do another checkpoint. Maybe this >> is something that should be configurable - # of retries, failure strategy, … >> >> Ron |
Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork to get this, but I’ll do it if I have to.
Ron > On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <[hidden email]> wrote: > > Hi Ron, > > Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it. > > Best, > Aljoscha > > >> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote: >> >> Hi Ron, >> >> you should be able to turn off the Task failure in case of a checkpoint >> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`. >> This setting should change the behavior such that checkpoint failures will >> simply fail the distributed checkpoint. >> >> Cheers, >> Till >> >> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote: >> >>> What would it take to be a little more flexible in handling checkpoint >>> failures? >>> >>> Right now I have a team that’s checkpointing into S3, via the >>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. >>> They’re transient, though, and a retry would likely work. >>> >>> However, when they fail, their job exits and restarts from the last >>> checkpoint. That’s fine, but I’d rather it tried again before failing, and >>> even after failing just keep running and do another checkpoint. Maybe this >>> is something that should be configurable - # of retries, failure strategy, … >>> >>> Ron > |
Hi,
I think there's currently no option for achieving this on Flink 1.4.x. Best, Aljoscha > On 15. Feb 2018, at 18:11, Ron Crocker <[hidden email]> wrote: > > Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork to get this, but I’ll do it if I have to. > > Ron > >> On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <[hidden email]> wrote: >> >> Hi Ron, >> >> Keep in mind, though, that this feature will only be available with the upcoming Flink 1.5. Just making sure you don't go looking for this and are surprised if you don't find it. >> >> Best, >> Aljoscha >> >> >>> On 14. Feb 2018, at 10:20, Till Rohrmann <[hidden email]> wrote: >>> >>> Hi Ron, >>> >>> you should be able to turn off the Task failure in case of a checkpoint >>> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`. >>> This setting should change the behavior such that checkpoint failures will >>> simply fail the distributed checkpoint. >>> >>> Cheers, >>> Till >>> >>> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <[hidden email]> wrote: >>> >>>> What would it take to be a little more flexible in handling checkpoint >>>> failures? >>>> >>>> Right now I have a team that’s checkpointing into S3, via the >>>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. >>>> They’re transient, though, and a retry would likely work. >>>> >>>> However, when they fail, their job exits and restarts from the last >>>> checkpoint. That’s fine, but I’d rather it tried again before failing, and >>>> even after failing just keep running and do another checkpoint. Maybe this >>>> is something that should be configurable - # of retries, failure strategy, … >>>> >>>> Ron >> > |
Free forum by Nabble | Edit this page |