[jira] [Created] (FLINK-15731) Stop while Checkpoint is in-progress triggers Failover

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-15731) Stop while Checkpoint is in-progress triggers Failover

Shang Yuanchun (Jira)
Konstantin Knauf created FLINK-15731:
----------------------------------------

             Summary: Stop while Checkpoint is in-progress triggers Failover
                 Key: FLINK-15731
                 URL: https://issues.apache.org/jira/browse/FLINK-15731
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.9.1, 1.10.0
            Reporter: Konstantin Knauf


Currently, when a Job is {{stop}}ed in-progress checkpoints are aborted and afterwards a synchronous savepoint is started.

Since the number of tolerable checkpoint failures is 0 per default (see ({{org.apache.flink.streaming.api.environment.CheckpointConfig#getTolerableCheckpointFailureNumber}}), this triggers a restart of the job if there are any ongoing checkpoints effectively.

In consequence, the stop call only triggers a failover of the job instead of stopping the job, if there is an ongoing checkpoint (or savepoint).

Possible Options would be:

a) change default of tolerable checkpoint failures to at least the max number of concurrent checkpoints
b) do not count checkpoint failures due to the stop action when checking against tolerable checkpoint failures
c) do not abort pending checkpoints when stopping a job, but queue the synchronous savepoint after all current in-progress checkpoints





--
This message was sent by Atlassian Jira
(v8.3.4#803005)