[jira] [Created] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

Shang Yuanchun (Jira)
Mark Cho created FLINK-18263:
--------------------------------

             Summary: Allow external checkpoints to be persisted even when the job is in "Finished" state.
                 Key: FLINK-18263
                 URL: https://issues.apache.org/jira/browse/FLINK-18263
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
            Reporter: Mark Cho


Currently, `execution.checkpointing.externalized-checkpoint-retention` configuration supports two options:
- `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED and SUSPENDED state.
- `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED, SUSPENDED, and CANCELED state.

This gives us control over the retention of externalized checkpoints in all terminal state of a job, except for the FINISHED state.
If the job ends up in "FINISHED" state, externalized checkpoints will be automatically cleaned up and there currently is no config that will ensure that these externalized checkpoints to be persisted.

I found an old Jira ticket FLINK-4512 where this was discussed. I think it would be helpful to have a config that can control the retention policy for FINISHED state as well.
- This can be useful for cases where we want to rewind a job (that reached the FINISHED state) to a previous checkpoint.
- When we use externalized checkpoints, we want to fully delegate the checkpoint clean-up to an external process in all job states (without cherrypicking FINISHED state to be cleaned up by Flink).

We have a quick fix working in our fork where we've changed `ExternalizedCheckpointCleanup` enum:
{code:java}
RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
{code}
Since this change requires changes to multiple components (e.g. config values, REST API, Web UI, etc), I wanted to get the community's thoughts before I invest more time in my quick fix PR (which currently only contains minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)