(DEPRECATED) Apache Flink Mailing List archive.

[FLIP-47] Savepoints vs Checkpoints

Classic

List

Threaded

5 messages Options

Kostas Kloudas-4

[FLIP-47] Savepoints vs Checkpoints

Hi Devs,

Currently there is a number of efforts around checkpoints/savepoints, as
reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41,
FLIP-43, and FLIP-45 are all directly related to these topics. This
reflects the importance of these two notions/features to the users of the
framework.

Although many efforts are centred around these notions, their semantics and
the interplay between them is not always clearly defined. This makes them
difficult to explain them to the users (all the different combinations of
state-backends, formats and tradeoffs) and in some cases it may have
negative effects to the users (e.g. the already-fixed-some-time-ago issue
of savepoints not being considered for recovery although they committed
side-effects).

FLIP-47 [1] and the related Document [2] is aiming at starting a discussion
around the semantics of savepoints/checkpoints and their interplay, and to
some extent help us fix the future steps concerning these notions. As an
example, should we work towards bringing them closer, or moving them
further apart.

This is not a complete proposal (by no means), as many of the practical
implications can only be fleshed out after we agree on the basic semantics
and the general frame around these notions. To that end, there are no
concrete implementation steps and the FLIP is going to be updated as the
discussion continues.

I am really looking forward to your opinions on the topic.

Cheers,
Kostas

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
[2]
https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing

Becket Qin

Re: [FLIP-47] Savepoints vs Checkpoints

Hi Kostas,

It makes a lot of sense to just have one underlying mechanism (snapshot) to
save the state of a Flink job. And we can use that mechanism in different
scenarios, including checkpoint and user-triggered savepoint.

To facilitate the discussion, maybe it is useful to clarify a few design
goals, for example:

1. one unified snapshot format that supports
- both incremental and global state saving
- rescaling on recovery
- compatibility check / migration across different Flink versions?
2. The snapshot can easily be managed by users.

And I have two questions regarding the FLIP.

1. What are the side-effects when taking a snapshot? Do you mean taking
snapshot may triggers some action other than saving the state of the Job.
Technically speaking, taking snapshot should be a "read-only" action to the
Flink jobs. So I assume by side-effects, you meant it's no-longer
read-only. If so, can you be more specific on what are the side-effects you
are referring to?

2. In the rejected alternative, you mentioned a scenario of AB testing. It
seems that if execution A and execution B runs different configurations
after the savepoints, the history of the two jobs will always be different
after that, right?

Thanks,

Jiangjie (Becket) Qin

On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <[hidden email]> wrote:

> Hi Devs,
>
> Currently there is a number of efforts around checkpoints/savepoints, as
> reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41,
> FLIP-43, and FLIP-45 are all directly related to these topics. This
> reflects the importance of these two notions/features to the users of the
> framework.
>
> Although many efforts are centred around these notions, their semantics and
> the interplay between them is not always clearly defined. This makes them
> difficult to explain them to the users (all the different combinations of
> state-backends, formats and tradeoffs) and in some cases it may have
> negative effects to the users (e.g. the already-fixed-some-time-ago issue
> of savepoints not being considered for recovery although they committed
> side-effects).
>
> FLIP-47 [1] and the related Document [2] is aiming at starting a discussion
> around the semantics of savepoints/checkpoints and their interplay, and to
> some extent help us fix the future steps concerning these notions. As an
> example, should we work towards bringing them closer, or moving them
> further apart.
>
> This is not a complete proposal (by no means), as many of the practical
> implications can only be fleshed out after we agree on the basic semantics
> and the general frame around these notions. To that end, there are no
> concrete implementation steps and the FLIP is going to be updated as the
> discussion continues.
>
> I am really looking forward to your opinions on the topic.
>
> Cheers,
> Kostas
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
> [2]
>
> https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing
>

Congxian Qiu

Re: [FLIP-47] Savepoints vs Checkpoints

Hi Kostas

Thanks for bringing this up. Currently, there are indeed some overlaps
between checkpoint and savepoint that will make user confused. I think the
FLIP's proposal can give users a clearer description.

About the FLIP, I have a question about “Deleting or moving a snapshot
must be done by Flink", seems like we will support MOVE/DELETE the stopped
job's snapshot. What should the user do when he/she wants to DELETE/MOVE
a stopped job's snapshot

Best,
Congxian

Becket Qin <[hidden email]> 于2019年7月10日周三上午9:33写道：

> Hi Kostas,
>
> It makes a lot of sense to just have one underlying mechanism (snapshot) to
> save the state of a Flink job. And we can use that mechanism in different
> scenarios, including checkpoint and user-triggered savepoint.
>
> To facilitate the discussion, maybe it is useful to clarify a few design
> goals, for example:
>
> 1. one unified snapshot format that supports
> - both incremental and global state saving
> - rescaling on recovery
> - compatibility check / migration across different Flink versions?
> 2. The snapshot can easily be managed by users.
>
>
> And I have two questions regarding the FLIP.
>
> 1. What are the side-effects when taking a snapshot? Do you mean taking
> snapshot may triggers some action other than saving the state of the Job.
> Technically speaking, taking snapshot should be a "read-only" action to the
> Flink jobs. So I assume by side-effects, you meant it's no-longer
> read-only. If so, can you be more specific on what are the side-effects you
> are referring to?
>
> 2. In the rejected alternative, you mentioned a scenario of AB testing. It
> seems that if execution A and execution B runs different configurations
> after the savepoints, the history of the two jobs will always be different
> after that, right?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <[hidden email]> wrote:
>
> > Hi Devs,
> >
> > Currently there is a number of efforts around checkpoints/savepoints, as
> > reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41,
> > FLIP-43, and FLIP-45 are all directly related to these topics. This
> > reflects the importance of these two notions/features to the users of the
> > framework.
> >
> > Although many efforts are centred around these notions, their semantics
> and
> > the interplay between them is not always clearly defined. This makes them
> > difficult to explain them to the users (all the different combinations of
> > state-backends, formats and tradeoffs) and in some cases it may have
> > negative effects to the users (e.g. the already-fixed-some-time-ago issue
> > of savepoints not being considered for recovery although they committed
> > side-effects).
> >
> > FLIP-47 [1] and the related Document [2] is aiming at starting a
> discussion
> > around the semantics of savepoints/checkpoints and their interplay, and
> to
> > some extent help us fix the future steps concerning these notions. As an
> > example, should we work towards bringing them closer, or moving them
> > further apart.
> >
> > This is not a complete proposal (by no means), as many of the practical
> > implications can only be fleshed out after we agree on the basic
> semantics
> > and the general frame around these notions. To that end, there are no
> > concrete implementation steps and the FLIP is going to be updated as the
> > discussion continues.
> >
> > I am really looking forward to your opinions on the topic.
> >
> > Cheers,
> > Kostas
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
> > [2]
> >
> >
> https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing
> >
>

Yu Li

Re: [FLIP-47] Savepoints vs Checkpoints

Hi all,

Please allow me to throw some points in combination of FLIP-45 [1] for
discussing, and please don't be confused if some of them are inconsistent
or even opposite to current proposals in FLIP-47 (with me as a co-author),
because as Kostas pointed out, the discussion is still in progress and
hasn't reached to a consensus, but we all agreed to move it forward to
public to collect more feedbacks.

FLIP-45 and FLIP-47 all touches the checkpoint and savepoint concept clean
up but in two different ways, and below are my understanding about their
variance and pros/cons:

* FLIP-45 proposes to map the concepts of Flink checkpoint and savepoint to
database checkpoint and backup, furthermore the periodic system-triggered
checkpoint to flurry [2] checkpoint and the stop-with-checkpoint to sharp
[3] checkpoint. And mentions whether we should introduce a Flink concept
relative to database snapshot, which IMHO we could use FLIP-47 as a good
start for discussion.

- Pros
- No change from user perspective, both conceptually and physically,
thus no additional education cost. (Semantic correction are mainly for
developer to understand)
- Concept mapping to a mature system (database) could help to make it
clear, as well as facilitating implement and explain db-like functions in
future, such as FLIP-43 [4] and streaming ledger [5]
- Cons
- Less beneficial for developers with no database experience (need to
learn database concepts to understand Flink's)
- One may argue that Flink is Flink (stream processing engine), not
database

* FLIP-47 proposes to unify the concepts of Flink checkpoint and savepoint
to snapshot, with a unified command.

- Pros
- Pure Flink concepts, no additional cost to learn/compare concepts
in other systems
- Unified semantic from developer perspective
- Cons
- Detectable change from user perspective, need to re-map the
existing checkpoint/savepoint use cases to new commands
- Currently: checkpoint for failover, savepoint for
upgrade/state-migration/switch-backend/import-export/blue-red-deployment
- Future: every use case to newly introduced command, for example
(the command format is just pseudo):
- Command format:
- take snapshot [mode] [format]
- mode: full(default), incremental
- format: UNIFIED, DEFAULT(default, backend specified)

Use case:
- Resume after stop/cancel: take snapshot incremental DEFAULT
- Upgrade: take snapshot full DEFAULT
- State migration: take snapshot full DEFAULT
- Switch backend: take snapshot full UNIFIED
- Blue/red deployment: take snapshot incremental DEFAULT

- No new functionality supplied but requires user action

And please correct me or give supplements if I've stated anything
wrong/missed anything @Kostas @Aljoscha @Konstantin. Thanks!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic
[2]
https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_fuzzy_checkpointing
[3]
https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_sharp_checkpoint
[4]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-43%3A+State+Processor+API
[5] https://github.com/dataArtisans/da-streamingledger

Best Regards,
Yu

On Wed, 10 Jul 2019 at 15:02, Congxian Qiu <[hidden email]> wrote:

> Hi Kostas
>
> Thanks for bringing this up. Currently, there are indeed some overlaps
> between checkpoint and savepoint that will make user confused. I think the
> FLIP's proposal can give users a clearer description.
>
> About the FLIP, I have a question about “Deleting or moving a snapshot
> must be done by Flink", seems like we will support MOVE/DELETE the stopped
> job's snapshot. What should the user do when he/she wants to DELETE/MOVE
> a stopped job's snapshot
>
> Best,
> Congxian
>
>
> Becket Qin <[hidden email]> 于2019年7月10日周三上午9:33写道：
>
> > Hi Kostas,
> >
> > It makes a lot of sense to just have one underlying mechanism (snapshot)
> to
> > save the state of a Flink job. And we can use that mechanism in different
> > scenarios, including checkpoint and user-triggered savepoint.
> >
> > To facilitate the discussion, maybe it is useful to clarify a few design
> > goals, for example:
> >
> > 1. one unified snapshot format that supports
> > - both incremental and global state saving
> > - rescaling on recovery
> > - compatibility check / migration across different Flink versions?
> > 2. The snapshot can easily be managed by users.
> >
> >
> > And I have two questions regarding the FLIP.
> >
> > 1. What are the side-effects when taking a snapshot? Do you mean taking
> > snapshot may triggers some action other than saving the state of the Job.
> > Technically speaking, taking snapshot should be a "read-only" action to
> the
> > Flink jobs. So I assume by side-effects, you meant it's no-longer
> > read-only. If so, can you be more specific on what are the side-effects
> you
> > are referring to?
> >
> > 2. In the rejected alternative, you mentioned a scenario of AB testing.
> It
> > seems that if execution A and execution B runs different configurations
> > after the savepoints, the history of the two jobs will always be
> different
> > after that, right?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <[hidden email]>
> wrote:
> >
> > > Hi Devs,
> > >
> > > Currently there is a number of efforts around checkpoints/savepoints,
> as
> > > reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41,
> > > FLIP-43, and FLIP-45 are all directly related to these topics. This
> > > reflects the importance of these two notions/features to the users of
> the
> > > framework.
> > >
> > > Although many efforts are centred around these notions, their semantics
> > and
> > > the interplay between them is not always clearly defined. This makes
> them
> > > difficult to explain them to the users (all the different combinations
> of
> > > state-backends, formats and tradeoffs) and in some cases it may have
> > > negative effects to the users (e.g. the already-fixed-some-time-ago
> issue
> > > of savepoints not being considered for recovery although they committed
> > > side-effects).
> > >
> > > FLIP-47 [1] and the related Document [2] is aiming at starting a
> > discussion
> > > around the semantics of savepoints/checkpoints and their interplay, and
> > to
> > > some extent help us fix the future steps concerning these notions. As
> an
> > > example, should we work towards bringing them closer, or moving them
> > > further apart.
> > >
> > > This is not a complete proposal (by no means), as many of the practical
> > > implications can only be fleshed out after we agree on the basic
> > semantics
> > > and the general frame around these notions. To that end, there are no
> > > concrete implementation steps and the FLIP is going to be updated as
> the
> > > discussion continues.
> > >
> > > I am really looking forward to your opinions on the topic.
> > >
> > > Cheers,
> > > Kostas
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
> > > [2]
> > >
> > >
> >
> https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing
> > >
> >
>

Aljoscha Krettek-2

Re: [FLIP-47] Savepoints vs Checkpoints

Hi,

Sorry for the quite late response!

I initially understood FLIP-45 [0] more as a “allow user to stop-with-checkpoint”, that’s why I didn’t think too much about the other things it mentions like semantics of savepoints and checkpoints. I thought that the “stop-with-checkpoint” would work very well together with the ideas proposed in FLIP-47 [2]. Before FLIP-47 we were somewhat adamant about the distinction between savepoints and checkpoints, especially that the former should be user controlled and the latter should be system controlled. That wouldn’t work well for “stop-with-checkpoints” because we would further weaken that separation (that is already weakened by externalised retained checkpoints). If we said (as FLIP-47 proposes) that there are only “snapshots” and they have certain properties like formats or whether they are user controlled, “stop-with-checkpoint” would fit neatly into that model, i.e. the snapshot created from “stop-with-checkpoint” (which should just be called “stop”) would be user controlled and in the format that is configured anyways for the backend (which could be incremental). That last thing is important for users, because this is the whole motivation for “stop-with-checkpoint”, i.e. users want a quicker way of doing a clean stop that is not as heavy as “stop-with-savepoint”.

I think we somehow have to converge on something that we all like about how we want to treat savepoints and checkpoints going forward. I think it’s important to look at it from the users perspective. Right now I see two possible things that users want:

- faster way of stopping than “stop-with-savepoint”
- potentially less confusion between what exactly is a checkpoint, savepoint, externalised retained checkpoint, etc.

For me, the first is more important than the second, but the second is also important to tackle.

What do you think? I have to read a bit more Yu’s email and FLIP-45 again and think about them, then I can have a more educated opinion.

Aljoscha

[0] https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic <https://cwiki.apache.org/confluence/display/FLINK/FLIP-45:+Reinforce+Job+Stop+Semantic>
[1] https://issues.apache.org/jira/browse/FLINK-12619 <https://issues.apache.org/jira/browse/FLINK-12619>
[2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints <https://cwiki.apache.org/confluence/display/FLINK/FLIP-47:+Checkpoints+vs.+Savepoints>

> On 10. Jul 2019, at 11:05, Yu Li <[hidden email]> wrote:
>
> Hi all,
>
> Please allow me to throw some points in combination of FLIP-45 [1] for
> discussing, and please don't be confused if some of them are inconsistent
> or even opposite to current proposals in FLIP-47 (with me as a co-author),
> because as Kostas pointed out, the discussion is still in progress and
> hasn't reached to a consensus, but we all agreed to move it forward to
> public to collect more feedbacks.
>
> FLIP-45 and FLIP-47 all touches the checkpoint and savepoint concept clean
> up but in two different ways, and below are my understanding about their
> variance and pros/cons:
>
> * FLIP-45 proposes to map the concepts of Flink checkpoint and savepoint to
> database checkpoint and backup, furthermore the periodic system-triggered
> checkpoint to flurry [2] checkpoint and the stop-with-checkpoint to sharp
> [3] checkpoint. And mentions whether we should introduce a Flink concept
> relative to database snapshot, which IMHO we could use FLIP-47 as a good
> start for discussion.
>
> - Pros
> - No change from user perspective, both conceptually and physically,
> thus no additional education cost. (Semantic correction are mainly for
> developer to understand)
> - Concept mapping to a mature system (database) could help to make it
> clear, as well as facilitating implement and explain db-like functions in
> future, such as FLIP-43 [4] and streaming ledger [5]
> - Cons
> - Less beneficial for developers with no database experience (need to
> learn database concepts to understand Flink's)
> - One may argue that Flink is Flink (stream processing engine), not
> database
>
>
> * FLIP-47 proposes to unify the concepts of Flink checkpoint and savepoint
> to snapshot, with a unified command.
>
> - Pros
> - Pure Flink concepts, no additional cost to learn/compare concepts
> in other systems
> - Unified semantic from developer perspective
> - Cons
> - Detectable change from user perspective, need to re-map the
> existing checkpoint/savepoint use cases to new commands
> - Currently: checkpoint for failover, savepoint for
> upgrade/state-migration/switch-backend/import-export/blue-red-deployment
> - Future: every use case to newly introduced command, for example
> (the command format is just pseudo):
> - Command format:
> - take snapshot [mode] [format]
> - mode: full(default), incremental
> - format: UNIFIED, DEFAULT(default, backend specified)
>
> Use case:
> - Resume after stop/cancel: take snapshot incremental DEFAULT
> - Upgrade: take snapshot full DEFAULT
> - State migration: take snapshot full DEFAULT
> - Switch backend: take snapshot full UNIFIED
> - Blue/red deployment: take snapshot incremental DEFAULT
>
>
> - No new functionality supplied but requires user action
>
> And please correct me or give supplements if I've stated anything
> wrong/missed anything @Kostas @Aljoscha @Konstantin. Thanks!
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic
> [2]
> https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_fuzzy_checkpointing
> [3]
> https://dev.mysql.com/doc/refman/8.0/en/glossary.html#glos_sharp_checkpoint
> [4]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-43%3A+State+Processor+API
> [5] https://github.com/dataArtisans/da-streamingledger
>
> Best Regards,
> Yu
>
>
> On Wed, 10 Jul 2019 at 15:02, Congxian Qiu <[hidden email]> wrote:
>
>> Hi Kostas
>>
>> Thanks for bringing this up. Currently, there are indeed some overlaps
>> between checkpoint and savepoint that will make user confused. I think the
>> FLIP's proposal can give users a clearer description.
>>
>> About the FLIP, I have a question about “Deleting or moving a snapshot
>> must be done by Flink", seems like we will support MOVE/DELETE the stopped
>> job's snapshot. What should the user do when he/she wants to DELETE/MOVE
>> a stopped job's snapshot
>>
>> Best,
>> Congxian
>>
>>
>> Becket Qin <[hidden email]> 于2019年7月10日周三上午9:33写道：
>>
>>> Hi Kostas,
>>>
>>> It makes a lot of sense to just have one underlying mechanism (snapshot)
>> to
>>> save the state of a Flink job. And we can use that mechanism in different
>>> scenarios, including checkpoint and user-triggered savepoint.
>>>
>>> To facilitate the discussion, maybe it is useful to clarify a few design
>>> goals, for example:
>>>
>>> 1. one unified snapshot format that supports
>>> - both incremental and global state saving
>>> - rescaling on recovery
>>> - compatibility check / migration across different Flink versions?
>>> 2. The snapshot can easily be managed by users.
>>>
>>>
>>> And I have two questions regarding the FLIP.
>>>
>>> 1. What are the side-effects when taking a snapshot? Do you mean taking
>>> snapshot may triggers some action other than saving the state of the Job.
>>> Technically speaking, taking snapshot should be a "read-only" action to
>> the
>>> Flink jobs. So I assume by side-effects, you meant it's no-longer
>>> read-only. If so, can you be more specific on what are the side-effects
>> you
>>> are referring to?
>>>
>>> 2. In the rejected alternative, you mentioned a scenario of AB testing.
>> It
>>> seems that if execution A and execution B runs different configurations
>>> after the savepoints, the history of the two jobs will always be
>> different
>>> after that, right?
>>>
>>> Thanks,
>>>
>>> Jiangjie (Becket) Qin
>>>
>>> On Mon, Jul 8, 2019 at 9:53 PM Kostas Kloudas <[hidden email]>
>> wrote:
>>>
>>>> Hi Devs,
>>>>
>>>> Currently there is a number of efforts around checkpoints/savepoints,
>> as
>>>> reflected by the number of FLIPs. From a quick look FLIP-34, FLIP-41,
>>>> FLIP-43, and FLIP-45 are all directly related to these topics. This
>>>> reflects the importance of these two notions/features to the users of
>> the
>>>> framework.
>>>>
>>>> Although many efforts are centred around these notions, their semantics
>>> and
>>>> the interplay between them is not always clearly defined. This makes
>> them
>>>> difficult to explain them to the users (all the different combinations
>> of
>>>> state-backends, formats and tradeoffs) and in some cases it may have
>>>> negative effects to the users (e.g. the already-fixed-some-time-ago
>> issue
>>>> of savepoints not being considered for recovery although they committed
>>>> side-effects).
>>>>
>>>> FLIP-47 [1] and the related Document [2] is aiming at starting a
>>> discussion
>>>> around the semantics of savepoints/checkpoints and their interplay, and
>>> to
>>>> some extent help us fix the future steps concerning these notions. As
>> an
>>>> example, should we work towards bringing them closer, or moving them
>>>> further apart.
>>>>
>>>> This is not a complete proposal (by no means), as many of the practical
>>>> implications can only be fleshed out after we agree on the basic
>>> semantics
>>>> and the general frame around these notions. To that end, there are no
>>>> concrete implementation steps and the FLIP is going to be updated as
>> the
>>>> discussion continues.
>>>>
>>>> I am really looking forward to your opinions on the topic.
>>>>
>>>> Cheers,
>>>> Kostas
>>>>
>>>> [1]
>>>>
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints
>>>> [2]
>>>>
>>>>
>>>
>> https://docs.google.com/document/d/1_1FF8D3u0tT_zHWtB-hUKCP_arVsxlmjwmJ-TvZd4fs/edit?usp=sharing
>>>>
>>>
>>