Hi all,
Currently, savepoints are exactly the completed checkpoints, and Flink provides commands (save/run) to allow saving and restoring jobs. But in the near future, savepoints will be very different from checkpoints because they will have common serialization formats and allow recover from major updates. The saving and restoring based on savepoints will be more costly. To provide efficient saving and restoring of jobs, we propose to add two more commands in Flink: SUSPEND and RESUME which are based on checkpoints. As the implementation of checkpoints depends on the backends (and many other components in Flink), suspending and resuming may not work if there exist major changes in the job or Flink (e.g., different backends). But as the implementation is based on checkpoints instead of savepoints, they are supposed to be more efficient. The details of the design can be viewed in the Google Doc: Support Resuming and Suspending of Flink Jobs <https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03NgQWZpDTkSxVjg/edit?usp=sharing> . Look forward to your comments. Any feedback is appreciated. :) Thanks, Xiaogang |
Sorry, I haven't followed this development, but roughly how much more
costly is the new serialization for savepoints? On Wed, Oct 12, 2016 at 5:51 AM, SHI Xiaogang <[hidden email]> wrote: > Hi all, > > Currently, savepoints are exactly the completed checkpoints, and Flink > provides commands (save/run) to allow saving and restoring jobs. But in the > near future, savepoints will be very different from checkpoints because > they will have common serialization formats and allow recover from major > updates. The saving and restoring based on savepoints will be more costly. > > To provide efficient saving and restoring of jobs, we propose to add two > more commands in Flink: SUSPEND and RESUME which are based on checkpoints. > > As the implementation of checkpoints depends on the backends (and many > other components in Flink), suspending and resuming may not work if there > exist major changes in the job or Flink (e.g., different backends). But as > the implementation is based on checkpoints instead of savepoints, they are > supposed to be more efficient. > > The details of the design can be viewed in the Google Doc: Support Resuming > and Suspending of Flink Jobs > <https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03 > NgQWZpDTkSxVjg/edit?usp=sharing> > . > > Look forward to your comments. Any feedback is appreciated. :) > > Thanks, > Xiaogang > |
Hi Greg,
at the moment the serialization of savepoints costs the same as the serialization of checkpoints, because they use the same serialization logic. In fact, with Ufuk's changes [1], a savepoint is a checkpoint with special properties. However, in the future we will probably have different serialization formats. The savepoint would use a generalized format which allows to restore the savepoint with a different state backend. When drawing a checkpoint, we don't have to do this, because we know that the state backend won't change. Thus, we could store the checkpoint in a more compressed format which exploits the characteristics of the respective state backend. But this is not implemented yet. [1] https://github.com/apache/flink/pull/2608 Cheers, Till On Wed, Oct 12, 2016 at 2:35 PM, Greg Hogan <[hidden email]> wrote: > Sorry, I haven't followed this development, but roughly how much more > costly is the new serialization for savepoints? > > On Wed, Oct 12, 2016 at 5:51 AM, SHI Xiaogang <[hidden email]> > wrote: > > > Hi all, > > > > Currently, savepoints are exactly the completed checkpoints, and Flink > > provides commands (save/run) to allow saving and restoring jobs. But in > the > > near future, savepoints will be very different from checkpoints because > > they will have common serialization formats and allow recover from major > > updates. The saving and restoring based on savepoints will be more > costly. > > > > To provide efficient saving and restoring of jobs, we propose to add two > > more commands in Flink: SUSPEND and RESUME which are based on > checkpoints. > > > > As the implementation of checkpoints depends on the backends (and many > > other components in Flink), suspending and resuming may not work if there > > exist major changes in the job or Flink (e.g., different backends). But > as > > the implementation is based on checkpoints instead of savepoints, they > are > > supposed to be more efficient. > > > > The details of the design can be viewed in the Google Doc: Support > Resuming > > and Suspending of Flink Jobs > > <https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03 > > NgQWZpDTkSxVjg/edit?usp=sharing> > > . > > > > Look forward to your comments. Any feedback is appreciated. :) > > > > Thanks, > > Xiaogang > > > |
Free forum by Nabble | Edit this page |