(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] Support Suspending and Resuming of Flink Jobs

Classic

List

Threaded

3 messages Options

SHI Xiaogang

[DISCUSS] Support Suspending and Resuming of Flink Jobs

Hi all,

Currently, savepoints are exactly the completed checkpoints, and Flink
provides commands (save/run) to allow saving and restoring jobs. But in the
near future, savepoints will be very different from checkpoints because
they will have common serialization formats and allow recover from major
updates. The saving and restoring based on savepoints will be more costly.

To provide efficient saving and restoring of jobs, we propose to add two
more commands in Flink: SUSPEND and RESUME which are based on checkpoints.

As the implementation of checkpoints depends on the backends (and many
other components in Flink), suspending and resuming may not work if there
exist major changes in the job or Flink (e.g., different backends). But as
the implementation is based on checkpoints instead of savepoints, they are
supposed to be more efficient.

The details of the design can be viewed in the Google Doc: Support Resuming
and Suspending of Flink Jobs
<https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03NgQWZpDTkSxVjg/edit?usp=sharing>
.

Look forward to your comments. Any feedback is appreciated. :)

Thanks,
Xiaogang

Greg Hogan

Re: [DISCUSS] Support Suspending and Resuming of Flink Jobs

Sorry, I haven't followed this development, but roughly how much more
costly is the new serialization for savepoints?

On Wed, Oct 12, 2016 at 5:51 AM, SHI Xiaogang <[hidden email]>
wrote:

> Hi all,
>
> Currently, savepoints are exactly the completed checkpoints, and Flink
> provides commands (save/run) to allow saving and restoring jobs. But in the
> near future, savepoints will be very different from checkpoints because
> they will have common serialization formats and allow recover from major
> updates. The saving and restoring based on savepoints will be more costly.
>
> To provide efficient saving and restoring of jobs, we propose to add two
> more commands in Flink: SUSPEND and RESUME which are based on checkpoints.
>
> As the implementation of checkpoints depends on the backends (and many
> other components in Flink), suspending and resuming may not work if there
> exist major changes in the job or Flink (e.g., different backends). But as
> the implementation is based on checkpoints instead of savepoints, they are
> supposed to be more efficient.
>
> The details of the design can be viewed in the Google Doc: Support Resuming
> and Suspending of Flink Jobs
> <https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03
> NgQWZpDTkSxVjg/edit?usp=sharing>
> .
>
> Look forward to your comments. Any feedback is appreciated. :)
>
> Thanks,
> Xiaogang
>

Till Rohrmann

Re: [DISCUSS] Support Suspending and Resuming of Flink Jobs

Hi Greg,

at the moment the serialization of savepoints costs the same as the
serialization of checkpoints, because they use the same serialization
logic. In fact, with Ufuk's changes [1], a savepoint is a checkpoint with
special properties. However, in the future we will probably have different
serialization formats. The savepoint would use a generalized format which
allows to restore the savepoint with a different state backend. When
drawing a checkpoint, we don't have to do this, because we know that the
state backend won't change. Thus, we could store the checkpoint in a more
compressed format which exploits the characteristics of the respective
state backend. But this is not implemented yet.

[1] https://github.com/apache/flink/pull/2608

Cheers,
Till

On Wed, Oct 12, 2016 at 2:35 PM, Greg Hogan <[hidden email]> wrote:

> Sorry, I haven't followed this development, but roughly how much more
> costly is the new serialization for savepoints?
>
> On Wed, Oct 12, 2016 at 5:51 AM, SHI Xiaogang <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > Currently, savepoints are exactly the completed checkpoints, and Flink
> > provides commands (save/run) to allow saving and restoring jobs. But in
> the
> > near future, savepoints will be very different from checkpoints because
> > they will have common serialization formats and allow recover from major
> > updates. The saving and restoring based on savepoints will be more
> costly.
> >
> > To provide efficient saving and restoring of jobs, we propose to add two
> > more commands in Flink: SUSPEND and RESUME which are based on
> checkpoints.
> >
> > As the implementation of checkpoints depends on the backends (and many
> > other components in Flink), suspending and resuming may not work if there
> > exist major changes in the job or Flink (e.g., different backends). But
> as
> > the implementation is based on checkpoints instead of savepoints, they
> are
> > supposed to be more efficient.
> >
> > The details of the design can be viewed in the Google Doc: Support
> Resuming
> > and Suspending of Flink Jobs
> > <https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03
> > NgQWZpDTkSxVjg/edit?usp=sharing>
> > .
> >
> > Look forward to your comments. Any feedback is appreciated. :)
> >
> > Thanks,
> > Xiaogang
> >
>