[DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Kostas Kloudas-3
Hi everyone,

 A commonly used functionality offered by Flink is the
"cancel-with-savepoint" operation. When applied to the current exactly-once
sinks, the current implementation of the feature can be problematic, as it
does not guarantee that side-effects will be committed by Flink to the 3rd
party storage system.

 This discussion targets fixing this issue and proposes the addition of two
termination modes, namely:
    1) SUSPEND, for temporarily stopping the job, e.g. for Flink version
upgrading in your cluster
    2) TERMINATE, for terminal shut down which ends the stream and sends
MAX_WATERMARK time, and flushes any state associated with (event time)
timers

A google doc with the FLIP proposal can be found here:
https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing

And the page for the FLIP is here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212

 The implementation sketch is far from complete, but it is worth having a
discussion on the semantics as soon as possible. The implementation section
is going to be updated soon.

 Looking forward to the discussion,
 Kostas

--

Kostas Kloudas | Software Engineer


<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Stephan Ewen
Thank you for starting this feature discussion.
This is a feature that has been requested various times, great to see it
happening.

+1 for this FLIP

On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <[hidden email]>
wrote:

> Hi everyone,
>
>  A commonly used functionality offered by Flink is the
> "cancel-with-savepoint" operation. When applied to the current exactly-once
> sinks, the current implementation of the feature can be problematic, as it
> does not guarantee that side-effects will be committed by Flink to the 3rd
> party storage system.
>
>  This discussion targets fixing this issue and proposes the addition of two
> termination modes, namely:
>     1) SUSPEND, for temporarily stopping the job, e.g. for Flink version
> upgrading in your cluster
>     2) TERMINATE, for terminal shut down which ends the stream and sends
> MAX_WATERMARK time, and flushes any state associated with (event time)
> timers
>
> A google doc with the FLIP proposal can be found here:
>
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
>
> And the page for the FLIP is here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
>
>  The implementation sketch is far from complete, but it is worth having a
> discussion on the semantics as soon as possible. The implementation section
> is going to be updated soon.
>
>  Looking forward to the discussion,
>  Kostas
>
> --
>
> Kostas Kloudas | Software Engineer
>
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Fabian Hueske-2
Thanks for working on improving cancel-with-savepoint Kostas.
Distinguishing the termination modes would be a big step forward, IMO.

Btw. there is already another FLIP-33 on the way.
This one should be FLIP-34.

Cheers,
Fabian

Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen <[hidden email]>:

> Thank you for starting this feature discussion.
> This is a feature that has been requested various times, great to see it
> happening.
>
> +1 for this FLIP
>
> On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <[hidden email]>
> wrote:
>
> > Hi everyone,
> >
> >  A commonly used functionality offered by Flink is the
> > "cancel-with-savepoint" operation. When applied to the current
> exactly-once
> > sinks, the current implementation of the feature can be problematic, as
> it
> > does not guarantee that side-effects will be committed by Flink to the
> 3rd
> > party storage system.
> >
> >  This discussion targets fixing this issue and proposes the addition of
> two
> > termination modes, namely:
> >     1) SUSPEND, for temporarily stopping the job, e.g. for Flink version
> > upgrading in your cluster
> >     2) TERMINATE, for terminal shut down which ends the stream and sends
> > MAX_WATERMARK time, and flushes any state associated with (event time)
> > timers
> >
> > A google doc with the FLIP proposal can be found here:
> >
> >
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
> >
> > And the page for the FLIP is here:
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
> >
> >  The implementation sketch is far from complete, but it is worth having a
> > discussion on the semantics as soon as possible. The implementation
> section
> > is going to be updated soon.
> >
> >  Looking forward to the discussion,
> >  Kostas
> >
> > --
> >
> > Kostas Kloudas | Software Engineer
> >
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Data Artisans GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

jincheng sun
Thank you for starting the discussion about cancel-with-savepoint Kostas.

+1 for the FLIP.

Cheers,
Jincheng

Fabian Hueske <[hidden email]> 于2019年2月13日周三 上午4:31写道:

> Thanks for working on improving cancel-with-savepoint Kostas.
> Distinguishing the termination modes would be a big step forward, IMO.
>
> Btw. there is already another FLIP-33 on the way.
> This one should be FLIP-34.
>
> Cheers,
> Fabian
>
> Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen <[hidden email]
> >:
>
> > Thank you for starting this feature discussion.
> > This is a feature that has been requested various times, great to see it
> > happening.
> >
> > +1 for this FLIP
> >
> > On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <[hidden email]>
> > wrote:
> >
> > > Hi everyone,
> > >
> > >  A commonly used functionality offered by Flink is the
> > > "cancel-with-savepoint" operation. When applied to the current
> > exactly-once
> > > sinks, the current implementation of the feature can be problematic, as
> > it
> > > does not guarantee that side-effects will be committed by Flink to the
> > 3rd
> > > party storage system.
> > >
> > >  This discussion targets fixing this issue and proposes the addition of
> > two
> > > termination modes, namely:
> > >     1) SUSPEND, for temporarily stopping the job, e.g. for Flink
> version
> > > upgrading in your cluster
> > >     2) TERMINATE, for terminal shut down which ends the stream and
> sends
> > > MAX_WATERMARK time, and flushes any state associated with (event time)
> > > timers
> > >
> > > A google doc with the FLIP proposal can be found here:
> > >
> > >
> >
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
> > >
> > > And the page for the FLIP is here:
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
> > >
> > >  The implementation sketch is far from complete, but it is worth
> having a
> > > discussion on the semantics as soon as possible. The implementation
> > section
> > > is going to be updated soon.
> > >
> > >  Looking forward to the discussion,
> > >  Kostas
> > >
> > > --
> > >
> > > Kostas Kloudas | Software Engineer
> > >
> > >
> > > <https://www.ververica.com/>
> > >
> > > Follow us @VervericaData
> > >
> > > --
> > >
> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > > Conference
> > >
> > > Stream Processing | Event Driven | Real Time
> > >
> > > --
> > >
> > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > >
> > > --
> > > Data Artisans GmbH
> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Guowei Ma
thanks for starting this discussion. It is a very cool feature.

+1 for the FLIP

Best
Guowei

jincheng sun <[hidden email]> 于2019年2月13日周三 上午9:35写道:

> Thank you for starting the discussion about cancel-with-savepoint Kostas.
>
> +1 for the FLIP.
>
> Cheers,
> Jincheng
>
> Fabian Hueske <[hidden email]> 于2019年2月13日周三 上午4:31写道:
>
> > Thanks for working on improving cancel-with-savepoint Kostas.
> > Distinguishing the termination modes would be a big step forward, IMO.
> >
> > Btw. there is already another FLIP-33 on the way.
> > This one should be FLIP-34.
> >
> > Cheers,
> > Fabian
> >
> > Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen <
> [hidden email]
> > >:
> >
> > > Thank you for starting this feature discussion.
> > > This is a feature that has been requested various times, great to see
> it
> > > happening.
> > >
> > > +1 for this FLIP
> > >
> > > On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <
> [hidden email]>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >  A commonly used functionality offered by Flink is the
> > > > "cancel-with-savepoint" operation. When applied to the current
> > > exactly-once
> > > > sinks, the current implementation of the feature can be problematic,
> as
> > > it
> > > > does not guarantee that side-effects will be committed by Flink to
> the
> > > 3rd
> > > > party storage system.
> > > >
> > > >  This discussion targets fixing this issue and proposes the addition
> of
> > > two
> > > > termination modes, namely:
> > > >     1) SUSPEND, for temporarily stopping the job, e.g. for Flink
> > version
> > > > upgrading in your cluster
> > > >     2) TERMINATE, for terminal shut down which ends the stream and
> > sends
> > > > MAX_WATERMARK time, and flushes any state associated with (event
> time)
> > > > timers
> > > >
> > > > A google doc with the FLIP proposal can be found here:
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
> > > >
> > > > And the page for the FLIP is here:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
> > > >
> > > >  The implementation sketch is far from complete, but it is worth
> > having a
> > > > discussion on the semantics as soon as possible. The implementation
> > > section
> > > > is going to be updated soon.
> > > >
> > > >  Looking forward to the discussion,
> > > >  Kostas
> > > >
> > > > --
> > > >
> > > > Kostas Kloudas | Software Engineer
> > > >
> > > >
> > > > <https://www.ververica.com/>
> > > >
> > > > Follow us @VervericaData
> > > >
> > > > --
> > > >
> > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > > > Conference
> > > >
> > > > Stream Processing | Event Driven | Real Time
> > > >
> > > > --
> > > >
> > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > > >
> > > > --
> > > > Data Artisans GmbH
> > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Biao Liu
In reply to this post by jincheng sun
Thanks for bringing us this discussion.
I like the idea. It's really useful in production scenario.

+1 for the proposal.

jincheng sun <[hidden email]> 于2019年2月13日周三 上午9:35写道:

> Thank you for starting the discussion about cancel-with-savepoint Kostas.
>
> +1 for the FLIP.
>
> Cheers,
> Jincheng
>
> Fabian Hueske <[hidden email]> 于2019年2月13日周三 上午4:31写道:
>
> > Thanks for working on improving cancel-with-savepoint Kostas.
> > Distinguishing the termination modes would be a big step forward, IMO.
> >
> > Btw. there is already another FLIP-33 on the way.
> > This one should be FLIP-34.
> >
> > Cheers,
> > Fabian
> >
> > Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen <
> [hidden email]
> > >:
> >
> > > Thank you for starting this feature discussion.
> > > This is a feature that has been requested various times, great to see
> it
> > > happening.
> > >
> > > +1 for this FLIP
> > >
> > > On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <
> [hidden email]>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >  A commonly used functionality offered by Flink is the
> > > > "cancel-with-savepoint" operation. When applied to the current
> > > exactly-once
> > > > sinks, the current implementation of the feature can be problematic,
> as
> > > it
> > > > does not guarantee that side-effects will be committed by Flink to
> the
> > > 3rd
> > > > party storage system.
> > > >
> > > >  This discussion targets fixing this issue and proposes the addition
> of
> > > two
> > > > termination modes, namely:
> > > >     1) SUSPEND, for temporarily stopping the job, e.g. for Flink
> > version
> > > > upgrading in your cluster
> > > >     2) TERMINATE, for terminal shut down which ends the stream and
> > sends
> > > > MAX_WATERMARK time, and flushes any state associated with (event
> time)
> > > > timers
> > > >
> > > > A google doc with the FLIP proposal can be found here:
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
> > > >
> > > > And the page for the FLIP is here:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
> > > >
> > > >  The implementation sketch is far from complete, but it is worth
> > having a
> > > > discussion on the semantics as soon as possible. The implementation
> > > section
> > > > is going to be updated soon.
> > > >
> > > >  Looking forward to the discussion,
> > > >  Kostas
> > > >
> > > > --
> > > >
> > > > Kostas Kloudas | Software Engineer
> > > >
> > > >
> > > > <https://www.ververica.com/>
> > > >
> > > > Follow us @VervericaData
> > > >
> > > > --
> > > >
> > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > > > Conference
> > > >
> > > > Stream Processing | Event Driven | Real Time
> > > >
> > > > --
> > > >
> > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > > >
> > > > --
> > > > Data Artisans GmbH
> > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Elias Levy
In reply to this post by Kostas Kloudas-3
Apologies for the late reply.

I think this is badly needed, but I fear we are adding complexity by
introducing yet two more stop commands.  We'll have: cancel, stop,
terminate. and suspend.  We basically want to do two things: terminate a
job with prejudice or stop a job safely.

For the former "cancel" is the appropriate term, and should have no need
for a cancel with checkpoint option.  If the job was configured to use
externalized checkpoints and it ran long enough, a checkpoint will be
available for it.

For the later "stop" is the appropriate term, and it means that a job
should process no messages after the checkpoints barrier and that it should
ensure that exactly-once sinks complete their two-phase commits
successfully.  If a savepoint was requested, one should be created.

So in my mind there are two commands, cancel and stop, with appropriate
semantics.  Emitting MAX_WATERMARK before the checkpoint barrier during
stop is merely an optional behavior, like creation of a savepoint.  But if
a specific command for it is desired, then "drain" seems appropriate.

On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote:

> Hi Elias!
>
> I remember you brought this missing feature up in the past. Do you think
> the proposed enhancement would work for your use case?
>
> Best,
> Stephan
>
> ---------- Forwarded message ---------
> From: Kostas Kloudas <[hidden email]>
> Date: Tue, Feb 12, 2019 at 5:28 PM
> Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint
> To: <[hidden email]>
>
>
> Hi everyone,
>
>  A commonly used functionality offered by Flink is the
> "cancel-with-savepoint" operation. When applied to the current exactly-once
> sinks, the current implementation of the feature can be problematic, as it
> does not guarantee that side-effects will be committed by Flink to the 3rd
> party storage system.
>
>  This discussion targets fixing this issue and proposes the addition of two
> termination modes, namely:
>     1) SUSPEND, for temporarily stopping the job, e.g. for Flink version
> upgrading in your cluster
>     2) TERMINATE, for terminal shut down which ends the stream and sends
> MAX_WATERMARK time, and flushes any state associated with (event time)
> timers
>
> A google doc with the FLIP proposal can be found here:
>
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
>
> And the page for the FLIP is here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
>
>  The implementation sketch is far from complete, but it is worth having a
> discussion on the semantics as soon as possible. The implementation section
> is going to be updated soon.
>
>  Looking forward to the discussion,
>  Kostas
>
> --
>
> Kostas Kloudas | Software Engineer
>
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Ufuk Celebi-2
I really like this effort. I think the original plan for
"cancel-with-savepoint" was always to just be a workaround until we
arrived at a better solution as proposed here.

Regarding the FLIP, I agree with Elias comments. I think the number of
termination modes the FLIP introduces can be overwhelming and I would
personally rather follow Elias' proposal. In context of the proposal,
this would result in the following:
- "terminate" becomes "stop --drain"
- "suspend" becomes "stop --with-savepoint"
- "cancel-with-savepoint" is superseded by "stop --with-savepoint"

I have two remaining questions:

1) @Kostas: Elias suggests for stop that "a job should process no
messages after the checkpoints barrier". This is something that needs
support from the sources. Is this in the scope of your proposal (I
think not)? If not, is there a future plan for this?

2) Would we need to introduce a new command/name for "stop" as we
already have a "stop" command? Assuming that there are no users that
actually use the existing "stop" command as no major sources are
stoppable (I think), I would personally suggest to upgrade the
existing "stop" command to the proposed one. If on the other hand, if
we know of users that rely on the current "stop" command, we'd need to
find another name for it.

Best,

Ufuk

On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email]> wrote:

>
> Apologies for the late reply.
>
> I think this is badly needed, but I fear we are adding complexity by
> introducing yet two more stop commands.  We'll have: cancel, stop,
> terminate. and suspend.  We basically want to do two things: terminate a
> job with prejudice or stop a job safely.
>
> For the former "cancel" is the appropriate term, and should have no need
> for a cancel with checkpoint option.  If the job was configured to use
> externalized checkpoints and it ran long enough, a checkpoint will be
> available for it.
>
> For the later "stop" is the appropriate term, and it means that a job
> should process no messages after the checkpoints barrier and that it should
> ensure that exactly-once sinks complete their two-phase commits
> successfully.  If a savepoint was requested, one should be created.
>
> So in my mind there are two commands, cancel and stop, with appropriate
> semantics.  Emitting MAX_WATERMARK before the checkpoint barrier during
> stop is merely an optional behavior, like creation of a savepoint.  But if
> a specific command for it is desired, then "drain" seems appropriate.
>
> On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote:
>
> > Hi Elias!
> >
> > I remember you brought this missing feature up in the past. Do you think
> > the proposed enhancement would work for your use case?
> >
> > Best,
> > Stephan
> >
> > ---------- Forwarded message ---------
> > From: Kostas Kloudas <[hidden email]>
> > Date: Tue, Feb 12, 2019 at 5:28 PM
> > Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint
> > To: <[hidden email]>
> >
> >
> > Hi everyone,
> >
> >  A commonly used functionality offered by Flink is the
> > "cancel-with-savepoint" operation. When applied to the current exactly-once
> > sinks, the current implementation of the feature can be problematic, as it
> > does not guarantee that side-effects will be committed by Flink to the 3rd
> > party storage system.
> >
> >  This discussion targets fixing this issue and proposes the addition of two
> > termination modes, namely:
> >     1) SUSPEND, for temporarily stopping the job, e.g. for Flink version
> > upgrading in your cluster
> >     2) TERMINATE, for terminal shut down which ends the stream and sends
> > MAX_WATERMARK time, and flushes any state associated with (event time)
> > timers
> >
> > A google doc with the FLIP proposal can be found here:
> >
> > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
> >
> > And the page for the FLIP is here:
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
> >
> >  The implementation sketch is far from complete, but it is worth having a
> > discussion on the semantics as soon as possible. The implementation section
> > is going to be updated soon.
> >
> >  Looking forward to the discussion,
> >  Kostas
> >
> > --
> >
> > Kostas Kloudas | Software Engineer
> >
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Data Artisans GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> >
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Kostas Kloudas-4
Hi,

Thanks for the comments.
I agree with the Ufuk's and Elias' proposal.

- "cancel" remains the good old "cancel"
- "terminate" becomes "stop --drain-with-savepoint"
- "suspend" becomes "stop --with-savepoint"
- "cancel-with-savepoint" is subsumed by "stop --with-savepoint"

As you see from the previous, I would also add "terminate" and "suspend"
to result in keeping a savepoint by default.

As for Ufuk's remarks:

1) You are correct that to have a proper way to not allow elements to be
fed in the pipeline
after the checkpoint barrier, we need support from the sources. This is
more the responsibility
of FLIP-27
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface

2) I would lean more towards replacing the old "stop" command with the new
one. But, as you said,
I have no view of how many users (if any) rely on the old "stop" command
for their usecases.

Cheers,
Kostas



On Wed, Mar 6, 2019 at 9:52 PM Ufuk Celebi <[hidden email]> wrote:

> I really like this effort. I think the original plan for
> "cancel-with-savepoint" was always to just be a workaround until we
> arrived at a better solution as proposed here.
>
> Regarding the FLIP, I agree with Elias comments. I think the number of
> termination modes the FLIP introduces can be overwhelming and I would
> personally rather follow Elias' proposal. In context of the proposal,
> this would result in the following:
> - "terminate" becomes "stop --drain"
> - "suspend" becomes "stop --with-savepoint"
> - "cancel-with-savepoint" is superseded by "stop --with-savepoint"
>
> I have two remaining questions:
>
> 1) @Kostas: Elias suggests for stop that "a job should process no
> messages after the checkpoints barrier". This is something that needs
> support from the sources. Is this in the scope of your proposal (I
> think not)? If not, is there a future plan for this?
>
> 2) Would we need to introduce a new command/name for "stop" as we
> already have a "stop" command? Assuming that there are no users that
> actually use the existing "stop" command as no major sources are
> stoppable (I think), I would personally suggest to upgrade the
> existing "stop" command to the proposed one. If on the other hand, if
> we know of users that rely on the current "stop" command, we'd need to
> find another name for it.
>
> Best,
>
> Ufuk
>
> On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email]>
> wrote:
> >
> > Apologies for the late reply.
> >
> > I think this is badly needed, but I fear we are adding complexity by
> > introducing yet two more stop commands.  We'll have: cancel, stop,
> > terminate. and suspend.  We basically want to do two things: terminate a
> > job with prejudice or stop a job safely.
> >
> > For the former "cancel" is the appropriate term, and should have no need
> > for a cancel with checkpoint option.  If the job was configured to use
> > externalized checkpoints and it ran long enough, a checkpoint will be
> > available for it.
> >
> > For the later "stop" is the appropriate term, and it means that a job
> > should process no messages after the checkpoints barrier and that it
> should
> > ensure that exactly-once sinks complete their two-phase commits
> > successfully.  If a savepoint was requested, one should be created.
> >
> > So in my mind there are two commands, cancel and stop, with appropriate
> > semantics.  Emitting MAX_WATERMARK before the checkpoint barrier during
> > stop is merely an optional behavior, like creation of a savepoint.  But
> if
> > a specific command for it is desired, then "drain" seems appropriate.
> >
> > On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote:
> >
> > > Hi Elias!
> > >
> > > I remember you brought this missing feature up in the past. Do you
> think
> > > the proposed enhancement would work for your use case?
> > >
> > > Best,
> > > Stephan
> > >
> > > ---------- Forwarded message ---------
> > > From: Kostas Kloudas <[hidden email]>
> > > Date: Tue, Feb 12, 2019 at 5:28 PM
> > > Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint
> > > To: <[hidden email]>
> > >
> > >
> > > Hi everyone,
> > >
> > >  A commonly used functionality offered by Flink is the
> > > "cancel-with-savepoint" operation. When applied to the current
> exactly-once
> > > sinks, the current implementation of the feature can be problematic,
> as it
> > > does not guarantee that side-effects will be committed by Flink to the
> 3rd
> > > party storage system.
> > >
> > >  This discussion targets fixing this issue and proposes the addition
> of two
> > > termination modes, namely:
> > >     1) SUSPEND, for temporarily stopping the job, e.g. for Flink
> version
> > > upgrading in your cluster
> > >     2) TERMINATE, for terminal shut down which ends the stream and
> sends
> > > MAX_WATERMARK time, and flushes any state associated with (event time)
> > > timers
> > >
> > > A google doc with the FLIP proposal can be found here:
> > >
> > >
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
> > >
> > > And the page for the FLIP is here:
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
> > >
> > >  The implementation sketch is far from complete, but it is worth
> having a
> > > discussion on the semantics as soon as possible. The implementation
> section
> > > is going to be updated soon.
> > >
> > >  Looking forward to the discussion,
> > >  Kostas
> > >
> > > --
> > >
> > > Kostas Kloudas | Software Engineer
> > >
> > >
> > > <https://www.ververica.com/>
> > >
> > > Follow us @VervericaData
> > >
> > > --
> > >
> > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > > Conference
> > >
> > > Stream Processing | Event Driven | Real Time
> > >
> > > --
> > >
> > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > >
> > > --
> > > Data Artisans GmbH
> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Aljoscha Krettek-2
I agree and already created a Jira issue for removing the old “stop” feature as preparation: https://issues.apache.org/jira/browse/FLINK-11889 <https://issues.apache.org/jira/browse/FLINK-11889>

Aljoscha

> On 7. Mar 2019, at 11:08, Kostas Kloudas <[hidden email]> wrote:
>
> Hi,
>
> Thanks for the comments.
> I agree with the Ufuk's and Elias' proposal.
>
> - "cancel" remains the good old "cancel"
> - "terminate" becomes "stop --drain-with-savepoint"
> - "suspend" becomes "stop --with-savepoint"
> - "cancel-with-savepoint" is subsumed by "stop --with-savepoint"
>
> As you see from the previous, I would also add "terminate" and "suspend"
> to result in keeping a savepoint by default.
>
> As for Ufuk's remarks:
>
> 1) You are correct that to have a proper way to not allow elements to be
> fed in the pipeline
> after the checkpoint barrier, we need support from the sources. This is
> more the responsibility
> of FLIP-27
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
>
> 2) I would lean more towards replacing the old "stop" command with the new
> one. But, as you said,
> I have no view of how many users (if any) rely on the old "stop" command
> for their usecases.
>
> Cheers,
> Kostas
>
>
>
> On Wed, Mar 6, 2019 at 9:52 PM Ufuk Celebi <[hidden email]> wrote:
>
>> I really like this effort. I think the original plan for
>> "cancel-with-savepoint" was always to just be a workaround until we
>> arrived at a better solution as proposed here.
>>
>> Regarding the FLIP, I agree with Elias comments. I think the number of
>> termination modes the FLIP introduces can be overwhelming and I would
>> personally rather follow Elias' proposal. In context of the proposal,
>> this would result in the following:
>> - "terminate" becomes "stop --drain"
>> - "suspend" becomes "stop --with-savepoint"
>> - "cancel-with-savepoint" is superseded by "stop --with-savepoint"
>>
>> I have two remaining questions:
>>
>> 1) @Kostas: Elias suggests for stop that "a job should process no
>> messages after the checkpoints barrier". This is something that needs
>> support from the sources. Is this in the scope of your proposal (I
>> think not)? If not, is there a future plan for this?
>>
>> 2) Would we need to introduce a new command/name for "stop" as we
>> already have a "stop" command? Assuming that there are no users that
>> actually use the existing "stop" command as no major sources are
>> stoppable (I think), I would personally suggest to upgrade the
>> existing "stop" command to the proposed one. If on the other hand, if
>> we know of users that rely on the current "stop" command, we'd need to
>> find another name for it.
>>
>> Best,
>>
>> Ufuk
>>
>> On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email]>
>> wrote:
>>>
>>> Apologies for the late reply.
>>>
>>> I think this is badly needed, but I fear we are adding complexity by
>>> introducing yet two more stop commands.  We'll have: cancel, stop,
>>> terminate. and suspend.  We basically want to do two things: terminate a
>>> job with prejudice or stop a job safely.
>>>
>>> For the former "cancel" is the appropriate term, and should have no need
>>> for a cancel with checkpoint option.  If the job was configured to use
>>> externalized checkpoints and it ran long enough, a checkpoint will be
>>> available for it.
>>>
>>> For the later "stop" is the appropriate term, and it means that a job
>>> should process no messages after the checkpoints barrier and that it
>> should
>>> ensure that exactly-once sinks complete their two-phase commits
>>> successfully.  If a savepoint was requested, one should be created.
>>>
>>> So in my mind there are two commands, cancel and stop, with appropriate
>>> semantics.  Emitting MAX_WATERMARK before the checkpoint barrier during
>>> stop is merely an optional behavior, like creation of a savepoint.  But
>> if
>>> a specific command for it is desired, then "drain" seems appropriate.
>>>
>>> On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote:
>>>
>>>> Hi Elias!
>>>>
>>>> I remember you brought this missing feature up in the past. Do you
>> think
>>>> the proposed enhancement would work for your use case?
>>>>
>>>> Best,
>>>> Stephan
>>>>
>>>> ---------- Forwarded message ---------
>>>> From: Kostas Kloudas <[hidden email]>
>>>> Date: Tue, Feb 12, 2019 at 5:28 PM
>>>> Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint
>>>> To: <[hidden email]>
>>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> A commonly used functionality offered by Flink is the
>>>> "cancel-with-savepoint" operation. When applied to the current
>> exactly-once
>>>> sinks, the current implementation of the feature can be problematic,
>> as it
>>>> does not guarantee that side-effects will be committed by Flink to the
>> 3rd
>>>> party storage system.
>>>>
>>>> This discussion targets fixing this issue and proposes the addition
>> of two
>>>> termination modes, namely:
>>>>    1) SUSPEND, for temporarily stopping the job, e.g. for Flink
>> version
>>>> upgrading in your cluster
>>>>    2) TERMINATE, for terminal shut down which ends the stream and
>> sends
>>>> MAX_WATERMARK time, and flushes any state associated with (event time)
>>>> timers
>>>>
>>>> A google doc with the FLIP proposal can be found here:
>>>>
>>>>
>> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
>>>>
>>>> And the page for the FLIP is here:
>>>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
>>>>
>>>> The implementation sketch is far from complete, but it is worth
>> having a
>>>> discussion on the semantics as soon as possible. The implementation
>> section
>>>> is going to be updated soon.
>>>>
>>>> Looking forward to the discussion,
>>>> Kostas
>>>>
>>>> --
>>>>
>>>> Kostas Kloudas | Software Engineer
>>>>
>>>>
>>>> <https://www.ververica.com/>
>>>>
>>>> Follow us @VervericaData
>>>>
>>>> --
>>>>
>>>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
>>>> Conference
>>>>
>>>> Stream Processing | Event Driven | Real Time
>>>>
>>>> --
>>>>
>>>> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>>>
>>>> --
>>>> Data Artisans GmbH
>>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>>>> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint

Kostas Kloudas-4
Thanks a lot Aljoscha!

On Tue, Mar 12, 2019 at 2:50 PM Aljoscha Krettek <[hidden email]>
wrote:

> I agree and already created a Jira issue for removing the old “stop”
> feature as preparation: https://issues.apache.org/jira/browse/FLINK-11889
> <https://issues.apache.org/jira/browse/FLINK-11889>
>
> Aljoscha
>
> > On 7. Mar 2019, at 11:08, Kostas Kloudas <[hidden email]> wrote:
> >
> > Hi,
> >
> > Thanks for the comments.
> > I agree with the Ufuk's and Elias' proposal.
> >
> > - "cancel" remains the good old "cancel"
> > - "terminate" becomes "stop --drain-with-savepoint"
> > - "suspend" becomes "stop --with-savepoint"
> > - "cancel-with-savepoint" is subsumed by "stop --with-savepoint"
> >
> > As you see from the previous, I would also add "terminate" and "suspend"
> > to result in keeping a savepoint by default.
> >
> > As for Ufuk's remarks:
> >
> > 1) You are correct that to have a proper way to not allow elements to be
> > fed in the pipeline
> > after the checkpoint barrier, we need support from the sources. This is
> > more the responsibility
> > of FLIP-27
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
> >
> > 2) I would lean more towards replacing the old "stop" command with the
> new
> > one. But, as you said,
> > I have no view of how many users (if any) rely on the old "stop" command
> > for their usecases.
> >
> > Cheers,
> > Kostas
> >
> >
> >
> > On Wed, Mar 6, 2019 at 9:52 PM Ufuk Celebi <[hidden email]> wrote:
> >
> >> I really like this effort. I think the original plan for
> >> "cancel-with-savepoint" was always to just be a workaround until we
> >> arrived at a better solution as proposed here.
> >>
> >> Regarding the FLIP, I agree with Elias comments. I think the number of
> >> termination modes the FLIP introduces can be overwhelming and I would
> >> personally rather follow Elias' proposal. In context of the proposal,
> >> this would result in the following:
> >> - "terminate" becomes "stop --drain"
> >> - "suspend" becomes "stop --with-savepoint"
> >> - "cancel-with-savepoint" is superseded by "stop --with-savepoint"
> >>
> >> I have two remaining questions:
> >>
> >> 1) @Kostas: Elias suggests for stop that "a job should process no
> >> messages after the checkpoints barrier". This is something that needs
> >> support from the sources. Is this in the scope of your proposal (I
> >> think not)? If not, is there a future plan for this?
> >>
> >> 2) Would we need to introduce a new command/name for "stop" as we
> >> already have a "stop" command? Assuming that there are no users that
> >> actually use the existing "stop" command as no major sources are
> >> stoppable (I think), I would personally suggest to upgrade the
> >> existing "stop" command to the proposed one. If on the other hand, if
> >> we know of users that rely on the current "stop" command, we'd need to
> >> find another name for it.
> >>
> >> Best,
> >>
> >> Ufuk
> >>
> >> On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email]
> >
> >> wrote:
> >>>
> >>> Apologies for the late reply.
> >>>
> >>> I think this is badly needed, but I fear we are adding complexity by
> >>> introducing yet two more stop commands.  We'll have: cancel, stop,
> >>> terminate. and suspend.  We basically want to do two things: terminate
> a
> >>> job with prejudice or stop a job safely.
> >>>
> >>> For the former "cancel" is the appropriate term, and should have no
> need
> >>> for a cancel with checkpoint option.  If the job was configured to use
> >>> externalized checkpoints and it ran long enough, a checkpoint will be
> >>> available for it.
> >>>
> >>> For the later "stop" is the appropriate term, and it means that a job
> >>> should process no messages after the checkpoints barrier and that it
> >> should
> >>> ensure that exactly-once sinks complete their two-phase commits
> >>> successfully.  If a savepoint was requested, one should be created.
> >>>
> >>> So in my mind there are two commands, cancel and stop, with appropriate
> >>> semantics.  Emitting MAX_WATERMARK before the checkpoint barrier during
> >>> stop is merely an optional behavior, like creation of a savepoint.  But
> >> if
> >>> a specific command for it is desired, then "drain" seems appropriate.
> >>>
> >>> On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote:
> >>>
> >>>> Hi Elias!
> >>>>
> >>>> I remember you brought this missing feature up in the past. Do you
> >> think
> >>>> the proposed enhancement would work for your use case?
> >>>>
> >>>> Best,
> >>>> Stephan
> >>>>
> >>>> ---------- Forwarded message ---------
> >>>> From: Kostas Kloudas <[hidden email]>
> >>>> Date: Tue, Feb 12, 2019 at 5:28 PM
> >>>> Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint
> >>>> To: <[hidden email]>
> >>>>
> >>>>
> >>>> Hi everyone,
> >>>>
> >>>> A commonly used functionality offered by Flink is the
> >>>> "cancel-with-savepoint" operation. When applied to the current
> >> exactly-once
> >>>> sinks, the current implementation of the feature can be problematic,
> >> as it
> >>>> does not guarantee that side-effects will be committed by Flink to the
> >> 3rd
> >>>> party storage system.
> >>>>
> >>>> This discussion targets fixing this issue and proposes the addition
> >> of two
> >>>> termination modes, namely:
> >>>>    1) SUSPEND, for temporarily stopping the job, e.g. for Flink
> >> version
> >>>> upgrading in your cluster
> >>>>    2) TERMINATE, for terminal shut down which ends the stream and
> >> sends
> >>>> MAX_WATERMARK time, and flushes any state associated with (event time)
> >>>> timers
> >>>>
> >>>> A google doc with the FLIP proposal can be found here:
> >>>>
> >>>>
> >>
> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing
> >>>>
> >>>> And the page for the FLIP is here:
> >>>>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
> >>>>
> >>>> The implementation sketch is far from complete, but it is worth
> >> having a
> >>>> discussion on the semantics as soon as possible. The implementation
> >> section
> >>>> is going to be updated soon.
> >>>>
> >>>> Looking forward to the discussion,
> >>>> Kostas
> >>>>
> >>>> --
> >>>>
> >>>> Kostas Kloudas | Software Engineer
> >>>>
> >>>>
> >>>> <https://www.ververica.com/>
> >>>>
> >>>> Follow us @VervericaData
> >>>>
> >>>> --
> >>>>
> >>>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> >>>> Conference
> >>>>
> >>>> Stream Processing | Event Driven | Real Time
> >>>>
> >>>> --
> >>>>
> >>>> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >>>>
> >>>> --
> >>>> Data Artisans GmbH
> >>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> >>>> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> >>>>
> >>
>
>