Hi everyone,
A commonly used functionality offered by Flink is the "cancel-with-savepoint" operation. When applied to the current exactly-once sinks, the current implementation of the feature can be problematic, as it does not guarantee that side-effects will be committed by Flink to the 3rd party storage system. This discussion targets fixing this issue and proposes the addition of two termination modes, namely: 1) SUSPEND, for temporarily stopping the job, e.g. for Flink version upgrading in your cluster 2) TERMINATE, for terminal shut down which ends the stream and sends MAX_WATERMARK time, and flushes any state associated with (event time) timers A google doc with the FLIP proposal can be found here: https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing And the page for the FLIP is here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 The implementation sketch is far from complete, but it is worth having a discussion on the semantics as soon as possible. The implementation section is going to be updated soon. Looking forward to the discussion, Kostas -- Kostas Kloudas | Software Engineer <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Data Artisans GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen |
Thank you for starting this feature discussion.
This is a feature that has been requested various times, great to see it happening. +1 for this FLIP On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <[hidden email]> wrote: > Hi everyone, > > A commonly used functionality offered by Flink is the > "cancel-with-savepoint" operation. When applied to the current exactly-once > sinks, the current implementation of the feature can be problematic, as it > does not guarantee that side-effects will be committed by Flink to the 3rd > party storage system. > > This discussion targets fixing this issue and proposes the addition of two > termination modes, namely: > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink version > upgrading in your cluster > 2) TERMINATE, for terminal shut down which ends the stream and sends > MAX_WATERMARK time, and flushes any state associated with (event time) > timers > > A google doc with the FLIP proposal can be found here: > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > And the page for the FLIP is here: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > The implementation sketch is far from complete, but it is worth having a > discussion on the semantics as soon as possible. The implementation section > is going to be updated soon. > > Looking forward to the discussion, > Kostas > > -- > > Kostas Kloudas | Software Engineer > > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Data Artisans GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > |
Thanks for working on improving cancel-with-savepoint Kostas.
Distinguishing the termination modes would be a big step forward, IMO. Btw. there is already another FLIP-33 on the way. This one should be FLIP-34. Cheers, Fabian Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen <[hidden email]>: > Thank you for starting this feature discussion. > This is a feature that has been requested various times, great to see it > happening. > > +1 for this FLIP > > On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <[hidden email]> > wrote: > > > Hi everyone, > > > > A commonly used functionality offered by Flink is the > > "cancel-with-savepoint" operation. When applied to the current > exactly-once > > sinks, the current implementation of the feature can be problematic, as > it > > does not guarantee that side-effects will be committed by Flink to the > 3rd > > party storage system. > > > > This discussion targets fixing this issue and proposes the addition of > two > > termination modes, namely: > > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink version > > upgrading in your cluster > > 2) TERMINATE, for terminal shut down which ends the stream and sends > > MAX_WATERMARK time, and flushes any state associated with (event time) > > timers > > > > A google doc with the FLIP proposal can be found here: > > > > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > > > And the page for the FLIP is here: > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > > > The implementation sketch is far from complete, but it is worth having a > > discussion on the semantics as soon as possible. The implementation > section > > is going to be updated soon. > > > > Looking forward to the discussion, > > Kostas > > > > -- > > > > Kostas Kloudas | Software Engineer > > > > > > <https://www.ververica.com/> > > > > Follow us @VervericaData > > > > -- > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > Conference > > > > Stream Processing | Event Driven | Real Time > > > > -- > > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > -- > > Data Artisans GmbH > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > > > |
Thank you for starting the discussion about cancel-with-savepoint Kostas.
+1 for the FLIP. Cheers, Jincheng Fabian Hueske <[hidden email]> 于2019年2月13日周三 上午4:31写道: > Thanks for working on improving cancel-with-savepoint Kostas. > Distinguishing the termination modes would be a big step forward, IMO. > > Btw. there is already another FLIP-33 on the way. > This one should be FLIP-34. > > Cheers, > Fabian > > Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen <[hidden email] > >: > > > Thank you for starting this feature discussion. > > This is a feature that has been requested various times, great to see it > > happening. > > > > +1 for this FLIP > > > > On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas <[hidden email]> > > wrote: > > > > > Hi everyone, > > > > > > A commonly used functionality offered by Flink is the > > > "cancel-with-savepoint" operation. When applied to the current > > exactly-once > > > sinks, the current implementation of the feature can be problematic, as > > it > > > does not guarantee that side-effects will be committed by Flink to the > > 3rd > > > party storage system. > > > > > > This discussion targets fixing this issue and proposes the addition of > > two > > > termination modes, namely: > > > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink > version > > > upgrading in your cluster > > > 2) TERMINATE, for terminal shut down which ends the stream and > sends > > > MAX_WATERMARK time, and flushes any state associated with (event time) > > > timers > > > > > > A google doc with the FLIP proposal can be found here: > > > > > > > > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > > > > > And the page for the FLIP is here: > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > > > > > The implementation sketch is far from complete, but it is worth > having a > > > discussion on the semantics as soon as possible. The implementation > > section > > > is going to be updated soon. > > > > > > Looking forward to the discussion, > > > Kostas > > > > > > -- > > > > > > Kostas Kloudas | Software Engineer > > > > > > > > > <https://www.ververica.com/> > > > > > > Follow us @VervericaData > > > > > > -- > > > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > > Conference > > > > > > Stream Processing | Event Driven | Real Time > > > > > > -- > > > > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > > > -- > > > Data Artisans GmbH > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > > > > > > |
thanks for starting this discussion. It is a very cool feature.
+1 for the FLIP Best Guowei jincheng sun <[hidden email]> 于2019年2月13日周三 上午9:35写道: > Thank you for starting the discussion about cancel-with-savepoint Kostas. > > +1 for the FLIP. > > Cheers, > Jincheng > > Fabian Hueske <[hidden email]> 于2019年2月13日周三 上午4:31写道: > > > Thanks for working on improving cancel-with-savepoint Kostas. > > Distinguishing the termination modes would be a big step forward, IMO. > > > > Btw. there is already another FLIP-33 on the way. > > This one should be FLIP-34. > > > > Cheers, > > Fabian > > > > Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen < > [hidden email] > > >: > > > > > Thank you for starting this feature discussion. > > > This is a feature that has been requested various times, great to see > it > > > happening. > > > > > > +1 for this FLIP > > > > > > On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas < > [hidden email]> > > > wrote: > > > > > > > Hi everyone, > > > > > > > > A commonly used functionality offered by Flink is the > > > > "cancel-with-savepoint" operation. When applied to the current > > > exactly-once > > > > sinks, the current implementation of the feature can be problematic, > as > > > it > > > > does not guarantee that side-effects will be committed by Flink to > the > > > 3rd > > > > party storage system. > > > > > > > > This discussion targets fixing this issue and proposes the addition > of > > > two > > > > termination modes, namely: > > > > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink > > version > > > > upgrading in your cluster > > > > 2) TERMINATE, for terminal shut down which ends the stream and > > sends > > > > MAX_WATERMARK time, and flushes any state associated with (event > time) > > > > timers > > > > > > > > A google doc with the FLIP proposal can be found here: > > > > > > > > > > > > > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > > > > > > > And the page for the FLIP is here: > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > > > > > > > The implementation sketch is far from complete, but it is worth > > having a > > > > discussion on the semantics as soon as possible. The implementation > > > section > > > > is going to be updated soon. > > > > > > > > Looking forward to the discussion, > > > > Kostas > > > > > > > > -- > > > > > > > > Kostas Kloudas | Software Engineer > > > > > > > > > > > > <https://www.ververica.com/> > > > > > > > > Follow us @VervericaData > > > > > > > > -- > > > > > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > > > Conference > > > > > > > > Stream Processing | Event Driven | Real Time > > > > > > > > -- > > > > > > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > > > > > -- > > > > Data Artisans GmbH > > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > > > > > > > > > > |
In reply to this post by jincheng sun
Thanks for bringing us this discussion.
I like the idea. It's really useful in production scenario. +1 for the proposal. jincheng sun <[hidden email]> 于2019年2月13日周三 上午9:35写道: > Thank you for starting the discussion about cancel-with-savepoint Kostas. > > +1 for the FLIP. > > Cheers, > Jincheng > > Fabian Hueske <[hidden email]> 于2019年2月13日周三 上午4:31写道: > > > Thanks for working on improving cancel-with-savepoint Kostas. > > Distinguishing the termination modes would be a big step forward, IMO. > > > > Btw. there is already another FLIP-33 on the way. > > This one should be FLIP-34. > > > > Cheers, > > Fabian > > > > Am Di., 12. Feb. 2019 um 18:49 Uhr schrieb Stephan Ewen < > [hidden email] > > >: > > > > > Thank you for starting this feature discussion. > > > This is a feature that has been requested various times, great to see > it > > > happening. > > > > > > +1 for this FLIP > > > > > > On Tue, Feb 12, 2019 at 5:28 PM Kostas Kloudas < > [hidden email]> > > > wrote: > > > > > > > Hi everyone, > > > > > > > > A commonly used functionality offered by Flink is the > > > > "cancel-with-savepoint" operation. When applied to the current > > > exactly-once > > > > sinks, the current implementation of the feature can be problematic, > as > > > it > > > > does not guarantee that side-effects will be committed by Flink to > the > > > 3rd > > > > party storage system. > > > > > > > > This discussion targets fixing this issue and proposes the addition > of > > > two > > > > termination modes, namely: > > > > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink > > version > > > > upgrading in your cluster > > > > 2) TERMINATE, for terminal shut down which ends the stream and > > sends > > > > MAX_WATERMARK time, and flushes any state associated with (event > time) > > > > timers > > > > > > > > A google doc with the FLIP proposal can be found here: > > > > > > > > > > > > > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > > > > > > > And the page for the FLIP is here: > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > > > > > > > The implementation sketch is far from complete, but it is worth > > having a > > > > discussion on the semantics as soon as possible. The implementation > > > section > > > > is going to be updated soon. > > > > > > > > Looking forward to the discussion, > > > > Kostas > > > > > > > > -- > > > > > > > > Kostas Kloudas | Software Engineer > > > > > > > > > > > > <https://www.ververica.com/> > > > > > > > > Follow us @VervericaData > > > > > > > > -- > > > > > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > > > Conference > > > > > > > > Stream Processing | Event Driven | Real Time > > > > > > > > -- > > > > > > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > > > > > -- > > > > Data Artisans GmbH > > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > > > > > > > > > > |
In reply to this post by Kostas Kloudas-3
Apologies for the late reply.
I think this is badly needed, but I fear we are adding complexity by introducing yet two more stop commands. We'll have: cancel, stop, terminate. and suspend. We basically want to do two things: terminate a job with prejudice or stop a job safely. For the former "cancel" is the appropriate term, and should have no need for a cancel with checkpoint option. If the job was configured to use externalized checkpoints and it ran long enough, a checkpoint will be available for it. For the later "stop" is the appropriate term, and it means that a job should process no messages after the checkpoints barrier and that it should ensure that exactly-once sinks complete their two-phase commits successfully. If a savepoint was requested, one should be created. So in my mind there are two commands, cancel and stop, with appropriate semantics. Emitting MAX_WATERMARK before the checkpoint barrier during stop is merely an optional behavior, like creation of a savepoint. But if a specific command for it is desired, then "drain" seems appropriate. On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote: > Hi Elias! > > I remember you brought this missing feature up in the past. Do you think > the proposed enhancement would work for your use case? > > Best, > Stephan > > ---------- Forwarded message --------- > From: Kostas Kloudas <[hidden email]> > Date: Tue, Feb 12, 2019 at 5:28 PM > Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint > To: <[hidden email]> > > > Hi everyone, > > A commonly used functionality offered by Flink is the > "cancel-with-savepoint" operation. When applied to the current exactly-once > sinks, the current implementation of the feature can be problematic, as it > does not guarantee that side-effects will be committed by Flink to the 3rd > party storage system. > > This discussion targets fixing this issue and proposes the addition of two > termination modes, namely: > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink version > upgrading in your cluster > 2) TERMINATE, for terminal shut down which ends the stream and sends > MAX_WATERMARK time, and flushes any state associated with (event time) > timers > > A google doc with the FLIP proposal can be found here: > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > And the page for the FLIP is here: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > The implementation sketch is far from complete, but it is worth having a > discussion on the semantics as soon as possible. The implementation section > is going to be updated soon. > > Looking forward to the discussion, > Kostas > > -- > > Kostas Kloudas | Software Engineer > > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Data Artisans GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > |
I really like this effort. I think the original plan for
"cancel-with-savepoint" was always to just be a workaround until we arrived at a better solution as proposed here. Regarding the FLIP, I agree with Elias comments. I think the number of termination modes the FLIP introduces can be overwhelming and I would personally rather follow Elias' proposal. In context of the proposal, this would result in the following: - "terminate" becomes "stop --drain" - "suspend" becomes "stop --with-savepoint" - "cancel-with-savepoint" is superseded by "stop --with-savepoint" I have two remaining questions: 1) @Kostas: Elias suggests for stop that "a job should process no messages after the checkpoints barrier". This is something that needs support from the sources. Is this in the scope of your proposal (I think not)? If not, is there a future plan for this? 2) Would we need to introduce a new command/name for "stop" as we already have a "stop" command? Assuming that there are no users that actually use the existing "stop" command as no major sources are stoppable (I think), I would personally suggest to upgrade the existing "stop" command to the proposed one. If on the other hand, if we know of users that rely on the current "stop" command, we'd need to find another name for it. Best, Ufuk On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email]> wrote: > > Apologies for the late reply. > > I think this is badly needed, but I fear we are adding complexity by > introducing yet two more stop commands. We'll have: cancel, stop, > terminate. and suspend. We basically want to do two things: terminate a > job with prejudice or stop a job safely. > > For the former "cancel" is the appropriate term, and should have no need > for a cancel with checkpoint option. If the job was configured to use > externalized checkpoints and it ran long enough, a checkpoint will be > available for it. > > For the later "stop" is the appropriate term, and it means that a job > should process no messages after the checkpoints barrier and that it should > ensure that exactly-once sinks complete their two-phase commits > successfully. If a savepoint was requested, one should be created. > > So in my mind there are two commands, cancel and stop, with appropriate > semantics. Emitting MAX_WATERMARK before the checkpoint barrier during > stop is merely an optional behavior, like creation of a savepoint. But if > a specific command for it is desired, then "drain" seems appropriate. > > On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote: > > > Hi Elias! > > > > I remember you brought this missing feature up in the past. Do you think > > the proposed enhancement would work for your use case? > > > > Best, > > Stephan > > > > ---------- Forwarded message --------- > > From: Kostas Kloudas <[hidden email]> > > Date: Tue, Feb 12, 2019 at 5:28 PM > > Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint > > To: <[hidden email]> > > > > > > Hi everyone, > > > > A commonly used functionality offered by Flink is the > > "cancel-with-savepoint" operation. When applied to the current exactly-once > > sinks, the current implementation of the feature can be problematic, as it > > does not guarantee that side-effects will be committed by Flink to the 3rd > > party storage system. > > > > This discussion targets fixing this issue and proposes the addition of two > > termination modes, namely: > > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink version > > upgrading in your cluster > > 2) TERMINATE, for terminal shut down which ends the stream and sends > > MAX_WATERMARK time, and flushes any state associated with (event time) > > timers > > > > A google doc with the FLIP proposal can be found here: > > > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > > > And the page for the FLIP is here: > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > > > The implementation sketch is far from complete, but it is worth having a > > discussion on the semantics as soon as possible. The implementation section > > is going to be updated soon. > > > > Looking forward to the discussion, > > Kostas > > > > -- > > > > Kostas Kloudas | Software Engineer > > > > > > <https://www.ververica.com/> > > > > Follow us @VervericaData > > > > -- > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > Conference > > > > Stream Processing | Event Driven | Real Time > > > > -- > > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > -- > > Data Artisans GmbH > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > > |
Hi,
Thanks for the comments. I agree with the Ufuk's and Elias' proposal. - "cancel" remains the good old "cancel" - "terminate" becomes "stop --drain-with-savepoint" - "suspend" becomes "stop --with-savepoint" - "cancel-with-savepoint" is subsumed by "stop --with-savepoint" As you see from the previous, I would also add "terminate" and "suspend" to result in keeping a savepoint by default. As for Ufuk's remarks: 1) You are correct that to have a proper way to not allow elements to be fed in the pipeline after the checkpoint barrier, we need support from the sources. This is more the responsibility of FLIP-27 https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface 2) I would lean more towards replacing the old "stop" command with the new one. But, as you said, I have no view of how many users (if any) rely on the old "stop" command for their usecases. Cheers, Kostas On Wed, Mar 6, 2019 at 9:52 PM Ufuk Celebi <[hidden email]> wrote: > I really like this effort. I think the original plan for > "cancel-with-savepoint" was always to just be a workaround until we > arrived at a better solution as proposed here. > > Regarding the FLIP, I agree with Elias comments. I think the number of > termination modes the FLIP introduces can be overwhelming and I would > personally rather follow Elias' proposal. In context of the proposal, > this would result in the following: > - "terminate" becomes "stop --drain" > - "suspend" becomes "stop --with-savepoint" > - "cancel-with-savepoint" is superseded by "stop --with-savepoint" > > I have two remaining questions: > > 1) @Kostas: Elias suggests for stop that "a job should process no > messages after the checkpoints barrier". This is something that needs > support from the sources. Is this in the scope of your proposal (I > think not)? If not, is there a future plan for this? > > 2) Would we need to introduce a new command/name for "stop" as we > already have a "stop" command? Assuming that there are no users that > actually use the existing "stop" command as no major sources are > stoppable (I think), I would personally suggest to upgrade the > existing "stop" command to the proposed one. If on the other hand, if > we know of users that rely on the current "stop" command, we'd need to > find another name for it. > > Best, > > Ufuk > > On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email]> > wrote: > > > > Apologies for the late reply. > > > > I think this is badly needed, but I fear we are adding complexity by > > introducing yet two more stop commands. We'll have: cancel, stop, > > terminate. and suspend. We basically want to do two things: terminate a > > job with prejudice or stop a job safely. > > > > For the former "cancel" is the appropriate term, and should have no need > > for a cancel with checkpoint option. If the job was configured to use > > externalized checkpoints and it ran long enough, a checkpoint will be > > available for it. > > > > For the later "stop" is the appropriate term, and it means that a job > > should process no messages after the checkpoints barrier and that it > should > > ensure that exactly-once sinks complete their two-phase commits > > successfully. If a savepoint was requested, one should be created. > > > > So in my mind there are two commands, cancel and stop, with appropriate > > semantics. Emitting MAX_WATERMARK before the checkpoint barrier during > > stop is merely an optional behavior, like creation of a savepoint. But > if > > a specific command for it is desired, then "drain" seems appropriate. > > > > On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote: > > > > > Hi Elias! > > > > > > I remember you brought this missing feature up in the past. Do you > think > > > the proposed enhancement would work for your use case? > > > > > > Best, > > > Stephan > > > > > > ---------- Forwarded message --------- > > > From: Kostas Kloudas <[hidden email]> > > > Date: Tue, Feb 12, 2019 at 5:28 PM > > > Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint > > > To: <[hidden email]> > > > > > > > > > Hi everyone, > > > > > > A commonly used functionality offered by Flink is the > > > "cancel-with-savepoint" operation. When applied to the current > exactly-once > > > sinks, the current implementation of the feature can be problematic, > as it > > > does not guarantee that side-effects will be committed by Flink to the > 3rd > > > party storage system. > > > > > > This discussion targets fixing this issue and proposes the addition > of two > > > termination modes, namely: > > > 1) SUSPEND, for temporarily stopping the job, e.g. for Flink > version > > > upgrading in your cluster > > > 2) TERMINATE, for terminal shut down which ends the stream and > sends > > > MAX_WATERMARK time, and flushes any state associated with (event time) > > > timers > > > > > > A google doc with the FLIP proposal can be found here: > > > > > > > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > > > > > > And the page for the FLIP is here: > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > > > > > The implementation sketch is far from complete, but it is worth > having a > > > discussion on the semantics as soon as possible. The implementation > section > > > is going to be updated soon. > > > > > > Looking forward to the discussion, > > > Kostas > > > > > > -- > > > > > > Kostas Kloudas | Software Engineer > > > > > > > > > <https://www.ververica.com/> > > > > > > Follow us @VervericaData > > > > > > -- > > > > > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > > > Conference > > > > > > Stream Processing | Event Driven | Real Time > > > > > > -- > > > > > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > > > > > -- > > > Data Artisans GmbH > > > Registered at Amtsgericht Charlottenburg: HRB 158244 B > > > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > > > > |
I agree and already created a Jira issue for removing the old “stop” feature as preparation: https://issues.apache.org/jira/browse/FLINK-11889 <https://issues.apache.org/jira/browse/FLINK-11889>
Aljoscha > On 7. Mar 2019, at 11:08, Kostas Kloudas <[hidden email]> wrote: > > Hi, > > Thanks for the comments. > I agree with the Ufuk's and Elias' proposal. > > - "cancel" remains the good old "cancel" > - "terminate" becomes "stop --drain-with-savepoint" > - "suspend" becomes "stop --with-savepoint" > - "cancel-with-savepoint" is subsumed by "stop --with-savepoint" > > As you see from the previous, I would also add "terminate" and "suspend" > to result in keeping a savepoint by default. > > As for Ufuk's remarks: > > 1) You are correct that to have a proper way to not allow elements to be > fed in the pipeline > after the checkpoint barrier, we need support from the sources. This is > more the responsibility > of FLIP-27 > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface > > 2) I would lean more towards replacing the old "stop" command with the new > one. But, as you said, > I have no view of how many users (if any) rely on the old "stop" command > for their usecases. > > Cheers, > Kostas > > > > On Wed, Mar 6, 2019 at 9:52 PM Ufuk Celebi <[hidden email]> wrote: > >> I really like this effort. I think the original plan for >> "cancel-with-savepoint" was always to just be a workaround until we >> arrived at a better solution as proposed here. >> >> Regarding the FLIP, I agree with Elias comments. I think the number of >> termination modes the FLIP introduces can be overwhelming and I would >> personally rather follow Elias' proposal. In context of the proposal, >> this would result in the following: >> - "terminate" becomes "stop --drain" >> - "suspend" becomes "stop --with-savepoint" >> - "cancel-with-savepoint" is superseded by "stop --with-savepoint" >> >> I have two remaining questions: >> >> 1) @Kostas: Elias suggests for stop that "a job should process no >> messages after the checkpoints barrier". This is something that needs >> support from the sources. Is this in the scope of your proposal (I >> think not)? If not, is there a future plan for this? >> >> 2) Would we need to introduce a new command/name for "stop" as we >> already have a "stop" command? Assuming that there are no users that >> actually use the existing "stop" command as no major sources are >> stoppable (I think), I would personally suggest to upgrade the >> existing "stop" command to the proposed one. If on the other hand, if >> we know of users that rely on the current "stop" command, we'd need to >> find another name for it. >> >> Best, >> >> Ufuk >> >> On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email]> >> wrote: >>> >>> Apologies for the late reply. >>> >>> I think this is badly needed, but I fear we are adding complexity by >>> introducing yet two more stop commands. We'll have: cancel, stop, >>> terminate. and suspend. We basically want to do two things: terminate a >>> job with prejudice or stop a job safely. >>> >>> For the former "cancel" is the appropriate term, and should have no need >>> for a cancel with checkpoint option. If the job was configured to use >>> externalized checkpoints and it ran long enough, a checkpoint will be >>> available for it. >>> >>> For the later "stop" is the appropriate term, and it means that a job >>> should process no messages after the checkpoints barrier and that it >> should >>> ensure that exactly-once sinks complete their two-phase commits >>> successfully. If a savepoint was requested, one should be created. >>> >>> So in my mind there are two commands, cancel and stop, with appropriate >>> semantics. Emitting MAX_WATERMARK before the checkpoint barrier during >>> stop is merely an optional behavior, like creation of a savepoint. But >> if >>> a specific command for it is desired, then "drain" seems appropriate. >>> >>> On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote: >>> >>>> Hi Elias! >>>> >>>> I remember you brought this missing feature up in the past. Do you >> think >>>> the proposed enhancement would work for your use case? >>>> >>>> Best, >>>> Stephan >>>> >>>> ---------- Forwarded message --------- >>>> From: Kostas Kloudas <[hidden email]> >>>> Date: Tue, Feb 12, 2019 at 5:28 PM >>>> Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint >>>> To: <[hidden email]> >>>> >>>> >>>> Hi everyone, >>>> >>>> A commonly used functionality offered by Flink is the >>>> "cancel-with-savepoint" operation. When applied to the current >> exactly-once >>>> sinks, the current implementation of the feature can be problematic, >> as it >>>> does not guarantee that side-effects will be committed by Flink to the >> 3rd >>>> party storage system. >>>> >>>> This discussion targets fixing this issue and proposes the addition >> of two >>>> termination modes, namely: >>>> 1) SUSPEND, for temporarily stopping the job, e.g. for Flink >> version >>>> upgrading in your cluster >>>> 2) TERMINATE, for terminal shut down which ends the stream and >> sends >>>> MAX_WATERMARK time, and flushes any state associated with (event time) >>>> timers >>>> >>>> A google doc with the FLIP proposal can be found here: >>>> >>>> >> https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing >>>> >>>> And the page for the FLIP is here: >>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 >>>> >>>> The implementation sketch is far from complete, but it is worth >> having a >>>> discussion on the semantics as soon as possible. The implementation >> section >>>> is going to be updated soon. >>>> >>>> Looking forward to the discussion, >>>> Kostas >>>> >>>> -- >>>> >>>> Kostas Kloudas | Software Engineer >>>> >>>> >>>> <https://www.ververica.com/> >>>> >>>> Follow us @VervericaData >>>> >>>> -- >>>> >>>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink >>>> Conference >>>> >>>> Stream Processing | Event Driven | Real Time >>>> >>>> -- >>>> >>>> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany >>>> >>>> -- >>>> Data Artisans GmbH >>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B >>>> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen >>>> >> |
Thanks a lot Aljoscha!
On Tue, Mar 12, 2019 at 2:50 PM Aljoscha Krettek <[hidden email]> wrote: > I agree and already created a Jira issue for removing the old “stop” > feature as preparation: https://issues.apache.org/jira/browse/FLINK-11889 > <https://issues.apache.org/jira/browse/FLINK-11889> > > Aljoscha > > > On 7. Mar 2019, at 11:08, Kostas Kloudas <[hidden email]> wrote: > > > > Hi, > > > > Thanks for the comments. > > I agree with the Ufuk's and Elias' proposal. > > > > - "cancel" remains the good old "cancel" > > - "terminate" becomes "stop --drain-with-savepoint" > > - "suspend" becomes "stop --with-savepoint" > > - "cancel-with-savepoint" is subsumed by "stop --with-savepoint" > > > > As you see from the previous, I would also add "terminate" and "suspend" > > to result in keeping a savepoint by default. > > > > As for Ufuk's remarks: > > > > 1) You are correct that to have a proper way to not allow elements to be > > fed in the pipeline > > after the checkpoint barrier, we need support from the sources. This is > > more the responsibility > > of FLIP-27 > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface > > > > 2) I would lean more towards replacing the old "stop" command with the > new > > one. But, as you said, > > I have no view of how many users (if any) rely on the old "stop" command > > for their usecases. > > > > Cheers, > > Kostas > > > > > > > > On Wed, Mar 6, 2019 at 9:52 PM Ufuk Celebi <[hidden email]> wrote: > > > >> I really like this effort. I think the original plan for > >> "cancel-with-savepoint" was always to just be a workaround until we > >> arrived at a better solution as proposed here. > >> > >> Regarding the FLIP, I agree with Elias comments. I think the number of > >> termination modes the FLIP introduces can be overwhelming and I would > >> personally rather follow Elias' proposal. In context of the proposal, > >> this would result in the following: > >> - "terminate" becomes "stop --drain" > >> - "suspend" becomes "stop --with-savepoint" > >> - "cancel-with-savepoint" is superseded by "stop --with-savepoint" > >> > >> I have two remaining questions: > >> > >> 1) @Kostas: Elias suggests for stop that "a job should process no > >> messages after the checkpoints barrier". This is something that needs > >> support from the sources. Is this in the scope of your proposal (I > >> think not)? If not, is there a future plan for this? > >> > >> 2) Would we need to introduce a new command/name for "stop" as we > >> already have a "stop" command? Assuming that there are no users that > >> actually use the existing "stop" command as no major sources are > >> stoppable (I think), I would personally suggest to upgrade the > >> existing "stop" command to the proposed one. If on the other hand, if > >> we know of users that rely on the current "stop" command, we'd need to > >> find another name for it. > >> > >> Best, > >> > >> Ufuk > >> > >> On Wed, Mar 6, 2019 at 12:27 AM Elias Levy <[hidden email] > > > >> wrote: > >>> > >>> Apologies for the late reply. > >>> > >>> I think this is badly needed, but I fear we are adding complexity by > >>> introducing yet two more stop commands. We'll have: cancel, stop, > >>> terminate. and suspend. We basically want to do two things: terminate > a > >>> job with prejudice or stop a job safely. > >>> > >>> For the former "cancel" is the appropriate term, and should have no > need > >>> for a cancel with checkpoint option. If the job was configured to use > >>> externalized checkpoints and it ran long enough, a checkpoint will be > >>> available for it. > >>> > >>> For the later "stop" is the appropriate term, and it means that a job > >>> should process no messages after the checkpoints barrier and that it > >> should > >>> ensure that exactly-once sinks complete their two-phase commits > >>> successfully. If a savepoint was requested, one should be created. > >>> > >>> So in my mind there are two commands, cancel and stop, with appropriate > >>> semantics. Emitting MAX_WATERMARK before the checkpoint barrier during > >>> stop is merely an optional behavior, like creation of a savepoint. But > >> if > >>> a specific command for it is desired, then "drain" seems appropriate. > >>> > >>> On Tue, Feb 12, 2019 at 9:50 AM Stephan Ewen <[hidden email]> wrote: > >>> > >>>> Hi Elias! > >>>> > >>>> I remember you brought this missing feature up in the past. Do you > >> think > >>>> the proposed enhancement would work for your use case? > >>>> > >>>> Best, > >>>> Stephan > >>>> > >>>> ---------- Forwarded message --------- > >>>> From: Kostas Kloudas <[hidden email]> > >>>> Date: Tue, Feb 12, 2019 at 5:28 PM > >>>> Subject: [DISCUSS] FLIP-33: Terminate/Suspend Job with Savepoint > >>>> To: <[hidden email]> > >>>> > >>>> > >>>> Hi everyone, > >>>> > >>>> A commonly used functionality offered by Flink is the > >>>> "cancel-with-savepoint" operation. When applied to the current > >> exactly-once > >>>> sinks, the current implementation of the feature can be problematic, > >> as it > >>>> does not guarantee that side-effects will be committed by Flink to the > >> 3rd > >>>> party storage system. > >>>> > >>>> This discussion targets fixing this issue and proposes the addition > >> of two > >>>> termination modes, namely: > >>>> 1) SUSPEND, for temporarily stopping the job, e.g. for Flink > >> version > >>>> upgrading in your cluster > >>>> 2) TERMINATE, for terminal shut down which ends the stream and > >> sends > >>>> MAX_WATERMARK time, and flushes any state associated with (event time) > >>>> timers > >>>> > >>>> A google doc with the FLIP proposal can be found here: > >>>> > >>>> > >> > https://docs.google.com/document/d/1EZf6pJMvqh_HeBCaUOnhLUr9JmkhfPgn6Mre_z6tgp8/edit?usp=sharing > >>>> > >>>> And the page for the FLIP is here: > >>>> > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > >>>> > >>>> The implementation sketch is far from complete, but it is worth > >> having a > >>>> discussion on the semantics as soon as possible. The implementation > >> section > >>>> is going to be updated soon. > >>>> > >>>> Looking forward to the discussion, > >>>> Kostas > >>>> > >>>> -- > >>>> > >>>> Kostas Kloudas | Software Engineer > >>>> > >>>> > >>>> <https://www.ververica.com/> > >>>> > >>>> Follow us @VervericaData > >>>> > >>>> -- > >>>> > >>>> Join Flink Forward <https://flink-forward.org/> - The Apache Flink > >>>> Conference > >>>> > >>>> Stream Processing | Event Driven | Real Time > >>>> > >>>> -- > >>>> > >>>> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > >>>> > >>>> -- > >>>> Data Artisans GmbH > >>>> Registered at Amtsgericht Charlottenburg: HRB 158244 B > >>>> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > >>>> > >> > > |
Free forum by Nabble | Edit this page |