Hi Flink devs,
Through recent discussion on job stop processing, we found its semantic is incomplete, mainly reflected in two aspects: 1. In our released versions [1], the stop process is as follows: - A “stop” call is a graceful way of stopping a running streaming job. When the user requests to stop a job, all sources will receive a stop() method call. The job will keep running until all sources properly shut down. This allows the job to finish processing all inflight data. However, for stateful operators with retained checkpointing, the stop call won’t take any checkpoint, thus when resuming the job it needs to recover from the latest checkpoint with source rewinding, which causes the wait for processing all inflight data meaningless (all need to be processed again). 2. In the latest master branch after FLIP-34 [2], job stop will always be accompanied by a savepoint, which has below problems: - It's an unexpected behavior change from user perspective, that old stop command will fail on jobs w/o savepoint configuration. - It slows down the job stop procedure and might block launching new jobs when resource is contended. To resolve the above issues and reinforce job stop semantic, we have opened FLIP-45. In the document we carefully compared the Flink concepts with database, clarified what's missing in Flink job stop process, and proposed changes to complete it. Since this involves conception level clarification and user-facing changes, we would like to collect feedback on the proposal before continuing efforts. This is the FLIP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/cli.html [2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 Best Regards, Yu |
Thanks for the effort, Yu, I've shared the link on the jira side[1] for
those who may be interested in this topic. [1] https://issues.apache.org/jira/browse/FLINK-12619 Best, Congxian Yu Li <[hidden email]> 于2019年7月3日周三 下午10:56写道: > Hi Flink devs, > > Through recent discussion on job stop processing, we found its semantic is > incomplete, mainly reflected in two aspects: > > 1. In our released versions [1], the stop process is as follows: > > - A “stop” call is a graceful way of stopping a running streaming job. > When the user requests to stop a job, all sources will receive a stop() > method call. The job will keep running until all sources properly shut > down. This allows the job to finish processing all inflight data. > > However, for stateful operators with retained checkpointing, the stop > call won’t take any checkpoint, thus when resuming the job it needs to > recover from the latest checkpoint with source rewinding, which causes the > wait for processing all inflight data meaningless (all need to be processed > again). > > 2. In the latest master branch after FLIP-34 [2], job stop will always be > accompanied by a savepoint, which has below problems: > > - It's an unexpected behavior change from user perspective, that old > stop command will fail on jobs w/o savepoint configuration. > - It slows down the job stop procedure and might block launching new > jobs when resource is contended. > > To resolve the above issues and reinforce job stop semantic, we have > opened FLIP-45. In the document we carefully compared the Flink concepts > with database, clarified what's missing in Flink job stop process, and > proposed changes to complete it. > > Since this involves conception level clarification and user-facing > changes, we would like to collect feedback on the proposal before > continuing efforts. > > This is the FLIP: > https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/cli.html > [2] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212 > > Best Regards, > Yu > |
Free forum by Nabble | Edit this page |