[DISCUSS] FLIP-45: Reinforce Job Stop Semantic

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-45: Reinforce Job Stop Semantic

Yu Li
Hi Flink devs,

Through recent discussion on job stop processing, we found its semantic is
incomplete, mainly reflected in two aspects:

1. In our released versions [1], the stop process is as follows:

   - A “stop” call is a graceful way of stopping a running streaming job.
   When the user requests to stop a job, all sources will receive a stop()
   method call. The job will keep running until all sources properly shut
   down. This allows the job to finish processing all inflight data.

    However, for stateful operators with retained checkpointing, the stop
call won’t take any checkpoint, thus when resuming the job it needs to
recover from the latest checkpoint with source rewinding, which causes the
wait for processing all inflight data meaningless (all need to be processed
again).

2. In the latest master branch after FLIP-34 [2], job stop will always be
accompanied by a savepoint, which has below problems:

   - It's an unexpected behavior change from user perspective, that old
   stop command will fail on jobs w/o savepoint configuration.
   - It slows down the job stop procedure and might block launching new
   jobs when resource is contended.

To resolve the above issues and reinforce job stop semantic, we have opened
FLIP-45. In the document we carefully compared the Flink concepts with
database, clarified what's missing in Flink job stop process, and proposed
changes to complete it.

Since this involves conception level clarification and user-facing
changes, we would like to collect feedback on the proposal before
continuing efforts.

This is the FLIP:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/cli.html
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212

Best Regards,
Yu
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-45: Reinforce Job Stop Semantic

Congxian Qiu
 Thanks for the effort, Yu, I've shared the link on the jira side[1] for
those who may be interested in this topic.

[1] https://issues.apache.org/jira/browse/FLINK-12619
Best,
Congxian


Yu Li <[hidden email]> 于2019年7月3日周三 下午10:56写道:

> Hi Flink devs,
>
> Through recent discussion on job stop processing, we found its semantic is
> incomplete, mainly reflected in two aspects:
>
> 1. In our released versions [1], the stop process is as follows:
>
>    - A “stop” call is a graceful way of stopping a running streaming job.
>    When the user requests to stop a job, all sources will receive a stop()
>    method call. The job will keep running until all sources properly shut
>    down. This allows the job to finish processing all inflight data.
>
>     However, for stateful operators with retained checkpointing, the stop
> call won’t take any checkpoint, thus when resuming the job it needs to
> recover from the latest checkpoint with source rewinding, which causes the
> wait for processing all inflight data meaningless (all need to be processed
> again).
>
> 2. In the latest master branch after FLIP-34 [2], job stop will always be
> accompanied by a savepoint, which has below problems:
>
>    - It's an unexpected behavior change from user perspective, that old
>    stop command will fail on jobs w/o savepoint configuration.
>    - It slows down the job stop procedure and might block launching new
>    jobs when resource is contended.
>
> To resolve the above issues and reinforce job stop semantic, we have
> opened FLIP-45. In the document we carefully compared the Flink concepts
> with database, clarified what's missing in Flink job stop process, and
> proposed changes to complete it.
>
> Since this involves conception level clarification and user-facing
> changes, we would like to collect feedback on the proposal before
> continuing efforts.
>
> This is the FLIP:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/cli.html
> [2]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103090212
>
> Best Regards,
> Yu
>