Hi all,
Due to recent changes in the shutdown mechanism of Flink [1] it is not conveniently possible anymore to suspend a job running on a jobcluster with a savepoint and retrieve the savepoint location via the Flink API programmatically. With the introduced changes the rest endpoint shutdowns immediately and rejects new request which makes the information inaccessible. Before the changes it was possible to stop the job and query the savepoint info endpoint until the location was shown. Admittedly, this was never a safe solution because it expected that the rest endpoint stays alive long enough. I would like to see what the community thinks about this and whether it is worth to implement a different solution to retrieve those information. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-18663 |
+1 Thank you Fabian!
On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <[hidden email]> wrote: > Hi all, > > Due to recent changes in the shutdown mechanism of Flink [1] it is not > conveniently possible anymore to suspend a job running on a jobcluster > with a savepoint and retrieve the savepoint location via the Flink API > programmatically. > > With the introduced changes the rest endpoint shutdowns immediately > and rejects new request which makes the information inaccessible. > > Before the changes it was possible to stop the job and query the savepoint > info endpoint until the location was shown. > Admittedly, this was never a safe solution because it expected that the > rest endpoint stays alive long enough. > > I would like to see what the community thinks about this and whether it is > worth to implement a different solution to retrieve those information. > > Best, > Fabian > [1] https://issues.apache.org/jira/browse/FLINK-18663 > |
Hi Fabian,
could explain a bit how you are cancelling a job with savepoint and then try to retrieve the savepoint path? When running Flink in per-job mode, the system should not shut down if you have an asynchronous operation running whose result you have not yet queried. I believe that this feature was introduced with FLINK-10309 [1]. The semantics is that Flink waits 5 minutes or until the result has been queried (by any client) [2]. If this is not working, then this is clearly a bug. FLINK-18663 [3] solved a bug where the cluster would hang while trying to shut it down. This was also a bug obviously. [1] https://issues.apache.org/jira/browse/FLINK-10309 [2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L141 [3] https://issues.apache.org/jira/browse/FLINK-18663 Cheers, Till On Fri, Aug 7, 2020 at 5:58 PM Eleanore Jin <[hidden email]> wrote: > +1 Thank you Fabian! > > On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <[hidden email]> > wrote: > > > Hi all, > > > > Due to recent changes in the shutdown mechanism of Flink [1] it is not > > conveniently possible anymore to suspend a job running on a jobcluster > > with a savepoint and retrieve the savepoint location via the Flink API > > programmatically. > > > > With the introduced changes the rest endpoint shutdowns immediately > > and rejects new request which makes the information inaccessible. > > > > Before the changes it was possible to stop the job and query the > savepoint > > info endpoint until the location was shown. > > Admittedly, this was never a safe solution because it expected that the > > rest endpoint stays alive long enough. > > > > I would like to see what the community thinks about this and whether it > is > > worth to implement a different solution to retrieve those information. > > > > Best, > > Fabian > > [1] https://issues.apache.org/jira/browse/FLINK-18663 > > > |
Hi Till,
The problem is reproducible with a basic shell script doing the following operations. 1. Post request to /jobs/${JOB_ID}/savepoints with the payload {"cancel-job": true,"target-directory": $(LOCATION)} and store the trigger ID 2. Sleep 10 seconds 3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID) results in a connect exception because rest endpoint is shutdown. Sorry, if I misunderstood you previous answer but I would expect that stopping the job with a savepoint is an asynchronous operation and should block the shutdown until the result is served. I also can confirm that the cluster is not shutdown but the rest endpoint is which makes it impossible to serve the asynchronous result. Best, Fabian |
This sounds like a bug in Flink. Could you share the logs of the cluster
(ideally with TRACE log level) with us? Cheers, Till On Tue, Aug 11, 2020 at 9:49 AM Fabian Paul <[hidden email]> wrote: > Hi Till, > > The problem is reproducible with a basic shell script doing the following > operations. > > 1. Post request to /jobs/${JOB_ID}/savepoints with the payload > {"cancel-job": true,"target-directory": $(LOCATION)} > and store the trigger ID > > 2. Sleep 10 seconds > > 3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID) > results in a connect exception because rest endpoint is shutdown. > > Sorry, if I misunderstood you previous answer but I would expect that > stopping the job > with a savepoint is an asynchronous operation and should block the > shutdown until > the result is served. > I also can confirm that the cluster is not shutdown but the rest endpoint > is which makes > it impossible to serve the asynchronous result. > > Best, > Fabian > > |
I attached the last log lines[1] of the jobmanager after triggering the savepoint. I just
saw the release for 1.10.2 is started so it would probably be great if we determine whether it is a bug to postpone the release if necessary. What do you think? Best, Fabian [1] https://pastebin.com/eWXN5fzS <https://pastebin.com/eWXN5fzS> |
Thanks for the logs Fabian. It is indeed a problem we introduced recently.
I've created a JIRA issue to fix the problem [1]. This fix will also be included in the Flink 1.10.2 release. [1] https://issues.apache.org/jira/browse/FLINK-18902 Cheers, Till On Wed, Aug 12, 2020 at 2:30 PM Fabian Paul <[hidden email]> wrote: > I attached the last log lines[1] of the jobmanager after triggering the > savepoint. I just > saw the release for 1.10.2 is started so it would probably be great if we > determine > whether it is a bug to postpone the release if necessary. > What do you think? > > Best, > Fabian > > [1] https://pastebin.com/eWXN5fzS > <https://pastebin.com/eWXN5fzS> > |
Free forum by Nabble | Edit this page |