[DISCUSS] Retrieve savepoint location after suspension of jobclusters

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Retrieve savepoint location after suspension of jobclusters

Fabian Paul-2
Hi all,

Due to recent changes in the shutdown mechanism of Flink [1] it is not
conveniently possible anymore to suspend a job running on a jobcluster
with a savepoint and retrieve the savepoint location via the Flink API
programmatically.

With the introduced changes the rest endpoint shutdowns immediately
and rejects new request which makes the information inaccessible.

Before the changes it was possible to stop the job and query the savepoint
info endpoint until the location was shown.
Admittedly, this was never a safe solution because it expected that the
rest endpoint stays alive long enough.

I would like to see what the community thinks about this and whether it is
worth to implement a different solution to retrieve those information.

Best,
Fabian
[1] https://issues.apache.org/jira/browse/FLINK-18663
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Eleanore Jin
+1 Thank you Fabian!

On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <[hidden email]>
wrote:

> Hi all,
>
> Due to recent changes in the shutdown mechanism of Flink [1] it is not
> conveniently possible anymore to suspend a job running on a jobcluster
> with a savepoint and retrieve the savepoint location via the Flink API
> programmatically.
>
> With the introduced changes the rest endpoint shutdowns immediately
> and rejects new request which makes the information inaccessible.
>
> Before the changes it was possible to stop the job and query the savepoint
> info endpoint until the location was shown.
> Admittedly, this was never a safe solution because it expected that the
> rest endpoint stays alive long enough.
>
> I would like to see what the community thinks about this and whether it is
> worth to implement a different solution to retrieve those information.
>
> Best,
> Fabian
> [1] https://issues.apache.org/jira/browse/FLINK-18663
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Till Rohrmann
Hi Fabian,

could explain a bit how you are cancelling a job with savepoint and then
try to retrieve the savepoint path?

When running Flink in per-job mode, the system should not shut down if you
have an asynchronous operation running whose result you have not yet
queried. I believe that this feature was introduced with FLINK-10309 [1].
The semantics is that Flink waits 5 minutes or until the result has been
queried (by any client) [2]. If this is not working, then this is clearly a
bug.

FLINK-18663 [3] solved a bug where the cluster would hang while trying to
shut it down. This was also a bug obviously.

[1] https://issues.apache.org/jira/browse/FLINK-10309
[2]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L141
[3] https://issues.apache.org/jira/browse/FLINK-18663

Cheers,
Till

On Fri, Aug 7, 2020 at 5:58 PM Eleanore Jin <[hidden email]> wrote:

> +1 Thank you Fabian!
>
> On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > Due to recent changes in the shutdown mechanism of Flink [1] it is not
> > conveniently possible anymore to suspend a job running on a jobcluster
> > with a savepoint and retrieve the savepoint location via the Flink API
> > programmatically.
> >
> > With the introduced changes the rest endpoint shutdowns immediately
> > and rejects new request which makes the information inaccessible.
> >
> > Before the changes it was possible to stop the job and query the
> savepoint
> > info endpoint until the location was shown.
> > Admittedly, this was never a safe solution because it expected that the
> > rest endpoint stays alive long enough.
> >
> > I would like to see what the community thinks about this and whether it
> is
> > worth to implement a different solution to retrieve those information.
> >
> > Best,
> > Fabian
> > [1] https://issues.apache.org/jira/browse/FLINK-18663
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Fabian Paul-2
Hi Till,

The problem is reproducible with a basic shell script doing the following operations.

1. Post request to /jobs/${JOB_ID}/savepoints with the payload
         {"cancel-job": true,"target-directory": $(LOCATION)}
        and store the trigger ID

2. Sleep 10 seconds

3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID)
        results in a connect exception because rest endpoint is shutdown.

Sorry, if I misunderstood you previous answer but I would expect that stopping the job
with a savepoint is an asynchronous operation and should block the shutdown until
the result is served.
I also can confirm that the cluster is not shutdown but the rest endpoint is which makes
it impossible to serve the asynchronous result.

Best,
Fabian

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Till Rohrmann
This sounds like a bug in Flink. Could you share the logs of the cluster
(ideally with TRACE log level) with us?

Cheers,
Till

On Tue, Aug 11, 2020 at 9:49 AM Fabian Paul <[hidden email]>
wrote:

> Hi Till,
>
> The problem is reproducible with a basic shell script doing the following
> operations.
>
> 1. Post request to /jobs/${JOB_ID}/savepoints with the payload
>          {"cancel-job": true,"target-directory": $(LOCATION)}
>         and store the trigger ID
>
> 2. Sleep 10 seconds
>
> 3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID)
>         results in a connect exception because rest endpoint is shutdown.
>
> Sorry, if I misunderstood you previous answer but I would expect that
> stopping the job
> with a savepoint is an asynchronous operation and should block the
> shutdown until
> the result is served.
> I also can confirm that the cluster is not shutdown but the rest endpoint
> is which makes
> it impossible to serve the asynchronous result.
>
> Best,
> Fabian
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Fabian Paul-2
I attached the last log lines[1] of the jobmanager after triggering the savepoint. I just
saw the release for 1.10.2 is started so it would probably be great if we determine
whether it is a bug to postpone the release if necessary.
What do you think?

Best,
Fabian

[1] https://pastebin.com/eWXN5fzS
 <https://pastebin.com/eWXN5fzS>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Till Rohrmann
Thanks for the logs Fabian. It is indeed a problem we introduced recently.
I've created a JIRA issue to fix the problem [1]. This fix will also be
included in the Flink 1.10.2 release.

[1] https://issues.apache.org/jira/browse/FLINK-18902

Cheers,
Till

On Wed, Aug 12, 2020 at 2:30 PM Fabian Paul <[hidden email]>
wrote:

> I attached the last log lines[1] of the jobmanager after triggering the
> savepoint. I just
> saw the release for 1.10.2 is started so it would probably be great if we
> determine
> whether it is a bug to postpone the release if necessary.
> What do you think?
>
> Best,
> Fabian
>
> [1] https://pastebin.com/eWXN5fzS
>  <https://pastebin.com/eWXN5fzS>
>