(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] Retrieve savepoint location after suspension of jobclusters

Classic

List

Threaded

7 messages Options

Fabian Paul-2

[DISCUSS] Retrieve savepoint location after suspension of jobclusters

Hi all,

Due to recent changes in the shutdown mechanism of Flink [1] it is not
conveniently possible anymore to suspend a job running on a jobcluster
with a savepoint and retrieve the savepoint location via the Flink API
programmatically.

With the introduced changes the rest endpoint shutdowns immediately
and rejects new request which makes the information inaccessible.

Before the changes it was possible to stop the job and query the savepoint
info endpoint until the location was shown.
Admittedly, this was never a safe solution because it expected that the
rest endpoint stays alive long enough.

I would like to see what the community thinks about this and whether it is
worth to implement a different solution to retrieve those information.

Best,
Fabian
[1] https://issues.apache.org/jira/browse/FLINK-18663

Eleanore Jin

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

+1 Thank you Fabian!

On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <[hidden email]>
wrote:

> Hi all,
>
> Due to recent changes in the shutdown mechanism of Flink [1] it is not
> conveniently possible anymore to suspend a job running on a jobcluster
> with a savepoint and retrieve the savepoint location via the Flink API
> programmatically.
>
> With the introduced changes the rest endpoint shutdowns immediately
> and rejects new request which makes the information inaccessible.
>
> Before the changes it was possible to stop the job and query the savepoint
> info endpoint until the location was shown.
> Admittedly, this was never a safe solution because it expected that the
> rest endpoint stays alive long enough.
>
> I would like to see what the community thinks about this and whether it is
> worth to implement a different solution to retrieve those information.
>
> Best,
> Fabian
> [1] https://issues.apache.org/jira/browse/FLINK-18663
>

Till Rohrmann

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Hi Fabian,

could explain a bit how you are cancelling a job with savepoint and then
try to retrieve the savepoint path?

When running Flink in per-job mode, the system should not shut down if you
have an asynchronous operation running whose result you have not yet
queried. I believe that this feature was introduced with FLINK-10309 [1].
The semantics is that Flink waits 5 minutes or until the result has been
queried (by any client) [2]. If this is not working, then this is clearly a
bug.

FLINK-18663 [3] solved a bug where the cluster would hang while trying to
shut it down. This was also a bug obviously.

[1] https://issues.apache.org/jira/browse/FLINK-10309
[2]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L141
[3] https://issues.apache.org/jira/browse/FLINK-18663

Cheers,
Till

On Fri, Aug 7, 2020 at 5:58 PM Eleanore Jin <[hidden email]> wrote:

> +1 Thank you Fabian!
>
> On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > Due to recent changes in the shutdown mechanism of Flink [1] it is not
> > conveniently possible anymore to suspend a job running on a jobcluster
> > with a savepoint and retrieve the savepoint location via the Flink API
> > programmatically.
> >
> > With the introduced changes the rest endpoint shutdowns immediately
> > and rejects new request which makes the information inaccessible.
> >
> > Before the changes it was possible to stop the job and query the
> savepoint
> > info endpoint until the location was shown.
> > Admittedly, this was never a safe solution because it expected that the
> > rest endpoint stays alive long enough.
> >
> > I would like to see what the community thinks about this and whether it
> is
> > worth to implement a different solution to retrieve those information.
> >
> > Best,
> > Fabian
> > [1] https://issues.apache.org/jira/browse/FLINK-18663
> >
>

Fabian Paul-2

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Hi Till,

The problem is reproducible with a basic shell script doing the following operations.

1. Post request to /jobs/${JOB_ID}/savepoints with the payload
{"cancel-job": true,"target-directory": $(LOCATION)}
and store the trigger ID

2. Sleep 10 seconds

3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID)
results in a connect exception because rest endpoint is shutdown.

Sorry, if I misunderstood you previous answer but I would expect that stopping the job
with a savepoint is an asynchronous operation and should block the shutdown until
the result is served.
I also can confirm that the cluster is not shutdown but the rest endpoint is which makes
it impossible to serve the asynchronous result.

Best,
Fabian

Till Rohrmann

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

This sounds like a bug in Flink. Could you share the logs of the cluster
(ideally with TRACE log level) with us?

Cheers,
Till

On Tue, Aug 11, 2020 at 9:49 AM Fabian Paul <[hidden email]>
wrote:

> Hi Till,
>
> The problem is reproducible with a basic shell script doing the following
> operations.
>
> 1. Post request to /jobs/${JOB_ID}/savepoints with the payload
> {"cancel-job": true,"target-directory": $(LOCATION)}
> and store the trigger ID
>
> 2. Sleep 10 seconds
>
> 3. Get jobs/${JOB_ID}/savepoints/$(TRIGGER_ID)
> results in a connect exception because rest endpoint is shutdown.
>
> Sorry, if I misunderstood you previous answer but I would expect that
> stopping the job
> with a savepoint is an asynchronous operation and should block the
> shutdown until
> the result is served.
> I also can confirm that the cluster is not shutdown but the rest endpoint
> is which makes
> it impossible to serve the asynchronous result.
>
> Best,
> Fabian
>
>

Fabian Paul-2

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

I attached the last log lines[1] of the jobmanager after triggering the savepoint. I just
saw the release for 1.10.2 is started so it would probably be great if we determine
whether it is a bug to postpone the release if necessary.
What do you think?

Best,
Fabian

[1] https://pastebin.com/eWXN5fzS
<https://pastebin.com/eWXN5fzS>

Till Rohrmann

Re: [DISCUSS] Retrieve savepoint location after suspension of jobclusters

Thanks for the logs Fabian. It is indeed a problem we introduced recently.
I've created a JIRA issue to fix the problem [1]. This fix will also be
included in the Flink 1.10.2 release.

[1] https://issues.apache.org/jira/browse/FLINK-18902

Cheers,
Till

On Wed, Aug 12, 2020 at 2:30 PM Fabian Paul <[hidden email]>
wrote:

> I attached the last log lines[1] of the jobmanager after triggering the
> savepoint. I just
> saw the release for 1.10.2 is started so it would probably be great if we
> determine
> whether it is a bug to postpone the release if necessary.
> What do you think?
>
> Best,
> Fabian
>
> [1] https://pastebin.com/eWXN5fzS
> <https://pastebin.com/eWXN5fzS>
>