Rest handler redirect problem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Rest handler redirect problem

Wong Lucent
Hi,


Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. And we met some problem when calling the flink rest api.

1) We deploy our flink cluster in standlone mode on kubernetes and use two Jobmanagers for HA.

2) We deployed a kubernetes service for the two jobmanagers to provide a unified url.

3) We use restful api to operate the flink cluster.

Afther upgraded to 1.10,  we found there is some difference between 1.7 when processing the savepoint query request. For example, if we send a savepoint trigger request to the leader jobmanager, in 1.7 we can query the standby jobmanager to get the status of the checkpoint, while in 1.10 it will return a 404 response.

In 1.7 all the requests to standby Jobmanager will be forward to the leader in "RedirectHandler", while in 1.10 the requesets will be forward with RPC in "LeaderRetrievalHandler". But there seems a issue in "AbstractAsynchronousOperationHandlers", in this handler, there is a local memory cache "completedOperationCache" to store the pending savpoint opeartion before redirect the request to the leader jobmanager, which seems not synced between all the jobmanagers. This makes only the jobmanager which receive the savepoint trigger requset can lookup the status of the savpoint, while the others can only return 404.

As this breaks our design in operating the flink cluster with restful API, we cannot use kubernetes service to hide the standby jobmanager any more. We hope to know is this behavior by design or it's really a bug?


Thanks and Best Regards
Lucent Wong
Reply | Threaded
Open this post in threaded view
|

Re: Rest handler redirect problem

Chesnay Schepler-3
I think this is not unintentional and simply a case we did not consider.

Please file a JIRA.

On 15/06/2020 19:01, Wong Lucent wrote:

> Hi,
>
>
> Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. And we met some problem when calling the flink rest api.
>
> 1) We deploy our flink cluster in standlone mode on kubernetes and use two Jobmanagers for HA.
>
> 2) We deployed a kubernetes service for the two jobmanagers to provide a unified url.
>
> 3) We use restful api to operate the flink cluster.
>
> Afther upgraded to 1.10,  we found there is some difference between 1.7 when processing the savepoint query request. For example, if we send a savepoint trigger request to the leader jobmanager, in 1.7 we can query the standby jobmanager to get the status of the checkpoint, while in 1.10 it will return a 404 response.
>
> In 1.7 all the requests to standby Jobmanager will be forward to the leader in "RedirectHandler", while in 1.10 the requesets will be forward with RPC in "LeaderRetrievalHandler". But there seems a issue in "AbstractAsynchronousOperationHandlers", in this handler, there is a local memory cache "completedOperationCache" to store the pending savpoint opeartion before redirect the request to the leader jobmanager, which seems not synced between all the jobmanagers. This makes only the jobmanager which receive the savepoint trigger requset can lookup the status of the savpoint, while the others can only return 404.
>
> As this breaks our design in operating the flink cluster with restful API, we cannot use kubernetes service to hide the standby jobmanager any more. We hope to know is this behavior by design or it's really a bug?
>
>
> Thanks and Best Regards
> Lucent Wong


Reply | Threaded
Open this post in threaded view
|

Re: Rest handler redirect problem

Till Rohrmann
Thanks for reporting this issue Lucent. One way to solve it would be to
reintroduce the RedirectHandler and allow the user to choose between
forwarding to the leader as it is right now in Flink 1.10 and using
redirects as it was the case in the past. If I remember correctly, then
redirects also had their problems.

Cheers,
Till

On Mon, Jun 15, 2020 at 8:16 PM Chesnay Schepler <[hidden email]> wrote:

> I think this is not unintentional and simply a case we did not consider.
>
> Please file a JIRA.
>
> On 15/06/2020 19:01, Wong Lucent wrote:
> > Hi,
> >
> >
> > Recently, our team upgraged our filnk cluster from version 1.7 to 1.10.
> And we met some problem when calling the flink rest api.
> >
> > 1) We deploy our flink cluster in standlone mode on kubernetes and use
> two Jobmanagers for HA.
> >
> > 2) We deployed a kubernetes service for the two jobmanagers to provide a
> unified url.
> >
> > 3) We use restful api to operate the flink cluster.
> >
> > Afther upgraded to 1.10,  we found there is some difference between 1.7
> when processing the savepoint query request. For example, if we send a
> savepoint trigger request to the leader jobmanager, in 1.7 we can query the
> standby jobmanager to get the status of the checkpoint, while in 1.10 it
> will return a 404 response.
> >
> > In 1.7 all the requests to standby Jobmanager will be forward to the
> leader in "RedirectHandler", while in 1.10 the requesets will be forward
> with RPC in "LeaderRetrievalHandler". But there seems a issue in
> "AbstractAsynchronousOperationHandlers", in this handler, there is a local
> memory cache "completedOperationCache" to store the pending savpoint
> opeartion before redirect the request to the leader jobmanager, which seems
> not synced between all the jobmanagers. This makes only the jobmanager
> which receive the savepoint trigger requset can lookup the status of the
> savpoint, while the others can only return 404.
> >
> > As this breaks our design in operating the flink cluster with restful
> API, we cannot use kubernetes service to hide the standby jobmanager any
> more. We hope to know is this behavior by design or it's really a bug?
> >
> >
> > Thanks and Best Regards
> > Lucent Wong
>
>
>