Hi,
Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. And we met some problem when calling the flink rest api. 1) We deploy our flink cluster in standlone mode on kubernetes and use two Jobmanagers for HA. 2) We deployed a kubernetes service for the two jobmanagers to provide a unified url. 3) We use restful api to operate the flink cluster. Afther upgraded to 1.10, we found there is some difference between 1.7 when processing the savepoint query request. For example, if we send a savepoint trigger request to the leader jobmanager, in 1.7 we can query the standby jobmanager to get the status of the checkpoint, while in 1.10 it will return a 404 response. In 1.7 all the requests to standby Jobmanager will be forward to the leader in "RedirectHandler", while in 1.10 the requesets will be forward with RPC in "LeaderRetrievalHandler". But there seems a issue in "AbstractAsynchronousOperationHandlers", in this handler, there is a local memory cache "completedOperationCache" to store the pending savpoint opeartion before redirect the request to the leader jobmanager, which seems not synced between all the jobmanagers. This makes only the jobmanager which receive the savepoint trigger requset can lookup the status of the savpoint, while the others can only return 404. As this breaks our design in operating the flink cluster with restful API, we cannot use kubernetes service to hide the standby jobmanager any more. We hope to know is this behavior by design or it's really a bug? Thanks and Best Regards Lucent Wong |
I think this is not unintentional and simply a case we did not consider.
Please file a JIRA. On 15/06/2020 19:01, Wong Lucent wrote: > Hi, > > > Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. And we met some problem when calling the flink rest api. > > 1) We deploy our flink cluster in standlone mode on kubernetes and use two Jobmanagers for HA. > > 2) We deployed a kubernetes service for the two jobmanagers to provide a unified url. > > 3) We use restful api to operate the flink cluster. > > Afther upgraded to 1.10, we found there is some difference between 1.7 when processing the savepoint query request. For example, if we send a savepoint trigger request to the leader jobmanager, in 1.7 we can query the standby jobmanager to get the status of the checkpoint, while in 1.10 it will return a 404 response. > > In 1.7 all the requests to standby Jobmanager will be forward to the leader in "RedirectHandler", while in 1.10 the requesets will be forward with RPC in "LeaderRetrievalHandler". But there seems a issue in "AbstractAsynchronousOperationHandlers", in this handler, there is a local memory cache "completedOperationCache" to store the pending savpoint opeartion before redirect the request to the leader jobmanager, which seems not synced between all the jobmanagers. This makes only the jobmanager which receive the savepoint trigger requset can lookup the status of the savpoint, while the others can only return 404. > > As this breaks our design in operating the flink cluster with restful API, we cannot use kubernetes service to hide the standby jobmanager any more. We hope to know is this behavior by design or it's really a bug? > > > Thanks and Best Regards > Lucent Wong |
Thanks for reporting this issue Lucent. One way to solve it would be to
reintroduce the RedirectHandler and allow the user to choose between forwarding to the leader as it is right now in Flink 1.10 and using redirects as it was the case in the past. If I remember correctly, then redirects also had their problems. Cheers, Till On Mon, Jun 15, 2020 at 8:16 PM Chesnay Schepler <[hidden email]> wrote: > I think this is not unintentional and simply a case we did not consider. > > Please file a JIRA. > > On 15/06/2020 19:01, Wong Lucent wrote: > > Hi, > > > > > > Recently, our team upgraged our filnk cluster from version 1.7 to 1.10. > And we met some problem when calling the flink rest api. > > > > 1) We deploy our flink cluster in standlone mode on kubernetes and use > two Jobmanagers for HA. > > > > 2) We deployed a kubernetes service for the two jobmanagers to provide a > unified url. > > > > 3) We use restful api to operate the flink cluster. > > > > Afther upgraded to 1.10, we found there is some difference between 1.7 > when processing the savepoint query request. For example, if we send a > savepoint trigger request to the leader jobmanager, in 1.7 we can query the > standby jobmanager to get the status of the checkpoint, while in 1.10 it > will return a 404 response. > > > > In 1.7 all the requests to standby Jobmanager will be forward to the > leader in "RedirectHandler", while in 1.10 the requesets will be forward > with RPC in "LeaderRetrievalHandler". But there seems a issue in > "AbstractAsynchronousOperationHandlers", in this handler, there is a local > memory cache "completedOperationCache" to store the pending savpoint > opeartion before redirect the request to the leader jobmanager, which seems > not synced between all the jobmanagers. This makes only the jobmanager > which receive the savepoint trigger requset can lookup the status of the > savpoint, while the others can only return 404. > > > > As this breaks our design in operating the flink cluster with restful > API, we cannot use kubernetes service to hide the standby jobmanager any > more. We hope to know is this behavior by design or it's really a bug? > > > > > > Thanks and Best Regards > > Lucent Wong > > > |
Free forum by Nabble | Edit this page |