Hi All,
Desc We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network disconnected temporarily between jobmanager and zookeeper.So we analyze this problem deeply. Flink JobManager use curator's `LeaderLatch` to maintain the leadership. When network disconncet, the `LeaderLatch` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it would be better to wait until the ZooKeeper connection is LOST. Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close FLINK-13189. Solution Back to this problem, there're two ways to solve this currently, one is rewrite LeaderLatch#handleStateChange method, another is upgrade curator-4.2.0. The first way is hackly but right, the second way need to consider the compatibility. For more detail, please see FLINK-10052. Hope The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this problem can fix as soon as possible. btw, thanks @TisonKun for talking about this problem and review pr. Links FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052> FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <https://issues.apache.org/jira/browse/FLINK-13189> Any suggestion is welcome, what do you think? Best, lamber-ken. |
Nice topic, our flink jobs met this problem too, and I think this work can help us deal with it. On 2019/07/20 10:55:23, QQ邮箱 <[hidden email]> wrote: > Hi All,> > > Desc> > We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network disconnected temporarily between jobmanager and zookeeper.So we analyze this problem deeply. Flink JobManager use curator's `LeaderLatch` to maintain the leadership. When network disconncet, the `LeaderLatch` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it would be better to wait until the ZooKeeper connection is LOST.> > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close FLINK-13189.> > > Solution> > Back to this problem, there're two ways to solve this currently, one is rewrite LeaderLatch#handleStateChange method, another is upgrade curator-4.2.0. The first way is hackly but right, the second way need to consider the > > compatibility. For more detail, please see FLINK-10052.> > > Hope> > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this problem can fix as soon as possible. > > btw, thanks @TisonKun for talking about this problem and review pr.> > > Links> > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>> > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <https://issues.apache.org/jira/browse/FLINK-13189>> > > Any suggestion is welcome, what do you think? > > > Best, lamber-ken.> |
In reply to this post by 未来阳光
Hi Lamber-Ken,
thanks for starting this discussion. I think there is benefit of not directly losing leadership if the ZooKeeper connection goes into the SUSPENDED state. In particular if we can guarantee that there is only a single JobMaster, it might make sense to not overly eagerly give up leadership. I would suggest to continue the technical discussion on the JIRA issue thread since it already contains a good amount of details. Cheers, Till On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote: > Hi All, > > Desc > We deploy flink streaming jobs on hadoop cluster on per-job model and use > zookeeper as HighAvailabilityService, but we found that flink job will > restart because of the network disconnected temporarily between jobmanager > and zookeeper.So we analyze this problem deeply. Flink JobManager use > curator's `LeaderLatch` to maintain the leadership. When network > disconncet, the `LeaderLatch` will change leadership to false directly. We > think it's too brutally that many flink longrunning jobs will restart > because of the network shake.Instead of directly revoking the leadership > upon a SUSPENDED ZooKeeper connection, it would be better to wait until the > ZooKeeper connection is LOST. > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > FLINK-13189. > > Solution > Back to this problem, there're two ways to solve this currently, one is > rewrite LeaderLatch#handleStateChange method, another is upgrade > curator-4.2.0. The first way is hackly but right, the second way need to > consider the > compatibility. For more detail, please see FLINK-10052. > > Hope > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope > this problem can fix as soon as possible. > btw, thanks @TisonKun for talking about this problem and review pr. > > Links > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > https://issues.apache.org/jira/browse/FLINK-10052> > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > https://issues.apache.org/jira/browse/FLINK-13189> > > Any suggestion is welcome, what do you think? > > Best, lamber-ken. |
Ok, If you have any suggestions, we can talk aobut the details under FLINK-10052.
Best. ------------------ 原始邮件 ------------------ 发件人: "Till Rohrmann"<[hidden email]>; 发送时间: 2019年7月23日(星期二) 晚上9:19 收件人: "dev"<[hidden email]>; 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections Hi Lamber-Ken, thanks for starting this discussion. I think there is benefit of not directly losing leadership if the ZooKeeper connection goes into the SUSPENDED state. In particular if we can guarantee that there is only a single JobMaster, it might make sense to not overly eagerly give up leadership. I would suggest to continue the technical discussion on the JIRA issue thread since it already contains a good amount of details. Cheers, Till On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote: > Hi All, > > Desc > We deploy flink streaming jobs on hadoop cluster on per-job model and use > zookeeper as HighAvailabilityService, but we found that flink job will > restart because of the network disconnected temporarily between jobmanager > and zookeeper.So we analyze this problem deeply. Flink JobManager use > curator's `LeaderLatch` to maintain the leadership. When network > disconncet, the `LeaderLatch` will change leadership to false directly. We > think it's too brutally that many flink longrunning jobs will restart > because of the network shake.Instead of directly revoking the leadership > upon a SUSPENDED ZooKeeper connection, it would be better to wait until the > ZooKeeper connection is LOST. > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > FLINK-13189. > > Solution > Back to this problem, there're two ways to solve this currently, one is > rewrite LeaderLatch#handleStateChange method, another is upgrade > curator-4.2.0. The first way is hackly but right, the second way need to > consider the > compatibility. For more detail, please see FLINK-10052. > > Hope > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope > this problem can fix as soon as possible. > btw, thanks @TisonKun for talking about this problem and review pr. > > Links > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > https://issues.apache.org/jira/browse/FLINK-10052> > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > https://issues.apache.org/jira/browse/FLINK-13189> > > Any suggestion is welcome, what do you think? > > Best, lamber-ken. |
Hi committers,
Now that we have an ongoing pr[1] to this JIRA, we need a committer to push this thread forward. It would be glad to see this issue fixed in 1.9.0. Best, tison. [1] https://github.com/apache/flink/pull/9158 未来阳光 <[hidden email]> 于2019年7月23日周二 下午9:28写道: > Ok, If you have any suggestions, we can talk aobut the details under > FLINK-10052. > > > Best. > > > ------------------ 原始邮件 ------------------ > 发件人: "Till Rohrmann"<[hidden email]>; > 发送时间: 2019年7月23日(星期二) 晚上9:19 > 收件人: "dev"<[hidden email]>; > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections > > > > Hi Lamber-Ken, > > thanks for starting this discussion. I think there is benefit of not > directly losing leadership if the ZooKeeper connection goes into the > SUSPENDED state. In particular if we can guarantee that there is only a > single JobMaster, it might make sense to not overly eagerly give up > leadership. I would suggest to continue the technical discussion on the > JIRA issue thread since it already contains a good amount of details. > > Cheers, > Till > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote: > > > Hi All, > > > > Desc > > We deploy flink streaming jobs on hadoop cluster on per-job model and use > > zookeeper as HighAvailabilityService, but we found that flink job will > > restart because of the network disconnected temporarily between > jobmanager > > and zookeeper.So we analyze this problem deeply. Flink JobManager use > > curator's `LeaderLatch` to maintain the leadership. When network > > disconncet, the `LeaderLatch` will change leadership to false directly. > We > > think it's too brutally that many flink longrunning jobs will restart > > because of the network shake.Instead of directly revoking the leadership > > upon a SUSPENDED ZooKeeper connection, it would be better to wait until > the > > ZooKeeper connection is LOST. > > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they > are > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > > FLINK-13189. > > > > Solution > > Back to this problem, there're two ways to solve this currently, one is > > rewrite LeaderLatch#handleStateChange method, another is upgrade > > curator-4.2.0. The first way is hackly but right, the second way need to > > consider the > > compatibility. For more detail, please see FLINK-10052. > > > > Hope > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope > > this problem can fix as soon as possible. > > btw, thanks @TisonKun for talking about this problem and review pr. > > > > Links > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > > https://issues.apache.org/jira/browse/FLINK-10052> > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > > https://issues.apache.org/jira/browse/FLINK-13189> > > > > Any suggestion is welcome, what do you think? > > > > Best, lamber-ken. |
Hi Tison,
I would consider this a new feature and as such it won't be possible to include it in the 1.9.0 release since the feature freeze has been passed. We might target 1.10, though. Cheers, Till On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[hidden email]> wrote: > Hi committers, > > Now that we have an ongoing pr[1] to this JIRA, we need a committer > to push this thread forward. It would be glad to see this issue fixed > in 1.9.0. > > Best, > tison. > > [1] https://github.com/apache/flink/pull/9158 > > > 未来阳光 <[hidden email]> 于2019年7月23日周二 下午9:28写道: > > > Ok, If you have any suggestions, we can talk aobut the details under > > FLINK-10052. > > > > > > Best. > > > > > > ------------------ 原始邮件 ------------------ > > 发件人: "Till Rohrmann"<[hidden email]>; > > 发送时间: 2019年7月23日(星期二) 晚上9:19 > > 收件人: "dev"<[hidden email]>; > > > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections > > > > > > > > Hi Lamber-Ken, > > > > thanks for starting this discussion. I think there is benefit of not > > directly losing leadership if the ZooKeeper connection goes into the > > SUSPENDED state. In particular if we can guarantee that there is only a > > single JobMaster, it might make sense to not overly eagerly give up > > leadership. I would suggest to continue the technical discussion on the > > JIRA issue thread since it already contains a good amount of details. > > > > Cheers, > > Till > > > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote: > > > > > Hi All, > > > > > > Desc > > > We deploy flink streaming jobs on hadoop cluster on per-job model and > use > > > zookeeper as HighAvailabilityService, but we found that flink job will > > > restart because of the network disconnected temporarily between > > jobmanager > > > and zookeeper.So we analyze this problem deeply. Flink JobManager use > > > curator's `LeaderLatch` to maintain the leadership. When network > > > disconncet, the `LeaderLatch` will change leadership to false directly. > > We > > > think it's too brutally that many flink longrunning jobs will restart > > > because of the network shake.Instead of directly revoking the > leadership > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait until > > the > > > ZooKeeper connection is LOST. > > > > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they > > are > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > > > FLINK-13189. > > > > > > Solution > > > Back to this problem, there're two ways to solve this currently, one is > > > rewrite LeaderLatch#handleStateChange method, another is upgrade > > > curator-4.2.0. The first way is hackly but right, the second way need > to > > > consider the > > > compatibility. For more detail, please see FLINK-10052. > > > > > > Hope > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we > hope > > > this problem can fix as soon as possible. > > > btw, thanks @TisonKun for talking about this problem and review pr. > > > > > > Links > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > > > https://issues.apache.org/jira/browse/FLINK-10052> > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > > > https://issues.apache.org/jira/browse/FLINK-13189> > > > > > > Any suggestion is welcome, what do you think? > > > > > > Best, lamber-ken. > |
Hi Till,
Thanks for your explanation. Let's pick up this thread in 1.10 developing. Best, tison. Till Rohrmann <[hidden email]> 于2019年7月29日周一 下午9:12写道: > Hi Tison, > > I would consider this a new feature and as such it won't be possible to > include it in the 1.9.0 release since the feature freeze has been passed. > We might target 1.10, though. > > Cheers, > Till > > On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[hidden email]> wrote: > > > Hi committers, > > > > Now that we have an ongoing pr[1] to this JIRA, we need a committer > > to push this thread forward. It would be glad to see this issue fixed > > in 1.9.0. > > > > Best, > > tison. > > > > [1] https://github.com/apache/flink/pull/9158 > > > > > > 未来阳光 <[hidden email]> 于2019年7月23日周二 下午9:28写道: > > > > > Ok, If you have any suggestions, we can talk aobut the details under > > > FLINK-10052. > > > > > > > > > Best. > > > > > > > > > ------------------ 原始邮件 ------------------ > > > 发件人: "Till Rohrmann"<[hidden email]>; > > > 发送时间: 2019年7月23日(星期二) 晚上9:19 > > > 收件人: "dev"<[hidden email]>; > > > > > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections > > > > > > > > > > > > Hi Lamber-Ken, > > > > > > thanks for starting this discussion. I think there is benefit of not > > > directly losing leadership if the ZooKeeper connection goes into the > > > SUSPENDED state. In particular if we can guarantee that there is only a > > > single JobMaster, it might make sense to not overly eagerly give up > > > leadership. I would suggest to continue the technical discussion on the > > > JIRA issue thread since it already contains a good amount of details. > > > > > > Cheers, > > > Till > > > > > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote: > > > > > > > Hi All, > > > > > > > > Desc > > > > We deploy flink streaming jobs on hadoop cluster on per-job model and > > use > > > > zookeeper as HighAvailabilityService, but we found that flink job > will > > > > restart because of the network disconnected temporarily between > > > jobmanager > > > > and zookeeper.So we analyze this problem deeply. Flink JobManager use > > > > curator's `LeaderLatch` to maintain the leadership. When network > > > > disconncet, the `LeaderLatch` will change leadership to false > directly. > > > We > > > > think it's too brutally that many flink longrunning jobs will restart > > > > because of the network shake.Instead of directly revoking the > > leadership > > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait > until > > > the > > > > ZooKeeper connection is LOST. > > > > > > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, > they > > > are > > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > > > > FLINK-13189. > > > > > > > > Solution > > > > Back to this problem, there're two ways to solve this currently, one > is > > > > rewrite LeaderLatch#handleStateChange method, another is upgrade > > > > curator-4.2.0. The first way is hackly but right, the second way need > > to > > > > consider the > > > > compatibility. For more detail, please see FLINK-10052. > > > > > > > > Hope > > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we > > hope > > > > this problem can fix as soon as possible. > > > > btw, thanks @TisonKun for talking about this problem and review pr. > > > > > > > > Links > > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > > > > https://issues.apache.org/jira/browse/FLINK-10052> > > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > > > > https://issues.apache.org/jira/browse/FLINK-13189> > > > > > > > > Any suggestion is welcome, what do you think? > > > > > > > > Best, lamber-ken. > > > |
Hi Till,
I'd like to revive this thread since 1.9.0 has been released. IMHO we already reached a consensus on JIRA and if you can review the pull request we hopefully address the issue in next release. Best, tison. Zili Chen <[hidden email]> 于2019年7月29日周一 下午11:05写道: > Hi Till, > > Thanks for your explanation. Let's pick up this thread in 1.10 developing. > > Best, > tison. > > > Till Rohrmann <[hidden email]> 于2019年7月29日周一 下午9:12写道: > >> Hi Tison, >> >> I would consider this a new feature and as such it won't be possible to >> include it in the 1.9.0 release since the feature freeze has been passed. >> We might target 1.10, though. >> >> Cheers, >> Till >> >> On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[hidden email]> wrote: >> >> > Hi committers, >> > >> > Now that we have an ongoing pr[1] to this JIRA, we need a committer >> > to push this thread forward. It would be glad to see this issue fixed >> > in 1.9.0. >> > >> > Best, >> > tison. >> > >> > [1] https://github.com/apache/flink/pull/9158 >> > >> > >> > 未来阳光 <[hidden email]> 于2019年7月23日周二 下午9:28写道: >> > >> > > Ok, If you have any suggestions, we can talk aobut the details under >> > > FLINK-10052. >> > > >> > > >> > > Best. >> > > >> > > >> > > ------------------ 原始邮件 ------------------ >> > > 发件人: "Till Rohrmann"<[hidden email]>; >> > > 发送时间: 2019年7月23日(星期二) 晚上9:19 >> > > 收件人: "dev"<[hidden email]>; >> > > >> > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper >> connections >> > > >> > > >> > > >> > > Hi Lamber-Ken, >> > > >> > > thanks for starting this discussion. I think there is benefit of not >> > > directly losing leadership if the ZooKeeper connection goes into the >> > > SUSPENDED state. In particular if we can guarantee that there is only >> a >> > > single JobMaster, it might make sense to not overly eagerly give up >> > > leadership. I would suggest to continue the technical discussion on >> the >> > > JIRA issue thread since it already contains a good amount of details. >> > > >> > > Cheers, >> > > Till >> > > >> > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote: >> > > >> > > > Hi All, >> > > > >> > > > Desc >> > > > We deploy flink streaming jobs on hadoop cluster on per-job model >> and >> > use >> > > > zookeeper as HighAvailabilityService, but we found that flink job >> will >> > > > restart because of the network disconnected temporarily between >> > > jobmanager >> > > > and zookeeper.So we analyze this problem deeply. Flink JobManager >> use >> > > > curator's `LeaderLatch` to maintain the leadership. When network >> > > > disconncet, the `LeaderLatch` will change leadership to false >> directly. >> > > We >> > > > think it's too brutally that many flink longrunning jobs will >> restart >> > > > because of the network shake.Instead of directly revoking the >> > leadership >> > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait >> until >> > > the >> > > > ZooKeeper connection is LOST. >> > > > >> > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, >> they >> > > are >> > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close >> > > > FLINK-13189. >> > > > >> > > > Solution >> > > > Back to this problem, there're two ways to solve this currently, >> one is >> > > > rewrite LeaderLatch#handleStateChange method, another is upgrade >> > > > curator-4.2.0. The first way is hackly but right, the second way >> need >> > to >> > > > consider the >> > > > compatibility. For more detail, please see FLINK-10052. >> > > > >> > > > Hope >> > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we >> > hope >> > > > this problem can fix as soon as possible. >> > > > btw, thanks @TisonKun for talking about this problem and review pr. >> > > > >> > > > Links >> > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < >> > > > https://issues.apache.org/jira/browse/FLINK-10052> >> > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < >> > > > https://issues.apache.org/jira/browse/FLINK-13189> >> > > > >> > > > Any suggestion is welcome, what do you think? >> > > > >> > > > Best, lamber-ken. >> > >> > |
Free forum by Nabble | Edit this page |