(DEPRECATED) Apache Flink Mailing List archive.

[DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Classic

List

Threaded

8 messages Options

未来阳光

[DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Hi All,

Desc
We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network disconnected temporarily between jobmanager and zookeeper.So we analyze this problem deeply. Flink JobManager use curator's `LeaderLatch` to maintain the leadership. When network disconncet, the `LeaderLatch` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it would be better to wait until the ZooKeeper connection is LOST.

Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close FLINK-13189.

Solution
Back to this problem, there're two ways to solve this currently, one is rewrite LeaderLatch#handleStateChange method, another is upgrade curator-4.2.0. The first way is hackly but right, the second way need to consider the
compatibility. For more detail, please see FLINK-10052.

Hope
The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this problem can fix as soon as possible.
btw, thanks @TisonKun for talking about this problem and review pr.

Links
FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <https://issues.apache.org/jira/browse/FLINK-13189>

Any suggestion is welcome, what do you think?

Best, lamber-ken.

abell kin

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Nice topic, our flink jobs met this problem too, and I think this work can help us deal with it.

On 2019/07/20 10:55:23, QQ邮箱 <[hidden email]> wrote:

> Hi All,>
>
> Desc>
> We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network disconnected temporarily between jobmanager and zookeeper.So we analyze this problem deeply. Flink JobManager use curator's `LeaderLatch` to maintain the leadership. When network disconncet, the `LeaderLatch` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it would be better to wait until the ZooKeeper connection is LOST.>
>
> Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close FLINK-13189.>
>
> Solution>
> Back to this problem, there're two ways to solve this currently, one is rewrite LeaderLatch#handleStateChange method, another is upgrade curator-4.2.0. The first way is hackly but right, the second way need to consider the >
> compatibility. For more detail, please see FLINK-10052.>
>
> Hope>
> The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this problem can fix as soon as possible. >
> btw, thanks @TisonKun for talking about this problem and review pr.>
>
> Links>
> FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>>
> FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <https://issues.apache.org/jira/browse/FLINK-13189>>
>
> Any suggestion is welcome, what do you think? >
>
> Best, lamber-ken.>

Till Rohrmann

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

In reply to this post by 未来阳光

Hi Lamber-Ken,

thanks for starting this discussion. I think there is benefit of not
directly losing leadership if the ZooKeeper connection goes into the
SUSPENDED state. In particular if we can guarantee that there is only a
single JobMaster, it might make sense to not overly eagerly give up
leadership. I would suggest to continue the technical discussion on the
JIRA issue thread since it already contains a good amount of details.

Cheers,
Till

On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote:

> Hi All,
>
> Desc
> We deploy flink streaming jobs on hadoop cluster on per-job model and use
> zookeeper as HighAvailabilityService, but we found that flink job will
> restart because of the network disconnected temporarily between jobmanager
> and zookeeper.So we analyze this problem deeply. Flink JobManager use
> curator's `LeaderLatch` to maintain the leadership. When network
> disconncet, the `LeaderLatch` will change leadership to false directly. We
> think it's too brutally that many flink longrunning jobs will restart
> because of the network shake.Instead of directly revoking the leadership
> upon a SUSPENDED ZooKeeper connection, it would be better to wait until the
> ZooKeeper connection is LOST.
>
> Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are
> duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> FLINK-13189.
>
> Solution
> Back to this problem, there're two ways to solve this currently, one is
> rewrite LeaderLatch#handleStateChange method, another is upgrade
> curator-4.2.0. The first way is hackly but right, the second way need to
> consider the
> compatibility. For more detail, please see FLINK-10052.
>
> Hope
> The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope
> this problem can fix as soon as possible.
> btw, thanks @TisonKun for talking about this problem and review pr.
>
> Links
> FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> https://issues.apache.org/jira/browse/FLINK-10052>
> FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> https://issues.apache.org/jira/browse/FLINK-13189>
>
> Any suggestion is welcome, what do you think?
>
> Best, lamber-ken.

未来阳光

回复： [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Ok, If you have any suggestions, we can talk aobut the details under FLINK-10052.

Best.

------------------ 原始邮件 ------------------
发件人: "Till Rohrmann"<[hidden email]>;
发送时间: 2019年7月23日(星期二) 晚上9:19
收件人: "dev"<[hidden email]>;

主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Hi Lamber-Ken,

thanks for starting this discussion. I think there is benefit of not
directly losing leadership if the ZooKeeper connection goes into the
SUSPENDED state. In particular if we can guarantee that there is only a
single JobMaster, it might make sense to not overly eagerly give up
leadership. I would suggest to continue the technical discussion on the
JIRA issue thread since it already contains a good amount of details.

Cheers,
Till

On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote:

tison

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Hi committers,

Now that we have an ongoing pr[1] to this JIRA, we need a committer
to push this thread forward. It would be glad to see this issue fixed
in 1.9.0.

Best,
tison.

[1] https://github.com/apache/flink/pull/9158

未来阳光 <[hidden email]> 于2019年7月23日周二下午9:28写道：

> Ok, If you have any suggestions, we can talk aobut the details under
> FLINK-10052.
>
>
> Best.
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Till Rohrmann"<[hidden email]>;
> 发送时间: 2019年7月23日(星期二) 晚上9:19
> 收件人: "dev"<[hidden email]>;
>
> 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections
>
>
>
> Hi Lamber-Ken,
>
> thanks for starting this discussion. I think there is benefit of not
> directly losing leadership if the ZooKeeper connection goes into the
> SUSPENDED state. In particular if we can guarantee that there is only a
> single JobMaster, it might make sense to not overly eagerly give up
> leadership. I would suggest to continue the technical discussion on the
> JIRA issue thread since it already contains a good amount of details.
>
> Cheers,
> Till
>
> On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote:
>
> > Hi All,
> >
> > Desc
> > We deploy flink streaming jobs on hadoop cluster on per-job model and use
> > zookeeper as HighAvailabilityService, but we found that flink job will
> > restart because of the network disconnected temporarily between
> jobmanager
> > and zookeeper.So we analyze this problem deeply. Flink JobManager use
> > curator's `LeaderLatch` to maintain the leadership. When network
> > disconncet, the `LeaderLatch` will change leadership to false directly.
> We
> > think it's too brutally that many flink longrunning jobs will restart
> > because of the network shake.Instead of directly revoking the leadership
> > upon a SUSPENDED ZooKeeper connection, it would be better to wait until
> the
> > ZooKeeper connection is LOST.
> >
> > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they
> are
> > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> > FLINK-13189.
> >
> > Solution
> > Back to this problem, there're two ways to solve this currently, one is
> > rewrite LeaderLatch#handleStateChange method, another is upgrade
> > curator-4.2.0. The first way is hackly but right, the second way need to
> > consider the
> > compatibility. For more detail, please see FLINK-10052.
> >
> > Hope
> > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope
> > this problem can fix as soon as possible.
> > btw, thanks @TisonKun for talking about this problem and review pr.
> >
> > Links
> > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> > https://issues.apache.org/jira/browse/FLINK-10052>
> > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> > https://issues.apache.org/jira/browse/FLINK-13189>
> >
> > Any suggestion is welcome, what do you think?
> >
> > Best, lamber-ken.

Till Rohrmann

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Hi Tison,

I would consider this a new feature and as such it won't be possible to
include it in the 1.9.0 release since the feature freeze has been passed.
We might target 1.10, though.

Cheers,
Till

On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[hidden email]> wrote:

> Hi committers,
>
> Now that we have an ongoing pr[1] to this JIRA, we need a committer
> to push this thread forward. It would be glad to see this issue fixed
> in 1.9.0.
>
> Best,
> tison.
>
> [1] https://github.com/apache/flink/pull/9158
>
>
> 未来阳光 <[hidden email]> 于2019年7月23日周二下午9:28写道：
>
> > Ok, If you have any suggestions, we can talk aobut the details under
> > FLINK-10052.
> >
> >
> > Best.
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "Till Rohrmann"<[hidden email]>;
> > 发送时间: 2019年7月23日(星期二) 晚上9:19
> > 收件人: "dev"<[hidden email]>;
> >
> > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections
> >
> >
> >
> > Hi Lamber-Ken,
> >
> > thanks for starting this discussion. I think there is benefit of not
> > directly losing leadership if the ZooKeeper connection goes into the
> > SUSPENDED state. In particular if we can guarantee that there is only a
> > single JobMaster, it might make sense to not overly eagerly give up
> > leadership. I would suggest to continue the technical discussion on the
> > JIRA issue thread since it already contains a good amount of details.
> >
> > Cheers,
> > Till
> >
> > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote:
> >
> > > Hi All,
> > >
> > > Desc
> > > We deploy flink streaming jobs on hadoop cluster on per-job model and
> use
> > > zookeeper as HighAvailabilityService, but we found that flink job will
> > > restart because of the network disconnected temporarily between
> > jobmanager
> > > and zookeeper.So we analyze this problem deeply. Flink JobManager use
> > > curator's `LeaderLatch` to maintain the leadership. When network
> > > disconncet, the `LeaderLatch` will change leadership to false directly.
> > We
> > > think it's too brutally that many flink longrunning jobs will restart
> > > because of the network shake.Instead of directly revoking the
> leadership
> > > upon a SUSPENDED ZooKeeper connection, it would be better to wait until
> > the
> > > ZooKeeper connection is LOST.
> > >
> > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they
> > are
> > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> > > FLINK-13189.
> > >
> > > Solution
> > > Back to this problem, there're two ways to solve this currently, one is
> > > rewrite LeaderLatch#handleStateChange method, another is upgrade
> > > curator-4.2.0. The first way is hackly but right, the second way need
> to
> > > consider the
> > > compatibility. For more detail, please see FLINK-10052.
> > >
> > > Hope
> > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we
> hope
> > > this problem can fix as soon as possible.
> > > btw, thanks @TisonKun for talking about this problem and review pr.
> > >
> > > Links
> > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> > > https://issues.apache.org/jira/browse/FLINK-10052>
> > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> > > https://issues.apache.org/jira/browse/FLINK-13189>
> > >
> > > Any suggestion is welcome, what do you think?
> > >
> > > Best, lamber-ken.
>

tison

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Hi Till,

Thanks for your explanation. Let's pick up this thread in 1.10 developing.

Best,
tison.

Till Rohrmann <[hidden email]> 于2019年7月29日周一下午9:12写道：

> Hi Tison,
>
> I would consider this a new feature and as such it won't be possible to
> include it in the 1.9.0 release since the feature freeze has been passed.
> We might target 1.10, though.
>
> Cheers,
> Till
>
> On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[hidden email]> wrote:
>
> > Hi committers,
> >
> > Now that we have an ongoing pr[1] to this JIRA, we need a committer
> > to push this thread forward. It would be glad to see this issue fixed
> > in 1.9.0.
> >
> > Best,
> > tison.
> >
> > [1] https://github.com/apache/flink/pull/9158
> >
> >
> > 未来阳光 <[hidden email]> 于2019年7月23日周二下午9:28写道：
> >
> > > Ok, If you have any suggestions, we can talk aobut the details under
> > > FLINK-10052.
> > >
> > >
> > > Best.
> > >
> > >
> > > ------------------ 原始邮件 ------------------
> > > 发件人: "Till Rohrmann"<[hidden email]>;
> > > 发送时间: 2019年7月23日(星期二) 晚上9:19
> > > 收件人: "dev"<[hidden email]>;
> > >
> > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections
> > >
> > >
> > >
> > > Hi Lamber-Ken,
> > >
> > > thanks for starting this discussion. I think there is benefit of not
> > > directly losing leadership if the ZooKeeper connection goes into the
> > > SUSPENDED state. In particular if we can guarantee that there is only a
> > > single JobMaster, it might make sense to not overly eagerly give up
> > > leadership. I would suggest to continue the technical discussion on the
> > > JIRA issue thread since it already contains a good amount of details.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Desc
> > > > We deploy flink streaming jobs on hadoop cluster on per-job model and
> > use
> > > > zookeeper as HighAvailabilityService, but we found that flink job
> will
> > > > restart because of the network disconnected temporarily between
> > > jobmanager
> > > > and zookeeper.So we analyze this problem deeply. Flink JobManager use
> > > > curator's `LeaderLatch` to maintain the leadership. When network
> > > > disconncet, the `LeaderLatch` will change leadership to false
> directly.
> > > We
> > > > think it's too brutally that many flink longrunning jobs will restart
> > > > because of the network shake.Instead of directly revoking the
> > leadership
> > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait
> until
> > > the
> > > > ZooKeeper connection is LOST.
> > > >
> > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189,
> they
> > > are
> > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> > > > FLINK-13189.
> > > >
> > > > Solution
> > > > Back to this problem, there're two ways to solve this currently, one
> is
> > > > rewrite LeaderLatch#handleStateChange method, another is upgrade
> > > > curator-4.2.0. The first way is hackly but right, the second way need
> > to
> > > > consider the
> > > > compatibility. For more detail, please see FLINK-10052.
> > > >
> > > > Hope
> > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we
> > hope
> > > > this problem can fix as soon as possible.
> > > > btw, thanks @TisonKun for talking about this problem and review pr.
> > > >
> > > > Links
> > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> > > > https://issues.apache.org/jira/browse/FLINK-10052>
> > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> > > > https://issues.apache.org/jira/browse/FLINK-13189>
> > > >
> > > > Any suggestion is welcome, what do you think?
> > > >
> > > > Best, lamber-ken.
> >
>

tison

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Hi Till,

I'd like to revive this thread since 1.9.0 has been released.

IMHO we already reached a consensus on JIRA and if you can review
the pull request we hopefully address the issue in next release.

Best,
tison.

Zili Chen <[hidden email]> 于2019年7月29日周一下午11:05写道：

> Hi Till,
>
> Thanks for your explanation. Let's pick up this thread in 1.10 developing.
>
> Best,
> tison.
>
>
> Till Rohrmann <[hidden email]> 于2019年7月29日周一下午9:12写道：
>
>> Hi Tison,
>>
>> I would consider this a new feature and as such it won't be possible to
>> include it in the 1.9.0 release since the feature freeze has been passed.
>> We might target 1.10, though.
>>
>> Cheers,
>> Till
>>
>> On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[hidden email]> wrote:
>>
>> > Hi committers,
>> >
>> > Now that we have an ongoing pr[1] to this JIRA, we need a committer
>> > to push this thread forward. It would be glad to see this issue fixed
>> > in 1.9.0.
>> >
>> > Best,
>> > tison.
>> >
>> > [1] https://github.com/apache/flink/pull/9158
>> >
>> >
>> > 未来阳光 <[hidden email]> 于2019年7月23日周二下午9:28写道：
>> >
>> > > Ok, If you have any suggestions, we can talk aobut the details under
>> > > FLINK-10052.
>> > >
>> > >
>> > > Best.
>> > >
>> > >
>> > > ------------------ 原始邮件 ------------------
>> > > 发件人: "Till Rohrmann"<[hidden email]>;
>> > > 发送时间: 2019年7月23日(星期二) 晚上9:19
>> > > 收件人: "dev"<[hidden email]>;
>> > >
>> > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper
>> connections
>> > >
>> > >
>> > >
>> > > Hi Lamber-Ken,
>> > >
>> > > thanks for starting this discussion. I think there is benefit of not
>> > > directly losing leadership if the ZooKeeper connection goes into the
>> > > SUSPENDED state. In particular if we can guarantee that there is only
>> a
>> > > single JobMaster, it might make sense to not overly eagerly give up
>> > > leadership. I would suggest to continue the technical discussion on
>> the
>> > > JIRA issue thread since it already contains a good amount of details.
>> > >
>> > > Cheers,
>> > > Till
>> > >
>> > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[hidden email]> wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > Desc
>> > > > We deploy flink streaming jobs on hadoop cluster on per-job model
>> and
>> > use
>> > > > zookeeper as HighAvailabilityService, but we found that flink job
>> will
>> > > > restart because of the network disconnected temporarily between
>> > > jobmanager
>> > > > and zookeeper.So we analyze this problem deeply. Flink JobManager
>> use
>> > > > curator's `LeaderLatch` to maintain the leadership. When network
>> > > > disconncet, the `LeaderLatch` will change leadership to false
>> directly.
>> > > We
>> > > > think it's too brutally that many flink longrunning jobs will
>> restart
>> > > > because of the network shake.Instead of directly revoking the
>> > leadership
>> > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait
>> until
>> > > the
>> > > > ZooKeeper connection is LOST.
>> > > >
>> > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189,
>> they
>> > > are
>> > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
>> > > > FLINK-13189.
>> > > >
>> > > > Solution
>> > > > Back to this problem, there're two ways to solve this currently,
>> one is
>> > > > rewrite LeaderLatch#handleStateChange method, another is upgrade
>> > > > curator-4.2.0. The first way is hackly but right, the second way
>> need
>> > to
>> > > > consider the
>> > > > compatibility. For more detail, please see FLINK-10052.
>> > > >
>> > > > Hope
>> > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we
>> > hope
>> > > > this problem can fix as soon as possible.
>> > > > btw, thanks @TisonKun for talking about this problem and review pr.
>> > > >
>> > > > Links
>> > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
>> > > > https://issues.apache.org/jira/browse/FLINK-10052>
>> > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
>> > > > https://issues.apache.org/jira/browse/FLINK-13189>
>> > > >
>> > > > Any suggestion is welcome, what do you think?
>> > > >
>> > > > Best, lamber-ken.
>> >
>>
>