handle SUSPENDED in ZooKeeperLeaderRetrievalService

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

handle SUSPENDED in ZooKeeperLeaderRetrievalService

chenqin
Hi there,

We observed several 1.11 job running in 1.11 restart due to job leader lost.
Dig deeper, the issue seems related to SUSPENDED state handler in
ZooKeeperLeaderRetrievalService.

ASFAIK, suspended state is expected when zk is not certain if leader is
still alive. It can follow up with RECONNECT or LOST. In current
implementation [1] , we treat suspended state same as lost state and
actively shutdown job. This pose stability issue on large HA setting.

My question is can we get some insight behind this decision and could we add
some tunable configuration for user to decide how long they can endure such
uncertain suspended state in their jobs.

Thanks,
Chen

[1]
https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201




--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Yang Wang
This might be related with FLINK-10052[1].
Unfortunately, we do not have any progress on this ticket.

cc @Till Rohrmann <[hidden email]>

Best,
Yang

chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:

> Hi there,
>
> We observed several 1.11 job running in 1.11 restart due to job leader
> lost.
> Dig deeper, the issue seems related to SUSPENDED state handler in
> ZooKeeperLeaderRetrievalService.
>
> ASFAIK, suspended state is expected when zk is not certain if leader is
> still alive. It can follow up with RECONNECT or LOST. In current
> implementation [1] , we treat suspended state same as lost state and
> actively shutdown job. This pose stability issue on large HA setting.
>
> My question is can we get some insight behind this decision and could we
> add
> some tunable configuration for user to decide how long they can endure such
> uncertain suspended state in their jobs.
>
> Thanks,
> Chen
>
> [1]
>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Till Rohrmann
Hi Chenqin,

The current rationale behind assuming a leadership loss when seeing a
SUSPENDED connection is to assume the worst and to be on the safe side.

Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour
configurable. Unfortunately, the community did not have enough time to
complete this feature.

[1] https://issues.apache.org/jira/browse/FLINK-10052

Cheers,
Till

On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]> wrote:

> This might be related with FLINK-10052[1].
> Unfortunately, we do not have any progress on this ticket.
>
> cc @Till Rohrmann <[hidden email]>
>
> Best,
> Yang
>
> chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
>
>> Hi there,
>>
>> We observed several 1.11 job running in 1.11 restart due to job leader
>> lost.
>> Dig deeper, the issue seems related to SUSPENDED state handler in
>> ZooKeeperLeaderRetrievalService.
>>
>> ASFAIK, suspended state is expected when zk is not certain if leader is
>> still alive. It can follow up with RECONNECT or LOST. In current
>> implementation [1] , we treat suspended state same as lost state and
>> actively shutdown job. This pose stability issue on large HA setting.
>>
>> My question is can we get some insight behind this decision and could we
>> add
>> some tunable configuration for user to decide how long they can endure
>> such
>> uncertain suspended state in their jobs.
>>
>> Thanks,
>> Chen
>>
>> [1]
>>
>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

chenqin
Hi there,

Thanks for providing points to related changes and jira. Some updates from our side, we applied a path by merging FLINK-10052 with master as well as only handling lost state leveraging SessionConnectionStateErrorPolicy  FLINK-10052 introduced.

Preliminary results were good, the same workload (240 TM) on the same environment runs stable without frequent restarts due to suspended state (seems false positive). We are working on more stringent load testing as well as chaos testing (blocking zk). Will keep folks posted.

Thanks,
Chen


On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <[hidden email]> wrote:
Hi Chenqin,

The current rationale behind assuming a leadership loss when seeing a
SUSPENDED connection is to assume the worst and to be on the safe side.

Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour
configurable. Unfortunately, the community did not have enough time to
complete this feature.

[1] https://issues.apache.org/jira/browse/FLINK-10052

Cheers,
Till

On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]> wrote:

> This might be related with FLINK-10052[1].
> Unfortunately, we do not have any progress on this ticket.
>
> cc @Till Rohrmann <[hidden email]>
>
> Best,
> Yang
>
> chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
>
>> Hi there,
>>
>> We observed several 1.11 job running in 1.11 restart due to job leader
>> lost.
>> Dig deeper, the issue seems related to SUSPENDED state handler in
>> ZooKeeperLeaderRetrievalService.
>>
>> ASFAIK, suspended state is expected when zk is not certain if leader is
>> still alive. It can follow up with RECONNECT or LOST. In current
>> implementation [1] , we treat suspended state same as lost state and
>> actively shutdown job. This pose stability issue on large HA setting.
>>
>> My question is can we get some insight behind this decision and could we
>> add
>> some tunable configuration for user to decide how long they can endure
>> such
>> uncertain suspended state in their jobs.
>>
>> Thanks,
>> Chen
>>
>> [1]
>>
>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Yang Wang
Thanks for trying the unfinished PR and sharing the testing results. Glad
to here that it could work
and really hope the result of more stringent load testing.

After then I think we could revive this ticket.


Best,
Yang

Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:

> Hi there,
>
> Thanks for providing points to related changes and jira. Some updates from
> our side, we applied a path by merging FLINK-10052
> <https://issues.apache.org/jira/browse/FLINK-10052> with master as well
> as only handling lost state leveraging SessionConnectionStateErrorPolicy
> FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
>  introduced.
>
> Preliminary results were good, the same workload (240 TM) on the same
> environment runs stable without frequent restarts due to suspended state
> (seems false positive). We are working on more stringent load testing as
> well as chaos testing (blocking zk). Will keep folks posted.
>
> Thanks,
> Chen
>
>
> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <[hidden email]>
> wrote:
>
>> Hi Chenqin,
>>
>> The current rationale behind assuming a leadership loss when seeing a
>> SUSPENDED connection is to assume the worst and to be on the safe side.
>>
>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour
>> configurable. Unfortunately, the community did not have enough time to
>> complete this feature.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>>
>> Cheers,
>> Till
>>
>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]> wrote:
>>
>> > This might be related with FLINK-10052[1].
>> > Unfortunately, we do not have any progress on this ticket.
>> >
>> > cc @Till Rohrmann <[hidden email]>
>> >
>> > Best,
>> > Yang
>> >
>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
>> >
>> >> Hi there,
>> >>
>> >> We observed several 1.11 job running in 1.11 restart due to job leader
>> >> lost.
>> >> Dig deeper, the issue seems related to SUSPENDED state handler in
>> >> ZooKeeperLeaderRetrievalService.
>> >>
>> >> ASFAIK, suspended state is expected when zk is not certain if leader is
>> >> still alive. It can follow up with RECONNECT or LOST. In current
>> >> implementation [1] , we treat suspended state same as lost state and
>> >> actively shutdown job. This pose stability issue on large HA setting.
>> >>
>> >> My question is can we get some insight behind this decision and could
>> we
>> >> add
>> >> some tunable configuration for user to decide how long they can endure
>> >> such
>> >> uncertain suspended state in their jobs.
>> >>
>> >> Thanks,
>> >> Chen
>> >>
>> >> [1]
>> >>
>> >>
>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Sent from:
>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

chenqin
Hi there,

Quick dial back here, we have been running load testing and so far haven't
seen suspended state cause job restarts.

Some findings, instead of curator framework capture suspended state and
active notify leader lost, we have seen task manager propagate unhandled
errors from zk client, most likely due to
high-availability.zookeeper.client.max-retry-attempts
were set to 3 and with 5 seconds interval. It would be great if we handle
this exception gracefully with a meaningful exception message. Those error
messages happen when other task managers die due to user code exceptions,
we would like to know more insights on this as well.

For more context, Lu from our team also filed [2] stating issue with 1.9,
so far we haven't seen regression on ongoing load testing jobs.

Thanks,
Chen

Caused by:
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> at
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)


[1] https://issues.apache.org/jira/browse/FLINK-10052
[2] https://issues.apache.org/jira/browse/FLINK-19985


On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[hidden email]> wrote:

> Thanks for trying the unfinished PR and sharing the testing results. Glad
> to here that it could work
> and really hope the result of more stringent load testing.
>
> After then I think we could revive this ticket.
>
>
> Best,
> Yang
>
> Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:
>
>> Hi there,
>>
>> Thanks for providing points to related changes and jira. Some updates
>> from our side, we applied a path by merging FLINK-10052
>> <https://issues.apache.org/jira/browse/FLINK-10052> with master as well
>> as only handling lost state leveraging SessionConnectionStateErrorPolicy
>>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
>>  introduced.
>>
>> Preliminary results were good, the same workload (240 TM) on the same
>> environment runs stable without frequent restarts due to suspended state
>> (seems false positive). We are working on more stringent load testing as
>> well as chaos testing (blocking zk). Will keep folks posted.
>>
>> Thanks,
>> Chen
>>
>>
>> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <[hidden email]>
>> wrote:
>>
>>> Hi Chenqin,
>>>
>>> The current rationale behind assuming a leadership loss when seeing a
>>> SUSPENDED connection is to assume the worst and to be on the safe side.
>>>
>>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour
>>> configurable. Unfortunately, the community did not have enough time to
>>> complete this feature.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]> wrote:
>>>
>>> > This might be related with FLINK-10052[1].
>>> > Unfortunately, we do not have any progress on this ticket.
>>> >
>>> > cc @Till Rohrmann <[hidden email]>
>>> >
>>> > Best,
>>> > Yang
>>> >
>>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
>>> >
>>> >> Hi there,
>>> >>
>>> >> We observed several 1.11 job running in 1.11 restart due to job leader
>>> >> lost.
>>> >> Dig deeper, the issue seems related to SUSPENDED state handler in
>>> >> ZooKeeperLeaderRetrievalService.
>>> >>
>>> >> ASFAIK, suspended state is expected when zk is not certain if leader
>>> is
>>> >> still alive. It can follow up with RECONNECT or LOST. In current
>>> >> implementation [1] , we treat suspended state same as lost state and
>>> >> actively shutdown job. This pose stability issue on large HA setting.
>>> >>
>>> >> My question is can we get some insight behind this decision and could
>>> we
>>> >> add
>>> >> some tunable configuration for user to decide how long they can endure
>>> >> such
>>> >> uncertain suspended state in their jobs.
>>> >>
>>> >> Thanks,
>>> >> Chen
>>> >>
>>> >> [1]
>>> >>
>>> >>
>>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Sent from:
>>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>>> >>
>>> >
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

tison
> Unfortunately, we do not have any progress on this ticket.

Here is a PR[1].

Here is the base PR[2] I made about one year ago without following review.

[hidden email]:

It requires further investigation about the impact involved by
FLINK-18677[3].
I do have some comments[4] but so far regard it as a stability problem
instead of
correctness problem.

FLINK-18677 tries to "fix" an unreasonable scenario where zk lost FOREVER,
and I don't want to pay any time before reactions on FLINK-10052 otherwise
it is highly possibly in vain again from my perspective.

Best,
tison.

[1] https://github.com/apache/flink/pull/15675
[2] https://github.com/apache/flink/pull/11338
[3] https://issues.apache.org/jira/browse/FLINK-18677
[4] https://github.com/apache/flink/pull/13055#discussion_r615871963



Chen Qin <[hidden email]> 于2021年4月23日周五 上午12:15写道:

> Hi there,
>
> Quick dial back here, we have been running load testing and so far haven't
> seen suspended state cause job restarts.
>
> Some findings, instead of curator framework capture suspended state and
> active notify leader lost, we have seen task manager propagate unhandled
> errors from zk client, most likely due to
> high-availability.zookeeper.client.max-retry-attempts
> were set to 3 and with 5 seconds interval. It would be great if we handle
> this exception gracefully with a meaningful exception message. Those error
> messages happen when other task managers die due to user code exceptions,
> we would like to know more insights on this as well.
>
> For more context, Lu from our team also filed [2] stating issue with 1.9,
> so far we haven't seen regression on ongoing load testing jobs.
>
> Thanks,
> Chen
>
> Caused by:
> >
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss
> > at
> >
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> > at
> >
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-10052
> [2] https://issues.apache.org/jira/browse/FLINK-19985
>
>
> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[hidden email]> wrote:
>
> > Thanks for trying the unfinished PR and sharing the testing results. Glad
> > to here that it could work
> > and really hope the result of more stringent load testing.
> >
> > After then I think we could revive this ticket.
> >
> >
> > Best,
> > Yang
> >
> > Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:
> >
> >> Hi there,
> >>
> >> Thanks for providing points to related changes and jira. Some updates
> >> from our side, we applied a path by merging FLINK-10052
> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master as well
> >> as only handling lost state leveraging SessionConnectionStateErrorPolicy
> >>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
> >>  introduced.
> >>
> >> Preliminary results were good, the same workload (240 TM) on the same
> >> environment runs stable without frequent restarts due to suspended state
> >> (seems false positive). We are working on more stringent load testing as
> >> well as chaos testing (blocking zk). Will keep folks posted.
> >>
> >> Thanks,
> >> Chen
> >>
> >>
> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <[hidden email]>
> >> wrote:
> >>
> >>> Hi Chenqin,
> >>>
> >>> The current rationale behind assuming a leadership loss when seeing a
> >>> SUSPENDED connection is to assume the worst and to be on the safe side.
> >>>
> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
> behaviour
> >>> configurable. Unfortunately, the community did not have enough time to
> >>> complete this feature.
> >>>
> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]>
> wrote:
> >>>
> >>> > This might be related with FLINK-10052[1].
> >>> > Unfortunately, we do not have any progress on this ticket.
> >>> >
> >>> > cc @Till Rohrmann <[hidden email]>
> >>> >
> >>> > Best,
> >>> > Yang
> >>> >
> >>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
> >>> >
> >>> >> Hi there,
> >>> >>
> >>> >> We observed several 1.11 job running in 1.11 restart due to job
> leader
> >>> >> lost.
> >>> >> Dig deeper, the issue seems related to SUSPENDED state handler in
> >>> >> ZooKeeperLeaderRetrievalService.
> >>> >>
> >>> >> ASFAIK, suspended state is expected when zk is not certain if leader
> >>> is
> >>> >> still alive. It can follow up with RECONNECT or LOST. In current
> >>> >> implementation [1] , we treat suspended state same as lost state and
> >>> >> actively shutdown job. This pose stability issue on large HA
> setting.
> >>> >>
> >>> >> My question is can we get some insight behind this decision and
> could
> >>> we
> >>> >> add
> >>> >> some tunable configuration for user to decide how long they can
> endure
> >>> >> such
> >>> >> uncertain suspended state in their jobs.
> >>> >>
> >>> >> Thanks,
> >>> >> Chen
> >>> >>
> >>> >> [1]
> >>> >>
> >>> >>
> >>>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Sent from:
> >>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >>> >>
> >>> >
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

tison
> My question is can we get some insight behind this decision and could we
add
some tunable configuration for user to decide how long they can endure such
uncertain suspended state in their jobs.

For the specific question, Curator provides a configure for session timeout
and a
LOST will be generated if disconnected elapsed longer then the configured
timeout.

https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102


Best,
tison.


tison <[hidden email]> 于2021年4月23日周五 上午12:57写道:

> To be concrete, if ZK suspended and reconnected, NodeCache already do
> the reset work for you and if there is a leader epoch updated, fencing
> token
> a.k.a leader session id would be updated so you will notice it.
>
> If ZK permanently lost, I think it is a system-wise fault and you'd better
> restart
> the job from checkpoint/savepoint with a working ZK ensemble.
>
> I am possibly concluding without more detailed investigation though.
>
> Best,
> tison.
>
>
> tison <[hidden email]> 于2021年4月23日周五 上午12:35写道:
>
>> > Unfortunately, we do not have any progress on this ticket.
>>
>> Here is a PR[1].
>>
>> Here is the base PR[2] I made about one year ago without following review.
>>
>> [hidden email]:
>>
>> It requires further investigation about the impact involved by
>> FLINK-18677[3].
>> I do have some comments[4] but so far regard it as a stability problem
>> instead of
>> correctness problem.
>>
>> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost FOREVER,
>> and I don't want to pay any time before reactions on FLINK-10052 otherwise
>> it is highly possibly in vain again from my perspective.
>>
>> Best,
>> tison.
>>
>> [1] https://github.com/apache/flink/pull/15675
>> [2] https://github.com/apache/flink/pull/11338
>> [3] https://issues.apache.org/jira/browse/FLINK-18677
>> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963
>>
>>
>>
>> Chen Qin <[hidden email]> 于2021年4月23日周五 上午12:15写道:
>>
>>> Hi there,
>>>
>>> Quick dial back here, we have been running load testing and so far
>>> haven't
>>> seen suspended state cause job restarts.
>>>
>>> Some findings, instead of curator framework capture suspended state and
>>> active notify leader lost, we have seen task manager propagate unhandled
>>> errors from zk client, most likely due to
>>> high-availability.zookeeper.client.max-retry-attempts
>>> were set to 3 and with 5 seconds interval. It would be great if we handle
>>> this exception gracefully with a meaningful exception message. Those
>>> error
>>> messages happen when other task managers die due to user code exceptions,
>>> we would like to know more insights on this as well.
>>>
>>> For more context, Lu from our team also filed [2] stating issue with 1.9,
>>> so far we haven't seen regression on ongoing load testing jobs.
>>>
>>> Thanks,
>>> Chen
>>>
>>> Caused by:
>>> >
>>> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> > KeeperErrorCode = ConnectionLoss
>>> > at
>>> >
>>> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>>> > at
>>> >
>>> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>>> [2] https://issues.apache.org/jira/browse/FLINK-19985
>>>
>>>
>>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[hidden email]> wrote:
>>>
>>> > Thanks for trying the unfinished PR and sharing the testing results.
>>> Glad
>>> > to here that it could work
>>> > and really hope the result of more stringent load testing.
>>> >
>>> > After then I think we could revive this ticket.
>>> >
>>> >
>>> > Best,
>>> > Yang
>>> >
>>> > Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:
>>> >
>>> >> Hi there,
>>> >>
>>> >> Thanks for providing points to related changes and jira. Some updates
>>> >> from our side, we applied a path by merging FLINK-10052
>>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master as
>>> well
>>> >> as only handling lost state leveraging
>>> SessionConnectionStateErrorPolicy
>>> >>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
>>> >>  introduced.
>>> >>
>>> >> Preliminary results were good, the same workload (240 TM) on the same
>>> >> environment runs stable without frequent restarts due to suspended
>>> state
>>> >> (seems false positive). We are working on more stringent load testing
>>> as
>>> >> well as chaos testing (blocking zk). Will keep folks posted.
>>> >>
>>> >> Thanks,
>>> >> Chen
>>> >>
>>> >>
>>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <[hidden email]>
>>> >> wrote:
>>> >>
>>> >>> Hi Chenqin,
>>> >>>
>>> >>> The current rationale behind assuming a leadership loss when seeing a
>>> >>> SUSPENDED connection is to assume the worst and to be on the safe
>>> side.
>>> >>>
>>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
>>> behaviour
>>> >>> configurable. Unfortunately, the community did not have enough time
>>> to
>>> >>> complete this feature.
>>> >>>
>>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>>> >>>
>>> >>> Cheers,
>>> >>> Till
>>> >>>
>>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]>
>>> wrote:
>>> >>>
>>> >>> > This might be related with FLINK-10052[1].
>>> >>> > Unfortunately, we do not have any progress on this ticket.
>>> >>> >
>>> >>> > cc @Till Rohrmann <[hidden email]>
>>> >>> >
>>> >>> > Best,
>>> >>> > Yang
>>> >>> >
>>> >>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
>>> >>> >
>>> >>> >> Hi there,
>>> >>> >>
>>> >>> >> We observed several 1.11 job running in 1.11 restart due to job
>>> leader
>>> >>> >> lost.
>>> >>> >> Dig deeper, the issue seems related to SUSPENDED state handler in
>>> >>> >> ZooKeeperLeaderRetrievalService.
>>> >>> >>
>>> >>> >> ASFAIK, suspended state is expected when zk is not certain if
>>> leader
>>> >>> is
>>> >>> >> still alive. It can follow up with RECONNECT or LOST. In current
>>> >>> >> implementation [1] , we treat suspended state same as lost state
>>> and
>>> >>> >> actively shutdown job. This pose stability issue on large HA
>>> setting.
>>> >>> >>
>>> >>> >> My question is can we get some insight behind this decision and
>>> could
>>> >>> we
>>> >>> >> add
>>> >>> >> some tunable configuration for user to decide how long they can
>>> endure
>>> >>> >> such
>>> >>> >> uncertain suspended state in their jobs.
>>> >>> >>
>>> >>> >> Thanks,
>>> >>> >> Chen
>>> >>> >>
>>> >>> >> [1]
>>> >>> >>
>>> >>> >>
>>> >>>
>>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >>
>>> >>> >> --
>>> >>> >> Sent from:
>>> >>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

chenqin
Hi Tison,

Please read my latest comments in the thread. Using SessionErrorPolicy
mitigated the suspended state issue while it might trigger an unhandled zk
client exception in some situations. We would like to get some idea of the
root cause of that issue to avoid introducing another issue in the fix.

Chen


On Thu, Apr 22, 2021 at 10:04 AM tison <[hidden email]> wrote:

> > My question is can we get some insight behind this decision and could we
> add
> some tunable configuration for user to decide how long they can endure such
> uncertain suspended state in their jobs.
>
> For the specific question, Curator provides a configure for session timeout
> and a
> LOST will be generated if disconnected elapsed longer then the configured
> timeout.
>
>
> https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102
>
>
> Best,
> tison.
>
>
> tison <[hidden email]> 于2021年4月23日周五 上午12:57写道:
>
> > To be concrete, if ZK suspended and reconnected, NodeCache already do
> > the reset work for you and if there is a leader epoch updated, fencing
> > token
> > a.k.a leader session id would be updated so you will notice it.
> >
> > If ZK permanently lost, I think it is a system-wise fault and you'd
> better
> > restart
> > the job from checkpoint/savepoint with a working ZK ensemble.
> >
> > I am possibly concluding without more detailed investigation though.
> >
> > Best,
> > tison.
> >
> >
> > tison <[hidden email]> 于2021年4月23日周五 上午12:35写道:
> >
> >> > Unfortunately, we do not have any progress on this ticket.
> >>
> >> Here is a PR[1].
> >>
> >> Here is the base PR[2] I made about one year ago without following
> review.
> >>
> >> [hidden email]:
> >>
> >> It requires further investigation about the impact involved by
> >> FLINK-18677[3].
> >> I do have some comments[4] but so far regard it as a stability problem
> >> instead of
> >> correctness problem.
> >>
> >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost
> FOREVER,
> >> and I don't want to pay any time before reactions on FLINK-10052
> otherwise
> >> it is highly possibly in vain again from my perspective.
> >>
> >> Best,
> >> tison.
> >>
> >> [1] https://github.com/apache/flink/pull/15675
> >> [2] https://github.com/apache/flink/pull/11338
> >> [3] https://issues.apache.org/jira/browse/FLINK-18677
> >> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963
> >>
> >>
> >>
> >> Chen Qin <[hidden email]> 于2021年4月23日周五 上午12:15写道:
> >>
> >>> Hi there,
> >>>
> >>> Quick dial back here, we have been running load testing and so far
> >>> haven't
> >>> seen suspended state cause job restarts.
> >>>
> >>> Some findings, instead of curator framework capture suspended state and
> >>> active notify leader lost, we have seen task manager propagate
> unhandled
> >>> errors from zk client, most likely due to
> >>> high-availability.zookeeper.client.max-retry-attempts
> >>> were set to 3 and with 5 seconds interval. It would be great if we
> handle
> >>> this exception gracefully with a meaningful exception message. Those
> >>> error
> >>> messages happen when other task managers die due to user code
> exceptions,
> >>> we would like to know more insights on this as well.
> >>>
> >>> For more context, Lu from our team also filed [2] stating issue with
> 1.9,
> >>> so far we haven't seen regression on ongoing load testing jobs.
> >>>
> >>> Thanks,
> >>> Chen
> >>>
> >>> Caused by:
> >>> >
> >>>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> >>> > KeeperErrorCode = ConnectionLoss
> >>> > at
> >>> >
> >>>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> >>> > at
> >>> >
> >>>
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
> >>>
> >>>
> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> >>> [2] https://issues.apache.org/jira/browse/FLINK-19985
> >>>
> >>>
> >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[hidden email]>
> wrote:
> >>>
> >>> > Thanks for trying the unfinished PR and sharing the testing results.
> >>> Glad
> >>> > to here that it could work
> >>> > and really hope the result of more stringent load testing.
> >>> >
> >>> > After then I think we could revive this ticket.
> >>> >
> >>> >
> >>> > Best,
> >>> > Yang
> >>> >
> >>> > Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:
> >>> >
> >>> >> Hi there,
> >>> >>
> >>> >> Thanks for providing points to related changes and jira. Some
> updates
> >>> >> from our side, we applied a path by merging FLINK-10052
> >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master as
> >>> well
> >>> >> as only handling lost state leveraging
> >>> SessionConnectionStateErrorPolicy
> >>> >>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
> >>> >>  introduced.
> >>> >>
> >>> >> Preliminary results were good, the same workload (240 TM) on the
> same
> >>> >> environment runs stable without frequent restarts due to suspended
> >>> state
> >>> >> (seems false positive). We are working on more stringent load
> testing
> >>> as
> >>> >> well as chaos testing (blocking zk). Will keep folks posted.
> >>> >>
> >>> >> Thanks,
> >>> >> Chen
> >>> >>
> >>> >>
> >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <[hidden email]
> >
> >>> >> wrote:
> >>> >>
> >>> >>> Hi Chenqin,
> >>> >>>
> >>> >>> The current rationale behind assuming a leadership loss when
> seeing a
> >>> >>> SUSPENDED connection is to assume the worst and to be on the safe
> >>> side.
> >>> >>>
> >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
> >>> behaviour
> >>> >>> configurable. Unfortunately, the community did not have enough time
> >>> to
> >>> >>> complete this feature.
> >>> >>>
> >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> >>> >>>
> >>> >>> Cheers,
> >>> >>> Till
> >>> >>>
> >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]>
> >>> wrote:
> >>> >>>
> >>> >>> > This might be related with FLINK-10052[1].
> >>> >>> > Unfortunately, we do not have any progress on this ticket.
> >>> >>> >
> >>> >>> > cc @Till Rohrmann <[hidden email]>
> >>> >>> >
> >>> >>> > Best,
> >>> >>> > Yang
> >>> >>> >
> >>> >>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
> >>> >>> >
> >>> >>> >> Hi there,
> >>> >>> >>
> >>> >>> >> We observed several 1.11 job running in 1.11 restart due to job
> >>> leader
> >>> >>> >> lost.
> >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state handler
> in
> >>> >>> >> ZooKeeperLeaderRetrievalService.
> >>> >>> >>
> >>> >>> >> ASFAIK, suspended state is expected when zk is not certain if
> >>> leader
> >>> >>> is
> >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In current
> >>> >>> >> implementation [1] , we treat suspended state same as lost state
> >>> and
> >>> >>> >> actively shutdown job. This pose stability issue on large HA
> >>> setting.
> >>> >>> >>
> >>> >>> >> My question is can we get some insight behind this decision and
> >>> could
> >>> >>> we
> >>> >>> >> add
> >>> >>> >> some tunable configuration for user to decide how long they can
> >>> endure
> >>> >>> >> such
> >>> >>> >> uncertain suspended state in their jobs.
> >>> >>> >>
> >>> >>> >> Thanks,
> >>> >>> >> Chen
> >>> >>> >>
> >>> >>> >> [1]
> >>> >>> >>
> >>> >>> >>
> >>> >>>
> >>>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
> >>> >>> >>
> >>> >>> >>
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> --
> >>> >>> >> Sent from:
> >>> >>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >>> >>> >>
> >>> >>> >
> >>> >>>
> >>> >>
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

tison
Could you show the log about which unhandled exception was thrown?

Best,
tison.


Chen Qin <[hidden email]> 于2021年4月23日周五 下午1:06写道:

> Hi Tison,
>
> Please read my latest comments in the thread. Using SessionErrorPolicy
> mitigated the suspended state issue while it might trigger an unhandled zk
> client exception in some situations. We would like to get some idea of the
> root cause of that issue to avoid introducing another issue in the fix.
>
> Chen
>
>
> On Thu, Apr 22, 2021 at 10:04 AM tison <[hidden email]> wrote:
>
> > > My question is can we get some insight behind this decision and could
> we
> > add
> > some tunable configuration for user to decide how long they can endure
> such
> > uncertain suspended state in their jobs.
> >
> > For the specific question, Curator provides a configure for session
> timeout
> > and a
> > LOST will be generated if disconnected elapsed longer then the configured
> > timeout.
> >
> >
> >
> https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102
> >
> >
> > Best,
> > tison.
> >
> >
> > tison <[hidden email]> 于2021年4月23日周五 上午12:57写道:
> >
> > > To be concrete, if ZK suspended and reconnected, NodeCache already do
> > > the reset work for you and if there is a leader epoch updated, fencing
> > > token
> > > a.k.a leader session id would be updated so you will notice it.
> > >
> > > If ZK permanently lost, I think it is a system-wise fault and you'd
> > better
> > > restart
> > > the job from checkpoint/savepoint with a working ZK ensemble.
> > >
> > > I am possibly concluding without more detailed investigation though.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > tison <[hidden email]> 于2021年4月23日周五 上午12:35写道:
> > >
> > >> > Unfortunately, we do not have any progress on this ticket.
> > >>
> > >> Here is a PR[1].
> > >>
> > >> Here is the base PR[2] I made about one year ago without following
> > review.
> > >>
> > >> [hidden email]:
> > >>
> > >> It requires further investigation about the impact involved by
> > >> FLINK-18677[3].
> > >> I do have some comments[4] but so far regard it as a stability problem
> > >> instead of
> > >> correctness problem.
> > >>
> > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost
> > FOREVER,
> > >> and I don't want to pay any time before reactions on FLINK-10052
> > otherwise
> > >> it is highly possibly in vain again from my perspective.
> > >>
> > >> Best,
> > >> tison.
> > >>
> > >> [1] https://github.com/apache/flink/pull/15675
> > >> [2] https://github.com/apache/flink/pull/11338
> > >> [3] https://issues.apache.org/jira/browse/FLINK-18677
> > >> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963
> > >>
> > >>
> > >>
> > >> Chen Qin <[hidden email]> 于2021年4月23日周五 上午12:15写道:
> > >>
> > >>> Hi there,
> > >>>
> > >>> Quick dial back here, we have been running load testing and so far
> > >>> haven't
> > >>> seen suspended state cause job restarts.
> > >>>
> > >>> Some findings, instead of curator framework capture suspended state
> and
> > >>> active notify leader lost, we have seen task manager propagate
> > unhandled
> > >>> errors from zk client, most likely due to
> > >>> high-availability.zookeeper.client.max-retry-attempts
> > >>> were set to 3 and with 5 seconds interval. It would be great if we
> > handle
> > >>> this exception gracefully with a meaningful exception message. Those
> > >>> error
> > >>> messages happen when other task managers die due to user code
> > exceptions,
> > >>> we would like to know more insights on this as well.
> > >>>
> > >>> For more context, Lu from our team also filed [2] stating issue with
> > 1.9,
> > >>> so far we haven't seen regression on ongoing load testing jobs.
> > >>>
> > >>> Thanks,
> > >>> Chen
> > >>>
> > >>> Caused by:
> > >>> >
> > >>>
> >
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> > >>> > KeeperErrorCode = ConnectionLoss
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
> > >>>
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985
> > >>>
> > >>>
> > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[hidden email]>
> > wrote:
> > >>>
> > >>> > Thanks for trying the unfinished PR and sharing the testing
> results.
> > >>> Glad
> > >>> > to here that it could work
> > >>> > and really hope the result of more stringent load testing.
> > >>> >
> > >>> > After then I think we could revive this ticket.
> > >>> >
> > >>> >
> > >>> > Best,
> > >>> > Yang
> > >>> >
> > >>> > Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:
> > >>> >
> > >>> >> Hi there,
> > >>> >>
> > >>> >> Thanks for providing points to related changes and jira. Some
> > updates
> > >>> >> from our side, we applied a path by merging FLINK-10052
> > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master
> as
> > >>> well
> > >>> >> as only handling lost state leveraging
> > >>> SessionConnectionStateErrorPolicy
> > >>> >>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
> > >>> >>  introduced.
> > >>> >>
> > >>> >> Preliminary results were good, the same workload (240 TM) on the
> > same
> > >>> >> environment runs stable without frequent restarts due to suspended
> > >>> state
> > >>> >> (seems false positive). We are working on more stringent load
> > testing
> > >>> as
> > >>> >> well as chaos testing (blocking zk). Will keep folks posted.
> > >>> >>
> > >>> >> Thanks,
> > >>> >> Chen
> > >>> >>
> > >>> >>
> > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <
> [hidden email]
> > >
> > >>> >> wrote:
> > >>> >>
> > >>> >>> Hi Chenqin,
> > >>> >>>
> > >>> >>> The current rationale behind assuming a leadership loss when
> > seeing a
> > >>> >>> SUSPENDED connection is to assume the worst and to be on the safe
> > >>> side.
> > >>> >>>
> > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
> > >>> behaviour
> > >>> >>> configurable. Unfortunately, the community did not have enough
> time
> > >>> to
> > >>> >>> complete this feature.
> > >>> >>>
> > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> > >>> >>>
> > >>> >>> Cheers,
> > >>> >>> Till
> > >>> >>>
> > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[hidden email]
> >
> > >>> wrote:
> > >>> >>>
> > >>> >>> > This might be related with FLINK-10052[1].
> > >>> >>> > Unfortunately, we do not have any progress on this ticket.
> > >>> >>> >
> > >>> >>> > cc @Till Rohrmann <[hidden email]>
> > >>> >>> >
> > >>> >>> > Best,
> > >>> >>> > Yang
> > >>> >>> >
> > >>> >>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
> > >>> >>> >
> > >>> >>> >> Hi there,
> > >>> >>> >>
> > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to
> job
> > >>> leader
> > >>> >>> >> lost.
> > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state handler
> > in
> > >>> >>> >> ZooKeeperLeaderRetrievalService.
> > >>> >>> >>
> > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain if
> > >>> leader
> > >>> >>> is
> > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In
> current
> > >>> >>> >> implementation [1] , we treat suspended state same as lost
> state
> > >>> and
> > >>> >>> >> actively shutdown job. This pose stability issue on large HA
> > >>> setting.
> > >>> >>> >>
> > >>> >>> >> My question is can we get some insight behind this decision
> and
> > >>> could
> > >>> >>> we
> > >>> >>> >> add
> > >>> >>> >> some tunable configuration for user to decide how long they
> can
> > >>> endure
> > >>> >>> >> such
> > >>> >>> >> uncertain suspended state in their jobs.
> > >>> >>> >>
> > >>> >>> >> Thanks,
> > >>> >>> >> Chen
> > >>> >>> >>
> > >>> >>> >> [1]
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>>
> > >>>
> >
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> --
> > >>> >>> >> Sent from:
> > >>> >>> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > >>> >>> >>
> > >>> >>> >
> > >>> >>>
> > >>> >>
> > >>>
> > >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

tison
The original log (section) is preferred over rephrasing.
Best,
tison.


tison <[hidden email]> 于2021年4月23日周五 下午1:15写道:

> Could you show the log about which unhandled exception was thrown?
>
> Best,
> tison.
>
>
> Chen Qin <[hidden email]> 于2021年4月23日周五 下午1:06写道:
>
>> Hi Tison,
>>
>> Please read my latest comments in the thread. Using SessionErrorPolicy
>> mitigated the suspended state issue while it might trigger an unhandled zk
>> client exception in some situations. We would like to get some idea of the
>> root cause of that issue to avoid introducing another issue in the fix.
>>
>> Chen
>>
>>
>> On Thu, Apr 22, 2021 at 10:04 AM tison <[hidden email]> wrote:
>>
>> > > My question is can we get some insight behind this decision and could
>> we
>> > add
>> > some tunable configuration for user to decide how long they can endure
>> such
>> > uncertain suspended state in their jobs.
>> >
>> > For the specific question, Curator provides a configure for session
>> timeout
>> > and a
>> > LOST will be generated if disconnected elapsed longer then the
>> configured
>> > timeout.
>> >
>> >
>> >
>> https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102
>> >
>> >
>> > Best,
>> > tison.
>> >
>> >
>> > tison <[hidden email]> 于2021年4月23日周五 上午12:57写道:
>> >
>> > > To be concrete, if ZK suspended and reconnected, NodeCache already do
>> > > the reset work for you and if there is a leader epoch updated, fencing
>> > > token
>> > > a.k.a leader session id would be updated so you will notice it.
>> > >
>> > > If ZK permanently lost, I think it is a system-wise fault and you'd
>> > better
>> > > restart
>> > > the job from checkpoint/savepoint with a working ZK ensemble.
>> > >
>> > > I am possibly concluding without more detailed investigation though.
>> > >
>> > > Best,
>> > > tison.
>> > >
>> > >
>> > > tison <[hidden email]> 于2021年4月23日周五 上午12:35写道:
>> > >
>> > >> > Unfortunately, we do not have any progress on this ticket.
>> > >>
>> > >> Here is a PR[1].
>> > >>
>> > >> Here is the base PR[2] I made about one year ago without following
>> > review.
>> > >>
>> > >> [hidden email]:
>> > >>
>> > >> It requires further investigation about the impact involved by
>> > >> FLINK-18677[3].
>> > >> I do have some comments[4] but so far regard it as a stability
>> problem
>> > >> instead of
>> > >> correctness problem.
>> > >>
>> > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost
>> > FOREVER,
>> > >> and I don't want to pay any time before reactions on FLINK-10052
>> > otherwise
>> > >> it is highly possibly in vain again from my perspective.
>> > >>
>> > >> Best,
>> > >> tison.
>> > >>
>> > >> [1] https://github.com/apache/flink/pull/15675
>> > >> [2] https://github.com/apache/flink/pull/11338
>> > >> [3] https://issues.apache.org/jira/browse/FLINK-18677
>> > >> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963
>> > >>
>> > >>
>> > >>
>> > >> Chen Qin <[hidden email]> 于2021年4月23日周五 上午12:15写道:
>> > >>
>> > >>> Hi there,
>> > >>>
>> > >>> Quick dial back here, we have been running load testing and so far
>> > >>> haven't
>> > >>> seen suspended state cause job restarts.
>> > >>>
>> > >>> Some findings, instead of curator framework capture suspended state
>> and
>> > >>> active notify leader lost, we have seen task manager propagate
>> > unhandled
>> > >>> errors from zk client, most likely due to
>> > >>> high-availability.zookeeper.client.max-retry-attempts
>> > >>> were set to 3 and with 5 seconds interval. It would be great if we
>> > handle
>> > >>> this exception gracefully with a meaningful exception message. Those
>> > >>> error
>> > >>> messages happen when other task managers die due to user code
>> > exceptions,
>> > >>> we would like to know more insights on this as well.
>> > >>>
>> > >>> For more context, Lu from our team also filed [2] stating issue with
>> > 1.9,
>> > >>> so far we haven't seen regression on ongoing load testing jobs.
>> > >>>
>> > >>> Thanks,
>> > >>> Chen
>> > >>>
>> > >>> Caused by:
>> > >>> >
>> > >>>
>> >
>> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
>> > >>> > KeeperErrorCode = ConnectionLoss
>> > >>> > at
>> > >>> >
>> > >>>
>> >
>> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>> > >>> > at
>> > >>> >
>> > >>>
>> >
>> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
>> > >>>
>> > >>>
>> > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>> > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985
>> > >>>
>> > >>>
>> > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[hidden email]>
>> > wrote:
>> > >>>
>> > >>> > Thanks for trying the unfinished PR and sharing the testing
>> results.
>> > >>> Glad
>> > >>> > to here that it could work
>> > >>> > and really hope the result of more stringent load testing.
>> > >>> >
>> > >>> > After then I think we could revive this ticket.
>> > >>> >
>> > >>> >
>> > >>> > Best,
>> > >>> > Yang
>> > >>> >
>> > >>> > Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:
>> > >>> >
>> > >>> >> Hi there,
>> > >>> >>
>> > >>> >> Thanks for providing points to related changes and jira. Some
>> > updates
>> > >>> >> from our side, we applied a path by merging FLINK-10052
>> > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master
>> as
>> > >>> well
>> > >>> >> as only handling lost state leveraging
>> > >>> SessionConnectionStateErrorPolicy
>> > >>> >>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052
>> >
>> > >>> >>  introduced.
>> > >>> >>
>> > >>> >> Preliminary results were good, the same workload (240 TM) on the
>> > same
>> > >>> >> environment runs stable without frequent restarts due to
>> suspended
>> > >>> state
>> > >>> >> (seems false positive). We are working on more stringent load
>> > testing
>> > >>> as
>> > >>> >> well as chaos testing (blocking zk). Will keep folks posted.
>> > >>> >>
>> > >>> >> Thanks,
>> > >>> >> Chen
>> > >>> >>
>> > >>> >>
>> > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <
>> [hidden email]
>> > >
>> > >>> >> wrote:
>> > >>> >>
>> > >>> >>> Hi Chenqin,
>> > >>> >>>
>> > >>> >>> The current rationale behind assuming a leadership loss when
>> > seeing a
>> > >>> >>> SUSPENDED connection is to assume the worst and to be on the
>> safe
>> > >>> side.
>> > >>> >>>
>> > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
>> > >>> behaviour
>> > >>> >>> configurable. Unfortunately, the community did not have enough
>> time
>> > >>> to
>> > >>> >>> complete this feature.
>> > >>> >>>
>> > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>> > >>> >>>
>> > >>> >>> Cheers,
>> > >>> >>> Till
>> > >>> >>>
>> > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <
>> [hidden email]>
>> > >>> wrote:
>> > >>> >>>
>> > >>> >>> > This might be related with FLINK-10052[1].
>> > >>> >>> > Unfortunately, we do not have any progress on this ticket.
>> > >>> >>> >
>> > >>> >>> > cc @Till Rohrmann <[hidden email]>
>> > >>> >>> >
>> > >>> >>> > Best,
>> > >>> >>> > Yang
>> > >>> >>> >
>> > >>> >>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
>> > >>> >>> >
>> > >>> >>> >> Hi there,
>> > >>> >>> >>
>> > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to
>> job
>> > >>> leader
>> > >>> >>> >> lost.
>> > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state
>> handler
>> > in
>> > >>> >>> >> ZooKeeperLeaderRetrievalService.
>> > >>> >>> >>
>> > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain if
>> > >>> leader
>> > >>> >>> is
>> > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In
>> current
>> > >>> >>> >> implementation [1] , we treat suspended state same as lost
>> state
>> > >>> and
>> > >>> >>> >> actively shutdown job. This pose stability issue on large HA
>> > >>> setting.
>> > >>> >>> >>
>> > >>> >>> >> My question is can we get some insight behind this decision
>> and
>> > >>> could
>> > >>> >>> we
>> > >>> >>> >> add
>> > >>> >>> >> some tunable configuration for user to decide how long they
>> can
>> > >>> endure
>> > >>> >>> >> such
>> > >>> >>> >> uncertain suspended state in their jobs.
>> > >>> >>> >>
>> > >>> >>> >> Thanks,
>> > >>> >>> >> Chen
>> > >>> >>> >>
>> > >>> >>> >> [1]
>> > >>> >>> >>
>> > >>> >>> >>
>> > >>> >>>
>> > >>>
>> >
>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>> > >>> >>> >>
>> > >>> >>> >>
>> > >>> >>> >>
>> > >>> >>> >>
>> > >>> >>> >> --
>> > >>> >>> >> Sent from:
>> > >>> >>> >>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>> > >>> >>> >>
>> > >>> >>> >
>> > >>> >>>
>> > >>> >>
>> > >>>
>> > >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

chenqin
Sure, I update jira with exception info. We could follow up from there
for technical discussions.

https://issues.apache.org/jira/browse/FLINK-10052?focusedCommentId=17330858&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17330858

On Thu, Apr 22, 2021 at 10:17 PM tison <[hidden email]> wrote:

> The original log (section) is preferred over rephrasing.
> Best,
> tison.
>
>
> tison <[hidden email]> 于2021年4月23日周五 下午1:15写道:
>
> > Could you show the log about which unhandled exception was thrown?
> >
> > Best,
> > tison.
> >
> >
> > Chen Qin <[hidden email]> 于2021年4月23日周五 下午1:06写道:
> >
> >> Hi Tison,
> >>
> >> Please read my latest comments in the thread. Using SessionErrorPolicy
> >> mitigated the suspended state issue while it might trigger an unhandled
> zk
> >> client exception in some situations. We would like to get some idea of
> the
> >> root cause of that issue to avoid introducing another issue in the fix.
> >>
> >> Chen
> >>
> >>
> >> On Thu, Apr 22, 2021 at 10:04 AM tison <[hidden email]> wrote:
> >>
> >> > > My question is can we get some insight behind this decision and
> could
> >> we
> >> > add
> >> > some tunable configuration for user to decide how long they can endure
> >> such
> >> > uncertain suspended state in their jobs.
> >> >
> >> > For the specific question, Curator provides a configure for session
> >> timeout
> >> > and a
> >> > LOST will be generated if disconnected elapsed longer then the
> >> configured
> >> > timeout.
> >> >
> >> >
> >> >
> >>
> https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102
> >> >
> >> >
> >> > Best,
> >> > tison.
> >> >
> >> >
> >> > tison <[hidden email]> 于2021年4月23日周五 上午12:57写道:
> >> >
> >> > > To be concrete, if ZK suspended and reconnected, NodeCache already
> do
> >> > > the reset work for you and if there is a leader epoch updated,
> fencing
> >> > > token
> >> > > a.k.a leader session id would be updated so you will notice it.
> >> > >
> >> > > If ZK permanently lost, I think it is a system-wise fault and you'd
> >> > better
> >> > > restart
> >> > > the job from checkpoint/savepoint with a working ZK ensemble.
> >> > >
> >> > > I am possibly concluding without more detailed investigation though.
> >> > >
> >> > > Best,
> >> > > tison.
> >> > >
> >> > >
> >> > > tison <[hidden email]> 于2021年4月23日周五 上午12:35写道:
> >> > >
> >> > >> > Unfortunately, we do not have any progress on this ticket.
> >> > >>
> >> > >> Here is a PR[1].
> >> > >>
> >> > >> Here is the base PR[2] I made about one year ago without following
> >> > review.
> >> > >>
> >> > >> [hidden email]:
> >> > >>
> >> > >> It requires further investigation about the impact involved by
> >> > >> FLINK-18677[3].
> >> > >> I do have some comments[4] but so far regard it as a stability
> >> problem
> >> > >> instead of
> >> > >> correctness problem.
> >> > >>
> >> > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost
> >> > FOREVER,
> >> > >> and I don't want to pay any time before reactions on FLINK-10052
> >> > otherwise
> >> > >> it is highly possibly in vain again from my perspective.
> >> > >>
> >> > >> Best,
> >> > >> tison.
> >> > >>
> >> > >> [1] https://github.com/apache/flink/pull/15675
> >> > >> [2] https://github.com/apache/flink/pull/11338
> >> > >> [3] https://issues.apache.org/jira/browse/FLINK-18677
> >> > >> [4]
> https://github.com/apache/flink/pull/13055#discussion_r615871963
> >> > >>
> >> > >>
> >> > >>
> >> > >> Chen Qin <[hidden email]> 于2021年4月23日周五 上午12:15写道:
> >> > >>
> >> > >>> Hi there,
> >> > >>>
> >> > >>> Quick dial back here, we have been running load testing and so far
> >> > >>> haven't
> >> > >>> seen suspended state cause job restarts.
> >> > >>>
> >> > >>> Some findings, instead of curator framework capture suspended
> state
> >> and
> >> > >>> active notify leader lost, we have seen task manager propagate
> >> > unhandled
> >> > >>> errors from zk client, most likely due to
> >> > >>> high-availability.zookeeper.client.max-retry-attempts
> >> > >>> were set to 3 and with 5 seconds interval. It would be great if we
> >> > handle
> >> > >>> this exception gracefully with a meaningful exception message.
> Those
> >> > >>> error
> >> > >>> messages happen when other task managers die due to user code
> >> > exceptions,
> >> > >>> we would like to know more insights on this as well.
> >> > >>>
> >> > >>> For more context, Lu from our team also filed [2] stating issue
> with
> >> > 1.9,
> >> > >>> so far we haven't seen regression on ongoing load testing jobs.
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Chen
> >> > >>>
> >> > >>> Caused by:
> >> > >>> >
> >> > >>>
> >> >
> >>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> >> > >>> > KeeperErrorCode = ConnectionLoss
> >> > >>> > at
> >> > >>> >
> >> > >>>
> >> >
> >>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> >> > >>> > at
> >> > >>> >
> >> > >>>
> >> >
> >>
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
> >> > >>>
> >> > >>>
> >> > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> >> > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985
> >> > >>>
> >> > >>>
> >> > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[hidden email]>
> >> > wrote:
> >> > >>>
> >> > >>> > Thanks for trying the unfinished PR and sharing the testing
> >> results.
> >> > >>> Glad
> >> > >>> > to here that it could work
> >> > >>> > and really hope the result of more stringent load testing.
> >> > >>> >
> >> > >>> > After then I think we could revive this ticket.
> >> > >>> >
> >> > >>> >
> >> > >>> > Best,
> >> > >>> > Yang
> >> > >>> >
> >> > >>> > Chen Qin <[hidden email]> 于2021年4月16日周五 上午2:01写道:
> >> > >>> >
> >> > >>> >> Hi there,
> >> > >>> >>
> >> > >>> >> Thanks for providing points to related changes and jira. Some
> >> > updates
> >> > >>> >> from our side, we applied a path by merging FLINK-10052
> >> > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with
> master
> >> as
> >> > >>> well
> >> > >>> >> as only handling lost state leveraging
> >> > >>> SessionConnectionStateErrorPolicy
> >> > >>> >>   FLINK-10052 <
> https://issues.apache.org/jira/browse/FLINK-10052
> >> >
> >> > >>> >>  introduced.
> >> > >>> >>
> >> > >>> >> Preliminary results were good, the same workload (240 TM) on
> the
> >> > same
> >> > >>> >> environment runs stable without frequent restarts due to
> >> suspended
> >> > >>> state
> >> > >>> >> (seems false positive). We are working on more stringent load
> >> > testing
> >> > >>> as
> >> > >>> >> well as chaos testing (blocking zk). Will keep folks posted.
> >> > >>> >>
> >> > >>> >> Thanks,
> >> > >>> >> Chen
> >> > >>> >>
> >> > >>> >>
> >> > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <
> >> [hidden email]
> >> > >
> >> > >>> >> wrote:
> >> > >>> >>
> >> > >>> >>> Hi Chenqin,
> >> > >>> >>>
> >> > >>> >>> The current rationale behind assuming a leadership loss when
> >> > seeing a
> >> > >>> >>> SUSPENDED connection is to assume the worst and to be on the
> >> safe
> >> > >>> side.
> >> > >>> >>>
> >> > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
> >> > >>> behaviour
> >> > >>> >>> configurable. Unfortunately, the community did not have enough
> >> time
> >> > >>> to
> >> > >>> >>> complete this feature.
> >> > >>> >>>
> >> > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> >> > >>> >>>
> >> > >>> >>> Cheers,
> >> > >>> >>> Till
> >> > >>> >>>
> >> > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <
> >> [hidden email]>
> >> > >>> wrote:
> >> > >>> >>>
> >> > >>> >>> > This might be related with FLINK-10052[1].
> >> > >>> >>> > Unfortunately, we do not have any progress on this ticket.
> >> > >>> >>> >
> >> > >>> >>> > cc @Till Rohrmann <[hidden email]>
> >> > >>> >>> >
> >> > >>> >>> > Best,
> >> > >>> >>> > Yang
> >> > >>> >>> >
> >> > >>> >>> > chenqin <[hidden email]> 于2021年4月13日周二 上午7:31写道:
> >> > >>> >>> >
> >> > >>> >>> >> Hi there,
> >> > >>> >>> >>
> >> > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to
> >> job
> >> > >>> leader
> >> > >>> >>> >> lost.
> >> > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state
> >> handler
> >> > in
> >> > >>> >>> >> ZooKeeperLeaderRetrievalService.
> >> > >>> >>> >>
> >> > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain
> if
> >> > >>> leader
> >> > >>> >>> is
> >> > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In
> >> current
> >> > >>> >>> >> implementation [1] , we treat suspended state same as lost
> >> state
> >> > >>> and
> >> > >>> >>> >> actively shutdown job. This pose stability issue on large
> HA
> >> > >>> setting.
> >> > >>> >>> >>
> >> > >>> >>> >> My question is can we get some insight behind this decision
> >> and
> >> > >>> could
> >> > >>> >>> we
> >> > >>> >>> >> add
> >> > >>> >>> >> some tunable configuration for user to decide how long they
> >> can
> >> > >>> endure
> >> > >>> >>> >> such
> >> > >>> >>> >> uncertain suspended state in their jobs.
> >> > >>> >>> >>
> >> > >>> >>> >> Thanks,
> >> > >>> >>> >> Chen
> >> > >>> >>> >>
> >> > >>> >>> >> [1]
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>>
> >> > >>>
> >> >
> >>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>> >> --
> >> > >>> >>> >> Sent from:
> >> > >>> >>> >>
> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >> > >>> >>> >>
> >> > >>> >>> >
> >> > >>> >>>
> >> > >>> >>
> >> > >>>
> >> > >>
> >> >
> >>
> >
>