[jira] [Created] (FLINK-14685) ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-14685) ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK

Shang Yuanchun (Jira)
Zili Chen created FLINK-14685:
---------------------------------

             Summary: ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK
                 Key: FLINK-14685
                 URL: https://issues.apache.org/jira/browse/FLINK-14685
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing, Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: Zili Chen
             Fix For: 1.10.0


Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. connection loss, it will set the state as invalid so that all checkpoint id counter operations succeed will fail.

Although couple with JM leadership management we will generate a new id counter on re-granted leadership so that it is not a problem so far, the semantic is wrong because id counter should only check whether current state is SUSPENDED/LOST.

It is also a blocker upgrading to Curator 4.2 and [~lamber-ken] provides a [fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299] there.

Besides, in product scenario we once noticed that JM didn't re-elected(it shouldn't happen after [~trohrmann] add linearized leader operation) on SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.

I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue and fix this wrong semantic.

CC [~GJL] [~azagrebin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)