Zili Chen created FLINK-14685:
---------------------------------
Summary: ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK
Key: FLINK-14685
URL:
https://issues.apache.org/jira/browse/FLINK-14685 Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing, Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Zili Chen
Fix For: 1.10.0
Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. connection loss, it will set the state as invalid so that all checkpoint id counter operations succeed will fail.
Although couple with JM leadership management we will generate a new id counter on re-granted leadership so that it is not a problem so far, the semantic is wrong because id counter should only check whether current state is SUSPENDED/LOST.
It is also a blocker upgrading to Curator 4.2 and [~lamber-ken] provides a [fix|
https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299] there.
Besides, in product scenario we once noticed that JM didn't re-elected(it shouldn't happen after [~trohrmann] add linearized leader operation) on SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue and fix this wrong semantic.
CC [~GJL] [~azagrebin]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)