[jira] [Created] (FLINK-19778) Failed job reinitiated with wrong checkpoint after a ZK reconnection

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-19778) Failed job reinitiated with wrong checkpoint after a ZK reconnection

Shang Yuanchun (Jira)
Paul Lin created FLINK-19778:
--------------------------------

             Summary: Failed job reinitiated with wrong checkpoint after a ZK reconnection
                 Key: FLINK-19778
                 URL: https://issues.apache.org/jira/browse/FLINK-19778
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.11.0
            Reporter: Paul Lin
         Attachments: jm_log

We have a job of Flink 1.11.0 running on YARN that reached FAILED state due to its jobmanager lost leadership during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again with no checkpoints found in ZK, and hence used an earlier savepoint to restore the job, which rewound the job unexpectedly.
 
For details please see the jobmanager logs in the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)