Till Rohrmann created FLINK-11813:
-------------------------------------
Summary: Standby per job mode Dispatchers don't know job's JobSchedulingStatus
Key: FLINK-11813
URL:
https://issues.apache.org/jira/browse/FLINK-11813 Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.7.2, 1.6.4, 1.8.0
Reporter: Till Rohrmann
At the moment, it can happen that standby {{Dispatchers}} in per job mode will restart a terminated job after they gained leadership. The problem is that we currently clear the {{RunningJobsRegistry}} once a job has reached a globally terminal state. After the leading {{Dispatcher}} terminates, a standby {{Dispatcher}} will gain leadership. Without having the information from the {{RunningJobsRegistry}} it cannot tell whether the job has been executed or whether the {{Dispatcher}} needs to re-execute the job. At the moment, the {{Dispatcher}} will assume that there was a fault and hence re-execute the job. This can lead to duplicate results.
I think we need some way to tell standby {{Dispatchers}} that a certain job has been successfully executed. One trivial solution could be to not clean up the {{RunningJobsRegistry}} but then we will clutter ZooKeeper.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)