UpdateTaskExecutionState during JobManager failover

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

UpdateTaskExecutionState during JobManager failover

wangzhijiang
Hi,     As i know, when TaskManager send UpdateTaskExecutionState to JobManager, if the JobManager failover and the future response is fail, the task will be failed. Is it feasible to retry send UpdateTaskExecutionState again when future response fail until success. In JobManager HA mode, the UpdateTaskExecutionState should be success when the leader JobManager active. Or are there any suggestions for sending messages during JobManager failover instead of fail task.    Thanks for any help in advance!Zhijiang Wang
Reply | Threaded
Open this post in threaded view
|

Re: UpdateTaskExecutionState during JobManager failover

Stephan Ewen
Hi!

That is a super interesting idea. If I understand you correctly, you are
suggesting to try and reconcile the TaskManagers and the JobManager before
restarting the job. That would mean that in case of a master failure, the
jobs may simply continue to run. That would be a nice enhancements, but I
think it is slightly more complicated, for two reasons:

1)

An assumption that we currently make is that the JobManager view and
TaskManager view of the task status get out of sync (a JobManager failure
is a reason that can happen), then the task is restarted. That is a quite
robust solution, but may lead to restarts in cases that may be recovered
otherwise as well. For example only certain state transitions are valid
(like RUNNING to FINISHED). If the JobManager gets an update that the state
is FINISHED when the JobManager thought it was CANCELED or CREATED, it will
reject this under the assumption that something went wrong in the
distributed coordination.

If we want to keep this safety checks, it would probably need something
like the JobManager asking TaskManagers that connect for their current
status and resetting the Job status to that.

2)

To make sure that JobManagers and TaskManagers do not confuse messages from
different sessions (a session being a JobManager having leader role), we
filter the critical messages by a "leaderSessionId", which is again very
robust. This would also need a careful change.


Hope that helps in understanding the current rational. If you want to work
on improving this, it would be great. We should probably talk more about
the detailed changes needed.


Greetings,
Stephan





On Thu, Jan 14, 2016 at 3:49 AM, wangzhijiang999 <[hidden email]
> wrote:

> Hi,
>
>     As i know, when TaskManager send UpdateTaskExecutionState to
> JobManager, if the JobManager failover and the future response is fail, the
> task will be failed. Is it feasible to retry send UpdateTaskExecutionState
> again when future response fail until success. In JobManager HA mode,
> the UpdateTaskExecutionState should be success when the leader JobManager
> active. Or are there any suggestions for sending messages during JobManager
> failover instead of fail task.
>
>     Thanks for any help in advance!
>
>
> Zhijiang Wang
>
Reply | Threaded
Open this post in threaded view
|

答复:UpdateTaskExecutionState during JobManager failover

wangzhijiang
In reply to this post by wangzhijiang
Hi Stephan,
 Thank you for detail explaination.  As you said, my opition is to keep task still running druing jobmanager failover, even though sending update status failed.
For the first reason you mentioned, if i understand correctly, the key issue is status out of sync between taskmanager and jobmanager. For example, when the jobmanager failover, the task is at CREATED status . When the task status transition to RUNNING, the updateStatus message can not be received because of jobmanager failover, then the taskmanager will retry sending the message to jobmanager until success. When the jobmanager recovers, the previous status of task is still CREATED in jobmanager view, and the task status maybe actually transition to FINISHED in taskmanager view. The key problem is that when the jobmanager received the FINISHED earlier than the RUNNING message, it will reject the FINISHED message.  If the task maintain a queue for sending message during jobmanager failover in order to confirm that the messages will be received in sequence at jobmanager when recover, that means the RUNNING status message must be arrived before FINISHED status message, are there any problems?
For the second reason you mentioned,  i am not very clear of the machenism of filtering the critical message by leaderSessionID, would you extend it in detail? 
I am trying to improve process of jobmanager and taskmanager failover, thank you for your help!
Zhijiang Wang