(DEPRECATED) Apache Flink Mailing List archive.

[jira] [Created] (FLINK-4911) Non-disruptive JobManager Failures

Classic

List

Threaded

1 message

Shang Yuanchun (Jira)

[jira] [Created] (FLINK-4911) Non-disruptive JobManager Failures

Stephan Ewen created FLINK-4911:
-----------------------------------

Summary: Non-disruptive JobManager Failures
Key: FLINK-4911
URL: https://issues.apache.org/jira/browse/FLINK-4911
Project: Flink
Issue Type: New Feature
Components: JobManager
Reporter: Stephan Ewen

JobManager failures can be handled in a non-disruptive way.

The basic approach is the following:

- When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect to the JobManager
- On connect, the TaskManager tells the JobManager about its currently running tasks
- A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs the ExecutionGraph state from these reports
- Tasks whose status was not reconstructed in a certain time are assumed failed and trigger regular task recovery.

To avoid having to re-implement this for the new JobManager / TaskManager approach in **flip-6**, I suggest to directly implement this into the {{flip-6}} feature branch.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)