[jira] [Created] (FLINK-21846) Rethink whether failure of ExecutionGraph creation should directly fail the job

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21846) Rethink whether failure of ExecutionGraph creation should directly fail the job

Shang Yuanchun (Jira)
Till Rohrmann created FLINK-21846:
-------------------------------------

             Summary: Rethink whether failure of ExecutionGraph creation should directly fail the job
                 Key: FLINK-21846
                 URL: https://issues.apache.org/jira/browse/FLINK-21846
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
    Affects Versions: 1.13.0
            Reporter: Till Rohrmann
             Fix For: 1.13.0


Currently, the {{AdaptiveScheduler}} fails a job execution if the {{ExecutionGraph}} creation fails. This can be problematic because the failure could result from a transient problem (e.g. filesystem is currently not available). In the case of a transient problem a job rescaling could lead to a job failure which might be a bit surprising for users. Instead, I would expect that Flink would retry the {{ExecutionGraph}} creation.

One idea could be to ask the restart policy for how to treat the failure and whether to retry the {{ExecutionGraph}} creation or not.

One thing to keep in mind, though, is that some failure might be permanent failures (e.g. wrongly specified savepoint path). In such as case we would ideally fail immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)