[jira] [Created] (FLINK-21376) Failed state might not provide failureCause

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21376) Failed state might not provide failureCause

Shang Yuanchun (Jira)
Matthias created FLINK-21376:
--------------------------------

             Summary: Failed state might not provide failureCause
                 Key: FLINK-21376
                 URL: https://issues.apache.org/jira/browse/FLINK-21376
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Task
    Affects Versions: 1.12.1
            Reporter: Matthias
             Fix For: 1.13.0


{{Task.executionState}} and {{Task.failureCause}} are not set atomically. This became an issue when implementing the exception history (FLINK-21187) where we relied on the invariant that a {{failureCause}} is present when the {{Task}} failed.

Adding this check to [Task.notifyFinalStage()|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1001] will reveal the race condition.

{{TaskExecutorSlotLifetimeTest}} becomes unstable when adding this invariant. The reason is that the test starts a task but does not wait for the task to be finished. The [task finalization|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L895] and [the cancellation of the task|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1105] triggered through stopping the {{TaskManager}} shutdown compete with each other and could cause the {{executionState}} to be set to {{FAILED}} while the {{failureCause}} still being {{null}}. This is then forwarded to {{Execution}} through [Task.notifyFinalState|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L895].

We should set {{failureCause}} while setting the {{executionState}} to failed to not miss any caught error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)