[jira] [Created] (FLINK-14949) Task cancellation can be stuck against out-of-thread error

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-14949) Task cancellation can be stuck against out-of-thread error

Shang Yuanchun (Jira)
Hwanju Kim created FLINK-14949:
----------------------------------

             Summary: Task cancellation can be stuck against out-of-thread error
                 Key: FLINK-14949
                 URL: https://issues.apache.org/jira/browse/FLINK-14949
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.8.2
            Reporter: Hwanju Kim


Task cancellation ([_cancelOrFailAndCancelInvokable_|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L991]) relies on multiple separate threads, which are _TaskCanceler_, _TaskInterrupter_, and _TaskCancelerWatchdog_. While TaskCanceler performs cancellation itself, TaskInterrupter periodically interrupts a non-reacting task and TaskCancelerWatchdog kills JVM if cancellation has never been finished within a certain amount of time (by default 3 min). Those all ensure that cancellation can be done or either aborted transitioning to a terminal state in finite time (FLINK-4715).

However, if any asynchronous thread creation is failed such as by out-of-thread (_java.lang.OutOfMemoryError: unable to create new native thread_), the code transitions to CANCELING, but nothing could be performed for cancellation or watched by watchdog. Currently, jobmanager does [retry cancellation|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1121] against any error returned, but a next retry [returns success once it sees CANCELING|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L997], assuming that it is in progress. This leads to complete stuck in CANCELING, which is non-terminal, so state machine is stuck after that.

One solution would be that if a task has transitioned to CANCELLING but it gets fatal error or OOM (i.e., _isJvmFatalOrOutOfMemoryError_ is true) indicating that it could not reach spawning TaskCancelerWatchdog, it could immediately consider that as fatal error (not safely cancellable) calling _notifyFatalError_, just as TaskCancelerWatchdog does but eagerly and synchronously. That way, it can at least transition out of the non-terminal state and furthermore clear potentially leaked thread/memory by restarting JVM. The same method is also invoked by _failExternally_, but transitioning to FAILED seems less critical as it's already terminal state.

How to reproduce is straightforward by running an application that keeps creating threads, each of which never finishes in a loop, and has multiple tasks so that one task triggers failure and then the others are attempted to be cancelled by full fail-over. In web UI dashboard, some tasks from a task manager where any of cancellation-related threads failed to be spawned are stuck in CANCELLING for good.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)