[jira] [Created] (FLINK-12183) Job Cluster doesn't release resources after cancel a running job in per-job Yarn mode

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-12183) Job Cluster doesn't release resources after cancel a running job in per-job Yarn mode

Shang Yuanchun (Jira)
Yumeng Zhang created FLINK-12183:
------------------------------------

             Summary: Job Cluster doesn't release resources after cancel a running job in per-job Yarn mode
                 Key: FLINK-12183
                 URL: https://issues.apache.org/jira/browse/FLINK-12183
             Project: Flink
          Issue Type: Bug
          Components: Runtime / REST
    Affects Versions: 1.8.0, 1.7.2, 1.6.4
            Reporter: Yumeng Zhang


The per-job Yarn cluster doesn't releases resources after cancel a running job if the job restarted many times, like 1000 times, in a short time.

The bug is in archiveExecutionGraph() phase before executing removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will exit unexpectedly with NullPointerException in archiveExecutionGraph() phase. It's hard to find that because here it only catches IOException. In SubtaskExecutionAttemptDetailsHandler and  SubtaskExecutionAttemptAccumulatorsHandler, when calling archiveJsonWithPath() method, it will construct some json information about prior execution attempts but the index is from 0 which might be dropped index for the for loop.  In default, it will return null when trying to get the prior execution attempt (AccessExecution attempt = subtask.getPriorExecutionAttempt(x)).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)