[jira] [Created] (FLINK-13962) Execution#taskRestore leaks if task fails before deploying

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-13962) Execution#taskRestore leaks if task fails before deploying

Shang Yuanchun (Jira)
Zhu Zhu created FLINK-13962:
-------------------------------

             Summary: Execution#taskRestore leaks if task fails before deploying
                 Key: FLINK-13962
                 URL: https://issues.apache.org/jira/browse/FLINK-13962
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.9.0, 1.10.0
            Reporter: Zhu Zhu


Currently Execution#taskRestore is reset to null in task deployment stage.

The purpose of it is "allows the JobManagerTaskRestore instance to be garbage collected. Furthermore, it won't be archived along with the Execution in the ExecutionVertex in case of a restart. This is especially important when setting state.backend.fs.memory-threshold to larger values because every state below this threshold will be stored in the meta state files and, thus, also the JobManagerTaskRestore instances." (From FLINK-9693)

 

However, if a task fails before it comes to the deployment stage, the Execution#taskRestore will remain non-null and will be archived in prior executions. 

This may result in large JM heap cost in certain cases.

 

I think we should check the Execution#taskRestore and make sure it is null when moving a execution to prior executions.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)