(DEPRECATED) Apache Flink Mailing List archive.

[jira] [Created] (FLINK-16299) Release containers recovered from previous attempt in which TaskExecutor is not started.

Classic

List

Threaded

1 message

Shang Yuanchun (Jira)

[jira] [Created] (FLINK-16299) Release containers recovered from previous attempt in which TaskExecutor is not started.

Xintong Song created FLINK-16299:
------------------------------------

Summary: Release containers recovered from previous attempt in which TaskExecutor is not started.
Key: FLINK-16299
URL: https://issues.apache.org/jira/browse/FLINK-16299
Project: Flink
Issue Type: Improvement
Components: Deployment / YARN
Reporter: Xintong Song

As discussed in FLINK-16215, on Yarn deployment, {{YarnResourceManager}} starts a new {{TaskExecutor}} in two steps:
# Request a new container from Yarn
# Starts a {{TaskExecutor}} process in the allocated container

If JM failover happens between the two steps, in the new attempt {{YarnResourceManager}} will not start {{TaskExecutor}} processes in recovered containers. That means such containers are neither used nor released.

A potential fix to this problem, is to query form the container status by calling {{NMClientAsync#getContainerStatusAsync}}, and release the containers whose state is {{NEW}}, keeps only those whose state is {{RUNNING}} and waiting for them to register.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)