[jira] [Created] (FLINK-17176) Slow down Pod recreation in KubernetesResourceManager#PodCallbackHandler

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-17176) Slow down Pod recreation in KubernetesResourceManager#PodCallbackHandler

Shang Yuanchun (Jira)
Canbin Zheng created FLINK-17176:
------------------------------------

             Summary: Slow down Pod recreation in KubernetesResourceManager#PodCallbackHandler
                 Key: FLINK-17176
                 URL: https://issues.apache.org/jira/browse/FLINK-17176
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / Kubernetes
    Affects Versions: 1.10.0
            Reporter: Canbin Zheng
             Fix For: 1.11.0


In the native K8s setups, there are some cases that we do not control the speed of pod re-creation which poses potential risks to flood the K8s API Server in the {{PodCallbackHandler}} implementation of {{KubernetesResourceManager.}}

Here are steps to reproduce this kind of problems:
 # Mount theĀ {{/opt/flink/log}} in the Container of TaskManager to a path on the K8s nodes via HostPath, make sure that the path exists but the TaskManager process has no write permission. We can achieve this via the user-specified pod template support or just hardcode it for testing only.
 # Launch a session cluster
 # Submit a new job to the session cluster, as expected, we can observe that the Pod constantly fails quickly during launching the main Container, then theĀ {{KubernetesResourceManager#onModified}} is invoked to re-create a new Pod immediately, without any speed control.

To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* event and that Pod is terminated before successfully registering into the {{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send another creation request to K8s API Server immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)