Containers are not released after job failed

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Containers are not released after job failed

刘建刚
      I run flink 1.6.2 on yarn. At some time, job is failed becuase of: org.apache.flink.util.FlinkException: The assigned slot container_e708_1555051789618_2644286_01_000061_0 was removed

      Then the job restarts. After some time, the container container_e708_1555051789618_2644286_01_000061 is still not released.

      The log of container_e708_1555051789618_2644286_01_000061 is as following:


      The log shows that two tasks are canceled before successful registration at resource manager and one is canceled after registration. After five minutes, the container registers again. At last, the container is alive but not used.
      Anyone have any idea about this problem. Thank you.