[jira] [Created] (FLINK-17470) Flink task executor process permanently hangs on `flink-daemon.sh stop`, deletes PID file

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-17470) Flink task executor process permanently hangs on `flink-daemon.sh stop`, deletes PID file

Shang Yuanchun (Jira)
Hunter Herman created FLINK-17470:
-------------------------------------

             Summary: Flink task executor process permanently hangs on `flink-daemon.sh stop`, deletes PID file
                 Key: FLINK-17470
                 URL: https://issues.apache.org/jira/browse/FLINK-17470
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Task
    Affects Versions: 1.10.0
         Environment:  
{code:java}
$ uname -a
Linux hostname.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.7.1908 (Core)
Release: 7.7.1908
Codename: Core
{code}

Flink version 1.10
 
            Reporter: Hunter Herman
         Attachments: flink_jstack.log, flink_mixed_jstack.log

Hi Flink team!

We've attempted to upgrade our flink 1.9 cluster to 1.10, but are experiencing reproducible instability on shutdown. Speciically, it appears that the `kill` issued in the `stop` case of flink-daemon.sh is causing the task executor process to hang permanently. Specifically, the process seems to be hanging in the `org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run` in a `Thread.sleep()` call. I think this is a bizarre behavior. Also note that every thread in the process is BLOCKED. on a `pthread_cond_wait` call. Is this an OS level issue? Banging my head on a wall here. See attached stack traces for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)