[jira] [Created] (FLINK-21768) Optimize system.exit() logic of CliFrontend

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21768) Optimize system.exit() logic of CliFrontend

Shang Yuanchun (Jira)
Junfan Zhang created FLINK-21768:
------------------------------------

             Summary: Optimize system.exit() logic of CliFrontend
                 Key: FLINK-21768
                 URL: https://issues.apache.org/jira/browse/FLINK-21768
             Project: Flink
          Issue Type: Improvement
          Components: Command Line Client
            Reporter: Junfan Zhang


h2. Why
We encounter a problem when Oozie integerated with Flink Batch Action.
Oozie will use a launcher job to start Flink client used to submit Flink job to Hadoop Yarn.
And when Flink client finished , Oozie will get its exitCode to determine job submission status and then do some extra things.

So how Oozie catch {{System.exit()}}? It will implement JDK SecurityManager. ([Oozie related code link|https://github.com/apache/oozie/blob/f1e01a9e155692aa5632f4573ab1b3ebeab7ef45/sharelib/oozie/src/main/java/org/apache/oozie/action/hadoop/security/LauncherSecurityManager.java#L24]).

Now when Flink Client finished successfully, it will call {{System.exit(0)}}([Flink related code link|https://github.com/apache/flink/blob/195298aea327b3f98d9852121f0f146368696300/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L1133]) method.
And then JVM will use LauncherSecurityManager(Oozie implemented) to handle {{System.exit(0)}} method and trigger {{LauncherSecurityManager.checkExit()}} method, and then will throw exception.
Finally Flink Client will catch its {{throwable}} and call {{System.exit(31)}}([related code link|https://github.com/apache/flink/blob/195298aea327b3f98d9852121f0f146368696300/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L1139]) method again. It will cause Oozie to misjudge the status of the Fllink job.

Actually it's a corner case. In most scenes, the situation I mentioned will not happen. But it's still necessary for us to optimize client exit logic.

Besides, i think the problem above may also exist in some other frameworks such as linkedin/azakaban and apache/airflow, which are using Flink client to submit batch job.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)