[jira] [Created] (FLINK-21053) Prevent further RejectedExecutionExceptions in CheckpointCoordinator failing JM

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21053) Prevent further RejectedExecutionExceptions in CheckpointCoordinator failing JM

Shang Yuanchun (Jira)
Roman Khachatryan created FLINK-21053:
-----------------------------------------

             Summary: Prevent further RejectedExecutionExceptions in CheckpointCoordinator failing JM
                 Key: FLINK-21053
                 URL: https://issues.apache.org/jira/browse/FLINK-21053
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
            Reporter: Roman Khachatryan
            Assignee: Roman Khachatryan
             Fix For: 1.13.0


In the past, there were multiple bugs caused by throwing/handling RejectedExecutionException in CheckpointCoordinator (FLINK-18290, FLINK-20992).

 

And I think it's still possible as there are many places where an executor is passed to calls to CompletableFuture.xxxAsync while it can already be shut down.

 

In FLINK-20992 we discussed two approaches to fix this.

One approach is to check executor state inside a synchronized block every time when it is used.

Second approach is to
 # Create executors inside CheckpointCoordinator (both io & timer thread pools)
 # Check isShutdown() in their error handlers (if yes and it's RejectedExecutionException then just log; otherwise delegate to FatalExitExceptionHandler)
 # (this will allow to remove such RejectedExecutionException checks from coordinator code)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)