[jira] [Created] (FLINK-20672) CheckpointAborted RPC failure can fail JM

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-20672) CheckpointAborted RPC failure can fail JM

Shang Yuanchun (Jira)
Roman Khachatryan created FLINK-20672:
-----------------------------------------

             Summary: CheckpointAborted RPC failure can fail JM
                 Key: FLINK-20672
                 URL: https://issues.apache.org/jira/browse/FLINK-20672
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.11.3, 1.12.0
            Reporter: Roman Khachatryan


Introduced in FLINK-8871, aborted RPC notifications are done asynchonously:
 
{code}
        private void sendAbortedMessages(long checkpointId, long timeStamp) {
                // send notification of aborted checkpoints asynchronously.
                executor.execute(() -> {
                        // send the "abort checkpoint" messages to necessary vertices.
                        // ..
                });
        }
{code}

However, the executor that eventually executes this request is created as follows
{code}
                final ScheduledExecutorService futureExecutor = Executors.newScheduledThreadPool(
                                Hardware.getNumberCPUCores(),
                                new ExecutorThreadFactory("jobmanager-future"));
{code}

ExecutorThreadFactory uses UncaughtExceptionHandler that exits JVM on error.

cc: [~yunta]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)