Roman Khachatryan created FLINK-20672:
-----------------------------------------
Summary: CheckpointAborted RPC failure can fail JM
Key: FLINK-20672
URL:
https://issues.apache.org/jira/browse/FLINK-20672 Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.11.3, 1.12.0
Reporter: Roman Khachatryan
Introduced in FLINK-8871, aborted RPC notifications are done asynchonously:
{code}
private void sendAbortedMessages(long checkpointId, long timeStamp) {
// send notification of aborted checkpoints asynchronously.
executor.execute(() -> {
// send the "abort checkpoint" messages to necessary vertices.
// ..
});
}
{code}
However, the executor that eventually executes this request is created as follows
{code}
final ScheduledExecutorService futureExecutor = Executors.newScheduledThreadPool(
Hardware.getNumberCPUCores(),
new ExecutorThreadFactory("jobmanager-future"));
{code}
ExecutorThreadFactory uses UncaughtExceptionHandler that exits JVM on error.
cc: [~yunta]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)