Alex created FLINK-12858:
----------------------------
Summary: Potentially not properly working Flink job in case of stop-with-savepoint failure
Key: FLINK-12858
URL:
https://issues.apache.org/jira/browse/FLINK-12858 Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Reporter: Alex
Current implementation of stop-with-savepoint (FLINK-11458) would lock the thread (on {{syncSavepointLatch}}) that carries {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is implied to be the task's main thread (stop-with-savepoint deliberately stops any activity in the task's main thread).
Unlocking happens either when the task is cancelled or when the corresponding checkpoint is acknowledged.
It's possible, that other downstream tasks of the same Flink job "soft" fail the checkpoint/savepoint due to various reasons (for example, due to max buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the checkpoint abortion would be notified to JM . But it looks like, the checkpoint coordinator would handle such abortion as usual and assume that the Flink job continues running.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)