(DEPRECATED) Apache Flink Mailing List archive.

[jira] [Created] (FLINK-17350) StreamTask should always fail immediately on failures in synchronous part of a checkpoint

Classic

List

Threaded

1 message

Shang Yuanchun (Jira)

[jira] [Created] (FLINK-17350) StreamTask should always fail immediately on failures in synchronous part of a checkpoint

Piotr Nowojski created FLINK-17350:
--------------------------------------

Summary: StreamTask should always fail immediately on failures in synchronous part of a checkpoint
Key: FLINK-17350
URL: https://issues.apache.org/jira/browse/FLINK-17350
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing, Runtime / Task
Affects Versions: 1.10.0, 1.9.2, 1.8.3, 1.7.2, 1.6.4
Reporter: Piotr Nowojski

This bugs also Affects 1.5.x branch.

As described https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576

{{setTolerableCheckpointFailureNumber(...)}} and its deprecated {{setFailTaskOnCheckpointError(...)}} predecessor are implemented incorrectly. Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can lead to operators (and especially sinks with an external state) end up in an inconsistent state. That's also true even if they are not used, because of another issue: PLACEHOLDER

For details please check FLINK-17327.

The problem boils down to a fact, that if operator/user functions throws an exception, job should always fail. There is no recovery from this. In case of {{FlinkKafkaProducer}} ignoring such failures might mean that whole transaction with all of it's records will be lost forever.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)