[jira] [Created] (FLINK-14480) Sort out exception supression during snapshotting for non-running tasks.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-14480) Sort out exception supression during snapshotting for non-running tasks.

Shang Yuanchun (Jira)
Arvid Heise created FLINK-14480:
-----------------------------------

             Summary: Sort out exception supression during snapshotting for non-running tasks.
                 Key: FLINK-14480
                 URL: https://issues.apache.org/jira/browse/FLINK-14480
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.10.0
            Reporter: Arvid Heise


Following-up on Flink-14370.

As [~becket_qin] wrote: "The cause [of Flink-14370] was that when all the records are processed before a snapshot was taken, the records that could not be sent out trigger the snapshot to fail. That snapshot failure will not cause the job to exit. However, all the records in the KafkaProducer are already expired after the snapshot failure. So when the producer closes, there will be no more exception thrown. Thus the job finished successfully.

It looks that the expected behavior from connector is to report exception in {{close()}} method as long as there was a record that could not be sent. On the other hand, exception thrown from {{CheckpointedFunction.snapshotState()}} might be ignored. Not sure if this is reasonable, but this expectation is not super clear from the connector implementation's perspective.

In terms of immediate fix, [~[hidden email]] proposes to always throw exception as long as there has been a record sending failure. I agree that is the right fix per current expected behavior on the connectors."

[~kkl0u], [~pnowojski] any other idea on how to resolve it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)