(DEPRECATED) Apache Flink Mailing List archive.

Re: [DISCUSS]FLIP-170 Adding Checkpoint Rejection Mechanism

Posted by Piotr Nowojski-5 on
URL: http://deprecated-apache-flink-mailing-list-archive.368.s1.nabble.com/DISCUSS-FLIP-170-Adding-Checkpoint-Rejection-Mechanism-tp51212p51269.html

Hi Senhong,

Thanks for the proposal. I have a couple of questions.

Have you seen
`org.apache.flink.streaming.api.checkpoint.ExternallyInducedSource` (for
the legacy SourceFunction) and
`org.apache.flink.api.connector.source.ExternallyInducedSourceReader` (for
FLIP-27) interfaces? They work the other way around, by letting the source
to trigger/initiate a checkpoint, instead of declining it. Could it be made
to work in your use case? If not, can you explain why?

Regarding declining/failing the checkpoint (without blocking the barrier
waiting for snapshot availability), can not you achieve the same thing by a
combination of throwing an exception in for example
`org.apache.flink.api.connector.source.SourceReader#snapshotState` call and
setting the tolerable checkpoint failure number? [1]

Best, Piotrek

[1]
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/environment/CheckpointConfig.html#setTolerableCheckpointFailureNumber-int-

śr., 9 cze 2021 o 09:11 Senhong Liu <[hidden email]> napisał(a):

> Here is some brief context about the new feature.
>
> 1. Actively checkpoint rejecting by the operator. Follow by the current
> checkpoint mechanism, one more preliminary step is added to help the
> operator determine that if it is able to take snapshots. The preliminary
> step is a new API provided to the users/developers. The new API will be
> implemented in the Source API (the new one based on FLIP-27) for CDC
> implementation. The new API can also be implemented in other operator if
> necessary.
>
> 2. Handling the failure returned from the operator. If the checkpoint is
> rejected by the operator, an appropriate failure reason needs to be
> returned
> from the operator as well. In the current design, two failure reasons are
> listed, soft failure and hard failure. The previous one would be ignored by
> the Flink and the later one would be counted as continuous checkpoint
> failure according to the current checkpoint failure manager mechanism.
>
> 3. To prevent that the operator keeps reporting soft failure and therefore
> no checkpoint can be completed for a long time, we introduce a new
> configuration about the tolerable checkpoint failure timeout, which is a
> timer that starts with the checkpoint scheduler. Overall, the timer would
> only be reset if and only if the checkpoint completes. Otherwise, it would
> do nothing until the tolerable timeout is hit. If the timer rings, it would
> then trigger the current checkpoint failover.
>
> Question:
> a. According to the current design, the checkpoint might fail for a
> possibly
> long time with a large checkpoint interval, for example. Is there any
> better
> idea to make the checkpoint more likely to succeed? For example, trigger
> the
> checkpoint immediately after the last one is rejected. But it seems
> unappropriate because that would increase the overhead.
> b. Is there any better idea on handling the soft failure?
>
>
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>