S3 Checkpointing taking long time with stateful operations

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

S3 Checkpointing taking long time with stateful operations

Kathula, Sandeep

Hi,

We are running a stateful application in Flink with RocksDB as backend and set incremental state to true with checkpoints written to S3

  • 10 task managers each with 2 task slots
  • Checkpoint interval 3 minutes
  • Checkpointing mode – At-least once processing

 

After running app for 2-3 days, we are seeing end to end checkpoint takes almost 2 minutes with Sync time 2 sec and async time 15 sec max. But initially when state is less, it takes 10-15 sec for checkpointing. As checkpointing mode is at least once, align duration is 0. We are seeing a dip in processing during this time. Couldn’t find out what the actual issue is. 

 

We also tried with remote HDFS for checkpointing but observed similar behavior. 

 

We have couple of questions:

  • When sync time is max 2 sec and async time is 15 sec why is end to end checkpointing taking almost 2 minutes?
  • How can we reduce the checkpoint time?

 

Any help would be appreciated.

 

 

Thank you

Sandeep Kathula

 

Reply | Threaded
Open this post in threaded view
|

Re: S3 Checkpointing taking long time with stateful operations

Congxian Qiu
Hi Sandeep

The picture isn't shown.

First, you can try to find out whether there is some operator's e2e time is
big, the e2e time of snapshot for one operator is time${barrier align time} +
time{sync-snapshot} + time{async-snapshot}. exactly-once and at least once
both need to wait for barrier align, but at least once can process record
when barrier aligning.
For at least once mode, you can enable debug log to track the barrier align
procedure.

Best,
Congxian


Kathula, Sandeep <[hidden email]> 于2020年6月19日周五
上午9:21写道:

> Hi,
>
> We are running a stateful application in Flink with *RocksDB as backend* and
> set *incremental state to true *with checkpoints written to* S3*.
>
>    - 10 task managers each with 2 task slots
>    - Checkpoint interval 3 minutes
>    - Checkpointing mode – At-least once processing
>
>
>
> After running app for 2-3 days, we are seeing end to end checkpoint takes
> almost *2 minutes* with Sync time 2 sec and async time 15 sec max. But
> initially when state is less, it takes 10-15 sec for checkpointing. As
> checkpointing mode is at least once, align duration is 0. We are seeing a
> dip in processing during this time. Couldn’t find out what the actual issue
> is.
>
>
>
> We also tried with remote HDFS for checkpointing but observed similar
> behavior.
>
>
>
> We have couple of questions:
>
>    - *When sync time is max 2 sec and async time is 15 sec why is end to
>    end checkpointing taking almost 2 minutes?*
>    - *How can we reduce the checkpoint time?*
>
> [image: A screenshot of a cell phone Description automatically generated]
>
>
>
> Any help would be appreciated.
>
>
>
>
>
> Thank you
>
> Sandeep Kathula
>
>
>