Flink 1.9.2 why always checkpoint expired

classic Classic list List threaded Threaded
5 messages Options
qq
Reply | Threaded
Open this post in threaded view
|

Flink 1.9.2 why always checkpoint expired

qq
Hi all,

Why my flink checkpoint always expired, I used RocksDB checkpoint,
and I can’t get any useful messages for this. Could you help me ? Thanks very much.



Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.9.2 why always checkpoint expired

Congxian Qiu
Hi
The image is not very clear.
For RocksDBStateBackend, do you enable incremental checkpoint?

Currently, checkpoint on TM side contains some steps:
1 barrier align
2 sync snapshot
3 async snapshot

For expired checkpoint, could you please check the tasks in the first operator of the DAG to find out why it timed out.
- is there any backpressure? (affect barrier align)
- is the disk util/network util is high? (affect step 2&3)
- is the task thread is too busy? (this can lead to the barrier processed sometime late)

you can enable the debug log to find out more info.

Best,
Congxian


qq <[hidden email]> 于2020年4月27日周一 下午12:34写道:
Hi all,

Why my flink checkpoint always expired, I used RocksDB checkpoint,
and I can’t get any useful messages for this. Could you help me ? Thanks very much.



qq
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.9.2 why always checkpoint expired

qq
In reply to this post by qq
Hi Jiayi Liao.

  Thanks your replying.   Add attachment . And can’t get any useful messages;

 


2020年4月27日 12:40,Jiayi Liao <[hidden email]> 写道:

<粘贴的图形-1.tiff>

qq
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.9.2 why always checkpoint expired

qq
In reply to this post by qq


2020年4月27日 12:40,Jiayi Liao <[hidden email]> 写道:

Hi,

The picture in your attachment is too vague to see any detail. And beside the overview, could you take a look at the details of a specific expired checkpoint in history tab? From my experience, the expiration is usually because:

1. The data skew problem, which you can find out from checkpoints' details.
2. The processing is too slow (or the job is back-pressured) and the checkpoint timeout is set too short.

Best Regards,
Jiayi Liao

On Mon, Apr 27, 2020 at 12:34 PM qq <[hidden email]> wrote:
Hi all,

Why my flink checkpoint always expired, I used RocksDB checkpoint,
and I can’t get any useful messages for this. Could you help me ? Thanks very much.



<粘贴的图形-1.tiff>

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.9.2 why always checkpoint expired

Congxian Qiu
Hi

From the picture and the previous eamil. you use RocksDBStateBackend, and all the operators chained together, checkpoint timeout set to 2min.

Do you have keyed state in your job (do you have `keyby` in your job)?

I'll share some experience to find out the reason of checkpoint timeout problem,
1. does the snapshot thread can get checkpoint lock if you run on version < 1.10
2. does the main thread consumes too much cpu, so that barrier can not be handled.
3. could you please enable debug log and find out more information.


Best,
Congxian


qq <[hidden email]> 于2020年4月28日周二 上午9:20写道:


2020年4月27日 12:40,Jiayi Liao <[hidden email]> 写道:

Hi,

The picture in your attachment is too vague to see any detail. And beside the overview, could you take a look at the details of a specific expired checkpoint in history tab? From my experience, the expiration is usually because:

1. The data skew problem, which you can find out from checkpoints' details.
2. The processing is too slow (or the job is back-pressured) and the checkpoint timeout is set too short.

Best Regards,
Jiayi Liao

On Mon, Apr 27, 2020 at 12:34 PM qq <[hidden email]> wrote:
Hi all,

Why my flink checkpoint always expired, I used RocksDB checkpoint,
and I can’t get any useful messages for this. Could you help me ? Thanks very much.



<粘贴的图形-1.tiff>