[ANNOUNCE] 1.12.1 may still produce corrupted checkpoints

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[ANNOUNCE] 1.12.1 may still produce corrupted checkpoints

Arvid Heise-4
Dear users,

Unfortunately, the bug in the unaligned checkpoint that we fixed in 1.12.1
still occurs under certain circumstances, such that we recommend to not use
unaligned checkpoints in production until 1.12.2. While the normal
processing is not affected by this bug, a recovery with corrupted
checkpoints will not succeed.

If you have used unaligned checkpoints, you can change back to aligned
checkpoint when starting from an uncorrupted unaligned checkpoint. There is
no easy way to check if a checkpoint is corrupted or not, however, the rare
corruption happens most likely when you have short checkpointing intervals
(<1s), high backpressure, and the previous checkpoint was declined for some
reason. So to be safe, before switching back, make sure that the last
handful of checkpoints all succeeded.

We have already prepared a fix that we will merge into the release branch
today, but the discussion on when to release 1.12.2 has not started yet.

Best,

Arvid
Reply | Threaded
Open this post in threaded view
|

Re: [ANNOUNCE] 1.12.1 may still produce corrupted checkpoints

Xintong Song
Hi Arvid,

Thanks for the announcement.

I think we'd better also update the 1.12 release notes[1] and 1.12.1
release blog post[2].
Would you have time to help prepare the warning messages?

Thank you~

Xintong Song


[1]
https://github.com/apache/flink/blob/master/docs/release-notes/flink-1.12.md

[2]
https://github.com/apache/flink-web/blob/asf-site/_posts/2021-01-19-release-1.12.1.md



On Fri, Jan 22, 2021 at 7:40 PM Arvid Heise <[hidden email]> wrote:

> Dear users,
>
> Unfortunately, the bug in the unaligned checkpoint that we fixed in 1.12.1
> still occurs under certain circumstances, such that we recommend to not use
> unaligned checkpoints in production until 1.12.2. While the normal
> processing is not affected by this bug, a recovery with corrupted
> checkpoints will not succeed.
>
> If you have used unaligned checkpoints, you can change back to aligned
> checkpoint when starting from an uncorrupted unaligned checkpoint. There is
> no easy way to check if a checkpoint is corrupted or not, however, the rare
> corruption happens most likely when you have short checkpointing intervals
> (<1s), high backpressure, and the previous checkpoint was declined for some
> reason. So to be safe, before switching back, make sure that the last
> handful of checkpoints all succeeded.
>
> We have already prepared a fix that we will merge into the release branch
> today, but the discussion on when to release 1.12.2 has not started yet.
>
> Best,
>
> Arvid
>
Reply | Threaded
Open this post in threaded view
|

Re: [ANNOUNCE] 1.12.1 may still produce corrupted checkpoints

Arvid Heise-3
Hi Xintong,

yes, I'm on it.

Best,

Arvid

On Fri, Jan 22, 2021 at 1:01 PM Xintong Song <[hidden email]> wrote:

> Hi Arvid,
>
> Thanks for the announcement.
>
> I think we'd better also update the 1.12 release notes[1] and 1.12.1
> release blog post[2].
> Would you have time to help prepare the warning messages?
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://github.com/apache/flink/blob/master/docs/release-notes/flink-1.12.md
>
> [2]
>
> https://github.com/apache/flink-web/blob/asf-site/_posts/2021-01-19-release-1.12.1.md
>
>
>
> On Fri, Jan 22, 2021 at 7:40 PM Arvid Heise <[hidden email]> wrote:
>
> > Dear users,
> >
> > Unfortunately, the bug in the unaligned checkpoint that we fixed in
> 1.12.1
> > still occurs under certain circumstances, such that we recommend to not
> use
> > unaligned checkpoints in production until 1.12.2. While the normal
> > processing is not affected by this bug, a recovery with corrupted
> > checkpoints will not succeed.
> >
> > If you have used unaligned checkpoints, you can change back to aligned
> > checkpoint when starting from an uncorrupted unaligned checkpoint. There
> is
> > no easy way to check if a checkpoint is corrupted or not, however, the
> rare
> > corruption happens most likely when you have short checkpointing
> intervals
> > (<1s), high backpressure, and the previous checkpoint was declined for
> some
> > reason. So to be safe, before switching back, make sure that the last
> > handful of checkpoints all succeeded.
> >
> > We have already prepared a fix that we will merge into the release branch
> > today, but the discussion on when to release 1.12.2 has not started yet.
> >
> > Best,
> >
> > Arvid
> >
>


--

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng
Reply | Threaded
Open this post in threaded view
|

Re: [ANNOUNCE] 1.12.1 may still produce corrupted checkpoints

Till Rohrmann
Thanks for the update Arvid. This fix warrants a quick 1.12.2 release imo.

Cheers,
Till

On Fri, Jan 22, 2021 at 1:42 PM Arvid Heise <[hidden email]> wrote:

> Hi Xintong,
>
> yes, I'm on it.
>
> Best,
>
> Arvid
>
> On Fri, Jan 22, 2021 at 1:01 PM Xintong Song <[hidden email]>
> wrote:
>
> > Hi Arvid,
> >
> > Thanks for the announcement.
> >
> > I think we'd better also update the 1.12 release notes[1] and 1.12.1
> > release blog post[2].
> > Would you have time to help prepare the warning messages?
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://github.com/apache/flink/blob/master/docs/release-notes/flink-1.12.md
> >
> > [2]
> >
> >
> https://github.com/apache/flink-web/blob/asf-site/_posts/2021-01-19-release-1.12.1.md
> >
> >
> >
> > On Fri, Jan 22, 2021 at 7:40 PM Arvid Heise <[hidden email]> wrote:
> >
> > > Dear users,
> > >
> > > Unfortunately, the bug in the unaligned checkpoint that we fixed in
> > 1.12.1
> > > still occurs under certain circumstances, such that we recommend to not
> > use
> > > unaligned checkpoints in production until 1.12.2. While the normal
> > > processing is not affected by this bug, a recovery with corrupted
> > > checkpoints will not succeed.
> > >
> > > If you have used unaligned checkpoints, you can change back to aligned
> > > checkpoint when starting from an uncorrupted unaligned checkpoint.
> There
> > is
> > > no easy way to check if a checkpoint is corrupted or not, however, the
> > rare
> > > corruption happens most likely when you have short checkpointing
> > intervals
> > > (<1s), high backpressure, and the previous checkpoint was declined for
> > some
> > > reason. So to be safe, before switching back, make sure that the last
> > > handful of checkpoints all succeeded.
> > >
> > > We have already prepared a fix that we will merge into the release
> branch
> > > today, but the discussion on when to release 1.12.2 has not started
> yet.
> > >
> > > Best,
> > >
> > > Arvid
> > >
> >
>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>
Reply | Threaded
Open this post in threaded view
|

Re: [ANNOUNCE] 1.12.1 may still produce corrupted checkpoints

Arvid Heise-4
Hi Till,

I completely agree with you.

Best,

Arvid

On Fri, Jan 22, 2021 at 1:46 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for the update Arvid. This fix warrants a quick 1.12.2 release imo.
>
> Cheers,
> Till
>
> On Fri, Jan 22, 2021 at 1:42 PM Arvid Heise <[hidden email]> wrote:
>
> > Hi Xintong,
> >
> > yes, I'm on it.
> >
> > Best,
> >
> > Arvid
> >
> > On Fri, Jan 22, 2021 at 1:01 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Hi Arvid,
> > >
> > > Thanks for the announcement.
> > >
> > > I think we'd better also update the 1.12 release notes[1] and 1.12.1
> > > release blog post[2].
> > > Would you have time to help prepare the warning messages?
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/flink/blob/master/docs/release-notes/flink-1.12.md
> > >
> > > [2]
> > >
> > >
> >
> https://github.com/apache/flink-web/blob/asf-site/_posts/2021-01-19-release-1.12.1.md
> > >
> > >
> > >
> > > On Fri, Jan 22, 2021 at 7:40 PM Arvid Heise <[hidden email]> wrote:
> > >
> > > > Dear users,
> > > >
> > > > Unfortunately, the bug in the unaligned checkpoint that we fixed in
> > > 1.12.1
> > > > still occurs under certain circumstances, such that we recommend to
> not
> > > use
> > > > unaligned checkpoints in production until 1.12.2. While the normal
> > > > processing is not affected by this bug, a recovery with corrupted
> > > > checkpoints will not succeed.
> > > >
> > > > If you have used unaligned checkpoints, you can change back to
> aligned
> > > > checkpoint when starting from an uncorrupted unaligned checkpoint.
> > There
> > > is
> > > > no easy way to check if a checkpoint is corrupted or not, however,
> the
> > > rare
> > > > corruption happens most likely when you have short checkpointing
> > > intervals
> > > > (<1s), high backpressure, and the previous checkpoint was declined
> for
> > > some
> > > > reason. So to be safe, before switching back, make sure that the last
> > > > handful of checkpoints all succeeded.
> > > >
> > > > We have already prepared a fix that we will merge into the release
> > branch
> > > > today, but the discussion on when to release 1.12.2 has not started
> > yet.
> > > >
> > > > Best,
> > > >
> > > > Arvid
> > > >
> > >
> >
> >
> > --
> >
> > Arvid Heise | Senior Java Developer
> >
> > <https://www.ververica.com/>
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Ververica GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> > (Toni) Cheng
> >
>