Hi,
We’re currently thinking about releasing StateFun 2.2.1, to address a critical bug that causes restores from checkpoints / savepoints to fail under certain circumstances [1]. To provide a bit more context, the full fix for this issue is two-fold: 1. *Fix restoring from checkpoints / savepoints taken with the same StateFun version:* this has already been fixed in StateFun, with changes backported to `flink-statefun/release-2.2`. 2. *Allow restoring from older savepoints taken with StateFun <= 2.2.0:* this requires a few fixes to Flink around restoring heap-based timers [2] and iterating through key groups in restored raw keyed state streams [3]. These fixes will be included in Flink 1.11.3 [4], meaning that to fix this, StateFun will need to wait until Flink 1.11.3 is out and upgrade its Flink dependency. The main discussion point here is whether or not it makes sense for StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the problems 1) and 2) can be solved together in a single hotfix release. The other option is to release StateFun 2.2.1 already with fixes for problem 1) only, and have another follow-up hotfix release 2.2.2 after Flink 1.11.3 is available. I propose to keep a close eye on the progress of Flink 1.11.3 (you can track progress on the 1.11.3 discussion thread [4]), and *make a decision here mid-week on Wednesday, Nov. 4th*. If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 because it could take a while, we can start with a StateFun 2.2.1 RC right away; otherwise, if Flink 1.11.3 seems to be just around the corner, we can wait for a few more days. What do you think? Cheers, Gordon [1] https://issues.apache.org/jira/browse/FLINK-19692 [2] https://github.com/apache/flink/pull/13761 [3] https://github.com/apache/flink/pull/13772 [4] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html |
Hi Gordon,
Thanks for driving this discussion! I would go with the second suggestion - having two consecutive StateFun releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release might take a while, and this hot-fix release is important enough to get out as early as possible. Cheers, Igal. On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <[hidden email]> wrote: > Hi, > > We’re currently thinking about releasing StateFun 2.2.1, to address a > critical bug that causes restores from checkpoints / savepoints to fail > under certain circumstances [1]. > > To provide a bit more context, the full fix for this issue is two-fold: > > 1. *Fix restoring from checkpoints / savepoints taken with the same > StateFun version:* this has already been fixed in StateFun, with > changes backported to `flink-statefun/release-2.2`. > 2. *Allow restoring from older savepoints taken with StateFun <= > 2.2.0:* this requires a few fixes to Flink around restoring heap-based > timers [2] and iterating through key groups in restored raw keyed state > streams [3]. These fixes will be included in Flink 1.11.3 [4], meaning that > to fix this, StateFun will need to wait until Flink 1.11.3 is out and > upgrade its Flink dependency. > > The main discussion point here is whether or not it makes sense for > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the problems > 1) and 2) can be solved together in a single hotfix release. > > The other option is to release StateFun 2.2.1 already with fixes for > problem 1) only, and have another follow-up hotfix release 2.2.2 after > Flink 1.11.3 is available. > > I propose to keep a close eye on the progress of Flink 1.11.3 (you can > track progress on the 1.11.3 discussion thread [4]), and *make a decision > here mid-week on Wednesday, Nov. 4th*. > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 > because it could take a while, we can start with a StateFun 2.2.1 RC right > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we can > wait for a few more days. > > What do you think? > > Cheers, > Gordon > > [1] https://issues.apache.org/jira/browse/FLINK-19692 > [2] https://github.com/apache/flink/pull/13761 > [3] https://github.com/apache/flink/pull/13772 > [4] > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html > |
Thanks a lot for starting this thread.
How many users are affected by the problem? Is it somebody else besides the initial issue reporter? If it is just one person, I would suggest to rather help pushing the 1.11.3 release over the line or work on more StateFun features ;) On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <[hidden email]> wrote: > Hi Gordon, > Thanks for driving this discussion! > > I would go with the second suggestion - having two consecutive StateFun > releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release > might take a while, and this hot-fix release is important enough to get out > as early as possible. > > Cheers, > Igal. > > > > > On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <[hidden email]> > wrote: > > > Hi, > > > > We’re currently thinking about releasing StateFun 2.2.1, to address a > > critical bug that causes restores from checkpoints / savepoints to fail > > under certain circumstances [1]. > > > > To provide a bit more context, the full fix for this issue is two-fold: > > > > 1. *Fix restoring from checkpoints / savepoints taken with the same > > StateFun version:* this has already been fixed in StateFun, with > > changes backported to `flink-statefun/release-2.2`. > > 2. *Allow restoring from older savepoints taken with StateFun <= > > 2.2.0:* this requires a few fixes to Flink around restoring heap-based > > timers [2] and iterating through key groups in restored raw keyed > state > > streams [3]. These fixes will be included in Flink 1.11.3 [4], > meaning that > > to fix this, StateFun will need to wait until Flink 1.11.3 is out and > > upgrade its Flink dependency. > > > > The main discussion point here is whether or not it makes sense for > > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the > problems > > 1) and 2) can be solved together in a single hotfix release. > > > > The other option is to release StateFun 2.2.1 already with fixes for > > problem 1) only, and have another follow-up hotfix release 2.2.2 after > > Flink 1.11.3 is available. > > > > I propose to keep a close eye on the progress of Flink 1.11.3 (you can > > track progress on the 1.11.3 discussion thread [4]), and *make a decision > > here mid-week on Wednesday, Nov. 4th*. > > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 > > because it could take a while, we can start with a StateFun 2.2.1 RC > right > > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we > can > > wait for a few more days. > > > > What do you think? > > > > Cheers, > > Gordon > > > > [1] https://issues.apache.org/jira/browse/FLINK-19692 > > [2] https://github.com/apache/flink/pull/13761 > > [3] https://github.com/apache/flink/pull/13772 > > [4] > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html > > > |
Hi Robert,
So far we've only seen a single user report the issue, but the severity of FLINK-19692 is actually pretty huge. TL;DR: If a checkpoint / savepoint that contains feedback events (which is considered normal under typical StateFun operations) is attempted to be restored from, the restore would always fail. That's why we came up with the discussion to potentially release a "partial" solution with StateFun 2.2.1 already so that at least there is a StateFun release available that works properly with failure recoveries, and then after that release another follow-up StateFun hotfix release 2.2.2, which would include Flink 1.11.3, to address the remaining part of the problem. BR, Gordon On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <[hidden email]> wrote: > Thanks a lot for starting this thread. > How many users are affected by the problem? Is it somebody else besides > the initial issue reporter? > If it is just one person, I would suggest to rather help pushing the > 1.11.3 release over the line or work on more StateFun features ;) > > On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <[hidden email]> wrote: > >> Hi Gordon, >> Thanks for driving this discussion! >> >> I would go with the second suggestion - having two consecutive StateFun >> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release >> might take a while, and this hot-fix release is important enough to get >> out >> as early as possible. >> >> Cheers, >> Igal. >> >> >> >> >> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <[hidden email]> >> wrote: >> >> > Hi, >> > >> > We’re currently thinking about releasing StateFun 2.2.1, to address a >> > critical bug that causes restores from checkpoints / savepoints to fail >> > under certain circumstances [1]. >> > >> > To provide a bit more context, the full fix for this issue is two-fold: >> > >> > 1. *Fix restoring from checkpoints / savepoints taken with the same >> > StateFun version:* this has already been fixed in StateFun, with >> > changes backported to `flink-statefun/release-2.2`. >> > 2. *Allow restoring from older savepoints taken with StateFun <= >> > 2.2.0:* this requires a few fixes to Flink around restoring >> heap-based >> > timers [2] and iterating through key groups in restored raw keyed >> state >> > streams [3]. These fixes will be included in Flink 1.11.3 [4], >> meaning that >> > to fix this, StateFun will need to wait until Flink 1.11.3 is out and >> > upgrade its Flink dependency. >> > >> > The main discussion point here is whether or not it makes sense for >> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the >> problems >> > 1) and 2) can be solved together in a single hotfix release. >> > >> > The other option is to release StateFun 2.2.1 already with fixes for >> > problem 1) only, and have another follow-up hotfix release 2.2.2 after >> > Flink 1.11.3 is available. >> > >> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can >> > track progress on the 1.11.3 discussion thread [4]), and *make a >> decision >> > here mid-week on Wednesday, Nov. 4th*. >> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 >> > because it could take a while, we can start with a StateFun 2.2.1 RC >> right >> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, we >> can >> > wait for a few more days. >> > >> > What do you think? >> > >> > Cheers, >> > Gordon >> > >> > [1] https://issues.apache.org/jira/browse/FLINK-19692 >> > [2] https://github.com/apache/flink/pull/13761 >> > [3] https://github.com/apache/flink/pull/13772 >> > [4] >> > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html >> > >> > |
Hi Gordon,
thanks a lot for this clarification. In this case I would vote for releasing StateFun 2.2.1 asap and not wait for 1.11.3. Thanks a lot for your efforts! On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai <[hidden email]> wrote: > Hi Robert, > > So far we've only seen a single user report the issue, but the severity of > FLINK-19692 is actually pretty huge. > TL;DR: If a checkpoint / savepoint that contains feedback events (which is > considered normal under typical StateFun operations) is attempted to be > restored from, the restore would always fail. > > That's why we came up with the discussion to potentially release a > "partial" solution with StateFun 2.2.1 already so that at least there is a > StateFun release available that works properly with failure recoveries, > and then after that release another follow-up StateFun hotfix release > 2.2.2, which would include Flink 1.11.3, to address the remaining part of > the problem. > > BR, > Gordon > > On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <[hidden email]> wrote: > >> Thanks a lot for starting this thread. >> How many users are affected by the problem? Is it somebody else besides >> the initial issue reporter? >> If it is just one person, I would suggest to rather help pushing the >> 1.11.3 release over the line or work on more StateFun features ;) >> >> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <[hidden email]> wrote: >> >>> Hi Gordon, >>> Thanks for driving this discussion! >>> >>> I would go with the second suggestion - having two consecutive StateFun >>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release >>> might take a while, and this hot-fix release is important enough to get >>> out >>> as early as possible. >>> >>> Cheers, >>> Igal. >>> >>> >>> >>> >>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai <[hidden email] >>> > >>> wrote: >>> >>> > Hi, >>> > >>> > We’re currently thinking about releasing StateFun 2.2.1, to address a >>> > critical bug that causes restores from checkpoints / savepoints to fail >>> > under certain circumstances [1]. >>> > >>> > To provide a bit more context, the full fix for this issue is two-fold: >>> > >>> > 1. *Fix restoring from checkpoints / savepoints taken with the same >>> > StateFun version:* this has already been fixed in StateFun, with >>> > changes backported to `flink-statefun/release-2.2`. >>> > 2. *Allow restoring from older savepoints taken with StateFun <= >>> > 2.2.0:* this requires a few fixes to Flink around restoring >>> heap-based >>> > timers [2] and iterating through key groups in restored raw keyed >>> state >>> > streams [3]. These fixes will be included in Flink 1.11.3 [4], >>> meaning that >>> > to fix this, StateFun will need to wait until Flink 1.11.3 is out >>> and >>> > upgrade its Flink dependency. >>> > >>> > The main discussion point here is whether or not it makes sense for >>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the >>> problems >>> > 1) and 2) can be solved together in a single hotfix release. >>> > >>> > The other option is to release StateFun 2.2.1 already with fixes for >>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after >>> > Flink 1.11.3 is available. >>> > >>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can >>> > track progress on the 1.11.3 discussion thread [4]), and *make a >>> decision >>> > here mid-week on Wednesday, Nov. 4th*. >>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 >>> > because it could take a while, we can start with a StateFun 2.2.1 RC >>> right >>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, >>> we can >>> > wait for a few more days. >>> > >>> > What do you think? >>> > >>> > Cheers, >>> > Gordon >>> > >>> > [1] https://issues.apache.org/jira/browse/FLINK-19692 >>> > [2] https://github.com/apache/flink/pull/13761 >>> > [3] https://github.com/apache/flink/pull/13772 >>> > [4] >>> > >>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html >>> > >>> >> |
Thanks everyone for the feedback.
I've just updated the status of Flink 1.11.3 earlier, in its corresponding discussion thread [1]. From the looks of it, it seems like it makes sense to proceed with StateFun 2.2.1 without waiting for Flink 1.11.3. Since this is also the consensus we've reached here, I have proceeded to create RC1 for StateFun 2.2.1 [2]. [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-StateFun-hotfix-version-2-2-1-td46239.html On Tue, Nov 3, 2020 at 10:42 PM Robert Metzger <[hidden email]> wrote: > Hi Gordon, > thanks a lot for this clarification. > > In this case I would vote for releasing StateFun 2.2.1 asap and not wait > for 1.11.3. > > Thanks a lot for your efforts! > > > On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai <[hidden email]> > wrote: > >> Hi Robert, >> >> So far we've only seen a single user report the issue, but the severity >> of FLINK-19692 is actually pretty huge. >> TL;DR: If a checkpoint / savepoint that contains feedback events (which >> is considered normal under typical StateFun operations) is attempted to be >> restored from, the restore would always fail. >> >> That's why we came up with the discussion to potentially release a >> "partial" solution with StateFun 2.2.1 already so that at least there is a >> StateFun release available that works properly with failure recoveries, >> and then after that release another follow-up StateFun hotfix release >> 2.2.2, which would include Flink 1.11.3, to address the remaining part of >> the problem. >> >> BR, >> Gordon >> >> On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <[hidden email]> >> wrote: >> >>> Thanks a lot for starting this thread. >>> How many users are affected by the problem? Is it somebody else besides >>> the initial issue reporter? >>> If it is just one person, I would suggest to rather help pushing the >>> 1.11.3 release over the line or work on more StateFun features ;) >>> >>> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <[hidden email]> wrote: >>> >>>> Hi Gordon, >>>> Thanks for driving this discussion! >>>> >>>> I would go with the second suggestion - having two consecutive StateFun >>>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release >>>> might take a while, and this hot-fix release is important enough to get >>>> out >>>> as early as possible. >>>> >>>> Cheers, >>>> Igal. >>>> >>>> >>>> >>>> >>>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai < >>>> [hidden email]> >>>> wrote: >>>> >>>> > Hi, >>>> > >>>> > We’re currently thinking about releasing StateFun 2.2.1, to address a >>>> > critical bug that causes restores from checkpoints / savepoints to >>>> fail >>>> > under certain circumstances [1]. >>>> > >>>> > To provide a bit more context, the full fix for this issue is >>>> two-fold: >>>> > >>>> > 1. *Fix restoring from checkpoints / savepoints taken with the same >>>> > StateFun version:* this has already been fixed in StateFun, with >>>> > changes backported to `flink-statefun/release-2.2`. >>>> > 2. *Allow restoring from older savepoints taken with StateFun <= >>>> > 2.2.0:* this requires a few fixes to Flink around restoring >>>> heap-based >>>> > timers [2] and iterating through key groups in restored raw keyed >>>> state >>>> > streams [3]. These fixes will be included in Flink 1.11.3 [4], >>>> meaning that >>>> > to fix this, StateFun will need to wait until Flink 1.11.3 is out >>>> and >>>> > upgrade its Flink dependency. >>>> > >>>> > The main discussion point here is whether or not it makes sense for >>>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the >>>> problems >>>> > 1) and 2) can be solved together in a single hotfix release. >>>> > >>>> > The other option is to release StateFun 2.2.1 already with fixes for >>>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after >>>> > Flink 1.11.3 is available. >>>> > >>>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can >>>> > track progress on the 1.11.3 discussion thread [4]), and *make a >>>> decision >>>> > here mid-week on Wednesday, Nov. 4th*. >>>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 >>>> > because it could take a while, we can start with a StateFun 2.2.1 RC >>>> right >>>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, >>>> we can >>>> > wait for a few more days. >>>> > >>>> > What do you think? >>>> > >>>> > Cheers, >>>> > Gordon >>>> > >>>> > [1] https://issues.apache.org/jira/browse/FLINK-19692 >>>> > [2] https://github.com/apache/flink/pull/13761 >>>> > [3] https://github.com/apache/flink/pull/13772 >>>> > [4] >>>> > >>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html >>>> > >>>> >>> |
Free forum by Nabble | Edit this page |