[VOTE] [FLIP-76] Unaligned checkpoints

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE] [FLIP-76] Unaligned checkpoints

Arvid Heise-3
Hi all,

I would like to start the vote for FLIP-76 [1], which is discussed and
reached a consensus in the discussion thread [2].

The vote will be open until March. 13th (72h), unless there is an objection
or not enough votes.

Thanks,
Arvid

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Thomas Weise
+1

Thanks for putting this together, looking forward to the experimental
support in the next release.

One clarification: since the MVP won't support rescaling, does it imply
that savepoints will always use aligned checkpointing? If so, this would
still block the user from taking a savepoint and resume with increased
parallelism to resolve a prolonged/permanent backpressure condition?

Thanks,
Thomas


On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]> wrote:

> Hi all,
>
> I would like to start the vote for FLIP-76 [1], which is discussed and
> reached a consensus in the discussion thread [2].
>
> The vote will be open until March. 13th (72h), unless there is an objection
> or not enough votes.
>
> Thanks,
> Arvid
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> [2]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Yu Li
+1 on the overall design and thanks for the efforts!

I totally agree with the plan of implementing the MVP first. However, since
the FLIP is for the whole feature instead of only MVP, how about adding a
*Roadmap* or *Future Work* section to write down plans include (but not
limited to):
* Dynamic switching between unaligned and aligned checkpoints
* Supporting local recovery

What do you think?

What's more, the existing PoC result of e2e checkpoint duration and
throughput looks great, but the recovery/restore time is not mentioned. Not
sure whether we also aim at providing a comparative recovery speed with
aligned checkpoint in the MVP implementation? Hopefully we could (smile).

Best Regards,
Yu


On Wed, 11 Mar 2020 at 06:14, Thomas Weise <[hidden email]> wrote:

> +1
>
> Thanks for putting this together, looking forward to the experimental
> support in the next release.
>
> One clarification: since the MVP won't support rescaling, does it imply
> that savepoints will always use aligned checkpointing? If so, this would
> still block the user from taking a savepoint and resume with increased
> parallelism to resolve a prolonged/permanent backpressure condition?
>
> Thanks,
> Thomas
>
>
> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]> wrote:
>
> > Hi all,
> >
> > I would like to start the vote for FLIP-76 [1], which is discussed and
> > reached a consensus in the discussion thread [2].
> >
> > The vote will be open until March. 13th (72h), unless there is an
> objection
> > or not enough votes.
> >
> > Thanks,
> > Arvid
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > [2]
> >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Arvid Heise-3
In reply to this post by Thomas Weise
Hi Thomas,

it's like you said. The first version will not support rescaling and mostly
addresses the concerns about making little to no progress because of
frequent crashes.

The main reason is that we cannot guarantee the ordering of non-keyed data
(and even keyed data in some weird cases) when rescaling currently. We have
a general concept to address that, which would also enable dynamic
rescaling in the future, but that would make the changes even bigger and we
would not have any version ready for 1.11.

The current plan, of course, is to continue improving unaligned checkpoints
immediately after release, such that we have the full feature set for 1.12.
Potentially, unaligned checkpoints (with timeouts) would even become the
default option.

On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]> wrote:

> +1
>
> Thanks for putting this together, looking forward to the experimental
> support in the next release.
>
> One clarification: since the MVP won't support rescaling, does it imply
> that savepoints will always use aligned checkpointing? If so, this would
> still block the user from taking a savepoint and resume with increased
> parallelism to resolve a prolonged/permanent backpressure condition?
>
> Thanks,
> Thomas
>
>
> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]> wrote:
>
> > Hi all,
> >
> > I would like to start the vote for FLIP-76 [1], which is discussed and
> > reached a consensus in the discussion thread [2].
> >
> > The vote will be open until March. 13th (72h), unless there is an
> objection
> > or not enough votes.
> >
> > Thanks,
> > Arvid
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > [2]
> >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

David Anderson-2
+1 I like where this is headed.

One question: during restore, it could happen that a new task manager is
configured with fewer or smaller buffers than was previously the case. How
will this be handled?

David


On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]> wrote:

> Hi Thomas,
>
> it's like you said. The first version will not support rescaling and mostly
> addresses the concerns about making little to no progress because of
> frequent crashes.
>
> The main reason is that we cannot guarantee the ordering of non-keyed data
> (and even keyed data in some weird cases) when rescaling currently. We have
> a general concept to address that, which would also enable dynamic
> rescaling in the future, but that would make the changes even bigger and we
> would not have any version ready for 1.11.
>
> The current plan, of course, is to continue improving unaligned checkpoints
> immediately after release, such that we have the full feature set for 1.12.
> Potentially, unaligned checkpoints (with timeouts) would even become the
> default option.
>
> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]> wrote:
>
> > +1
> >
> > Thanks for putting this together, looking forward to the experimental
> > support in the next release.
> >
> > One clarification: since the MVP won't support rescaling, does it imply
> > that savepoints will always use aligned checkpointing? If so, this would
> > still block the user from taking a savepoint and resume with increased
> > parallelism to resolve a prolonged/permanent backpressure condition?
> >
> > Thanks,
> > Thomas
> >
> >
> > On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]> wrote:
> >
> > > Hi all,
> > >
> > > I would like to start the vote for FLIP-76 [1], which is discussed and
> > > reached a consensus in the discussion thread [2].
> > >
> > > The vote will be open until March. 13th (72h), unless there is an
> > objection
> > > or not enough votes.
> > >
> > > Thanks,
> > > Arvid
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > > [2]
> > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Piotr Nowojski-3
+1 (binding).

Piotrek

> On 11 Mar 2020, at 09:19, David Anderson <[hidden email]> wrote:
>
> +1 I like where this is headed.
>
> One question: during restore, it could happen that a new task manager is
> configured with fewer or smaller buffers than was previously the case. How
> will this be handled?
>
> David
>
>
> On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]> wrote:
>
>> Hi Thomas,
>>
>> it's like you said. The first version will not support rescaling and mostly
>> addresses the concerns about making little to no progress because of
>> frequent crashes.
>>
>> The main reason is that we cannot guarantee the ordering of non-keyed data
>> (and even keyed data in some weird cases) when rescaling currently. We have
>> a general concept to address that, which would also enable dynamic
>> rescaling in the future, but that would make the changes even bigger and we
>> would not have any version ready for 1.11.
>>
>> The current plan, of course, is to continue improving unaligned checkpoints
>> immediately after release, such that we have the full feature set for 1.12.
>> Potentially, unaligned checkpoints (with timeouts) would even become the
>> default option.
>>
>> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]> wrote:
>>
>>> +1
>>>
>>> Thanks for putting this together, looking forward to the experimental
>>> support in the next release.
>>>
>>> One clarification: since the MVP won't support rescaling, does it imply
>>> that savepoints will always use aligned checkpointing? If so, this would
>>> still block the user from taking a savepoint and resume with increased
>>> parallelism to resolve a prolonged/permanent backpressure condition?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to start the vote for FLIP-76 [1], which is discussed and
>>>> reached a consensus in the discussion thread [2].
>>>>
>>>> The vote will be open until March. 13th (72h), unless there is an
>>> objection
>>>> or not enough votes.
>>>>
>>>> Thanks,
>>>> Arvid
>>>>
>>>> [1]
>>>>
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
>>>> [2]
>>>>
>>>>
>>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Zhijiang(wangzhijiang999)
+1 (binding).

As for David's concern of smaller buffers after recovery, I ever had a draft design [1] to solve this issue.
You can take a look and leave comments if still have concerns. :)

[1] https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit

Best,
Zhijiang


------------------------------------------------------------------
From:Piotr Nowojski <[hidden email]>
Send Time:2020 Mar. 11 (Wed.) 21:19
To:dev <[hidden email]>
Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints

+1 (binding).

Piotrek

> On 11 Mar 2020, at 09:19, David Anderson <[hidden email]> wrote:
>
> +1 I like where this is headed.
>
> One question: during restore, it could happen that a new task manager is
> configured with fewer or smaller buffers than was previously the case. How
> will this be handled?
>
> David
>
>
> On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]> wrote:
>
>> Hi Thomas,
>>
>> it's like you said. The first version will not support rescaling and mostly
>> addresses the concerns about making little to no progress because of
>> frequent crashes.
>>
>> The main reason is that we cannot guarantee the ordering of non-keyed data
>> (and even keyed data in some weird cases) when rescaling currently. We have
>> a general concept to address that, which would also enable dynamic
>> rescaling in the future, but that would make the changes even bigger and we
>> would not have any version ready for 1.11.
>>
>> The current plan, of course, is to continue improving unaligned checkpoints
>> immediately after release, such that we have the full feature set for 1.12.
>> Potentially, unaligned checkpoints (with timeouts) would even become the
>> default option.
>>
>> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]> wrote:
>>
>>> +1
>>>
>>> Thanks for putting this together, looking forward to the experimental
>>> support in the next release.
>>>
>>> One clarification: since the MVP won't support rescaling, does it imply
>>> that savepoints will always use aligned checkpointing? If so, this would
>>> still block the user from taking a savepoint and resume with increased
>>> parallelism to resolve a prolonged/permanent backpressure condition?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to start the vote for FLIP-76 [1], which is discussed and
>>>> reached a consensus in the discussion thread [2].
>>>>
>>>> The vote will be open until March. 13th (72h), unless there is an
>>> objection
>>>> or not enough votes.
>>>>
>>>> Thanks,
>>>> Arvid
>>>>
>>>> [1]
>>>>
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
>>>> [2]
>>>>
>>>>
>>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Roman Khachatryan
+1 (non-binding)

Regarding Yu's suggestion about *Roadmap* or *Future Work* section, I think
it's a good idea.
Currently, some MVP limitations are mentioned at the end of the document,
so we can extract and expand it.
As for the recovery speed it's not a priority currently, but we could also
mention it in this section.


On Wed, Mar 11, 2020 at 4:11 PM Zhijiang <[hidden email]>
wrote:

> +1 (binding).
>
> As for David's concern of smaller buffers after recovery, I ever had a
> draft design [1] to solve this issue.
> You can take a look and leave comments if still have concerns. :)
>
> [1]
> https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit
>
> Best,
> Zhijiang
>
>
> ------------------------------------------------------------------
> From:Piotr Nowojski <[hidden email]>
> Send Time:2020 Mar. 11 (Wed.) 21:19
> To:dev <[hidden email]>
> Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
>
> +1 (binding).
>
> Piotrek
>
> > On 11 Mar 2020, at 09:19, David Anderson <[hidden email]> wrote:
> >
> > +1 I like where this is headed.
> >
> > One question: during restore, it could happen that a new task manager is
> > configured with fewer or smaller buffers than was previously the case.
> How
> > will this be handled?
> >
> > David
> >
> >
> > On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]> wrote:
> >
> >> Hi Thomas,
> >>
> >> it's like you said. The first version will not support rescaling and
> mostly
> >> addresses the concerns about making little to no progress because of
> >> frequent crashes.
> >>
> >> The main reason is that we cannot guarantee the ordering of non-keyed
> data
> >> (and even keyed data in some weird cases) when rescaling currently. We
> have
> >> a general concept to address that, which would also enable dynamic
> >> rescaling in the future, but that would make the changes even bigger
> and we
> >> would not have any version ready for 1.11.
> >>
> >> The current plan, of course, is to continue improving unaligned
> checkpoints
> >> immediately after release, such that we have the full feature set for
> 1.12.
> >> Potentially, unaligned checkpoints (with timeouts) would even become the
> >> default option.
> >>
> >> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]> wrote:
> >>
> >>> +1
> >>>
> >>> Thanks for putting this together, looking forward to the experimental
> >>> support in the next release.
> >>>
> >>> One clarification: since the MVP won't support rescaling, does it imply
> >>> that savepoints will always use aligned checkpointing? If so, this
> would
> >>> still block the user from taking a savepoint and resume with increased
> >>> parallelism to resolve a prolonged/permanent backpressure condition?
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>>
> >>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I would like to start the vote for FLIP-76 [1], which is discussed and
> >>>> reached a consensus in the discussion thread [2].
> >>>>
> >>>> The vote will be open until March. 13th (72h), unless there is an
> >>> objection
> >>>> or not enough votes.
> >>>>
> >>>> Thanks,
> >>>> Arvid
> >>>>
> >>>> [1]
> >>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> >>>> [2]
> >>>>
> >>>>
> >>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> >>>>
> >>>
> >>
>
>

--
Regards,
Roman
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Yun Gao
+1 (non-binding)
     I think the PoC result has shown the effect on reducing checkpoint time when back-pressure occurs, and I totally agree with that the feature could be implemented in steps.


------------------------------------------------------------------
From:Roman Khachatryan <[hidden email]>
Send Time:2020 Mar. 12 (Thu.) 01:33
To:dev <[hidden email]>; Zhijiang <[hidden email]>
Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints

+1 (non-binding)

Regarding Yu's suggestion about *Roadmap* or *Future Work* section, I think
it's a good idea.
Currently, some MVP limitations are mentioned at the end of the document,
so we can extract and expand it.
As for the recovery speed it's not a priority currently, but we could also
mention it in this section.


On Wed, Mar 11, 2020 at 4:11 PM Zhijiang <[hidden email]>
wrote:

> +1 (binding).
>
> As for David's concern of smaller buffers after recovery, I ever had a
> draft design [1] to solve this issue.
> You can take a look and leave comments if still have concerns. :)
>
> [1]
> https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit
>
> Best,
> Zhijiang
>
>
> ------------------------------------------------------------------
> From:Piotr Nowojski <[hidden email]>
> Send Time:2020 Mar. 11 (Wed.) 21:19
> To:dev <[hidden email]>
> Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
>
> +1 (binding).
>
> Piotrek
>
> > On 11 Mar 2020, at 09:19, David Anderson <[hidden email]> wrote:
> >
> > +1 I like where this is headed.
> >
> > One question: during restore, it could happen that a new task manager is
> > configured with fewer or smaller buffers than was previously the case.
> How
> > will this be handled?
> >
> > David
> >
> >
> > On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]> wrote:
> >
> >> Hi Thomas,
> >>
> >> it's like you said. The first version will not support rescaling and
> mostly
> >> addresses the concerns about making little to no progress because of
> >> frequent crashes.
> >>
> >> The main reason is that we cannot guarantee the ordering of non-keyed
> data
> >> (and even keyed data in some weird cases) when rescaling currently. We
> have
> >> a general concept to address that, which would also enable dynamic
> >> rescaling in the future, but that would make the changes even bigger
> and we
> >> would not have any version ready for 1.11.
> >>
> >> The current plan, of course, is to continue improving unaligned
> checkpoints
> >> immediately after release, such that we have the full feature set for
> 1.12.
> >> Potentially, unaligned checkpoints (with timeouts) would even become the
> >> default option.
> >>
> >> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]> wrote:
> >>
> >>> +1
> >>>
> >>> Thanks for putting this together, looking forward to the experimental
> >>> support in the next release.
> >>>
> >>> One clarification: since the MVP won't support rescaling, does it imply
> >>> that savepoints will always use aligned checkpointing? If so, this
> would
> >>> still block the user from taking a savepoint and resume with increased
> >>> parallelism to resolve a prolonged/permanent backpressure condition?
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>>
> >>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I would like to start the vote for FLIP-76 [1], which is discussed and
> >>>> reached a consensus in the discussion thread [2].
> >>>>
> >>>> The vote will be open until March. 13th (72h), unless there is an
> >>> objection
> >>>> or not enough votes.
> >>>>
> >>>> Thanks,
> >>>> Arvid
> >>>>
> >>>> [1]
> >>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> >>>> [2]
> >>>>
> >>>>
> >>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> >>>>
> >>>
> >>
>
>

--
Regards,
Roman

Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Arvid Heise-3
I added a roadmap section to the FLIP as suggested by Yu and Roman.

Unless someone objects, I'd still consider the voting period to end
tomorrow. For me, the roadmap is only a clarification of already written
and discussed points.

We already have enough binding votes, but there may be concerns popping up
until tomorrow.

On Thu, Mar 12, 2020 at 5:00 PM Yun Gao <[hidden email]>
wrote:

> +1 (non-binding)
>      I think the PoC result has shown the effect on reducing checkpoint
> time when back-pressure occurs, and I totally agree with that the feature
> could be implemented in steps.
>
>
> ------------------------------------------------------------------
> From:Roman Khachatryan <[hidden email]>
> Send Time:2020 Mar. 12 (Thu.) 01:33
> To:dev <[hidden email]>; Zhijiang <[hidden email]>
> Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
>
> +1 (non-binding)
>
> Regarding Yu's suggestion about *Roadmap* or *Future Work* section, I think
> it's a good idea.
> Currently, some MVP limitations are mentioned at the end of the document,
> so we can extract and expand it.
> As for the recovery speed it's not a priority currently, but we could also
> mention it in this section.
>
>
> On Wed, Mar 11, 2020 at 4:11 PM Zhijiang <[hidden email]
> .invalid>
> wrote:
>
> > +1 (binding).
> >
> > As for David's concern of smaller buffers after recovery, I ever had a
> > draft design [1] to solve this issue.
> > You can take a look and leave comments if still have concerns. :)
> >
> > [1]
> >
> https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit
> >
> > Best,
> > Zhijiang
> >
> >
> > ------------------------------------------------------------------
> > From:Piotr Nowojski <[hidden email]>
> > Send Time:2020 Mar. 11 (Wed.) 21:19
> > To:dev <[hidden email]>
> > Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
> >
> > +1 (binding).
> >
> > Piotrek
> >
> > > On 11 Mar 2020, at 09:19, David Anderson <[hidden email]> wrote:
> > >
> > > +1 I like where this is headed.
> > >
> > > One question: during restore, it could happen that a new task manager
> is
> > > configured with fewer or smaller buffers than was previously the case.
> > How
> > > will this be handled?
> > >
> > > David
> > >
> > >
> > > On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]>
> wrote:
> > >
> > >> Hi Thomas,
> > >>
> > >> it's like you said. The first version will not support rescaling and
> > mostly
> > >> addresses the concerns about making little to no progress because of
> > >> frequent crashes.
> > >>
> > >> The main reason is that we cannot guarantee the ordering of non-keyed
> > data
> > >> (and even keyed data in some weird cases) when rescaling currently. We
> > have
> > >> a general concept to address that, which would also enable dynamic
> > >> rescaling in the future, but that would make the changes even bigger
> > and we
> > >> would not have any version ready for 1.11.
> > >>
> > >> The current plan, of course, is to continue improving unaligned
> > checkpoints
> > >> immediately after release, such that we have the full feature set for
> > 1.12.
> > >> Potentially, unaligned checkpoints (with timeouts) would even become
> the
> > >> default option.
> > >>
> > >> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> Thanks for putting this together, looking forward to the experimental
> > >>> support in the next release.
> > >>>
> > >>> One clarification: since the MVP won't support rescaling, does it
> imply
> > >>> that savepoints will always use aligned checkpointing? If so, this
> > would
> > >>> still block the user from taking a savepoint and resume with
> increased
> > >>> parallelism to resolve a prolonged/permanent backpressure condition?
> > >>>
> > >>> Thanks,
> > >>> Thomas
> > >>>
> > >>>
> > >>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]>
> > wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> I would like to start the vote for FLIP-76 [1], which is discussed
> and
> > >>>> reached a consensus in the discussion thread [2].
> > >>>>
> > >>>> The vote will be open until March. 13th (72h), unless there is an
> > >>> objection
> > >>>> or not enough votes.
> > >>>>
> > >>>> Thanks,
> > >>>> Arvid
> > >>>>
> > >>>> [1]
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > >>>> [2]
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> > >>>>
> > >>>
> > >>
> >
> >
>
> --
> Regards,
> Roman
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

yingjie
In reply to this post by Arvid Heise-3
+1 (non-binding)

Checkpoint timeout in cases of backpressure is hard to tune. I and our
users ever spent lots of time on that. It is great to have this feature.

Arvid Heise <[hidden email]> 于2020年3月10日周二 下午9:33写道:

> Hi all,
>
> I would like to start the vote for FLIP-76 [1], which is discussed and
> reached a consensus in the discussion thread [2].
>
> The vote will be open until March. 13th (72h), unless there is an objection
> or not enough votes.
>
> Thanks,
> Arvid
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> [2]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Yu Li
In reply to this post by Arvid Heise-3
+1 (binding)

The updated FLIP doc LGTM. Thanks for addressing the comments Arvid and
Roman.

Best Regards,
Yu


On Fri, 13 Mar 2020 at 03:48, Arvid Heise <[hidden email]> wrote:

> I added a roadmap section to the FLIP as suggested by Yu and Roman.
>
> Unless someone objects, I'd still consider the voting period to end
> tomorrow. For me, the roadmap is only a clarification of already written
> and discussed points.
>
> We already have enough binding votes, but there may be concerns popping up
> until tomorrow.
>
> On Thu, Mar 12, 2020 at 5:00 PM Yun Gao <[hidden email]>
> wrote:
>
> > +1 (non-binding)
> >      I think the PoC result has shown the effect on reducing checkpoint
> > time when back-pressure occurs, and I totally agree with that the feature
> > could be implemented in steps.
> >
> >
> > ------------------------------------------------------------------
> > From:Roman Khachatryan <[hidden email]>
> > Send Time:2020 Mar. 12 (Thu.) 01:33
> > To:dev <[hidden email]>; Zhijiang <[hidden email]>
> > Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
> >
> > +1 (non-binding)
> >
> > Regarding Yu's suggestion about *Roadmap* or *Future Work* section, I
> think
> > it's a good idea.
> > Currently, some MVP limitations are mentioned at the end of the document,
> > so we can extract and expand it.
> > As for the recovery speed it's not a priority currently, but we could
> also
> > mention it in this section.
> >
> >
> > On Wed, Mar 11, 2020 at 4:11 PM Zhijiang <[hidden email]
> > .invalid>
> > wrote:
> >
> > > +1 (binding).
> > >
> > > As for David's concern of smaller buffers after recovery, I ever had a
> > > draft design [1] to solve this issue.
> > > You can take a look and leave comments if still have concerns. :)
> > >
> > > [1]
> > >
> >
> https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit
> > >
> > > Best,
> > > Zhijiang
> > >
> > >
> > > ------------------------------------------------------------------
> > > From:Piotr Nowojski <[hidden email]>
> > > Send Time:2020 Mar. 11 (Wed.) 21:19
> > > To:dev <[hidden email]>
> > > Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
> > >
> > > +1 (binding).
> > >
> > > Piotrek
> > >
> > > > On 11 Mar 2020, at 09:19, David Anderson <[hidden email]>
> wrote:
> > > >
> > > > +1 I like where this is headed.
> > > >
> > > > One question: during restore, it could happen that a new task manager
> > is
> > > > configured with fewer or smaller buffers than was previously the
> case.
> > > How
> > > > will this be handled?
> > > >
> > > > David
> > > >
> > > >
> > > > On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]>
> > wrote:
> > > >
> > > >> Hi Thomas,
> > > >>
> > > >> it's like you said. The first version will not support rescaling and
> > > mostly
> > > >> addresses the concerns about making little to no progress because of
> > > >> frequent crashes.
> > > >>
> > > >> The main reason is that we cannot guarantee the ordering of
> non-keyed
> > > data
> > > >> (and even keyed data in some weird cases) when rescaling currently.
> We
> > > have
> > > >> a general concept to address that, which would also enable dynamic
> > > >> rescaling in the future, but that would make the changes even bigger
> > > and we
> > > >> would not have any version ready for 1.11.
> > > >>
> > > >> The current plan, of course, is to continue improving unaligned
> > > checkpoints
> > > >> immediately after release, such that we have the full feature set
> for
> > > 1.12.
> > > >> Potentially, unaligned checkpoints (with timeouts) would even become
> > the
> > > >> default option.
> > > >>
> > > >> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]>
> wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> Thanks for putting this together, looking forward to the
> experimental
> > > >>> support in the next release.
> > > >>>
> > > >>> One clarification: since the MVP won't support rescaling, does it
> > imply
> > > >>> that savepoints will always use aligned checkpointing? If so, this
> > > would
> > > >>> still block the user from taking a savepoint and resume with
> > increased
> > > >>> parallelism to resolve a prolonged/permanent backpressure
> condition?
> > > >>>
> > > >>> Thanks,
> > > >>> Thomas
> > > >>>
> > > >>>
> > > >>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]>
> > > wrote:
> > > >>>
> > > >>>> Hi all,
> > > >>>>
> > > >>>> I would like to start the vote for FLIP-76 [1], which is discussed
> > and
> > > >>>> reached a consensus in the discussion thread [2].
> > > >>>>
> > > >>>> The vote will be open until March. 13th (72h), unless there is an
> > > >>> objection
> > > >>>> or not enough votes.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Arvid
> > > >>>>
> > > >>>> [1]
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > > >>>> [2]
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > Regards,
> > Roman
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] [FLIP-76] Unaligned checkpoints

Arvid Heise-3
Voting period is now over even with the roadmap changes (forgot to close on
Friday because of all the Coronavirus chaos).

We have 4 binding votes (Thomas, Yu, Piotr, Zhijiang) and no objections, so
FLIP-76 passed.

Thank you very much for your feedback.

On Fri, Mar 13, 2020 at 11:08 AM Yu Li <[hidden email]> wrote:

> +1 (binding)
>
> The updated FLIP doc LGTM. Thanks for addressing the comments Arvid and
> Roman.
>
> Best Regards,
> Yu
>
>
> On Fri, 13 Mar 2020 at 03:48, Arvid Heise <[hidden email]> wrote:
>
> > I added a roadmap section to the FLIP as suggested by Yu and Roman.
> >
> > Unless someone objects, I'd still consider the voting period to end
> > tomorrow. For me, the roadmap is only a clarification of already written
> > and discussed points.
> >
> > We already have enough binding votes, but there may be concerns popping
> up
> > until tomorrow.
> >
> > On Thu, Mar 12, 2020 at 5:00 PM Yun Gao <[hidden email]>
> > wrote:
> >
> > > +1 (non-binding)
> > >      I think the PoC result has shown the effect on reducing checkpoint
> > > time when back-pressure occurs, and I totally agree with that the
> feature
> > > could be implemented in steps.
> > >
> > >
> > > ------------------------------------------------------------------
> > > From:Roman Khachatryan <[hidden email]>
> > > Send Time:2020 Mar. 12 (Thu.) 01:33
> > > To:dev <[hidden email]>; Zhijiang <[hidden email]>
> > > Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
> > >
> > > +1 (non-binding)
> > >
> > > Regarding Yu's suggestion about *Roadmap* or *Future Work* section, I
> > think
> > > it's a good idea.
> > > Currently, some MVP limitations are mentioned at the end of the
> document,
> > > so we can extract and expand it.
> > > As for the recovery speed it's not a priority currently, but we could
> > also
> > > mention it in this section.
> > >
> > >
> > > On Wed, Mar 11, 2020 at 4:11 PM Zhijiang <[hidden email]
> > > .invalid>
> > > wrote:
> > >
> > > > +1 (binding).
> > > >
> > > > As for David's concern of smaller buffers after recovery, I ever had
> a
> > > > draft design [1] to solve this issue.
> > > > You can take a look and leave comments if still have concerns. :)
> > > >
> > > > [1]
> > > >
> > >
> >
> https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit
> > > >
> > > > Best,
> > > > Zhijiang
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > From:Piotr Nowojski <[hidden email]>
> > > > Send Time:2020 Mar. 11 (Wed.) 21:19
> > > > To:dev <[hidden email]>
> > > > Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
> > > >
> > > > +1 (binding).
> > > >
> > > > Piotrek
> > > >
> > > > > On 11 Mar 2020, at 09:19, David Anderson <[hidden email]>
> > wrote:
> > > > >
> > > > > +1 I like where this is headed.
> > > > >
> > > > > One question: during restore, it could happen that a new task
> manager
> > > is
> > > > > configured with fewer or smaller buffers than was previously the
> > case.
> > > > How
> > > > > will this be handled?
> > > > >
> > > > > David
> > > > >
> > > > >
> > > > > On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise <[hidden email]>
> > > wrote:
> > > > >
> > > > >> Hi Thomas,
> > > > >>
> > > > >> it's like you said. The first version will not support rescaling
> and
> > > > mostly
> > > > >> addresses the concerns about making little to no progress because
> of
> > > > >> frequent crashes.
> > > > >>
> > > > >> The main reason is that we cannot guarantee the ordering of
> > non-keyed
> > > > data
> > > > >> (and even keyed data in some weird cases) when rescaling
> currently.
> > We
> > > > have
> > > > >> a general concept to address that, which would also enable dynamic
> > > > >> rescaling in the future, but that would make the changes even
> bigger
> > > > and we
> > > > >> would not have any version ready for 1.11.
> > > > >>
> > > > >> The current plan, of course, is to continue improving unaligned
> > > > checkpoints
> > > > >> immediately after release, such that we have the full feature set
> > for
> > > > 1.12.
> > > > >> Potentially, unaligned checkpoints (with timeouts) would even
> become
> > > the
> > > > >> default option.
> > > > >>
> > > > >> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise <[hidden email]>
> > wrote:
> > > > >>
> > > > >>> +1
> > > > >>>
> > > > >>> Thanks for putting this together, looking forward to the
> > experimental
> > > > >>> support in the next release.
> > > > >>>
> > > > >>> One clarification: since the MVP won't support rescaling, does it
> > > imply
> > > > >>> that savepoints will always use aligned checkpointing? If so,
> this
> > > > would
> > > > >>> still block the user from taking a savepoint and resume with
> > > increased
> > > > >>> parallelism to resolve a prolonged/permanent backpressure
> > condition?
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Thomas
> > > > >>>
> > > > >>>
> > > > >>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise <[hidden email]
> >
> > > > wrote:
> > > > >>>
> > > > >>>> Hi all,
> > > > >>>>
> > > > >>>> I would like to start the vote for FLIP-76 [1], which is
> discussed
> > > and
> > > > >>>> reached a consensus in the discussion thread [2].
> > > > >>>>
> > > > >>>> The vote will be open until March. 13th (72h), unless there is
> an
> > > > >>> objection
> > > > >>>> or not enough votes.
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Arvid
> > > > >>>>
> > > > >>>> [1]
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
> > > > >>>> [2]
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> > > --
> > > Regards,
> > > Roman
> > >
> > >
> >
>