[VOTE] FLIP-56: Dynamic Slot Allocation

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[VOTE] FLIP-56: Dynamic Slot Allocation

Xintong Song
Hi all,

I would like to start the vote for FLIP-56 [1], on which a consensus is
reached in this discussion thread [2].

The vote will be open for at least 72 hours. I'll try to close it after
Sep. 20 15:00 UTC, unless there is an objection or not enough votes.

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation

[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Andrey Zagrebin-4
Hi Xintong,

Thanks for starting the vote, +1 from my side.

Best,
Andrey

On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <[hidden email]> wrote:

> Hi all,
>
> I would like to start the vote for FLIP-56 [1], on which a consensus is
> reached in this discussion thread [2].
>
> The vote will be open for at least 72 hours. I'll try to close it after
> Sep. 20 15:00 UTC, unless there is an objection or not enough votes.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>
> [2]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Till Rohrmann
Hi Xintong,

thanks for starting the vote. The general plan looks good. Hence +1 from my
side. I still have some minor comments one could think about:

* As we no longer have predetermined slots on the TaskExecutor, I think we
can get rid of the SlotID. Instead, an allocated slot will be identified by
the AllocationID and the TaskManager's ResourceID in order to differentiate
duplicate registrations.
* For the implementation plan, I believe there is only one tiny part on the
SlotManager for which we need a separate code path/feature flag which is
how we find a matching slot. Everything else should be possible to
implement in a way that it works with dynamic and static slot allocation:
1. Let TMs register with default slot profile at RM
2. Change SlotManager to use reported slot profiles instead of
pre-calculated profiles
3. Replace SlotID with SlotProfile in TaskExecutorGateway#requestSlot
4. Extend TM to support dynamic slot allocation (aka proper bookkeeping)
(can happen concurrently to any of steps 2-3)
5. Add bookkeeping to SlotManager (for pending TMs and registered TMs) but
still only use default slot profiles for matching with slot requests
6. Allow to match slot requests with reported resources instead of default
slot profiles (here we could use a feature flag to switch between dynamic
and static slot allocation)

Wdyt?

Cheers,
Till

On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <[hidden email]>
wrote:

> Hi Xintong,
>
> Thanks for starting the vote, +1 from my side.
>
> Best,
> Andrey
>
> On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > I would like to start the vote for FLIP-56 [1], on which a consensus is
> > reached in this discussion thread [2].
> >
> > The vote will be open for at least 72 hours. I'll try to close it after
> > Sep. 20 15:00 UTC, unless there is an objection or not enough votes.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> >
> > [2]
> >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Xintong Song
Thanks for the comments, Till.

- Agree on removing SlotID.

- Regarding the implementation plan, it is true that we can possibly reduce
codes separated by the feature option. But I think to do that we need to
introduce more dependencies between implementation steps. With the current
plan, we can easily separate steps on the RM side and the TM side, and
start concurrently working on them after quickly updating the interfaces in
between. The feature will come alive when the steps on both RM/TM sides are
finished. Since we are planning to have two persons (Andrey and I) working
on this FLIP, I think the current plan is probably more convenient.

Thank you~

Xintong Song



On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <[hidden email]> wrote:

> Hi Xintong,
>
> thanks for starting the vote. The general plan looks good. Hence +1 from my
> side. I still have some minor comments one could think about:
>
> * As we no longer have predetermined slots on the TaskExecutor, I think we
> can get rid of the SlotID. Instead, an allocated slot will be identified by
> the AllocationID and the TaskManager's ResourceID in order to differentiate
> duplicate registrations.
> * For the implementation plan, I believe there is only one tiny part on the
> SlotManager for which we need a separate code path/feature flag which is
> how we find a matching slot. Everything else should be possible to
> implement in a way that it works with dynamic and static slot allocation:
> 1. Let TMs register with default slot profile at RM
> 2. Change SlotManager to use reported slot profiles instead of
> pre-calculated profiles
> 3. Replace SlotID with SlotProfile in TaskExecutorGateway#requestSlot
> 4. Extend TM to support dynamic slot allocation (aka proper bookkeeping)
> (can happen concurrently to any of steps 2-3)
> 5. Add bookkeeping to SlotManager (for pending TMs and registered TMs) but
> still only use default slot profiles for matching with slot requests
> 6. Allow to match slot requests with reported resources instead of default
> slot profiles (here we could use a feature flag to switch between dynamic
> and static slot allocation)
>
> Wdyt?
>
> Cheers,
> Till
>
> On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <[hidden email]>
> wrote:
>
> > Hi Xintong,
> >
> > Thanks for starting the vote, +1 from my side.
> >
> > Best,
> > Andrey
> >
> > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to start the vote for FLIP-56 [1], on which a consensus is
> > > reached in this discussion thread [2].
> > >
> > > The vote will be open for at least 72 hours. I'll try to close it after
> > > Sep. 20 15:00 UTC, unless there is an objection or not enough votes.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > >
> > > [2]
> > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Till Rohrmann
I think besides of point 1. and 3. there are no dependencies between the RM
and TM side changes. Also, I'm not sure whether it makes sense to split the
slot manager changes up into the proposed steps 5, 6 and 7.

I would highly recommend to not add too much duplicate logic/separate code
paths because it just adds blind spots which are probably not as well
tested as the old code paths.

Cheers,
Till

On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <[hidden email]> wrote:

> Thanks for the comments, Till.
>
> - Agree on removing SlotID.
>
> - Regarding the implementation plan, it is true that we can possibly reduce
> codes separated by the feature option. But I think to do that we need to
> introduce more dependencies between implementation steps. With the current
> plan, we can easily separate steps on the RM side and the TM side, and
> start concurrently working on them after quickly updating the interfaces in
> between. The feature will come alive when the steps on both RM/TM sides are
> finished. Since we are planning to have two persons (Andrey and I) working
> on this FLIP, I think the current plan is probably more convenient.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Hi Xintong,
> >
> > thanks for starting the vote. The general plan looks good. Hence +1 from
> my
> > side. I still have some minor comments one could think about:
> >
> > * As we no longer have predetermined slots on the TaskExecutor, I think
> we
> > can get rid of the SlotID. Instead, an allocated slot will be identified
> by
> > the AllocationID and the TaskManager's ResourceID in order to
> differentiate
> > duplicate registrations.
> > * For the implementation plan, I believe there is only one tiny part on
> the
> > SlotManager for which we need a separate code path/feature flag which is
> > how we find a matching slot. Everything else should be possible to
> > implement in a way that it works with dynamic and static slot allocation:
> > 1. Let TMs register with default slot profile at RM
> > 2. Change SlotManager to use reported slot profiles instead of
> > pre-calculated profiles
> > 3. Replace SlotID with SlotProfile in TaskExecutorGateway#requestSlot
> > 4. Extend TM to support dynamic slot allocation (aka proper bookkeeping)
> > (can happen concurrently to any of steps 2-3)
> > 5. Add bookkeeping to SlotManager (for pending TMs and registered TMs)
> but
> > still only use default slot profiles for matching with slot requests
> > 6. Allow to match slot requests with reported resources instead of
> default
> > slot profiles (here we could use a feature flag to switch between dynamic
> > and static slot allocation)
> >
> > Wdyt?
> >
> > Cheers,
> > Till
> >
> > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <[hidden email]>
> > wrote:
> >
> > > Hi Xintong,
> > >
> > > Thanks for starting the vote, +1 from my side.
> > >
> > > Best,
> > > Andrey
> > >
> > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to start the vote for FLIP-56 [1], on which a consensus
> is
> > > > reached in this discussion thread [2].
> > > >
> > > > The vote will be open for at least 72 hours. I'll try to close it
> after
> > > > Sep. 20 15:00 UTC, unless there is an objection or not enough votes.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > >
> > > > [2]
> > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Xintong Song
I'm not sure if I understand the implementation plan you suggested
correctly. To my understanding, it seems that all the steps except for step
5 have to happen in strict order.

   - Profiles to be used in step 2 is reported with step 1.
   - SlotProfile in TaskExecutorGateway#requestSlot in step 3 comes from
   profiles used in step 2.
   - Only if RM request slots from TM with profiles (step 3), would TM be
   able to do the proper bookkeeping (step 4)
   - Step 5 can be done as long as we have step 2.
   - Step 6 relies on both step 4  and step 5, for proper bookkeepings on
   both TM and RM sides before enabling non-default profiles.

That means we can only work on the steps in the following order.
1-2-3-4-6
   \-5-/

What I'm trying to achieve with the current plan, is to have most of the
implementation steps paralleled, as the following. So that Andrey and I can
work concurrently without blocking each other too much.
1-2-3-4
   \5-6-7


I also agree that it would be good to not add too much separate codes. I
would suggest leave that decision to the implementation time. E.g., if by
the time we do the TM side bookkeeping, the RM side has already implemented
requesting slots with profiles, then we do not need to separate the code
paths.


To that end, I think it makes sense to adjust step 5-7 to first use default
slot resource profiles for all the bookkeepings, and replace it with the
requested profiles at the end.


What do you think?


Thank you~

Xintong Song



On Thu, Sep 19, 2019 at 7:59 PM Till Rohrmann <[hidden email]> wrote:

> I think besides of point 1. and 3. there are no dependencies between the RM
> and TM side changes. Also, I'm not sure whether it makes sense to split the
> slot manager changes up into the proposed steps 5, 6 and 7.
>
> I would highly recommend to not add too much duplicate logic/separate code
> paths because it just adds blind spots which are probably not as well
> tested as the old code paths.
>
> Cheers,
> Till
>
> On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks for the comments, Till.
> >
> > - Agree on removing SlotID.
> >
> > - Regarding the implementation plan, it is true that we can possibly
> reduce
> > codes separated by the feature option. But I think to do that we need to
> > introduce more dependencies between implementation steps. With the
> current
> > plan, we can easily separate steps on the RM side and the TM side, and
> > start concurrently working on them after quickly updating the interfaces
> in
> > between. The feature will come alive when the steps on both RM/TM sides
> are
> > finished. Since we are planning to have two persons (Andrey and I)
> working
> > on this FLIP, I think the current plan is probably more convenient.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Hi Xintong,
> > >
> > > thanks for starting the vote. The general plan looks good. Hence +1
> from
> > my
> > > side. I still have some minor comments one could think about:
> > >
> > > * As we no longer have predetermined slots on the TaskExecutor, I think
> > we
> > > can get rid of the SlotID. Instead, an allocated slot will be
> identified
> > by
> > > the AllocationID and the TaskManager's ResourceID in order to
> > differentiate
> > > duplicate registrations.
> > > * For the implementation plan, I believe there is only one tiny part on
> > the
> > > SlotManager for which we need a separate code path/feature flag which
> is
> > > how we find a matching slot. Everything else should be possible to
> > > implement in a way that it works with dynamic and static slot
> allocation:
> > > 1. Let TMs register with default slot profile at RM
> > > 2. Change SlotManager to use reported slot profiles instead of
> > > pre-calculated profiles
> > > 3. Replace SlotID with SlotProfile in TaskExecutorGateway#requestSlot
> > > 4. Extend TM to support dynamic slot allocation (aka proper
> bookkeeping)
> > > (can happen concurrently to any of steps 2-3)
> > > 5. Add bookkeeping to SlotManager (for pending TMs and registered TMs)
> > but
> > > still only use default slot profiles for matching with slot requests
> > > 6. Allow to match slot requests with reported resources instead of
> > default
> > > slot profiles (here we could use a feature flag to switch between
> dynamic
> > > and static slot allocation)
> > >
> > > Wdyt?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <[hidden email]>
> > > wrote:
> > >
> > > > Hi Xintong,
> > > >
> > > > Thanks for starting the vote, +1 from my side.
> > > >
> > > > Best,
> > > > Andrey
> > > >
> > > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <[hidden email]>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to start the vote for FLIP-56 [1], on which a
> consensus
> > is
> > > > > reached in this discussion thread [2].
> > > > >
> > > > > The vote will be open for at least 72 hours. I'll try to close it
> > after
> > > > > Sep. 20 15:00 UTC, unless there is an objection or not enough
> votes.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > > >
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Xintong Song
@Till @Andrey

According to the comments, I just updated the FLIP document [1], with the
following changes:

   - Remove SlotID (in the section Protocol Changes)
   - Updated implementation steps to reduce separated code paths. As far as
   I can see at the moment, we do not need the feature option. We can add it
   if later we find it necessary in the implementation.


Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation

On Fri, Sep 20, 2019 at 11:01 AM Xintong Song <[hidden email]> wrote:

> I'm not sure if I understand the implementation plan you suggested
> correctly. To my understanding, it seems that all the steps except for step
> 5 have to happen in strict order.
>
>    - Profiles to be used in step 2 is reported with step 1.
>    - SlotProfile in TaskExecutorGateway#requestSlot in step 3 comes from
>    profiles used in step 2.
>    - Only if RM request slots from TM with profiles (step 3), would TM be
>    able to do the proper bookkeeping (step 4)
>    - Step 5 can be done as long as we have step 2.
>    - Step 6 relies on both step 4  and step 5, for proper bookkeepings on
>    both TM and RM sides before enabling non-default profiles.
>
> That means we can only work on the steps in the following order.
> 1-2-3-4-6
>    \-5-/
>
> What I'm trying to achieve with the current plan, is to have most of the
> implementation steps paralleled, as the following. So that Andrey and I can
> work concurrently without blocking each other too much.
> 1-2-3-4
>    \5-6-7
>
>
> I also agree that it would be good to not add too much separate codes. I
> would suggest leave that decision to the implementation time. E.g., if by
> the time we do the TM side bookkeeping, the RM side has already implemented
> requesting slots with profiles, then we do not need to separate the code
> paths.
>
>
> To that end, I think it makes sense to adjust step 5-7 to first use
> default slot resource profiles for all the bookkeepings, and replace it
> with the requested profiles at the end.
>
>
> What do you think?
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Sep 19, 2019 at 7:59 PM Till Rohrmann <[hidden email]>
> wrote:
>
>> I think besides of point 1. and 3. there are no dependencies between the
>> RM
>> and TM side changes. Also, I'm not sure whether it makes sense to split
>> the
>> slot manager changes up into the proposed steps 5, 6 and 7.
>>
>> I would highly recommend to not add too much duplicate logic/separate code
>> paths because it just adds blind spots which are probably not as well
>> tested as the old code paths.
>>
>> Cheers,
>> Till
>>
>> On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <[hidden email]>
>> wrote:
>>
>> > Thanks for the comments, Till.
>> >
>> > - Agree on removing SlotID.
>> >
>> > - Regarding the implementation plan, it is true that we can possibly
>> reduce
>> > codes separated by the feature option. But I think to do that we need to
>> > introduce more dependencies between implementation steps. With the
>> current
>> > plan, we can easily separate steps on the RM side and the TM side, and
>> > start concurrently working on them after quickly updating the
>> interfaces in
>> > between. The feature will come alive when the steps on both RM/TM sides
>> are
>> > finished. Since we are planning to have two persons (Andrey and I)
>> working
>> > on this FLIP, I think the current plan is probably more convenient.
>> >
>> > Thank you~
>> >
>> > Xintong Song
>> >
>> >
>> >
>> > On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <[hidden email]>
>> > wrote:
>> >
>> > > Hi Xintong,
>> > >
>> > > thanks for starting the vote. The general plan looks good. Hence +1
>> from
>> > my
>> > > side. I still have some minor comments one could think about:
>> > >
>> > > * As we no longer have predetermined slots on the TaskExecutor, I
>> think
>> > we
>> > > can get rid of the SlotID. Instead, an allocated slot will be
>> identified
>> > by
>> > > the AllocationID and the TaskManager's ResourceID in order to
>> > differentiate
>> > > duplicate registrations.
>> > > * For the implementation plan, I believe there is only one tiny part
>> on
>> > the
>> > > SlotManager for which we need a separate code path/feature flag which
>> is
>> > > how we find a matching slot. Everything else should be possible to
>> > > implement in a way that it works with dynamic and static slot
>> allocation:
>> > > 1. Let TMs register with default slot profile at RM
>> > > 2. Change SlotManager to use reported slot profiles instead of
>> > > pre-calculated profiles
>> > > 3. Replace SlotID with SlotProfile in TaskExecutorGateway#requestSlot
>> > > 4. Extend TM to support dynamic slot allocation (aka proper
>> bookkeeping)
>> > > (can happen concurrently to any of steps 2-3)
>> > > 5. Add bookkeeping to SlotManager (for pending TMs and registered TMs)
>> > but
>> > > still only use default slot profiles for matching with slot requests
>> > > 6. Allow to match slot requests with reported resources instead of
>> > default
>> > > slot profiles (here we could use a feature flag to switch between
>> dynamic
>> > > and static slot allocation)
>> > >
>> > > Wdyt?
>> > >
>> > > Cheers,
>> > > Till
>> > >
>> > > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <[hidden email]
>> >
>> > > wrote:
>> > >
>> > > > Hi Xintong,
>> > > >
>> > > > Thanks for starting the vote, +1 from my side.
>> > > >
>> > > > Best,
>> > > > Andrey
>> > > >
>> > > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <[hidden email]
>> >
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I would like to start the vote for FLIP-56 [1], on which a
>> consensus
>> > is
>> > > > > reached in this discussion thread [2].
>> > > > >
>> > > > > The vote will be open for at least 72 hours. I'll try to close it
>> > after
>> > > > > Sep. 20 15:00 UTC, unless there is an objection or not enough
>> votes.
>> > > > >
>> > > > > Thank you~
>> > > > >
>> > > > > Xintong Song
>> > > > >
>> > > > >
>> > > > > [1]
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>> > > > >
>> > > > > [2]
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
>> > > > >
>> > > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Till Rohrmann
Thanks for updating the Flip. It looks good to me.

+1 (binding)

Cheers,
Till

On Mon, Sep 23, 2019 at 4:12 PM Xintong Song <[hidden email]> wrote:

> @Till @Andrey
>
> According to the comments, I just updated the FLIP document [1], with the
> following changes:
>
>    - Remove SlotID (in the section Protocol Changes)
>    - Updated implementation steps to reduce separated code paths. As far as
>    I can see at the moment, we do not need the feature option. We can add
> it
>    if later we find it necessary in the implementation.
>
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>
> On Fri, Sep 20, 2019 at 11:01 AM Xintong Song <[hidden email]>
> wrote:
>
> > I'm not sure if I understand the implementation plan you suggested
> > correctly. To my understanding, it seems that all the steps except for
> step
> > 5 have to happen in strict order.
> >
> >    - Profiles to be used in step 2 is reported with step 1.
> >    - SlotProfile in TaskExecutorGateway#requestSlot in step 3 comes from
> >    profiles used in step 2.
> >    - Only if RM request slots from TM with profiles (step 3), would TM be
> >    able to do the proper bookkeeping (step 4)
> >    - Step 5 can be done as long as we have step 2.
> >    - Step 6 relies on both step 4  and step 5, for proper bookkeepings on
> >    both TM and RM sides before enabling non-default profiles.
> >
> > That means we can only work on the steps in the following order.
> > 1-2-3-4-6
> >    \-5-/
> >
> > What I'm trying to achieve with the current plan, is to have most of the
> > implementation steps paralleled, as the following. So that Andrey and I
> can
> > work concurrently without blocking each other too much.
> > 1-2-3-4
> >    \5-6-7
> >
> >
> > I also agree that it would be good to not add too much separate codes. I
> > would suggest leave that decision to the implementation time. E.g., if by
> > the time we do the TM side bookkeeping, the RM side has already
> implemented
> > requesting slots with profiles, then we do not need to separate the code
> > paths.
> >
> >
> > To that end, I think it makes sense to adjust step 5-7 to first use
> > default slot resource profiles for all the bookkeepings, and replace it
> > with the requested profiles at the end.
> >
> >
> > What do you think?
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Sep 19, 2019 at 7:59 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> >> I think besides of point 1. and 3. there are no dependencies between the
> >> RM
> >> and TM side changes. Also, I'm not sure whether it makes sense to split
> >> the
> >> slot manager changes up into the proposed steps 5, 6 and 7.
> >>
> >> I would highly recommend to not add too much duplicate logic/separate
> code
> >> paths because it just adds blind spots which are probably not as well
> >> tested as the old code paths.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <[hidden email]>
> >> wrote:
> >>
> >> > Thanks for the comments, Till.
> >> >
> >> > - Agree on removing SlotID.
> >> >
> >> > - Regarding the implementation plan, it is true that we can possibly
> >> reduce
> >> > codes separated by the feature option. But I think to do that we need
> to
> >> > introduce more dependencies between implementation steps. With the
> >> current
> >> > plan, we can easily separate steps on the RM side and the TM side, and
> >> > start concurrently working on them after quickly updating the
> >> interfaces in
> >> > between. The feature will come alive when the steps on both RM/TM
> sides
> >> are
> >> > finished. Since we are planning to have two persons (Andrey and I)
> >> working
> >> > on this FLIP, I think the current plan is probably more convenient.
> >> >
> >> > Thank you~
> >> >
> >> > Xintong Song
> >> >
> >> >
> >> >
> >> > On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <[hidden email]>
> >> > wrote:
> >> >
> >> > > Hi Xintong,
> >> > >
> >> > > thanks for starting the vote. The general plan looks good. Hence +1
> >> from
> >> > my
> >> > > side. I still have some minor comments one could think about:
> >> > >
> >> > > * As we no longer have predetermined slots on the TaskExecutor, I
> >> think
> >> > we
> >> > > can get rid of the SlotID. Instead, an allocated slot will be
> >> identified
> >> > by
> >> > > the AllocationID and the TaskManager's ResourceID in order to
> >> > differentiate
> >> > > duplicate registrations.
> >> > > * For the implementation plan, I believe there is only one tiny part
> >> on
> >> > the
> >> > > SlotManager for which we need a separate code path/feature flag
> which
> >> is
> >> > > how we find a matching slot. Everything else should be possible to
> >> > > implement in a way that it works with dynamic and static slot
> >> allocation:
> >> > > 1. Let TMs register with default slot profile at RM
> >> > > 2. Change SlotManager to use reported slot profiles instead of
> >> > > pre-calculated profiles
> >> > > 3. Replace SlotID with SlotProfile in
> TaskExecutorGateway#requestSlot
> >> > > 4. Extend TM to support dynamic slot allocation (aka proper
> >> bookkeeping)
> >> > > (can happen concurrently to any of steps 2-3)
> >> > > 5. Add bookkeeping to SlotManager (for pending TMs and registered
> TMs)
> >> > but
> >> > > still only use default slot profiles for matching with slot requests
> >> > > 6. Allow to match slot requests with reported resources instead of
> >> > default
> >> > > slot profiles (here we could use a feature flag to switch between
> >> dynamic
> >> > > and static slot allocation)
> >> > >
> >> > > Wdyt?
> >> > >
> >> > > Cheers,
> >> > > Till
> >> > >
> >> > > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <
> [hidden email]
> >> >
> >> > > wrote:
> >> > >
> >> > > > Hi Xintong,
> >> > > >
> >> > > > Thanks for starting the vote, +1 from my side.
> >> > > >
> >> > > > Best,
> >> > > > Andrey
> >> > > >
> >> > > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <
> [hidden email]
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I would like to start the vote for FLIP-56 [1], on which a
> >> consensus
> >> > is
> >> > > > > reached in this discussion thread [2].
> >> > > > >
> >> > > > > The vote will be open for at least 72 hours. I'll try to close
> it
> >> > after
> >> > > > > Sep. 20 15:00 UTC, unless there is an objection or not enough
> >> votes.
> >> > > > >
> >> > > > > Thank you~
> >> > > > >
> >> > > > > Xintong Song
> >> > > > >
> >> > > > >
> >> > > > > [1]
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> >> > > > >
> >> > > > > [2]
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Gary Yao-3
Hi Xintong,

Thanks for starting the vote. The proposed changes look good to me.

+1 (binding)

Best,
Gary

On Mon, Sep 23, 2019 at 4:18 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for updating the Flip. It looks good to me.
>
> +1 (binding)
>
> Cheers,
> Till
>
> On Mon, Sep 23, 2019 at 4:12 PM Xintong Song <[hidden email]>
> wrote:
>
> > @Till @Andrey
> >
> > According to the comments, I just updated the FLIP document [1], with the
> > following changes:
> >
> >    - Remove SlotID (in the section Protocol Changes)
> >    - Updated implementation steps to reduce separated code paths. As far
> as
> >    I can see at the moment, we do not need the feature option. We can add
> > it
> >    if later we find it necessary in the implementation.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> >
> > On Fri, Sep 20, 2019 at 11:01 AM Xintong Song <[hidden email]>
> > wrote:
> >
> > > I'm not sure if I understand the implementation plan you suggested
> > > correctly. To my understanding, it seems that all the steps except for
> > step
> > > 5 have to happen in strict order.
> > >
> > >    - Profiles to be used in step 2 is reported with step 1.
> > >    - SlotProfile in TaskExecutorGateway#requestSlot in step 3 comes
> from
> > >    profiles used in step 2.
> > >    - Only if RM request slots from TM with profiles (step 3), would TM
> be
> > >    able to do the proper bookkeeping (step 4)
> > >    - Step 5 can be done as long as we have step 2.
> > >    - Step 6 relies on both step 4  and step 5, for proper bookkeepings
> on
> > >    both TM and RM sides before enabling non-default profiles.
> > >
> > > That means we can only work on the steps in the following order.
> > > 1-2-3-4-6
> > >    \-5-/
> > >
> > > What I'm trying to achieve with the current plan, is to have most of
> the
> > > implementation steps paralleled, as the following. So that Andrey and I
> > can
> > > work concurrently without blocking each other too much.
> > > 1-2-3-4
> > >    \5-6-7
> > >
> > >
> > > I also agree that it would be good to not add too much separate codes.
> I
> > > would suggest leave that decision to the implementation time. E.g., if
> by
> > > the time we do the TM side bookkeeping, the RM side has already
> > implemented
> > > requesting slots with profiles, then we do not need to separate the
> code
> > > paths.
> > >
> > >
> > > To that end, I think it makes sense to adjust step 5-7 to first use
> > > default slot resource profiles for all the bookkeepings, and replace it
> > > with the requested profiles at the end.
> > >
> > >
> > > What do you think?
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Thu, Sep 19, 2019 at 7:59 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > >> I think besides of point 1. and 3. there are no dependencies between
> the
> > >> RM
> > >> and TM side changes. Also, I'm not sure whether it makes sense to
> split
> > >> the
> > >> slot manager changes up into the proposed steps 5, 6 and 7.
> > >>
> > >> I would highly recommend to not add too much duplicate logic/separate
> > code
> > >> paths because it just adds blind spots which are probably not as well
> > >> tested as the old code paths.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <[hidden email]>
> > >> wrote:
> > >>
> > >> > Thanks for the comments, Till.
> > >> >
> > >> > - Agree on removing SlotID.
> > >> >
> > >> > - Regarding the implementation plan, it is true that we can possibly
> > >> reduce
> > >> > codes separated by the feature option. But I think to do that we
> need
> > to
> > >> > introduce more dependencies between implementation steps. With the
> > >> current
> > >> > plan, we can easily separate steps on the RM side and the TM side,
> and
> > >> > start concurrently working on them after quickly updating the
> > >> interfaces in
> > >> > between. The feature will come alive when the steps on both RM/TM
> > sides
> > >> are
> > >> > finished. Since we are planning to have two persons (Andrey and I)
> > >> working
> > >> > on this FLIP, I think the current plan is probably more convenient.
> > >> >
> > >> > Thank you~
> > >> >
> > >> > Xintong Song
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <[hidden email]
> >
> > >> > wrote:
> > >> >
> > >> > > Hi Xintong,
> > >> > >
> > >> > > thanks for starting the vote. The general plan looks good. Hence
> +1
> > >> from
> > >> > my
> > >> > > side. I still have some minor comments one could think about:
> > >> > >
> > >> > > * As we no longer have predetermined slots on the TaskExecutor, I
> > >> think
> > >> > we
> > >> > > can get rid of the SlotID. Instead, an allocated slot will be
> > >> identified
> > >> > by
> > >> > > the AllocationID and the TaskManager's ResourceID in order to
> > >> > differentiate
> > >> > > duplicate registrations.
> > >> > > * For the implementation plan, I believe there is only one tiny
> part
> > >> on
> > >> > the
> > >> > > SlotManager for which we need a separate code path/feature flag
> > which
> > >> is
> > >> > > how we find a matching slot. Everything else should be possible to
> > >> > > implement in a way that it works with dynamic and static slot
> > >> allocation:
> > >> > > 1. Let TMs register with default slot profile at RM
> > >> > > 2. Change SlotManager to use reported slot profiles instead of
> > >> > > pre-calculated profiles
> > >> > > 3. Replace SlotID with SlotProfile in
> > TaskExecutorGateway#requestSlot
> > >> > > 4. Extend TM to support dynamic slot allocation (aka proper
> > >> bookkeeping)
> > >> > > (can happen concurrently to any of steps 2-3)
> > >> > > 5. Add bookkeeping to SlotManager (for pending TMs and registered
> > TMs)
> > >> > but
> > >> > > still only use default slot profiles for matching with slot
> requests
> > >> > > 6. Allow to match slot requests with reported resources instead of
> > >> > default
> > >> > > slot profiles (here we could use a feature flag to switch between
> > >> dynamic
> > >> > > and static slot allocation)
> > >> > >
> > >> > > Wdyt?
> > >> > >
> > >> > > Cheers,
> > >> > > Till
> > >> > >
> > >> > > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <
> > [hidden email]
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Xintong,
> > >> > > >
> > >> > > > Thanks for starting the vote, +1 from my side.
> > >> > > >
> > >> > > > Best,
> > >> > > > Andrey
> > >> > > >
> > >> > > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <
> > [hidden email]
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > I would like to start the vote for FLIP-56 [1], on which a
> > >> consensus
> > >> > is
> > >> > > > > reached in this discussion thread [2].
> > >> > > > >
> > >> > > > > The vote will be open for at least 72 hours. I'll try to close
> > it
> > >> > after
> > >> > > > > Sep. 20 15:00 UTC, unless there is an objection or not enough
> > >> votes.
> > >> > > > >
> > >> > > > > Thank you~
> > >> > > > >
> > >> > > > > Xintong Song
> > >> > > > >
> > >> > > > >
> > >> > > > > [1]
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > >> > > > >
> > >> > > > > [2]
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Kurt Young
In reply to this post by Till Rohrmann
If it's possible, I would suggest to add one sector in this doc to
emphasize that current design has a prerequisite that each job
should either has all its operators using unknown resource
profile or all using specified amount of resource. This would
make this document easier to understand.

(I was confused by it and realized this after talking to Xingtong
offline)

But still I would +1 for this.

Best,
Kurt


On Mon, Sep 23, 2019 at 10:18 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for updating the Flip. It looks good to me.
>
> +1 (binding)
>
> Cheers,
> Till
>
> On Mon, Sep 23, 2019 at 4:12 PM Xintong Song <[hidden email]>
> wrote:
>
> > @Till @Andrey
> >
> > According to the comments, I just updated the FLIP document [1], with the
> > following changes:
> >
> >    - Remove SlotID (in the section Protocol Changes)
> >    - Updated implementation steps to reduce separated code paths. As far
> as
> >    I can see at the moment, we do not need the feature option. We can add
> > it
> >    if later we find it necessary in the implementation.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> >
> > On Fri, Sep 20, 2019 at 11:01 AM Xintong Song <[hidden email]>
> > wrote:
> >
> > > I'm not sure if I understand the implementation plan you suggested
> > > correctly. To my understanding, it seems that all the steps except for
> > step
> > > 5 have to happen in strict order.
> > >
> > >    - Profiles to be used in step 2 is reported with step 1.
> > >    - SlotProfile in TaskExecutorGateway#requestSlot in step 3 comes
> from
> > >    profiles used in step 2.
> > >    - Only if RM request slots from TM with profiles (step 3), would TM
> be
> > >    able to do the proper bookkeeping (step 4)
> > >    - Step 5 can be done as long as we have step 2.
> > >    - Step 6 relies on both step 4  and step 5, for proper bookkeepings
> on
> > >    both TM and RM sides before enabling non-default profiles.
> > >
> > > That means we can only work on the steps in the following order.
> > > 1-2-3-4-6
> > >    \-5-/
> > >
> > > What I'm trying to achieve with the current plan, is to have most of
> the
> > > implementation steps paralleled, as the following. So that Andrey and I
> > can
> > > work concurrently without blocking each other too much.
> > > 1-2-3-4
> > >    \5-6-7
> > >
> > >
> > > I also agree that it would be good to not add too much separate codes.
> I
> > > would suggest leave that decision to the implementation time. E.g., if
> by
> > > the time we do the TM side bookkeeping, the RM side has already
> > implemented
> > > requesting slots with profiles, then we do not need to separate the
> code
> > > paths.
> > >
> > >
> > > To that end, I think it makes sense to adjust step 5-7 to first use
> > > default slot resource profiles for all the bookkeepings, and replace it
> > > with the requested profiles at the end.
> > >
> > >
> > > What do you think?
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Thu, Sep 19, 2019 at 7:59 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > >> I think besides of point 1. and 3. there are no dependencies between
> the
> > >> RM
> > >> and TM side changes. Also, I'm not sure whether it makes sense to
> split
> > >> the
> > >> slot manager changes up into the proposed steps 5, 6 and 7.
> > >>
> > >> I would highly recommend to not add too much duplicate logic/separate
> > code
> > >> paths because it just adds blind spots which are probably not as well
> > >> tested as the old code paths.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <[hidden email]>
> > >> wrote:
> > >>
> > >> > Thanks for the comments, Till.
> > >> >
> > >> > - Agree on removing SlotID.
> > >> >
> > >> > - Regarding the implementation plan, it is true that we can possibly
> > >> reduce
> > >> > codes separated by the feature option. But I think to do that we
> need
> > to
> > >> > introduce more dependencies between implementation steps. With the
> > >> current
> > >> > plan, we can easily separate steps on the RM side and the TM side,
> and
> > >> > start concurrently working on them after quickly updating the
> > >> interfaces in
> > >> > between. The feature will come alive when the steps on both RM/TM
> > sides
> > >> are
> > >> > finished. Since we are planning to have two persons (Andrey and I)
> > >> working
> > >> > on this FLIP, I think the current plan is probably more convenient.
> > >> >
> > >> > Thank you~
> > >> >
> > >> > Xintong Song
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <[hidden email]
> >
> > >> > wrote:
> > >> >
> > >> > > Hi Xintong,
> > >> > >
> > >> > > thanks for starting the vote. The general plan looks good. Hence
> +1
> > >> from
> > >> > my
> > >> > > side. I still have some minor comments one could think about:
> > >> > >
> > >> > > * As we no longer have predetermined slots on the TaskExecutor, I
> > >> think
> > >> > we
> > >> > > can get rid of the SlotID. Instead, an allocated slot will be
> > >> identified
> > >> > by
> > >> > > the AllocationID and the TaskManager's ResourceID in order to
> > >> > differentiate
> > >> > > duplicate registrations.
> > >> > > * For the implementation plan, I believe there is only one tiny
> part
> > >> on
> > >> > the
> > >> > > SlotManager for which we need a separate code path/feature flag
> > which
> > >> is
> > >> > > how we find a matching slot. Everything else should be possible to
> > >> > > implement in a way that it works with dynamic and static slot
> > >> allocation:
> > >> > > 1. Let TMs register with default slot profile at RM
> > >> > > 2. Change SlotManager to use reported slot profiles instead of
> > >> > > pre-calculated profiles
> > >> > > 3. Replace SlotID with SlotProfile in
> > TaskExecutorGateway#requestSlot
> > >> > > 4. Extend TM to support dynamic slot allocation (aka proper
> > >> bookkeeping)
> > >> > > (can happen concurrently to any of steps 2-3)
> > >> > > 5. Add bookkeeping to SlotManager (for pending TMs and registered
> > TMs)
> > >> > but
> > >> > > still only use default slot profiles for matching with slot
> requests
> > >> > > 6. Allow to match slot requests with reported resources instead of
> > >> > default
> > >> > > slot profiles (here we could use a feature flag to switch between
> > >> dynamic
> > >> > > and static slot allocation)
> > >> > >
> > >> > > Wdyt?
> > >> > >
> > >> > > Cheers,
> > >> > > Till
> > >> > >
> > >> > > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <
> > [hidden email]
> > >> >
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Xintong,
> > >> > > >
> > >> > > > Thanks for starting the vote, +1 from my side.
> > >> > > >
> > >> > > > Best,
> > >> > > > Andrey
> > >> > > >
> > >> > > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <
> > [hidden email]
> > >> >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > I would like to start the vote for FLIP-56 [1], on which a
> > >> consensus
> > >> > is
> > >> > > > > reached in this discussion thread [2].
> > >> > > > >
> > >> > > > > The vote will be open for at least 72 hours. I'll try to close
> > it
> > >> > after
> > >> > > > > Sep. 20 15:00 UTC, unless there is an objection or not enough
> > >> votes.
> > >> > > > >
> > >> > > > > Thank you~
> > >> > > > >
> > >> > > > > Xintong Song
> > >> > > > >
> > >> > > > >
> > >> > > > > [1]
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > >> > > > >
> > >> > > > > [2]
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Xintong Song
Thanks for the votes, Gary and Kurt.

@Kurt
Sorry for the confusion. I've added a clarification in the section "Unknown
Resource Requirement".

And +1 (non-binding) from my side.

Thank you~

Xintong Song



On Tue, Sep 24, 2019 at 5:35 PM Kurt Young <[hidden email]> wrote:

> If it's possible, I would suggest to add one sector in this doc to
> emphasize that current design has a prerequisite that each job
> should either has all its operators using unknown resource
> profile or all using specified amount of resource. This would
> make this document easier to understand.
>
> (I was confused by it and realized this after talking to Xingtong
> offline)
>
> But still I would +1 for this.
>
> Best,
> Kurt
>
>
> On Mon, Sep 23, 2019 at 10:18 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Thanks for updating the Flip. It looks good to me.
> >
> > +1 (binding)
> >
> > Cheers,
> > Till
> >
> > On Mon, Sep 23, 2019 at 4:12 PM Xintong Song <[hidden email]>
> > wrote:
> >
> > > @Till @Andrey
> > >
> > > According to the comments, I just updated the FLIP document [1], with
> the
> > > following changes:
> > >
> > >    - Remove SlotID (in the section Protocol Changes)
> > >    - Updated implementation steps to reduce separated code paths. As
> far
> > as
> > >    I can see at the moment, we do not need the feature option. We can
> add
> > > it
> > >    if later we find it necessary in the implementation.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > >
> > > On Fri, Sep 20, 2019 at 11:01 AM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > I'm not sure if I understand the implementation plan you suggested
> > > > correctly. To my understanding, it seems that all the steps except
> for
> > > step
> > > > 5 have to happen in strict order.
> > > >
> > > >    - Profiles to be used in step 2 is reported with step 1.
> > > >    - SlotProfile in TaskExecutorGateway#requestSlot in step 3 comes
> > from
> > > >    profiles used in step 2.
> > > >    - Only if RM request slots from TM with profiles (step 3), would
> TM
> > be
> > > >    able to do the proper bookkeeping (step 4)
> > > >    - Step 5 can be done as long as we have step 2.
> > > >    - Step 6 relies on both step 4  and step 5, for proper
> bookkeepings
> > on
> > > >    both TM and RM sides before enabling non-default profiles.
> > > >
> > > > That means we can only work on the steps in the following order.
> > > > 1-2-3-4-6
> > > >    \-5-/
> > > >
> > > > What I'm trying to achieve with the current plan, is to have most of
> > the
> > > > implementation steps paralleled, as the following. So that Andrey
> and I
> > > can
> > > > work concurrently without blocking each other too much.
> > > > 1-2-3-4
> > > >    \5-6-7
> > > >
> > > >
> > > > I also agree that it would be good to not add too much separate
> codes.
> > I
> > > > would suggest leave that decision to the implementation time. E.g.,
> if
> > by
> > > > the time we do the TM side bookkeeping, the RM side has already
> > > implemented
> > > > requesting slots with profiles, then we do not need to separate the
> > code
> > > > paths.
> > > >
> > > >
> > > > To that end, I think it makes sense to adjust step 5-7 to first use
> > > > default slot resource profiles for all the bookkeepings, and replace
> it
> > > > with the requested profiles at the end.
> > > >
> > > >
> > > > What do you think?
> > > >
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Thu, Sep 19, 2019 at 7:59 PM Till Rohrmann <[hidden email]>
> > > > wrote:
> > > >
> > > >> I think besides of point 1. and 3. there are no dependencies between
> > the
> > > >> RM
> > > >> and TM side changes. Also, I'm not sure whether it makes sense to
> > split
> > > >> the
> > > >> slot manager changes up into the proposed steps 5, 6 and 7.
> > > >>
> > > >> I would highly recommend to not add too much duplicate
> logic/separate
> > > code
> > > >> paths because it just adds blind spots which are probably not as
> well
> > > >> tested as the old code paths.
> > > >>
> > > >> Cheers,
> > > >> Till
> > > >>
> > > >> On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <
> [hidden email]>
> > > >> wrote:
> > > >>
> > > >> > Thanks for the comments, Till.
> > > >> >
> > > >> > - Agree on removing SlotID.
> > > >> >
> > > >> > - Regarding the implementation plan, it is true that we can
> possibly
> > > >> reduce
> > > >> > codes separated by the feature option. But I think to do that we
> > need
> > > to
> > > >> > introduce more dependencies between implementation steps. With the
> > > >> current
> > > >> > plan, we can easily separate steps on the RM side and the TM side,
> > and
> > > >> > start concurrently working on them after quickly updating the
> > > >> interfaces in
> > > >> > between. The feature will come alive when the steps on both RM/TM
> > > sides
> > > >> are
> > > >> > finished. Since we are planning to have two persons (Andrey and I)
> > > >> working
> > > >> > on this FLIP, I think the current plan is probably more
> convenient.
> > > >> >
> > > >> > Thank you~
> > > >> >
> > > >> > Xintong Song
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <
> [hidden email]
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Xintong,
> > > >> > >
> > > >> > > thanks for starting the vote. The general plan looks good. Hence
> > +1
> > > >> from
> > > >> > my
> > > >> > > side. I still have some minor comments one could think about:
> > > >> > >
> > > >> > > * As we no longer have predetermined slots on the TaskExecutor,
> I
> > > >> think
> > > >> > we
> > > >> > > can get rid of the SlotID. Instead, an allocated slot will be
> > > >> identified
> > > >> > by
> > > >> > > the AllocationID and the TaskManager's ResourceID in order to
> > > >> > differentiate
> > > >> > > duplicate registrations.
> > > >> > > * For the implementation plan, I believe there is only one tiny
> > part
> > > >> on
> > > >> > the
> > > >> > > SlotManager for which we need a separate code path/feature flag
> > > which
> > > >> is
> > > >> > > how we find a matching slot. Everything else should be possible
> to
> > > >> > > implement in a way that it works with dynamic and static slot
> > > >> allocation:
> > > >> > > 1. Let TMs register with default slot profile at RM
> > > >> > > 2. Change SlotManager to use reported slot profiles instead of
> > > >> > > pre-calculated profiles
> > > >> > > 3. Replace SlotID with SlotProfile in
> > > TaskExecutorGateway#requestSlot
> > > >> > > 4. Extend TM to support dynamic slot allocation (aka proper
> > > >> bookkeeping)
> > > >> > > (can happen concurrently to any of steps 2-3)
> > > >> > > 5. Add bookkeeping to SlotManager (for pending TMs and
> registered
> > > TMs)
> > > >> > but
> > > >> > > still only use default slot profiles for matching with slot
> > requests
> > > >> > > 6. Allow to match slot requests with reported resources instead
> of
> > > >> > default
> > > >> > > slot profiles (here we could use a feature flag to switch
> between
> > > >> dynamic
> > > >> > > and static slot allocation)
> > > >> > >
> > > >> > > Wdyt?
> > > >> > >
> > > >> > > Cheers,
> > > >> > > Till
> > > >> > >
> > > >> > > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <
> > > [hidden email]
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Xintong,
> > > >> > > >
> > > >> > > > Thanks for starting the vote, +1 from my side.
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Andrey
> > > >> > > >
> > > >> > > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <
> > > [hidden email]
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi all,
> > > >> > > > >
> > > >> > > > > I would like to start the vote for FLIP-56 [1], on which a
> > > >> consensus
> > > >> > is
> > > >> > > > > reached in this discussion thread [2].
> > > >> > > > >
> > > >> > > > > The vote will be open for at least 72 hours. I'll try to
> close
> > > it
> > > >> > after
> > > >> > > > > Sep. 20 15:00 UTC, unless there is an objection or not
> enough
> > > >> votes.
> > > >> > > > >
> > > >> > > > > Thank you~
> > > >> > > > >
> > > >> > > > > Xintong Song
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > [1]
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > >> > > > >
> > > >> > > > > [2]
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [VOTE] FLIP-56: Dynamic Slot Allocation

Xintong Song
Thanks all for the votes.

So far, we have

   - 4 binding +1 votes (Till, Andrey, Gary and Kurt)
   - 1 un-binding +1 votes (Xintong)
   - No -1 votes

There are more than 3 binding +1 votes and no -1 votes, and the voting time
has past. According to the community bylaws, I'm glad to announce that
FLIP-56 is approved to be adopted by Apache Flink.

Thank you~

Xintong Song



On Tue, Sep 24, 2019 at 7:17 PM Xintong Song <[hidden email]> wrote:

> Thanks for the votes, Gary and Kurt.
>
> @Kurt
> Sorry for the confusion. I've added a clarification in the section
> "Unknown Resource Requirement".
>
> And +1 (non-binding) from my side.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Sep 24, 2019 at 5:35 PM Kurt Young <[hidden email]> wrote:
>
>> If it's possible, I would suggest to add one sector in this doc to
>> emphasize that current design has a prerequisite that each job
>> should either has all its operators using unknown resource
>> profile or all using specified amount of resource. This would
>> make this document easier to understand.
>>
>> (I was confused by it and realized this after talking to Xingtong
>> offline)
>>
>> But still I would +1 for this.
>>
>> Best,
>> Kurt
>>
>>
>> On Mon, Sep 23, 2019 at 10:18 PM Till Rohrmann <[hidden email]>
>> wrote:
>>
>> > Thanks for updating the Flip. It looks good to me.
>> >
>> > +1 (binding)
>> >
>> > Cheers,
>> > Till
>> >
>> > On Mon, Sep 23, 2019 at 4:12 PM Xintong Song <[hidden email]>
>> > wrote:
>> >
>> > > @Till @Andrey
>> > >
>> > > According to the comments, I just updated the FLIP document [1], with
>> the
>> > > following changes:
>> > >
>> > >    - Remove SlotID (in the section Protocol Changes)
>> > >    - Updated implementation steps to reduce separated code paths. As
>> far
>> > as
>> > >    I can see at the moment, we do not need the feature option. We can
>> add
>> > > it
>> > >    if later we find it necessary in the implementation.
>> > >
>> > >
>> > > Thank you~
>> > >
>> > > Xintong Song
>> > >
>> > >
>> > > [1]
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>> > >
>> > > On Fri, Sep 20, 2019 at 11:01 AM Xintong Song <[hidden email]>
>> > > wrote:
>> > >
>> > > > I'm not sure if I understand the implementation plan you suggested
>> > > > correctly. To my understanding, it seems that all the steps except
>> for
>> > > step
>> > > > 5 have to happen in strict order.
>> > > >
>> > > >    - Profiles to be used in step 2 is reported with step 1.
>> > > >    - SlotProfile in TaskExecutorGateway#requestSlot in step 3 comes
>> > from
>> > > >    profiles used in step 2.
>> > > >    - Only if RM request slots from TM with profiles (step 3), would
>> TM
>> > be
>> > > >    able to do the proper bookkeeping (step 4)
>> > > >    - Step 5 can be done as long as we have step 2.
>> > > >    - Step 6 relies on both step 4  and step 5, for proper
>> bookkeepings
>> > on
>> > > >    both TM and RM sides before enabling non-default profiles.
>> > > >
>> > > > That means we can only work on the steps in the following order.
>> > > > 1-2-3-4-6
>> > > >    \-5-/
>> > > >
>> > > > What I'm trying to achieve with the current plan, is to have most of
>> > the
>> > > > implementation steps paralleled, as the following. So that Andrey
>> and I
>> > > can
>> > > > work concurrently without blocking each other too much.
>> > > > 1-2-3-4
>> > > >    \5-6-7
>> > > >
>> > > >
>> > > > I also agree that it would be good to not add too much separate
>> codes.
>> > I
>> > > > would suggest leave that decision to the implementation time. E.g.,
>> if
>> > by
>> > > > the time we do the TM side bookkeeping, the RM side has already
>> > > implemented
>> > > > requesting slots with profiles, then we do not need to separate the
>> > code
>> > > > paths.
>> > > >
>> > > >
>> > > > To that end, I think it makes sense to adjust step 5-7 to first use
>> > > > default slot resource profiles for all the bookkeepings, and
>> replace it
>> > > > with the requested profiles at the end.
>> > > >
>> > > >
>> > > > What do you think?
>> > > >
>> > > >
>> > > > Thank you~
>> > > >
>> > > > Xintong Song
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Sep 19, 2019 at 7:59 PM Till Rohrmann <[hidden email]
>> >
>> > > > wrote:
>> > > >
>> > > >> I think besides of point 1. and 3. there are no dependencies
>> between
>> > the
>> > > >> RM
>> > > >> and TM side changes. Also, I'm not sure whether it makes sense to
>> > split
>> > > >> the
>> > > >> slot manager changes up into the proposed steps 5, 6 and 7.
>> > > >>
>> > > >> I would highly recommend to not add too much duplicate
>> logic/separate
>> > > code
>> > > >> paths because it just adds blind spots which are probably not as
>> well
>> > > >> tested as the old code paths.
>> > > >>
>> > > >> Cheers,
>> > > >> Till
>> > > >>
>> > > >> On Thu, Sep 19, 2019 at 11:58 AM Xintong Song <
>> [hidden email]>
>> > > >> wrote:
>> > > >>
>> > > >> > Thanks for the comments, Till.
>> > > >> >
>> > > >> > - Agree on removing SlotID.
>> > > >> >
>> > > >> > - Regarding the implementation plan, it is true that we can
>> possibly
>> > > >> reduce
>> > > >> > codes separated by the feature option. But I think to do that we
>> > need
>> > > to
>> > > >> > introduce more dependencies between implementation steps. With
>> the
>> > > >> current
>> > > >> > plan, we can easily separate steps on the RM side and the TM
>> side,
>> > and
>> > > >> > start concurrently working on them after quickly updating the
>> > > >> interfaces in
>> > > >> > between. The feature will come alive when the steps on both RM/TM
>> > > sides
>> > > >> are
>> > > >> > finished. Since we are planning to have two persons (Andrey and
>> I)
>> > > >> working
>> > > >> > on this FLIP, I think the current plan is probably more
>> convenient.
>> > > >> >
>> > > >> > Thank you~
>> > > >> >
>> > > >> > Xintong Song
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Thu, Sep 19, 2019 at 5:09 PM Till Rohrmann <
>> [hidden email]
>> > >
>> > > >> > wrote:
>> > > >> >
>> > > >> > > Hi Xintong,
>> > > >> > >
>> > > >> > > thanks for starting the vote. The general plan looks good.
>> Hence
>> > +1
>> > > >> from
>> > > >> > my
>> > > >> > > side. I still have some minor comments one could think about:
>> > > >> > >
>> > > >> > > * As we no longer have predetermined slots on the
>> TaskExecutor, I
>> > > >> think
>> > > >> > we
>> > > >> > > can get rid of the SlotID. Instead, an allocated slot will be
>> > > >> identified
>> > > >> > by
>> > > >> > > the AllocationID and the TaskManager's ResourceID in order to
>> > > >> > differentiate
>> > > >> > > duplicate registrations.
>> > > >> > > * For the implementation plan, I believe there is only one tiny
>> > part
>> > > >> on
>> > > >> > the
>> > > >> > > SlotManager for which we need a separate code path/feature flag
>> > > which
>> > > >> is
>> > > >> > > how we find a matching slot. Everything else should be
>> possible to
>> > > >> > > implement in a way that it works with dynamic and static slot
>> > > >> allocation:
>> > > >> > > 1. Let TMs register with default slot profile at RM
>> > > >> > > 2. Change SlotManager to use reported slot profiles instead of
>> > > >> > > pre-calculated profiles
>> > > >> > > 3. Replace SlotID with SlotProfile in
>> > > TaskExecutorGateway#requestSlot
>> > > >> > > 4. Extend TM to support dynamic slot allocation (aka proper
>> > > >> bookkeeping)
>> > > >> > > (can happen concurrently to any of steps 2-3)
>> > > >> > > 5. Add bookkeeping to SlotManager (for pending TMs and
>> registered
>> > > TMs)
>> > > >> > but
>> > > >> > > still only use default slot profiles for matching with slot
>> > requests
>> > > >> > > 6. Allow to match slot requests with reported resources
>> instead of
>> > > >> > default
>> > > >> > > slot profiles (here we could use a feature flag to switch
>> between
>> > > >> dynamic
>> > > >> > > and static slot allocation)
>> > > >> > >
>> > > >> > > Wdyt?
>> > > >> > >
>> > > >> > > Cheers,
>> > > >> > > Till
>> > > >> > >
>> > > >> > > On Thu, Sep 19, 2019 at 9:45 AM Andrey Zagrebin <
>> > > [hidden email]
>> > > >> >
>> > > >> > > wrote:
>> > > >> > >
>> > > >> > > > Hi Xintong,
>> > > >> > > >
>> > > >> > > > Thanks for starting the vote, +1 from my side.
>> > > >> > > >
>> > > >> > > > Best,
>> > > >> > > > Andrey
>> > > >> > > >
>> > > >> > > > On Tue, Sep 17, 2019 at 4:26 PM Xintong Song <
>> > > [hidden email]
>> > > >> >
>> > > >> > > > wrote:
>> > > >> > > >
>> > > >> > > > > Hi all,
>> > > >> > > > >
>> > > >> > > > > I would like to start the vote for FLIP-56 [1], on which a
>> > > >> consensus
>> > > >> > is
>> > > >> > > > > reached in this discussion thread [2].
>> > > >> > > > >
>> > > >> > > > > The vote will be open for at least 72 hours. I'll try to
>> close
>> > > it
>> > > >> > after
>> > > >> > > > > Sep. 20 15:00 UTC, unless there is an objection or not
>> enough
>> > > >> votes.
>> > > >> > > > >
>> > > >> > > > > Thank you~
>> > > >> > > > >
>> > > >> > > > > Xintong Song
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > > > [1]
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>> > > >> > > > >
>> > > >> > > > > [2]
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>>
>