Hi everyone,
We would like to start a discussion thread on "FLIP-56: Dynamic Slot Allocation" [1]. This is originally part of the discussion thread for "FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, we would like split the original discussion into two topics, and start a separate new discussion thread as well as FLIP process for this one. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html |
We suddenly skipped FLIP-55 lol.
Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > Hi everyone, > > We would like to start a discussion thread on "FLIP-56: Dynamic Slot > Allocation" [1]. This is originally part of the discussion thread for > "FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, we > would like split the original discussion into two topics, and start a > separate new discussion thread as well as FLIP process for this one. > > Thank you~ > > Xintong Song > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > [2] > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > |
@Zili
As far as I know, Timo is drafting a FLIP that has taken the number 55. There is a round-up number maintained on the FLIP wiki page [1] shows which number should be used for the new FLIP, which should be increased by whoever takes the number for a new FLIP. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals On Tue, Aug 20, 2019 at 3:28 AM Zili Chen <[hidden email]> wrote: > We suddenly skipped FLIP-55 lol. > > > Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > > > Hi everyone, > > > > We would like to start a discussion thread on "FLIP-56: Dynamic Slot > > Allocation" [1]. This is originally part of the discussion thread for > > "FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, we > > would like split the original discussion into two topics, and start a > > separate new discussion thread as well as FLIP process for this one. > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > [2] > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > |
Added implementation steps for this FLIP on the wiki page [1].
Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation On Tue, Aug 20, 2019 at 3:43 PM Xintong Song <[hidden email]> wrote: > @Zili > > As far as I know, Timo is drafting a FLIP that has taken the number 55. > There is a round-up number maintained on the FLIP wiki page [1] shows > which number should be used for the new FLIP, which should be increased by > whoever takes the number for a new FLIP. > > Thank you~ > > Xintong Song > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen <[hidden email]> wrote: > >> We suddenly skipped FLIP-55 lol. >> >> >> Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: >> >> > Hi everyone, >> > >> > We would like to start a discussion thread on "FLIP-56: Dynamic Slot >> > Allocation" [1]. This is originally part of the discussion thread for >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, we >> > would like split the original discussion into two topics, and start a >> > separate new discussion thread as well as FLIP process for this one. >> > >> > Thank you~ >> > >> > Xintong Song >> > >> > >> > [1] >> > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation >> > >> > [2] >> > >> > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html >> > >> > |
Thanks for the update Xintong. From a high level perspective the
implementation plan looks good to me. Cheers, Till On Thu, Sep 12, 2019 at 11:04 AM Xintong Song <[hidden email]> wrote: > Added implementation steps for this FLIP on the wiki page [1]. > > > Thank you~ > > Xintong Song > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song <[hidden email]> > wrote: > > > @Zili > > > > As far as I know, Timo is drafting a FLIP that has taken the number 55. > > There is a round-up number maintained on the FLIP wiki page [1] shows > > which number should be used for the new FLIP, which should be increased > by > > whoever takes the number for a new FLIP. > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen <[hidden email]> wrote: > > > >> We suddenly skipped FLIP-55 lol. > >> > >> > >> Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > >> > >> > Hi everyone, > >> > > >> > We would like to start a discussion thread on "FLIP-56: Dynamic Slot > >> > Allocation" [1]. This is originally part of the discussion thread for > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, we > >> > would like split the original discussion into two topics, and start a > >> > separate new discussion thread as well as FLIP process for this one. > >> > > >> > Thank you~ > >> > > >> > Xintong Song > >> > > >> > > >> > [1] > >> > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > >> > > >> > [2] > >> > > >> > > >> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > >> > > >> > > > |
Hi, Xintong, thanks for the great proposal. big +1 for the feature! It is
something like mapreduce-1.0 to mapreduce-2.0. I like the design on the whole. One point may need to be included in the proposal:How we deal with slot share group and dynamic slot allocation? It can be quite different with dynamic slot allocation. On Fri, 13 Sep 2019 at 16:42, Till Rohrmann <[hidden email]> wrote: > Thanks for the update Xintong. From a high level perspective the > implementation plan looks good to me. > > Cheers, > Till > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song <[hidden email]> > wrote: > > > Added implementation steps for this FLIP on the wiki page [1]. > > > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song <[hidden email]> > > wrote: > > > > > @Zili > > > > > > As far as I know, Timo is drafting a FLIP that has taken the number 55. > > > There is a round-up number maintained on the FLIP wiki page [1] shows > > > which number should be used for the new FLIP, which should be increased > > by > > > whoever takes the number for a new FLIP. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen <[hidden email]> > wrote: > > > > > >> We suddenly skipped FLIP-55 lol. > > >> > > >> > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > > >> > > >> > Hi everyone, > > >> > > > >> > We would like to start a discussion thread on "FLIP-56: Dynamic Slot > > >> > Allocation" [1]. This is originally part of the discussion thread > for > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, > we > > >> > would like split the original discussion into two topics, and start > a > > >> > separate new discussion thread as well as FLIP process for this one. > > >> > > > >> > Thank you~ > > >> > > > >> > Xintong Song > > >> > > > >> > > > >> > [1] > > >> > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > >> > > > >> > [2] > > >> > > > >> > > > >> > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > >> > > > >> > > > > > > |
Thanks for the comments, Till and Wenlong.
@Wenlong Regarding slot sharing, the general idea is to request a slot with resources for tasks of the entire slot sharing group. Details can be found in FLIP-53 [1], regarding how to decide the slot sharing groups and how to manage task resources within the shared slots. Thank you~ Xintong Song On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl <[hidden email]> wrote: > Hi, Xintong, thanks for the great proposal. big +1 for the feature! It is > something like mapreduce-1.0 to mapreduce-2.0. > > I like the design on the whole. One point may need to be included in the > proposal:How we deal with slot share group and dynamic slot allocation? It > can be quite different with dynamic slot allocation. > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann <[hidden email]> wrote: > > > Thanks for the update Xintong. From a high level perspective the > > implementation plan looks good to me. > > > > Cheers, > > Till > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song <[hidden email]> > > wrote: > > > > > Added implementation steps for this FLIP on the wiki page [1]. > > > > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song <[hidden email]> > > > wrote: > > > > > > > @Zili > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken the number > 55. > > > > There is a round-up number maintained on the FLIP wiki page [1] shows > > > > which number should be used for the new FLIP, which should be > increased > > > by > > > > whoever takes the number for a new FLIP. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > [1] > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen <[hidden email]> > > wrote: > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > >> > > > >> > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > > > >> > > > >> > Hi everyone, > > > >> > > > > >> > We would like to start a discussion thread on "FLIP-56: Dynamic > Slot > > > >> > Allocation" [1]. This is originally part of the discussion thread > > for > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till > suggested, > > we > > > >> > would like split the original discussion into two topics, and > start > > a > > > >> > separate new discussion thread as well as FLIP process for this > one. > > > >> > > > > >> > Thank you~ > > > >> > > > > >> > Xintong Song > > > >> > > > > >> > > > > >> > [1] > > > >> > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > >> > > > > >> > [2] > > > >> > > > > >> > > > > >> > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > >> > > > > >> > > > > > > > > > > |
Hi Xintong,
Thanks for sharing the implementation steps. I also think they makes sense with the feature option. I was wondering if we could order the steps in a way that each change does not affect other components too much, always having a working system then maybe the feature option does not always need to split the code. Here are some thoughts. - We could do default slot profile firstly and include it into the TM registration. I would suggest to add to ResourceManagerGateway#registerTaskExecutor, not sendSlotReport. This way RM knows about it but does not use at this point. (parts of step 4,6) - We could try to do step 3 firstly in a way that it also supports the current way of allocation in TaskExecutorGateway#requestSlot with the default slot profile and sends reports both with available resources and with free default slots which correspond to the available resources. We can just remove free default slots later. The new way of TaskExecutorGateway#requestResource could be also implemented here but not used yet. - Then step 5 can use the new TaskExecutorGateway#requestResource and the default slot profile - Not sure, step 5 and 7 can be implemented independently without regression of what we have. Maybe if we do step 7 firstly it will have only default slots firstly and it will simplify step 5 later. Best, Andrey On Mon, Sep 16, 2019 at 5:53 AM Xintong Song <[hidden email]> wrote: > Thanks for the comments, Till and Wenlong. > > @Wenlong > Regarding slot sharing, the general idea is to request a slot with > resources for tasks of the entire slot sharing group. Details can be found > in FLIP-53 [1], regarding how to decide the slot sharing groups and how to > manage task resources within the shared slots. > > Thank you~ > > Xintong Song > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl <[hidden email]> > wrote: > > > Hi, Xintong, thanks for the great proposal. big +1 for the feature! It is > > something like mapreduce-1.0 to mapreduce-2.0. > > > > I like the design on the whole. One point may need to be included in the > > proposal:How we deal with slot share group and dynamic slot allocation? > It > > can be quite different with dynamic slot allocation. > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann <[hidden email]> > wrote: > > > > > Thanks for the update Xintong. From a high level perspective the > > > implementation plan looks good to me. > > > > > > Cheers, > > > Till > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > Added implementation steps for this FLIP on the wiki page [1]. > > > > > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > [1] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song <[hidden email]> > > > > wrote: > > > > > > > > > @Zili > > > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken the number > > 55. > > > > > There is a round-up number maintained on the FLIP wiki page [1] > shows > > > > > which number should be used for the new FLIP, which should be > > increased > > > > by > > > > > whoever takes the number for a new FLIP. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen <[hidden email]> > > > wrote: > > > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > >> > > > > >> > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > > > > >> > > > > >> > Hi everyone, > > > > >> > > > > > >> > We would like to start a discussion thread on "FLIP-56: Dynamic > > Slot > > > > >> > Allocation" [1]. This is originally part of the discussion > thread > > > for > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till > > suggested, > > > we > > > > >> > would like split the original discussion into two topics, and > > start > > > a > > > > >> > separate new discussion thread as well as FLIP process for this > > one. > > > > >> > > > > > >> > Thank you~ > > > > >> > > > > > >> > Xintong Song > > > > >> > > > > > >> > > > > > >> > [1] > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > >> > > > > > >> > [2] > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > >> > > > > > >> > > > > > > > > > > > > > > > |
Thanks for the comments, Andrey.
- I agree that instead of ResourceManagerGateway#sendSlotReport, we should add the default slot resource profile to ResourceManagerGateway#registerTaskExecutor. - If I understand correctly, the reason you suggest do default slot resource profile first and then do step 3 in a way that support both TaskExecutorGateway#requestSlot and TaskExecutorGateway#requestResource, is to try to avoid splitting code paths with the feature option? I think we can do that, but I also want to bring it up that this can only reduce the code split by the feature option (which is good) but not eliminate it. We still need the feature option for the fundamental differences, e.g. creating new SlotIDs on allocation vs. allocate to free slots with existing SlotIDs. - I don't really think we can do step 5, 6 and 7 independently. Basically they are all making changes to the same component. We probably can do step 6 and 7 independently, but I think they both depends on step 5. In general, I would say it's good to have as less as possible codes split by the feature option, which makes the later clean-up easier. But if it cannot be easily done, I would rather not to put too much efforts on having a good abstraction and deduplication between the new code path and the original one that we are removing soon. What do you think? Thank you~ Xintong Song On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin <[hidden email]> wrote: > Hi Xintong, > > Thanks for sharing the implementation steps. I also think they makes sense > with the feature option. > > I was wondering if we could order the steps in a way that each change does > not affect other components too much, always having a working system > then maybe the feature option does not always need to split the code. Here > are some thoughts. > > - We could do default slot profile firstly and include it into the TM > registration. I would suggest to add > to ResourceManagerGateway#registerTaskExecutor, not sendSlotReport. > This way RM knows about it but does not use at this point. (parts of step > 4,6) > > - We could try to do step 3 firstly in a way that it also supports the > current way of allocation in TaskExecutorGateway#requestSlot with the > default slot profile > and sends reports both with available resources and with free default > slots which correspond to the available resources. We can just remove free > default slots later. > The new way of TaskExecutorGateway#requestResource could be also > implemented here but not used yet. > > - Then step 5 can use the new TaskExecutorGateway#requestResource and the > default slot profile > > - Not sure, step 5 and 7 can be implemented independently without > regression of what we have. Maybe if we do step 7 firstly it will have only > default slots firstly and it will simplify step 5 later. > > Best, > Andrey > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song <[hidden email]> > wrote: > > > Thanks for the comments, Till and Wenlong. > > > > @Wenlong > > Regarding slot sharing, the general idea is to request a slot with > > resources for tasks of the entire slot sharing group. Details can be > found > > in FLIP-53 [1], regarding how to decide the slot sharing groups and how > to > > manage task resources within the shared slots. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl <[hidden email]> > > wrote: > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the feature! It > is > > > something like mapreduce-1.0 to mapreduce-2.0. > > > > > > I like the design on the whole. One point may need to be included in > the > > > proposal:How we deal with slot share group and dynamic slot allocation? > > It > > > can be quite different with dynamic slot allocation. > > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann <[hidden email]> > > wrote: > > > > > > > Thanks for the update Xintong. From a high level perspective the > > > > implementation plan looks good to me. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song <[hidden email] > > > > > > wrote: > > > > > > > > > Added implementation steps for this FLIP on the wiki page [1]. > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > [hidden email]> > > > > > wrote: > > > > > > > > > > > @Zili > > > > > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken the > number > > > 55. > > > > > > There is a round-up number maintained on the FLIP wiki page [1] > > shows > > > > > > which number should be used for the new FLIP, which should be > > > increased > > > > > by > > > > > > whoever takes the number for a new FLIP. > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen <[hidden email]> > > > > wrote: > > > > > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > > >> > > > > > >> > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > > > > > >> > > > > > >> > Hi everyone, > > > > > >> > > > > > > >> > We would like to start a discussion thread on "FLIP-56: > Dynamic > > > Slot > > > > > >> > Allocation" [1]. This is originally part of the discussion > > thread > > > > for > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till > > > suggested, > > > > we > > > > > >> > would like split the original discussion into two topics, and > > > start > > > > a > > > > > >> > separate new discussion thread as well as FLIP process for > this > > > one. > > > > > >> > > > > > > >> > Thank you~ > > > > > >> > > > > > > >> > Xintong Song > > > > > >> > > > > > > >> > > > > > > >> > [1] > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > >> > > > > > > >> > [2] > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > |
One thing which was briefly mentioned in the Flip but not in the
implementation plan is the update of the web UI. I think it is worth putting an extra item for updating the web UI to properly display the resources a TM has still to offer with dynamic slot allocation. I guess we need to pull in some JavaScript help in order to implement this step. Cheers, Till On Mon, Sep 16, 2019 at 2:15 PM Xintong Song <[hidden email]> wrote: > Thanks for the comments, Andrey. > > - I agree that instead of ResourceManagerGateway#sendSlotReport, we should > add the default slot resource profile to > ResourceManagerGateway#registerTaskExecutor. > > - If I understand correctly, the reason you suggest do default slot > resource profile first and then do step 3 in a way that support both > TaskExecutorGateway#requestSlot and TaskExecutorGateway#requestResource, is > to try to avoid splitting code paths with the feature option? I think we > can do that, but I also want to bring it up that this can only reduce the > code split by the feature option (which is good) but not eliminate it. We > still need the feature option for the fundamental differences, e.g. > creating new SlotIDs on allocation vs. allocate to free slots with existing > SlotIDs. > > - I don't really think we can do step 5, 6 and 7 independently. Basically > they are all making changes to the same component. We probably can do step > 6 and 7 independently, but I think they both depends on step 5. > > In general, I would say it's good to have as less as possible codes split > by the feature option, which makes the later clean-up easier. But if it > cannot be easily done, I would rather not to put too much efforts on having > a good abstraction and deduplication between the new code path and the > original one that we are removing soon. > > What do you think? > > Thank you~ > > Xintong Song > > > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin <[hidden email]> > wrote: > > > Hi Xintong, > > > > Thanks for sharing the implementation steps. I also think they makes > sense > > with the feature option. > > > > I was wondering if we could order the steps in a way that each change > does > > not affect other components too much, always having a working system > > then maybe the feature option does not always need to split the code. > Here > > are some thoughts. > > > > - We could do default slot profile firstly and include it into the TM > > registration. I would suggest to add > > to ResourceManagerGateway#registerTaskExecutor, not sendSlotReport. > > This way RM knows about it but does not use at this point. (parts of > step > > 4,6) > > > > - We could try to do step 3 firstly in a way that it also supports the > > current way of allocation in TaskExecutorGateway#requestSlot with the > > default slot profile > > and sends reports both with available resources and with free default > > slots which correspond to the available resources. We can just remove > free > > default slots later. > > The new way of TaskExecutorGateway#requestResource could be also > > implemented here but not used yet. > > > > - Then step 5 can use the new TaskExecutorGateway#requestResource and the > > default slot profile > > > > - Not sure, step 5 and 7 can be implemented independently without > > regression of what we have. Maybe if we do step 7 firstly it will have > only > > default slots firstly and it will simplify step 5 later. > > > > Best, > > Andrey > > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song <[hidden email]> > > wrote: > > > > > Thanks for the comments, Till and Wenlong. > > > > > > @Wenlong > > > Regarding slot sharing, the general idea is to request a slot with > > > resources for tasks of the entire slot sharing group. Details can be > > found > > > in FLIP-53 [1], regarding how to decide the slot sharing groups and how > > to > > > manage task resources within the shared slots. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl <[hidden email]> > > > wrote: > > > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the feature! > It > > is > > > > something like mapreduce-1.0 to mapreduce-2.0. > > > > > > > > I like the design on the whole. One point may need to be included in > > the > > > > proposal:How we deal with slot share group and dynamic slot > allocation? > > > It > > > > can be quite different with dynamic slot allocation. > > > > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann <[hidden email]> > > > wrote: > > > > > > > > > Thanks for the update Xintong. From a high level perspective the > > > > > implementation plan looks good to me. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > [hidden email] > > > > > > > > wrote: > > > > > > > > > > > Added implementation steps for this FLIP on the wiki page [1]. > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > @Zili > > > > > > > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken the > > number > > > > 55. > > > > > > > There is a round-up number maintained on the FLIP wiki page [1] > > > shows > > > > > > > which number should be used for the new FLIP, which should be > > > > increased > > > > > > by > > > > > > > whoever takes the number for a new FLIP. > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > > > >> > > > > > > >> > > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 下午10:23写道: > > > > > > >> > > > > > > >> > Hi everyone, > > > > > > >> > > > > > > > >> > We would like to start a discussion thread on "FLIP-56: > > Dynamic > > > > Slot > > > > > > >> > Allocation" [1]. This is originally part of the discussion > > > thread > > > > > for > > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till > > > > suggested, > > > > > we > > > > > > >> > would like split the original discussion into two topics, > and > > > > start > > > > > a > > > > > > >> > separate new discussion thread as well as FLIP process for > > this > > > > one. > > > > > > >> > > > > > > > >> > Thank you~ > > > > > > >> > > > > > > > >> > Xintong Song > > > > > > >> > > > > > > > >> > > > > > > > >> > [1] > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > >> > > > > > > > >> > [2] > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
@Xintong
Thanks for the feedback. Just to clarify step 6: If the first point is done before step 5 (e.g. as part of 4) then it is just keeping the info about the default slot in RM's data structure associated the TM and no real change in the behaviour. When this info is available, I think it can be straightforwardly used during step 5 where we get either concrete slot requirement or the unknown one (step 6, point 2) which simply grabs some of the concrete default ones (btw not clear which one, seems just some random?) For steps 5,7, true, it is not quite clear whether we can avoid some split, e.g. after step 5 before doing step 7. I agree that we should introduce the feature flag if we clearly see that it would be a bigger effort without the flag. Best, Andrey On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann <[hidden email]> wrote: > One thing which was briefly mentioned in the Flip but not in the > implementation plan is the update of the web UI. I think it is worth > putting an extra item for updating the web UI to properly display the > resources a TM has still to offer with dynamic slot allocation. I guess we > need to pull in some JavaScript help in order to implement this step. > > Cheers, > Till > > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song <[hidden email]> > wrote: > > > Thanks for the comments, Andrey. > > > > - I agree that instead of ResourceManagerGateway#sendSlotReport, we > should > > add the default slot resource profile to > > ResourceManagerGateway#registerTaskExecutor. > > > > - If I understand correctly, the reason you suggest do default slot > > resource profile first and then do step 3 in a way that support both > > TaskExecutorGateway#requestSlot and TaskExecutorGateway#requestResource, > is > > to try to avoid splitting code paths with the feature option? I think we > > can do that, but I also want to bring it up that this can only reduce the > > code split by the feature option (which is good) but not eliminate it. We > > still need the feature option for the fundamental differences, e.g. > > creating new SlotIDs on allocation vs. allocate to free slots with > existing > > SlotIDs. > > > > - I don't really think we can do step 5, 6 and 7 independently. Basically > > they are all making changes to the same component. We probably can do > step > > 6 and 7 independently, but I think they both depends on step 5. > > > > In general, I would say it's good to have as less as possible codes split > > by the feature option, which makes the later clean-up easier. But if it > > cannot be easily done, I would rather not to put too much efforts on > having > > a good abstraction and deduplication between the new code path and the > > original one that we are removing soon. > > > > What do you think? > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin <[hidden email]> > > wrote: > > > > > Hi Xintong, > > > > > > Thanks for sharing the implementation steps. I also think they makes > > sense > > > with the feature option. > > > > > > I was wondering if we could order the steps in a way that each change > > does > > > not affect other components too much, always having a working system > > > then maybe the feature option does not always need to split the code. > > Here > > > are some thoughts. > > > > > > - We could do default slot profile firstly and include it into the TM > > > registration. I would suggest to add > > > to ResourceManagerGateway#registerTaskExecutor, not sendSlotReport. > > > This way RM knows about it but does not use at this point. (parts of > > step > > > 4,6) > > > > > > - We could try to do step 3 firstly in a way that it also supports the > > > current way of allocation in TaskExecutorGateway#requestSlot with the > > > default slot profile > > > and sends reports both with available resources and with free default > > > slots which correspond to the available resources. We can just remove > > free > > > default slots later. > > > The new way of TaskExecutorGateway#requestResource could be also > > > implemented here but not used yet. > > > > > > - Then step 5 can use the new TaskExecutorGateway#requestResource and > the > > > default slot profile > > > > > > - Not sure, step 5 and 7 can be implemented independently without > > > regression of what we have. Maybe if we do step 7 firstly it will have > > only > > > default slots firstly and it will simplify step 5 later. > > > > > > Best, > > > Andrey > > > > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > Thanks for the comments, Till and Wenlong. > > > > > > > > @Wenlong > > > > Regarding slot sharing, the general idea is to request a slot with > > > > resources for tasks of the entire slot sharing group. Details can be > > > found > > > > in FLIP-53 [1], regarding how to decide the slot sharing groups and > how > > > to > > > > manage task resources within the shared slots. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > [hidden email]> > > > > wrote: > > > > > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the feature! > > It > > > is > > > > > something like mapreduce-1.0 to mapreduce-2.0. > > > > > > > > > > I like the design on the whole. One point may need to be included > in > > > the > > > > > proposal:How we deal with slot share group and dynamic slot > > allocation? > > > > It > > > > > can be quite different with dynamic slot allocation. > > > > > > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann <[hidden email]> > > > > wrote: > > > > > > > > > > > Thanks for the update Xintong. From a high level perspective the > > > > > > implementation plan looks good to me. > > > > > > > > > > > > Cheers, > > > > > > Till > > > > > > > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > > > Added implementation steps for this FLIP on the wiki page [1]. > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > @Zili > > > > > > > > > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken the > > > number > > > > > 55. > > > > > > > > There is a round-up number maintained on the FLIP wiki page > [1] > > > > shows > > > > > > > > which number should be used for the new FLIP, which should be > > > > > increased > > > > > > > by > > > > > > > > whoever takes the number for a new FLIP. > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > > > > >> > > > > > > > >> > > > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 > 下午10:23写道: > > > > > > > >> > > > > > > > >> > Hi everyone, > > > > > > > >> > > > > > > > > >> > We would like to start a discussion thread on "FLIP-56: > > > Dynamic > > > > > Slot > > > > > > > >> > Allocation" [1]. This is originally part of the discussion > > > > thread > > > > > > for > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till > > > > > suggested, > > > > > > we > > > > > > > >> > would like split the original discussion into two topics, > > and > > > > > start > > > > > > a > > > > > > > >> > separate new discussion thread as well as FLIP process for > > > this > > > > > one. > > > > > > > >> > > > > > > > > >> > Thank you~ > > > > > > > >> > > > > > > > > >> > Xintong Song > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > [1] > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > >> > > > > > > > > >> > [2] > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
@Till
Thanks for the reminding. I'll add a step for updating the web ui. I'll try to involve Lining to help us with this step. @Andrey I was thinking that after we define the RM-TM interfaces in step 2, it would be good to concurrently work on both RM and TM side. But yes, if we finish Step 4 early, then it would make step 6 easier. We can start to have some IT/E2E tests, with the default slot resource profiles being available. Thank you~ Xintong Song On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin <[hidden email]> wrote: > @Xintong > > Thanks for the feedback. > > Just to clarify step 6: > If the first point is done before step 5 (e.g. as part of 4) then it is > just keeping the info about the default slot in RM's data structure > associated the TM and no real change in the behaviour. > When this info is available, I think it can be straightforwardly used > during step 5 where we get either concrete slot requirement > or the unknown one (step 6, point 2) which simply grabs some of the > concrete default ones (btw not clear which one, seems just some random?) > > For steps 5,7, true, it is not quite clear whether we can avoid some split, > e.g. after step 5 before doing step 7. > I agree that we should introduce the feature flag if we clearly see that it > would be a bigger effort without the flag. > > Best, > Andrey > > On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann <[hidden email]> > wrote: > > > One thing which was briefly mentioned in the Flip but not in the > > implementation plan is the update of the web UI. I think it is worth > > putting an extra item for updating the web UI to properly display the > > resources a TM has still to offer with dynamic slot allocation. I guess > we > > need to pull in some JavaScript help in order to implement this step. > > > > Cheers, > > Till > > > > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song <[hidden email]> > > wrote: > > > > > Thanks for the comments, Andrey. > > > > > > - I agree that instead of ResourceManagerGateway#sendSlotReport, we > > should > > > add the default slot resource profile to > > > ResourceManagerGateway#registerTaskExecutor. > > > > > > - If I understand correctly, the reason you suggest do default slot > > > resource profile first and then do step 3 in a way that support both > > > TaskExecutorGateway#requestSlot and > TaskExecutorGateway#requestResource, > > is > > > to try to avoid splitting code paths with the feature option? I think > we > > > can do that, but I also want to bring it up that this can only reduce > the > > > code split by the feature option (which is good) but not eliminate it. > We > > > still need the feature option for the fundamental differences, e.g. > > > creating new SlotIDs on allocation vs. allocate to free slots with > > existing > > > SlotIDs. > > > > > > - I don't really think we can do step 5, 6 and 7 independently. > Basically > > > they are all making changes to the same component. We probably can do > > step > > > 6 and 7 independently, but I think they both depends on step 5. > > > > > > In general, I would say it's good to have as less as possible codes > split > > > by the feature option, which makes the later clean-up easier. But if it > > > cannot be easily done, I would rather not to put too much efforts on > > having > > > a good abstraction and deduplication between the new code path and the > > > original one that we are removing soon. > > > > > > What do you think? > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin <[hidden email]> > > > wrote: > > > > > > > Hi Xintong, > > > > > > > > Thanks for sharing the implementation steps. I also think they makes > > > sense > > > > with the feature option. > > > > > > > > I was wondering if we could order the steps in a way that each change > > > does > > > > not affect other components too much, always having a working system > > > > then maybe the feature option does not always need to split the code. > > > Here > > > > are some thoughts. > > > > > > > > - We could do default slot profile firstly and include it into the TM > > > > registration. I would suggest to add > > > > to ResourceManagerGateway#registerTaskExecutor, not sendSlotReport. > > > > This way RM knows about it but does not use at this point. (parts > of > > > step > > > > 4,6) > > > > > > > > - We could try to do step 3 firstly in a way that it also supports > the > > > > current way of allocation in TaskExecutorGateway#requestSlot with the > > > > default slot profile > > > > and sends reports both with available resources and with free > default > > > > slots which correspond to the available resources. We can just remove > > > free > > > > default slots later. > > > > The new way of TaskExecutorGateway#requestResource could be also > > > > implemented here but not used yet. > > > > > > > > - Then step 5 can use the new TaskExecutorGateway#requestResource and > > the > > > > default slot profile > > > > > > > > - Not sure, step 5 and 7 can be implemented independently without > > > > regression of what we have. Maybe if we do step 7 firstly it will > have > > > only > > > > default slots firstly and it will simplify step 5 later. > > > > > > > > Best, > > > > Andrey > > > > > > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song <[hidden email]> > > > > wrote: > > > > > > > > > Thanks for the comments, Till and Wenlong. > > > > > > > > > > @Wenlong > > > > > Regarding slot sharing, the general idea is to request a slot with > > > > > resources for tasks of the entire slot sharing group. Details can > be > > > > found > > > > > in FLIP-53 [1], regarding how to decide the slot sharing groups and > > how > > > > to > > > > > manage task resources within the shared slots. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > > [hidden email]> > > > > > wrote: > > > > > > > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the > feature! > > > It > > > > is > > > > > > something like mapreduce-1.0 to mapreduce-2.0. > > > > > > > > > > > > I like the design on the whole. One point may need to be included > > in > > > > the > > > > > > proposal:How we deal with slot share group and dynamic slot > > > allocation? > > > > > It > > > > > > can be quite different with dynamic slot allocation. > > > > > > > > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > Thanks for the update Xintong. From a high level perspective > the > > > > > > > implementation plan looks good to me. > > > > > > > > > > > > > > Cheers, > > > > > > > Till > > > > > > > > > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Added implementation steps for this FLIP on the wiki page > [1]. > > > > > > > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > @Zili > > > > > > > > > > > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken > the > > > > number > > > > > > 55. > > > > > > > > > There is a round-up number maintained on the FLIP wiki page > > [1] > > > > > shows > > > > > > > > > which number should be used for the new FLIP, which should > be > > > > > > increased > > > > > > > > by > > > > > > > > > whoever takes the number for a new FLIP. > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 > > 下午10:23写道: > > > > > > > > >> > > > > > > > > >> > Hi everyone, > > > > > > > > >> > > > > > > > > > >> > We would like to start a discussion thread on "FLIP-56: > > > > Dynamic > > > > > > Slot > > > > > > > > >> > Allocation" [1]. This is originally part of the > discussion > > > > > thread > > > > > > > for > > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As Till > > > > > > suggested, > > > > > > > we > > > > > > > > >> > would like split the original discussion into two > topics, > > > and > > > > > > start > > > > > > > a > > > > > > > > >> > separate new discussion thread as well as FLIP process > for > > > > this > > > > > > one. > > > > > > > > >> > > > > > > > > > >> > Thank you~ > > > > > > > > >> > > > > > > > > > >> > Xintong Song > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > [1] > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > >> > > > > > > > > > >> > [2] > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
The implementation plan [1] is updated, with the following changes:
- Add default slot resource profile to ResourceManagerGateway#registerTaskExecutor rather than #sendSlotReport. - Swap 'TaskExecutor derive and register with default slot resource profile' and 'Extend TaskExecutor to support dynamic slot allocation' - Add step for updating RestAPI / Web UI Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation On Tue, Sep 17, 2019 at 11:49 AM Xintong Song <[hidden email]> wrote: > @Till > Thanks for the reminding. I'll add a step for updating the web ui. I'll > try to involve Lining to help us with this step. > > @Andrey > I was thinking that after we define the RM-TM interfaces in step 2, it > would be good to concurrently work on both RM and TM side. But yes, if we > finish Step 4 early, then it would make step 6 easier. We can start to have > some IT/E2E tests, with the default slot resource profiles being available. > > Thank you~ > > Xintong Song > > > > On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin <[hidden email]> > wrote: > >> @Xintong >> >> Thanks for the feedback. >> >> Just to clarify step 6: >> If the first point is done before step 5 (e.g. as part of 4) then it is >> just keeping the info about the default slot in RM's data structure >> associated the TM and no real change in the behaviour. >> When this info is available, I think it can be straightforwardly used >> during step 5 where we get either concrete slot requirement >> or the unknown one (step 6, point 2) which simply grabs some of the >> concrete default ones (btw not clear which one, seems just some random?) >> >> For steps 5,7, true, it is not quite clear whether we can avoid some >> split, >> e.g. after step 5 before doing step 7. >> I agree that we should introduce the feature flag if we clearly see that >> it >> would be a bigger effort without the flag. >> >> Best, >> Andrey >> >> On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann <[hidden email]> >> wrote: >> >> > One thing which was briefly mentioned in the Flip but not in the >> > implementation plan is the update of the web UI. I think it is worth >> > putting an extra item for updating the web UI to properly display the >> > resources a TM has still to offer with dynamic slot allocation. I guess >> we >> > need to pull in some JavaScript help in order to implement this step. >> > >> > Cheers, >> > Till >> > >> > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song <[hidden email]> >> > wrote: >> > >> > > Thanks for the comments, Andrey. >> > > >> > > - I agree that instead of ResourceManagerGateway#sendSlotReport, we >> > should >> > > add the default slot resource profile to >> > > ResourceManagerGateway#registerTaskExecutor. >> > > >> > > - If I understand correctly, the reason you suggest do default slot >> > > resource profile first and then do step 3 in a way that support both >> > > TaskExecutorGateway#requestSlot and >> TaskExecutorGateway#requestResource, >> > is >> > > to try to avoid splitting code paths with the feature option? I think >> we >> > > can do that, but I also want to bring it up that this can only reduce >> the >> > > code split by the feature option (which is good) but not eliminate >> it. We >> > > still need the feature option for the fundamental differences, e.g. >> > > creating new SlotIDs on allocation vs. allocate to free slots with >> > existing >> > > SlotIDs. >> > > >> > > - I don't really think we can do step 5, 6 and 7 independently. >> Basically >> > > they are all making changes to the same component. We probably can do >> > step >> > > 6 and 7 independently, but I think they both depends on step 5. >> > > >> > > In general, I would say it's good to have as less as possible codes >> split >> > > by the feature option, which makes the later clean-up easier. But if >> it >> > > cannot be easily done, I would rather not to put too much efforts on >> > having >> > > a good abstraction and deduplication between the new code path and the >> > > original one that we are removing soon. >> > > >> > > What do you think? >> > > >> > > Thank you~ >> > > >> > > Xintong Song >> > > >> > > >> > > >> > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin <[hidden email] >> > >> > > wrote: >> > > >> > > > Hi Xintong, >> > > > >> > > > Thanks for sharing the implementation steps. I also think they makes >> > > sense >> > > > with the feature option. >> > > > >> > > > I was wondering if we could order the steps in a way that each >> change >> > > does >> > > > not affect other components too much, always having a working system >> > > > then maybe the feature option does not always need to split the >> code. >> > > Here >> > > > are some thoughts. >> > > > >> > > > - We could do default slot profile firstly and include it into the >> TM >> > > > registration. I would suggest to add >> > > > to ResourceManagerGateway#registerTaskExecutor, not sendSlotReport. >> > > > This way RM knows about it but does not use at this point. (parts >> of >> > > step >> > > > 4,6) >> > > > >> > > > - We could try to do step 3 firstly in a way that it also supports >> the >> > > > current way of allocation in TaskExecutorGateway#requestSlot with >> the >> > > > default slot profile >> > > > and sends reports both with available resources and with free >> default >> > > > slots which correspond to the available resources. We can just >> remove >> > > free >> > > > default slots later. >> > > > The new way of TaskExecutorGateway#requestResource could be also >> > > > implemented here but not used yet. >> > > > >> > > > - Then step 5 can use the new TaskExecutorGateway#requestResource >> and >> > the >> > > > default slot profile >> > > > >> > > > - Not sure, step 5 and 7 can be implemented independently without >> > > > regression of what we have. Maybe if we do step 7 firstly it will >> have >> > > only >> > > > default slots firstly and it will simplify step 5 later. >> > > > >> > > > Best, >> > > > Andrey >> > > > >> > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song <[hidden email] >> > >> > > > wrote: >> > > > >> > > > > Thanks for the comments, Till and Wenlong. >> > > > > >> > > > > @Wenlong >> > > > > Regarding slot sharing, the general idea is to request a slot with >> > > > > resources for tasks of the entire slot sharing group. Details can >> be >> > > > found >> > > > > in FLIP-53 [1], regarding how to decide the slot sharing groups >> and >> > how >> > > > to >> > > > > manage task resources within the shared slots. >> > > > > >> > > > > Thank you~ >> > > > > >> > > > > Xintong Song >> > > > > >> > > > > >> > > > > >> > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < >> > [hidden email]> >> > > > > wrote: >> > > > > >> > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the >> feature! >> > > It >> > > > is >> > > > > > something like mapreduce-1.0 to mapreduce-2.0. >> > > > > > >> > > > > > I like the design on the whole. One point may need to be >> included >> > in >> > > > the >> > > > > > proposal:How we deal with slot share group and dynamic slot >> > > allocation? >> > > > > It >> > > > > > can be quite different with dynamic slot allocation. >> > > > > > >> > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < >> [hidden email]> >> > > > > wrote: >> > > > > > >> > > > > > > Thanks for the update Xintong. From a high level perspective >> the >> > > > > > > implementation plan looks good to me. >> > > > > > > >> > > > > > > Cheers, >> > > > > > > Till >> > > > > > > >> > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < >> > > [hidden email] >> > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Added implementation steps for this FLIP on the wiki page >> [1]. >> > > > > > > > >> > > > > > > > >> > > > > > > > Thank you~ >> > > > > > > > >> > > > > > > > Xintong Song >> > > > > > > > >> > > > > > > > >> > > > > > > > [1] >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < >> > > > [hidden email]> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > @Zili >> > > > > > > > > >> > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken >> the >> > > > number >> > > > > > 55. >> > > > > > > > > There is a round-up number maintained on the FLIP wiki >> page >> > [1] >> > > > > shows >> > > > > > > > > which number should be used for the new FLIP, which >> should be >> > > > > > increased >> > > > > > > > by >> > > > > > > > > whoever takes the number for a new FLIP. >> > > > > > > > > >> > > > > > > > > Thank you~ >> > > > > > > > > >> > > > > > > > > Xintong Song >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > [1] >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals >> > > > > > > > > >> > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < >> > > [hidden email]> >> > > > > > > wrote: >> > > > > > > > > >> > > > > > > > >> We suddenly skipped FLIP-55 lol. >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 >> > 下午10:23写道: >> > > > > > > > >> >> > > > > > > > >> > Hi everyone, >> > > > > > > > >> > >> > > > > > > > >> > We would like to start a discussion thread on "FLIP-56: >> > > > Dynamic >> > > > > > Slot >> > > > > > > > >> > Allocation" [1]. This is originally part of the >> discussion >> > > > > thread >> > > > > > > for >> > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As >> Till >> > > > > > suggested, >> > > > > > > we >> > > > > > > > >> > would like split the original discussion into two >> topics, >> > > and >> > > > > > start >> > > > > > > a >> > > > > > > > >> > separate new discussion thread as well as FLIP process >> for >> > > > this >> > > > > > one. >> > > > > > > > >> > >> > > > > > > > >> > Thank you~ >> > > > > > > > >> > >> > > > > > > > >> > Xintong Song >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > [1] >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation >> > > > > > > > >> > >> > > > > > > > >> > [2] >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > |
Thanks for the update @Xintong.
I would be ok with starting the vote. Best, Andrey On Tue, Sep 17, 2019 at 6:12 AM Xintong Song <[hidden email]> wrote: > The implementation plan [1] is updated, with the following changes: > > - Add default slot resource profile to > ResourceManagerGateway#registerTaskExecutor rather than #sendSlotReport. > - Swap 'TaskExecutor derive and register with default slot resource > profile' and 'Extend TaskExecutor to support dynamic slot allocation' > - Add step for updating RestAPI / Web UI > > Thank you~ > > Xintong Song > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > On Tue, Sep 17, 2019 at 11:49 AM Xintong Song <[hidden email]> > wrote: > > > @Till > > Thanks for the reminding. I'll add a step for updating the web ui. I'll > > try to involve Lining to help us with this step. > > > > @Andrey > > I was thinking that after we define the RM-TM interfaces in step 2, it > > would be good to concurrently work on both RM and TM side. But yes, if we > > finish Step 4 early, then it would make step 6 easier. We can start to > have > > some IT/E2E tests, with the default slot resource profiles being > available. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin <[hidden email]> > > wrote: > > > >> @Xintong > >> > >> Thanks for the feedback. > >> > >> Just to clarify step 6: > >> If the first point is done before step 5 (e.g. as part of 4) then it is > >> just keeping the info about the default slot in RM's data structure > >> associated the TM and no real change in the behaviour. > >> When this info is available, I think it can be straightforwardly used > >> during step 5 where we get either concrete slot requirement > >> or the unknown one (step 6, point 2) which simply grabs some of the > >> concrete default ones (btw not clear which one, seems just some random?) > >> > >> For steps 5,7, true, it is not quite clear whether we can avoid some > >> split, > >> e.g. after step 5 before doing step 7. > >> I agree that we should introduce the feature flag if we clearly see that > >> it > >> would be a bigger effort without the flag. > >> > >> Best, > >> Andrey > >> > >> On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann <[hidden email]> > >> wrote: > >> > >> > One thing which was briefly mentioned in the Flip but not in the > >> > implementation plan is the update of the web UI. I think it is worth > >> > putting an extra item for updating the web UI to properly display the > >> > resources a TM has still to offer with dynamic slot allocation. I > guess > >> we > >> > need to pull in some JavaScript help in order to implement this step. > >> > > >> > Cheers, > >> > Till > >> > > >> > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song <[hidden email]> > >> > wrote: > >> > > >> > > Thanks for the comments, Andrey. > >> > > > >> > > - I agree that instead of ResourceManagerGateway#sendSlotReport, we > >> > should > >> > > add the default slot resource profile to > >> > > ResourceManagerGateway#registerTaskExecutor. > >> > > > >> > > - If I understand correctly, the reason you suggest do default slot > >> > > resource profile first and then do step 3 in a way that support both > >> > > TaskExecutorGateway#requestSlot and > >> TaskExecutorGateway#requestResource, > >> > is > >> > > to try to avoid splitting code paths with the feature option? I > think > >> we > >> > > can do that, but I also want to bring it up that this can only > reduce > >> the > >> > > code split by the feature option (which is good) but not eliminate > >> it. We > >> > > still need the feature option for the fundamental differences, e.g. > >> > > creating new SlotIDs on allocation vs. allocate to free slots with > >> > existing > >> > > SlotIDs. > >> > > > >> > > - I don't really think we can do step 5, 6 and 7 independently. > >> Basically > >> > > they are all making changes to the same component. We probably can > do > >> > step > >> > > 6 and 7 independently, but I think they both depends on step 5. > >> > > > >> > > In general, I would say it's good to have as less as possible codes > >> split > >> > > by the feature option, which makes the later clean-up easier. But if > >> it > >> > > cannot be easily done, I would rather not to put too much efforts on > >> > having > >> > > a good abstraction and deduplication between the new code path and > the > >> > > original one that we are removing soon. > >> > > > >> > > What do you think? > >> > > > >> > > Thank you~ > >> > > > >> > > Xintong Song > >> > > > >> > > > >> > > > >> > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin < > [hidden email] > >> > > >> > > wrote: > >> > > > >> > > > Hi Xintong, > >> > > > > >> > > > Thanks for sharing the implementation steps. I also think they > makes > >> > > sense > >> > > > with the feature option. > >> > > > > >> > > > I was wondering if we could order the steps in a way that each > >> change > >> > > does > >> > > > not affect other components too much, always having a working > system > >> > > > then maybe the feature option does not always need to split the > >> code. > >> > > Here > >> > > > are some thoughts. > >> > > > > >> > > > - We could do default slot profile firstly and include it into the > >> TM > >> > > > registration. I would suggest to add > >> > > > to ResourceManagerGateway#registerTaskExecutor, not > sendSlotReport. > >> > > > This way RM knows about it but does not use at this point. > (parts > >> of > >> > > step > >> > > > 4,6) > >> > > > > >> > > > - We could try to do step 3 firstly in a way that it also supports > >> the > >> > > > current way of allocation in TaskExecutorGateway#requestSlot with > >> the > >> > > > default slot profile > >> > > > and sends reports both with available resources and with free > >> default > >> > > > slots which correspond to the available resources. We can just > >> remove > >> > > free > >> > > > default slots later. > >> > > > The new way of TaskExecutorGateway#requestResource could be also > >> > > > implemented here but not used yet. > >> > > > > >> > > > - Then step 5 can use the new TaskExecutorGateway#requestResource > >> and > >> > the > >> > > > default slot profile > >> > > > > >> > > > - Not sure, step 5 and 7 can be implemented independently without > >> > > > regression of what we have. Maybe if we do step 7 firstly it will > >> have > >> > > only > >> > > > default slots firstly and it will simplify step 5 later. > >> > > > > >> > > > Best, > >> > > > Andrey > >> > > > > >> > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song < > [hidden email] > >> > > >> > > > wrote: > >> > > > > >> > > > > Thanks for the comments, Till and Wenlong. > >> > > > > > >> > > > > @Wenlong > >> > > > > Regarding slot sharing, the general idea is to request a slot > with > >> > > > > resources for tasks of the entire slot sharing group. Details > can > >> be > >> > > > found > >> > > > > in FLIP-53 [1], regarding how to decide the slot sharing groups > >> and > >> > how > >> > > > to > >> > > > > manage task resources within the shared slots. > >> > > > > > >> > > > > Thank you~ > >> > > > > > >> > > > > Xintong Song > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > >> > [hidden email]> > >> > > > > wrote: > >> > > > > > >> > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the > >> feature! > >> > > It > >> > > > is > >> > > > > > something like mapreduce-1.0 to mapreduce-2.0. > >> > > > > > > >> > > > > > I like the design on the whole. One point may need to be > >> included > >> > in > >> > > > the > >> > > > > > proposal:How we deal with slot share group and dynamic slot > >> > > allocation? > >> > > > > It > >> > > > > > can be quite different with dynamic slot allocation. > >> > > > > > > >> > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < > >> [hidden email]> > >> > > > > wrote: > >> > > > > > > >> > > > > > > Thanks for the update Xintong. From a high level perspective > >> the > >> > > > > > > implementation plan looks good to me. > >> > > > > > > > >> > > > > > > Cheers, > >> > > > > > > Till > >> > > > > > > > >> > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > >> > > [hidden email] > >> > > > > > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > > Added implementation steps for this FLIP on the wiki page > >> [1]. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Thank you~ > >> > > > > > > > > >> > > > > > > > Xintong Song > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > [1] > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > >> > > > [hidden email]> > >> > > > > > > > wrote: > >> > > > > > > > > >> > > > > > > > > @Zili > >> > > > > > > > > > >> > > > > > > > > As far as I know, Timo is drafting a FLIP that has taken > >> the > >> > > > number > >> > > > > > 55. > >> > > > > > > > > There is a round-up number maintained on the FLIP wiki > >> page > >> > [1] > >> > > > > shows > >> > > > > > > > > which number should be used for the new FLIP, which > >> should be > >> > > > > > increased > >> > > > > > > > by > >> > > > > > > > > whoever takes the number for a new FLIP. > >> > > > > > > > > > >> > > > > > > > > Thank you~ > >> > > > > > > > > > >> > > > > > > > > Xintong Song > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > [1] > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > >> > > > > > > > > > >> > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > >> > > [hidden email]> > >> > > > > > > wrote: > >> > > > > > > > > > >> > > > > > > > >> We suddenly skipped FLIP-55 lol. > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 > >> > 下午10:23写道: > >> > > > > > > > >> > >> > > > > > > > >> > Hi everyone, > >> > > > > > > > >> > > >> > > > > > > > >> > We would like to start a discussion thread on > "FLIP-56: > >> > > > Dynamic > >> > > > > > Slot > >> > > > > > > > >> > Allocation" [1]. This is originally part of the > >> discussion > >> > > > > thread > >> > > > > > > for > >> > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As > >> Till > >> > > > > > suggested, > >> > > > > > > we > >> > > > > > > > >> > would like split the original discussion into two > >> topics, > >> > > and > >> > > > > > start > >> > > > > > > a > >> > > > > > > > >> > separate new discussion thread as well as FLIP > process > >> for > >> > > > this > >> > > > > > one. > >> > > > > > > > >> > > >> > > > > > > > >> > Thank you~ > >> > > > > > > > >> > > >> > > > > > > > >> > Xintong Song > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > [1] > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > >> > > > > > > > >> > > >> > > > > > > > >> > [2] > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > |
Thanks for the feedback, Andrey.
I'll start the vote. Thank you~ Xintong Song On Tue, Sep 17, 2019 at 10:09 PM Andrey Zagrebin <[hidden email]> wrote: > Thanks for the update @Xintong. > I would be ok with starting the vote. > > Best, > Andrey > > On Tue, Sep 17, 2019 at 6:12 AM Xintong Song <[hidden email]> > wrote: > > > The implementation plan [1] is updated, with the following changes: > > > > - Add default slot resource profile to > > ResourceManagerGateway#registerTaskExecutor rather than > #sendSlotReport. > > - Swap 'TaskExecutor derive and register with default slot resource > > profile' and 'Extend TaskExecutor to support dynamic slot allocation' > > - Add step for updating RestAPI / Web UI > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > On Tue, Sep 17, 2019 at 11:49 AM Xintong Song <[hidden email]> > > wrote: > > > > > @Till > > > Thanks for the reminding. I'll add a step for updating the web ui. I'll > > > try to involve Lining to help us with this step. > > > > > > @Andrey > > > I was thinking that after we define the RM-TM interfaces in step 2, it > > > would be good to concurrently work on both RM and TM side. But yes, if > we > > > finish Step 4 early, then it would make step 6 easier. We can start to > > have > > > some IT/E2E tests, with the default slot resource profiles being > > available. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin <[hidden email]> > > > wrote: > > > > > >> @Xintong > > >> > > >> Thanks for the feedback. > > >> > > >> Just to clarify step 6: > > >> If the first point is done before step 5 (e.g. as part of 4) then it > is > > >> just keeping the info about the default slot in RM's data structure > > >> associated the TM and no real change in the behaviour. > > >> When this info is available, I think it can be straightforwardly used > > >> during step 5 where we get either concrete slot requirement > > >> or the unknown one (step 6, point 2) which simply grabs some of the > > >> concrete default ones (btw not clear which one, seems just some > random?) > > >> > > >> For steps 5,7, true, it is not quite clear whether we can avoid some > > >> split, > > >> e.g. after step 5 before doing step 7. > > >> I agree that we should introduce the feature flag if we clearly see > that > > >> it > > >> would be a bigger effort without the flag. > > >> > > >> Best, > > >> Andrey > > >> > > >> On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann <[hidden email]> > > >> wrote: > > >> > > >> > One thing which was briefly mentioned in the Flip but not in the > > >> > implementation plan is the update of the web UI. I think it is worth > > >> > putting an extra item for updating the web UI to properly display > the > > >> > resources a TM has still to offer with dynamic slot allocation. I > > guess > > >> we > > >> > need to pull in some JavaScript help in order to implement this > step. > > >> > > > >> > Cheers, > > >> > Till > > >> > > > >> > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song <[hidden email] > > > > >> > wrote: > > >> > > > >> > > Thanks for the comments, Andrey. > > >> > > > > >> > > - I agree that instead of ResourceManagerGateway#sendSlotReport, > we > > >> > should > > >> > > add the default slot resource profile to > > >> > > ResourceManagerGateway#registerTaskExecutor. > > >> > > > > >> > > - If I understand correctly, the reason you suggest do default > slot > > >> > > resource profile first and then do step 3 in a way that support > both > > >> > > TaskExecutorGateway#requestSlot and > > >> TaskExecutorGateway#requestResource, > > >> > is > > >> > > to try to avoid splitting code paths with the feature option? I > > think > > >> we > > >> > > can do that, but I also want to bring it up that this can only > > reduce > > >> the > > >> > > code split by the feature option (which is good) but not eliminate > > >> it. We > > >> > > still need the feature option for the fundamental differences, > e.g. > > >> > > creating new SlotIDs on allocation vs. allocate to free slots with > > >> > existing > > >> > > SlotIDs. > > >> > > > > >> > > - I don't really think we can do step 5, 6 and 7 independently. > > >> Basically > > >> > > they are all making changes to the same component. We probably can > > do > > >> > step > > >> > > 6 and 7 independently, but I think they both depends on step 5. > > >> > > > > >> > > In general, I would say it's good to have as less as possible > codes > > >> split > > >> > > by the feature option, which makes the later clean-up easier. But > if > > >> it > > >> > > cannot be easily done, I would rather not to put too much efforts > on > > >> > having > > >> > > a good abstraction and deduplication between the new code path and > > the > > >> > > original one that we are removing soon. > > >> > > > > >> > > What do you think? > > >> > > > > >> > > Thank you~ > > >> > > > > >> > > Xintong Song > > >> > > > > >> > > > > >> > > > > >> > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin < > > [hidden email] > > >> > > > >> > > wrote: > > >> > > > > >> > > > Hi Xintong, > > >> > > > > > >> > > > Thanks for sharing the implementation steps. I also think they > > makes > > >> > > sense > > >> > > > with the feature option. > > >> > > > > > >> > > > I was wondering if we could order the steps in a way that each > > >> change > > >> > > does > > >> > > > not affect other components too much, always having a working > > system > > >> > > > then maybe the feature option does not always need to split the > > >> code. > > >> > > Here > > >> > > > are some thoughts. > > >> > > > > > >> > > > - We could do default slot profile firstly and include it into > the > > >> TM > > >> > > > registration. I would suggest to add > > >> > > > to ResourceManagerGateway#registerTaskExecutor, not > > sendSlotReport. > > >> > > > This way RM knows about it but does not use at this point. > > (parts > > >> of > > >> > > step > > >> > > > 4,6) > > >> > > > > > >> > > > - We could try to do step 3 firstly in a way that it also > supports > > >> the > > >> > > > current way of allocation in TaskExecutorGateway#requestSlot > with > > >> the > > >> > > > default slot profile > > >> > > > and sends reports both with available resources and with free > > >> default > > >> > > > slots which correspond to the available resources. We can just > > >> remove > > >> > > free > > >> > > > default slots later. > > >> > > > The new way of TaskExecutorGateway#requestResource could be > also > > >> > > > implemented here but not used yet. > > >> > > > > > >> > > > - Then step 5 can use the new > TaskExecutorGateway#requestResource > > >> and > > >> > the > > >> > > > default slot profile > > >> > > > > > >> > > > - Not sure, step 5 and 7 can be implemented independently > without > > >> > > > regression of what we have. Maybe if we do step 7 firstly it > will > > >> have > > >> > > only > > >> > > > default slots firstly and it will simplify step 5 later. > > >> > > > > > >> > > > Best, > > >> > > > Andrey > > >> > > > > > >> > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song < > > [hidden email] > > >> > > > >> > > > wrote: > > >> > > > > > >> > > > > Thanks for the comments, Till and Wenlong. > > >> > > > > > > >> > > > > @Wenlong > > >> > > > > Regarding slot sharing, the general idea is to request a slot > > with > > >> > > > > resources for tasks of the entire slot sharing group. Details > > can > > >> be > > >> > > > found > > >> > > > > in FLIP-53 [1], regarding how to decide the slot sharing > groups > > >> and > > >> > how > > >> > > > to > > >> > > > > manage task resources within the shared slots. > > >> > > > > > > >> > > > > Thank you~ > > >> > > > > > > >> > > > > Xintong Song > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > > >> > [hidden email]> > > >> > > > > wrote: > > >> > > > > > > >> > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the > > >> feature! > > >> > > It > > >> > > > is > > >> > > > > > something like mapreduce-1.0 to mapreduce-2.0. > > >> > > > > > > > >> > > > > > I like the design on the whole. One point may need to be > > >> included > > >> > in > > >> > > > the > > >> > > > > > proposal:How we deal with slot share group and dynamic slot > > >> > > allocation? > > >> > > > > It > > >> > > > > > can be quite different with dynamic slot allocation. > > >> > > > > > > > >> > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < > > >> [hidden email]> > > >> > > > > wrote: > > >> > > > > > > > >> > > > > > > Thanks for the update Xintong. From a high level > perspective > > >> the > > >> > > > > > > implementation plan looks good to me. > > >> > > > > > > > > >> > > > > > > Cheers, > > >> > > > > > > Till > > >> > > > > > > > > >> > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > > >> > > [hidden email] > > >> > > > > > > >> > > > > > > wrote: > > >> > > > > > > > > >> > > > > > > > Added implementation steps for this FLIP on the wiki > page > > >> [1]. > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > Thank you~ > > >> > > > > > > > > > >> > > > > > > > Xintong Song > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > [1] > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > >> > > > [hidden email]> > > >> > > > > > > > wrote: > > >> > > > > > > > > > >> > > > > > > > > @Zili > > >> > > > > > > > > > > >> > > > > > > > > As far as I know, Timo is drafting a FLIP that has > taken > > >> the > > >> > > > number > > >> > > > > > 55. > > >> > > > > > > > > There is a round-up number maintained on the FLIP wiki > > >> page > > >> > [1] > > >> > > > > shows > > >> > > > > > > > > which number should be used for the new FLIP, which > > >> should be > > >> > > > > > increased > > >> > > > > > > > by > > >> > > > > > > > > whoever takes the number for a new FLIP. > > >> > > > > > > > > > > >> > > > > > > > > Thank you~ > > >> > > > > > > > > > > >> > > > > > > > > Xintong Song > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > [1] > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > >> > > > > > > > > > > >> > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > > >> > > [hidden email]> > > >> > > > > > > wrote: > > >> > > > > > > > > > > >> > > > > > > > >> We suddenly skipped FLIP-55 lol. > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 > > >> > 下午10:23写道: > > >> > > > > > > > >> > > >> > > > > > > > >> > Hi everyone, > > >> > > > > > > > >> > > > >> > > > > > > > >> > We would like to start a discussion thread on > > "FLIP-56: > > >> > > > Dynamic > > >> > > > > > Slot > > >> > > > > > > > >> > Allocation" [1]. This is originally part of the > > >> discussion > > >> > > > > thread > > >> > > > > > > for > > >> > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. As > > >> Till > > >> > > > > > suggested, > > >> > > > > > > we > > >> > > > > > > > >> > would like split the original discussion into two > > >> topics, > > >> > > and > > >> > > > > > start > > >> > > > > > > a > > >> > > > > > > > >> > separate new discussion thread as well as FLIP > > process > > >> for > > >> > > > this > > >> > > > > > one. > > >> > > > > > > > >> > > > >> > > > > > > > >> > Thank you~ > > >> > > > > > > > >> > > > >> > > > > > > > >> > Xintong Song > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > [1] > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > >> > > > > > > > >> > > > >> > > > > > > > >> > [2] > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > |
Sorry if I ask a question that has been addressed before. please point me
to the reference. How do we limit the cpu usage to a slot? Does the thread that executes the slot get paused when it uses CPU cycles more than it requests? On Tue, Sep 17, 2019 at 10:23 PM Xintong Song <[hidden email]> wrote: > Thanks for the feedback, Andrey. > > I'll start the vote. > > Thank you~ > > Xintong Song > > > > On Tue, Sep 17, 2019 at 10:09 PM Andrey Zagrebin <[hidden email]> > wrote: > > > Thanks for the update @Xintong. > > I would be ok with starting the vote. > > > > Best, > > Andrey > > > > On Tue, Sep 17, 2019 at 6:12 AM Xintong Song <[hidden email]> > > wrote: > > > > > The implementation plan [1] is updated, with the following changes: > > > > > > - Add default slot resource profile to > > > ResourceManagerGateway#registerTaskExecutor rather than > > #sendSlotReport. > > > - Swap 'TaskExecutor derive and register with default slot resource > > > profile' and 'Extend TaskExecutor to support dynamic slot > allocation' > > > - Add step for updating RestAPI / Web UI > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > On Tue, Sep 17, 2019 at 11:49 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > @Till > > > > Thanks for the reminding. I'll add a step for updating the web ui. > I'll > > > > try to involve Lining to help us with this step. > > > > > > > > @Andrey > > > > I was thinking that after we define the RM-TM interfaces in step 2, > it > > > > would be good to concurrently work on both RM and TM side. But yes, > if > > we > > > > finish Step 4 early, then it would make step 6 easier. We can start > to > > > have > > > > some IT/E2E tests, with the default slot resource profiles being > > > available. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin < > [hidden email]> > > > > wrote: > > > > > > > >> @Xintong > > > >> > > > >> Thanks for the feedback. > > > >> > > > >> Just to clarify step 6: > > > >> If the first point is done before step 5 (e.g. as part of 4) then it > > is > > > >> just keeping the info about the default slot in RM's data structure > > > >> associated the TM and no real change in the behaviour. > > > >> When this info is available, I think it can be straightforwardly > used > > > >> during step 5 where we get either concrete slot requirement > > > >> or the unknown one (step 6, point 2) which simply grabs some of the > > > >> concrete default ones (btw not clear which one, seems just some > > random?) > > > >> > > > >> For steps 5,7, true, it is not quite clear whether we can avoid some > > > >> split, > > > >> e.g. after step 5 before doing step 7. > > > >> I agree that we should introduce the feature flag if we clearly see > > that > > > >> it > > > >> would be a bigger effort without the flag. > > > >> > > > >> Best, > > > >> Andrey > > > >> > > > >> On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann <[hidden email] > > > > > >> wrote: > > > >> > > > >> > One thing which was briefly mentioned in the Flip but not in the > > > >> > implementation plan is the update of the web UI. I think it is > worth > > > >> > putting an extra item for updating the web UI to properly display > > the > > > >> > resources a TM has still to offer with dynamic slot allocation. I > > > guess > > > >> we > > > >> > need to pull in some JavaScript help in order to implement this > > step. > > > >> > > > > >> > Cheers, > > > >> > Till > > > >> > > > > >> > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song < > [hidden email] > > > > > > >> > wrote: > > > >> > > > > >> > > Thanks for the comments, Andrey. > > > >> > > > > > >> > > - I agree that instead of ResourceManagerGateway#sendSlotReport, > > we > > > >> > should > > > >> > > add the default slot resource profile to > > > >> > > ResourceManagerGateway#registerTaskExecutor. > > > >> > > > > > >> > > - If I understand correctly, the reason you suggest do default > > slot > > > >> > > resource profile first and then do step 3 in a way that support > > both > > > >> > > TaskExecutorGateway#requestSlot and > > > >> TaskExecutorGateway#requestResource, > > > >> > is > > > >> > > to try to avoid splitting code paths with the feature option? I > > > think > > > >> we > > > >> > > can do that, but I also want to bring it up that this can only > > > reduce > > > >> the > > > >> > > code split by the feature option (which is good) but not > eliminate > > > >> it. We > > > >> > > still need the feature option for the fundamental differences, > > e.g. > > > >> > > creating new SlotIDs on allocation vs. allocate to free slots > with > > > >> > existing > > > >> > > SlotIDs. > > > >> > > > > > >> > > - I don't really think we can do step 5, 6 and 7 independently. > > > >> Basically > > > >> > > they are all making changes to the same component. We probably > can > > > do > > > >> > step > > > >> > > 6 and 7 independently, but I think they both depends on step 5. > > > >> > > > > > >> > > In general, I would say it's good to have as less as possible > > codes > > > >> split > > > >> > > by the feature option, which makes the later clean-up easier. > But > > if > > > >> it > > > >> > > cannot be easily done, I would rather not to put too much > efforts > > on > > > >> > having > > > >> > > a good abstraction and deduplication between the new code path > and > > > the > > > >> > > original one that we are removing soon. > > > >> > > > > > >> > > What do you think? > > > >> > > > > > >> > > Thank you~ > > > >> > > > > > >> > > Xintong Song > > > >> > > > > > >> > > > > > >> > > > > > >> > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin < > > > [hidden email] > > > >> > > > > >> > > wrote: > > > >> > > > > > >> > > > Hi Xintong, > > > >> > > > > > > >> > > > Thanks for sharing the implementation steps. I also think they > > > makes > > > >> > > sense > > > >> > > > with the feature option. > > > >> > > > > > > >> > > > I was wondering if we could order the steps in a way that each > > > >> change > > > >> > > does > > > >> > > > not affect other components too much, always having a working > > > system > > > >> > > > then maybe the feature option does not always need to split > the > > > >> code. > > > >> > > Here > > > >> > > > are some thoughts. > > > >> > > > > > > >> > > > - We could do default slot profile firstly and include it into > > the > > > >> TM > > > >> > > > registration. I would suggest to add > > > >> > > > to ResourceManagerGateway#registerTaskExecutor, not > > > sendSlotReport. > > > >> > > > This way RM knows about it but does not use at this point. > > > (parts > > > >> of > > > >> > > step > > > >> > > > 4,6) > > > >> > > > > > > >> > > > - We could try to do step 3 firstly in a way that it also > > supports > > > >> the > > > >> > > > current way of allocation in TaskExecutorGateway#requestSlot > > with > > > >> the > > > >> > > > default slot profile > > > >> > > > and sends reports both with available resources and with > free > > > >> default > > > >> > > > slots which correspond to the available resources. We can just > > > >> remove > > > >> > > free > > > >> > > > default slots later. > > > >> > > > The new way of TaskExecutorGateway#requestResource could be > > also > > > >> > > > implemented here but not used yet. > > > >> > > > > > > >> > > > - Then step 5 can use the new > > TaskExecutorGateway#requestResource > > > >> and > > > >> > the > > > >> > > > default slot profile > > > >> > > > > > > >> > > > - Not sure, step 5 and 7 can be implemented independently > > without > > > >> > > > regression of what we have. Maybe if we do step 7 firstly it > > will > > > >> have > > > >> > > only > > > >> > > > default slots firstly and it will simplify step 5 later. > > > >> > > > > > > >> > > > Best, > > > >> > > > Andrey > > > >> > > > > > > >> > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song < > > > [hidden email] > > > >> > > > > >> > > > wrote: > > > >> > > > > > > >> > > > > Thanks for the comments, Till and Wenlong. > > > >> > > > > > > > >> > > > > @Wenlong > > > >> > > > > Regarding slot sharing, the general idea is to request a > slot > > > with > > > >> > > > > resources for tasks of the entire slot sharing group. > Details > > > can > > > >> be > > > >> > > > found > > > >> > > > > in FLIP-53 [1], regarding how to decide the slot sharing > > groups > > > >> and > > > >> > how > > > >> > > > to > > > >> > > > > manage task resources within the shared slots. > > > >> > > > > > > > >> > > > > Thank you~ > > > >> > > > > > > > >> > > > > Xintong Song > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > > > >> > [hidden email]> > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > > > Hi, Xintong, thanks for the great proposal. big +1 for the > > > >> feature! > > > >> > > It > > > >> > > > is > > > >> > > > > > something like mapreduce-1.0 to mapreduce-2.0. > > > >> > > > > > > > > >> > > > > > I like the design on the whole. One point may need to be > > > >> included > > > >> > in > > > >> > > > the > > > >> > > > > > proposal:How we deal with slot share group and dynamic > slot > > > >> > > allocation? > > > >> > > > > It > > > >> > > > > > can be quite different with dynamic slot allocation. > > > >> > > > > > > > > >> > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < > > > >> [hidden email]> > > > >> > > > > wrote: > > > >> > > > > > > > > >> > > > > > > Thanks for the update Xintong. From a high level > > perspective > > > >> the > > > >> > > > > > > implementation plan looks good to me. > > > >> > > > > > > > > > >> > > > > > > Cheers, > > > >> > > > > > > Till > > > >> > > > > > > > > > >> > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > > > >> > > [hidden email] > > > >> > > > > > > > >> > > > > > > wrote: > > > >> > > > > > > > > > >> > > > > > > > Added implementation steps for this FLIP on the wiki > > page > > > >> [1]. > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > Thank you~ > > > >> > > > > > > > > > > >> > > > > > > > Xintong Song > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > [1] > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > > >> > > > [hidden email]> > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > >> > > > > > > > > @Zili > > > >> > > > > > > > > > > > >> > > > > > > > > As far as I know, Timo is drafting a FLIP that has > > taken > > > >> the > > > >> > > > number > > > >> > > > > > 55. > > > >> > > > > > > > > There is a round-up number maintained on the FLIP > wiki > > > >> page > > > >> > [1] > > > >> > > > > shows > > > >> > > > > > > > > which number should be used for the new FLIP, which > > > >> should be > > > >> > > > > > increased > > > >> > > > > > > > by > > > >> > > > > > > > > whoever takes the number for a new FLIP. > > > >> > > > > > > > > > > > >> > > > > > > > > Thank you~ > > > >> > > > > > > > > > > > >> > > > > > > > > Xintong Song > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > [1] > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > >> > > > > > > > > > > > >> > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > > > >> > > [hidden email]> > > > >> > > > > > > wrote: > > > >> > > > > > > > > > > > >> > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> Xintong Song <[hidden email]> 于2019年8月19日周一 > > > >> > 下午10:23写道: > > > >> > > > > > > > >> > > > >> > > > > > > > >> > Hi everyone, > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > We would like to start a discussion thread on > > > "FLIP-56: > > > >> > > > Dynamic > > > >> > > > > > Slot > > > >> > > > > > > > >> > Allocation" [1]. This is originally part of the > > > >> discussion > > > >> > > > > thread > > > >> > > > > > > for > > > >> > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" [2]. > As > > > >> Till > > > >> > > > > > suggested, > > > >> > > > > > > we > > > >> > > > > > > > >> > would like split the original discussion into two > > > >> topics, > > > >> > > and > > > >> > > > > > start > > > >> > > > > > > a > > > >> > > > > > > > >> > separate new discussion thread as well as FLIP > > > process > > > >> for > > > >> > > > this > > > >> > > > > > one. > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > Thank you~ > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > Xintong Song > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > [1] > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > [2] > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > -- Regards, Tao |
@tao
I think we cannot limit the cpu usage of a slot, nor isolate the usages between slots. We do have cpu limits for the task executor in some scenarios, such as on yarn with strict cgroup mode. The purpose of bookkeep and dynamic allocation of cpu cores is to prevent scheduling tasks with too many computation loads to the task executor, rather than limit the cpu usage of each slot. Thank you~ Xintong Song On Wed, Sep 18, 2019 at 12:18 AM tao xiao <[hidden email]> wrote: > Sorry if I ask a question that has been addressed before. please point me > to the reference. > > How do we limit the cpu usage to a slot? Does the thread that executes the > slot get paused when it uses CPU cycles more than it requests? > > On Tue, Sep 17, 2019 at 10:23 PM Xintong Song <[hidden email]> > wrote: > > > Thanks for the feedback, Andrey. > > > > I'll start the vote. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Sep 17, 2019 at 10:09 PM Andrey Zagrebin <[hidden email]> > > wrote: > > > > > Thanks for the update @Xintong. > > > I would be ok with starting the vote. > > > > > > Best, > > > Andrey > > > > > > On Tue, Sep 17, 2019 at 6:12 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > The implementation plan [1] is updated, with the following changes: > > > > > > > > - Add default slot resource profile to > > > > ResourceManagerGateway#registerTaskExecutor rather than > > > #sendSlotReport. > > > > - Swap 'TaskExecutor derive and register with default slot > resource > > > > profile' and 'Extend TaskExecutor to support dynamic slot > > allocation' > > > > - Add step for updating RestAPI / Web UI > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > [1] > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > On Tue, Sep 17, 2019 at 11:49 AM Xintong Song <[hidden email] > > > > > > wrote: > > > > > > > > > @Till > > > > > Thanks for the reminding. I'll add a step for updating the web ui. > > I'll > > > > > try to involve Lining to help us with this step. > > > > > > > > > > @Andrey > > > > > I was thinking that after we define the RM-TM interfaces in step 2, > > it > > > > > would be good to concurrently work on both RM and TM side. But yes, > > if > > > we > > > > > finish Step 4 early, then it would make step 6 easier. We can start > > to > > > > have > > > > > some IT/E2E tests, with the default slot resource profiles being > > > > available. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin < > > [hidden email]> > > > > > wrote: > > > > > > > > > >> @Xintong > > > > >> > > > > >> Thanks for the feedback. > > > > >> > > > > >> Just to clarify step 6: > > > > >> If the first point is done before step 5 (e.g. as part of 4) then > it > > > is > > > > >> just keeping the info about the default slot in RM's data > structure > > > > >> associated the TM and no real change in the behaviour. > > > > >> When this info is available, I think it can be straightforwardly > > used > > > > >> during step 5 where we get either concrete slot requirement > > > > >> or the unknown one (step 6, point 2) which simply grabs some of > the > > > > >> concrete default ones (btw not clear which one, seems just some > > > random?) > > > > >> > > > > >> For steps 5,7, true, it is not quite clear whether we can avoid > some > > > > >> split, > > > > >> e.g. after step 5 before doing step 7. > > > > >> I agree that we should introduce the feature flag if we clearly > see > > > that > > > > >> it > > > > >> would be a bigger effort without the flag. > > > > >> > > > > >> Best, > > > > >> Andrey > > > > >> > > > > >> On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann < > [hidden email] > > > > > > > >> wrote: > > > > >> > > > > >> > One thing which was briefly mentioned in the Flip but not in the > > > > >> > implementation plan is the update of the web UI. I think it is > > worth > > > > >> > putting an extra item for updating the web UI to properly > display > > > the > > > > >> > resources a TM has still to offer with dynamic slot allocation. > I > > > > guess > > > > >> we > > > > >> > need to pull in some JavaScript help in order to implement this > > > step. > > > > >> > > > > > >> > Cheers, > > > > >> > Till > > > > >> > > > > > >> > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song < > > [hidden email] > > > > > > > > >> > wrote: > > > > >> > > > > > >> > > Thanks for the comments, Andrey. > > > > >> > > > > > > >> > > - I agree that instead of > ResourceManagerGateway#sendSlotReport, > > > we > > > > >> > should > > > > >> > > add the default slot resource profile to > > > > >> > > ResourceManagerGateway#registerTaskExecutor. > > > > >> > > > > > > >> > > - If I understand correctly, the reason you suggest do default > > > slot > > > > >> > > resource profile first and then do step 3 in a way that > support > > > both > > > > >> > > TaskExecutorGateway#requestSlot and > > > > >> TaskExecutorGateway#requestResource, > > > > >> > is > > > > >> > > to try to avoid splitting code paths with the feature option? > I > > > > think > > > > >> we > > > > >> > > can do that, but I also want to bring it up that this can only > > > > reduce > > > > >> the > > > > >> > > code split by the feature option (which is good) but not > > eliminate > > > > >> it. We > > > > >> > > still need the feature option for the fundamental differences, > > > e.g. > > > > >> > > creating new SlotIDs on allocation vs. allocate to free slots > > with > > > > >> > existing > > > > >> > > SlotIDs. > > > > >> > > > > > > >> > > - I don't really think we can do step 5, 6 and 7 > independently. > > > > >> Basically > > > > >> > > they are all making changes to the same component. We probably > > can > > > > do > > > > >> > step > > > > >> > > 6 and 7 independently, but I think they both depends on step > 5. > > > > >> > > > > > > >> > > In general, I would say it's good to have as less as possible > > > codes > > > > >> split > > > > >> > > by the feature option, which makes the later clean-up easier. > > But > > > if > > > > >> it > > > > >> > > cannot be easily done, I would rather not to put too much > > efforts > > > on > > > > >> > having > > > > >> > > a good abstraction and deduplication between the new code path > > and > > > > the > > > > >> > > original one that we are removing soon. > > > > >> > > > > > > >> > > What do you think? > > > > >> > > > > > > >> > > Thank you~ > > > > >> > > > > > > >> > > Xintong Song > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin < > > > > [hidden email] > > > > >> > > > > > >> > > wrote: > > > > >> > > > > > > >> > > > Hi Xintong, > > > > >> > > > > > > > >> > > > Thanks for sharing the implementation steps. I also think > they > > > > makes > > > > >> > > sense > > > > >> > > > with the feature option. > > > > >> > > > > > > > >> > > > I was wondering if we could order the steps in a way that > each > > > > >> change > > > > >> > > does > > > > >> > > > not affect other components too much, always having a > working > > > > system > > > > >> > > > then maybe the feature option does not always need to split > > the > > > > >> code. > > > > >> > > Here > > > > >> > > > are some thoughts. > > > > >> > > > > > > > >> > > > - We could do default slot profile firstly and include it > into > > > the > > > > >> TM > > > > >> > > > registration. I would suggest to add > > > > >> > > > to ResourceManagerGateway#registerTaskExecutor, not > > > > sendSlotReport. > > > > >> > > > This way RM knows about it but does not use at this point. > > > > (parts > > > > >> of > > > > >> > > step > > > > >> > > > 4,6) > > > > >> > > > > > > > >> > > > - We could try to do step 3 firstly in a way that it also > > > supports > > > > >> the > > > > >> > > > current way of allocation in TaskExecutorGateway#requestSlot > > > with > > > > >> the > > > > >> > > > default slot profile > > > > >> > > > and sends reports both with available resources and with > > free > > > > >> default > > > > >> > > > slots which correspond to the available resources. We can > just > > > > >> remove > > > > >> > > free > > > > >> > > > default slots later. > > > > >> > > > The new way of TaskExecutorGateway#requestResource could > be > > > also > > > > >> > > > implemented here but not used yet. > > > > >> > > > > > > > >> > > > - Then step 5 can use the new > > > TaskExecutorGateway#requestResource > > > > >> and > > > > >> > the > > > > >> > > > default slot profile > > > > >> > > > > > > > >> > > > - Not sure, step 5 and 7 can be implemented independently > > > without > > > > >> > > > regression of what we have. Maybe if we do step 7 firstly it > > > will > > > > >> have > > > > >> > > only > > > > >> > > > default slots firstly and it will simplify step 5 later. > > > > >> > > > > > > > >> > > > Best, > > > > >> > > > Andrey > > > > >> > > > > > > > >> > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song < > > > > [hidden email] > > > > >> > > > > > >> > > > wrote: > > > > >> > > > > > > > >> > > > > Thanks for the comments, Till and Wenlong. > > > > >> > > > > > > > > >> > > > > @Wenlong > > > > >> > > > > Regarding slot sharing, the general idea is to request a > > slot > > > > with > > > > >> > > > > resources for tasks of the entire slot sharing group. > > Details > > > > can > > > > >> be > > > > >> > > > found > > > > >> > > > > in FLIP-53 [1], regarding how to decide the slot sharing > > > groups > > > > >> and > > > > >> > how > > > > >> > > > to > > > > >> > > > > manage task resources within the shared slots. > > > > >> > > > > > > > > >> > > > > Thank you~ > > > > >> > > > > > > > > >> > > > > Xintong Song > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > > > > >> > [hidden email]> > > > > >> > > > > wrote: > > > > >> > > > > > > > > >> > > > > > Hi, Xintong, thanks for the great proposal. big +1 for > the > > > > >> feature! > > > > >> > > It > > > > >> > > > is > > > > >> > > > > > something like mapreduce-1.0 to mapreduce-2.0. > > > > >> > > > > > > > > > >> > > > > > I like the design on the whole. One point may need to be > > > > >> included > > > > >> > in > > > > >> > > > the > > > > >> > > > > > proposal:How we deal with slot share group and dynamic > > slot > > > > >> > > allocation? > > > > >> > > > > It > > > > >> > > > > > can be quite different with dynamic slot allocation. > > > > >> > > > > > > > > > >> > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < > > > > >> [hidden email]> > > > > >> > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > Thanks for the update Xintong. From a high level > > > perspective > > > > >> the > > > > >> > > > > > > implementation plan looks good to me. > > > > >> > > > > > > > > > > >> > > > > > > Cheers, > > > > >> > > > > > > Till > > > > >> > > > > > > > > > > >> > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > > > > >> > > [hidden email] > > > > >> > > > > > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > >> > > > > > > > Added implementation steps for this FLIP on the wiki > > > page > > > > >> [1]. > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > Thank you~ > > > > >> > > > > > > > > > > > >> > > > > > > > Xintong Song > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > [1] > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > > > >> > > > [hidden email]> > > > > >> > > > > > > > wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > @Zili > > > > >> > > > > > > > > > > > > >> > > > > > > > > As far as I know, Timo is drafting a FLIP that has > > > taken > > > > >> the > > > > >> > > > number > > > > >> > > > > > 55. > > > > >> > > > > > > > > There is a round-up number maintained on the FLIP > > wiki > > > > >> page > > > > >> > [1] > > > > >> > > > > shows > > > > >> > > > > > > > > which number should be used for the new FLIP, > which > > > > >> should be > > > > >> > > > > > increased > > > > >> > > > > > > > by > > > > >> > > > > > > > > whoever takes the number for a new FLIP. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Thank you~ > > > > >> > > > > > > > > > > > > >> > > > > > > > > Xintong Song > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > [1] > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > >> > > > > > > > > > > > > >> > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > > > > >> > > [hidden email]> > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > > > >> > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> Xintong Song <[hidden email]> > 于2019年8月19日周一 > > > > >> > 下午10:23写道: > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > Hi everyone, > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > We would like to start a discussion thread on > > > > "FLIP-56: > > > > >> > > > Dynamic > > > > >> > > > > > Slot > > > > >> > > > > > > > >> > Allocation" [1]. This is originally part of the > > > > >> discussion > > > > >> > > > > thread > > > > >> > > > > > > for > > > > >> > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" > [2]. > > As > > > > >> Till > > > > >> > > > > > suggested, > > > > >> > > > > > > we > > > > >> > > > > > > > >> > would like split the original discussion into > two > > > > >> topics, > > > > >> > > and > > > > >> > > > > > start > > > > >> > > > > > > a > > > > >> > > > > > > > >> > separate new discussion thread as well as FLIP > > > > process > > > > >> for > > > > >> > > > this > > > > >> > > > > > one. > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > Thank you~ > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > Xintong Song > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > [1] > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > [2] > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > -- > Regards, > Tao > |
That makes sense. I suggest we add one note to the KIP to avoid confusion
On Wed, Sep 18, 2019 at 9:51 AM Xintong Song <[hidden email]> wrote: > @tao > > I think we cannot limit the cpu usage of a slot, nor isolate the usages > between slots. We do have cpu limits for the task executor in some > scenarios, such as on yarn with strict cgroup mode. > > The purpose of bookkeep and dynamic allocation of cpu cores is to prevent > scheduling tasks with too many computation loads to the task executor, > rather than limit the cpu usage of each slot. > > Thank you~ > > Xintong Song > > > > On Wed, Sep 18, 2019 at 12:18 AM tao xiao <[hidden email]> wrote: > > > Sorry if I ask a question that has been addressed before. please point me > > to the reference. > > > > How do we limit the cpu usage to a slot? Does the thread that executes > the > > slot get paused when it uses CPU cycles more than it requests? > > > > On Tue, Sep 17, 2019 at 10:23 PM Xintong Song <[hidden email]> > > wrote: > > > > > Thanks for the feedback, Andrey. > > > > > > I'll start the vote. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Sep 17, 2019 at 10:09 PM Andrey Zagrebin <[hidden email] > > > > > wrote: > > > > > > > Thanks for the update @Xintong. > > > > I would be ok with starting the vote. > > > > > > > > Best, > > > > Andrey > > > > > > > > On Tue, Sep 17, 2019 at 6:12 AM Xintong Song <[hidden email]> > > > > wrote: > > > > > > > > > The implementation plan [1] is updated, with the following changes: > > > > > > > > > > - Add default slot resource profile to > > > > > ResourceManagerGateway#registerTaskExecutor rather than > > > > #sendSlotReport. > > > > > - Swap 'TaskExecutor derive and register with default slot > > resource > > > > > profile' and 'Extend TaskExecutor to support dynamic slot > > > allocation' > > > > > - Add step for updating RestAPI / Web UI > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > > > > On Tue, Sep 17, 2019 at 11:49 AM Xintong Song < > [hidden email] > > > > > > > > wrote: > > > > > > > > > > > @Till > > > > > > Thanks for the reminding. I'll add a step for updating the web > ui. > > > I'll > > > > > > try to involve Lining to help us with this step. > > > > > > > > > > > > @Andrey > > > > > > I was thinking that after we define the RM-TM interfaces in step > 2, > > > it > > > > > > would be good to concurrently work on both RM and TM side. But > yes, > > > if > > > > we > > > > > > finish Step 4 early, then it would make step 6 easier. We can > start > > > to > > > > > have > > > > > > some IT/E2E tests, with the default slot resource profiles being > > > > > available. > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 9:50 PM Andrey Zagrebin < > > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > >> @Xintong > > > > > >> > > > > > >> Thanks for the feedback. > > > > > >> > > > > > >> Just to clarify step 6: > > > > > >> If the first point is done before step 5 (e.g. as part of 4) > then > > it > > > > is > > > > > >> just keeping the info about the default slot in RM's data > > structure > > > > > >> associated the TM and no real change in the behaviour. > > > > > >> When this info is available, I think it can be straightforwardly > > > used > > > > > >> during step 5 where we get either concrete slot requirement > > > > > >> or the unknown one (step 6, point 2) which simply grabs some of > > the > > > > > >> concrete default ones (btw not clear which one, seems just some > > > > random?) > > > > > >> > > > > > >> For steps 5,7, true, it is not quite clear whether we can avoid > > some > > > > > >> split, > > > > > >> e.g. after step 5 before doing step 7. > > > > > >> I agree that we should introduce the feature flag if we clearly > > see > > > > that > > > > > >> it > > > > > >> would be a bigger effort without the flag. > > > > > >> > > > > > >> Best, > > > > > >> Andrey > > > > > >> > > > > > >> On Mon, Sep 16, 2019 at 3:21 PM Till Rohrmann < > > [hidden email] > > > > > > > > > >> wrote: > > > > > >> > > > > > >> > One thing which was briefly mentioned in the Flip but not in > the > > > > > >> > implementation plan is the update of the web UI. I think it is > > > worth > > > > > >> > putting an extra item for updating the web UI to properly > > display > > > > the > > > > > >> > resources a TM has still to offer with dynamic slot > allocation. > > I > > > > > guess > > > > > >> we > > > > > >> > need to pull in some JavaScript help in order to implement > this > > > > step. > > > > > >> > > > > > > >> > Cheers, > > > > > >> > Till > > > > > >> > > > > > > >> > On Mon, Sep 16, 2019 at 2:15 PM Xintong Song < > > > [hidden email] > > > > > > > > > > >> > wrote: > > > > > >> > > > > > > >> > > Thanks for the comments, Andrey. > > > > > >> > > > > > > > >> > > - I agree that instead of > > ResourceManagerGateway#sendSlotReport, > > > > we > > > > > >> > should > > > > > >> > > add the default slot resource profile to > > > > > >> > > ResourceManagerGateway#registerTaskExecutor. > > > > > >> > > > > > > > >> > > - If I understand correctly, the reason you suggest do > default > > > > slot > > > > > >> > > resource profile first and then do step 3 in a way that > > support > > > > both > > > > > >> > > TaskExecutorGateway#requestSlot and > > > > > >> TaskExecutorGateway#requestResource, > > > > > >> > is > > > > > >> > > to try to avoid splitting code paths with the feature > option? > > I > > > > > think > > > > > >> we > > > > > >> > > can do that, but I also want to bring it up that this can > only > > > > > reduce > > > > > >> the > > > > > >> > > code split by the feature option (which is good) but not > > > eliminate > > > > > >> it. We > > > > > >> > > still need the feature option for the fundamental > differences, > > > > e.g. > > > > > >> > > creating new SlotIDs on allocation vs. allocate to free > slots > > > with > > > > > >> > existing > > > > > >> > > SlotIDs. > > > > > >> > > > > > > > >> > > - I don't really think we can do step 5, 6 and 7 > > independently. > > > > > >> Basically > > > > > >> > > they are all making changes to the same component. We > probably > > > can > > > > > do > > > > > >> > step > > > > > >> > > 6 and 7 independently, but I think they both depends on step > > 5. > > > > > >> > > > > > > > >> > > In general, I would say it's good to have as less as > possible > > > > codes > > > > > >> split > > > > > >> > > by the feature option, which makes the later clean-up > easier. > > > But > > > > if > > > > > >> it > > > > > >> > > cannot be easily done, I would rather not to put too much > > > efforts > > > > on > > > > > >> > having > > > > > >> > > a good abstraction and deduplication between the new code > path > > > and > > > > > the > > > > > >> > > original one that we are removing soon. > > > > > >> > > > > > > > >> > > What do you think? > > > > > >> > > > > > > > >> > > Thank you~ > > > > > >> > > > > > > > >> > > Xintong Song > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > On Mon, Sep 16, 2019 at 5:59 PM Andrey Zagrebin < > > > > > [hidden email] > > > > > >> > > > > > > >> > > wrote: > > > > > >> > > > > > > > >> > > > Hi Xintong, > > > > > >> > > > > > > > > >> > > > Thanks for sharing the implementation steps. I also think > > they > > > > > makes > > > > > >> > > sense > > > > > >> > > > with the feature option. > > > > > >> > > > > > > > > >> > > > I was wondering if we could order the steps in a way that > > each > > > > > >> change > > > > > >> > > does > > > > > >> > > > not affect other components too much, always having a > > working > > > > > system > > > > > >> > > > then maybe the feature option does not always need to > split > > > the > > > > > >> code. > > > > > >> > > Here > > > > > >> > > > are some thoughts. > > > > > >> > > > > > > > > >> > > > - We could do default slot profile firstly and include it > > into > > > > the > > > > > >> TM > > > > > >> > > > registration. I would suggest to add > > > > > >> > > > to ResourceManagerGateway#registerTaskExecutor, not > > > > > sendSlotReport. > > > > > >> > > > This way RM knows about it but does not use at this > point. > > > > > (parts > > > > > >> of > > > > > >> > > step > > > > > >> > > > 4,6) > > > > > >> > > > > > > > > >> > > > - We could try to do step 3 firstly in a way that it also > > > > supports > > > > > >> the > > > > > >> > > > current way of allocation in > TaskExecutorGateway#requestSlot > > > > with > > > > > >> the > > > > > >> > > > default slot profile > > > > > >> > > > and sends reports both with available resources and with > > > free > > > > > >> default > > > > > >> > > > slots which correspond to the available resources. We can > > just > > > > > >> remove > > > > > >> > > free > > > > > >> > > > default slots later. > > > > > >> > > > The new way of TaskExecutorGateway#requestResource could > > be > > > > also > > > > > >> > > > implemented here but not used yet. > > > > > >> > > > > > > > > >> > > > - Then step 5 can use the new > > > > TaskExecutorGateway#requestResource > > > > > >> and > > > > > >> > the > > > > > >> > > > default slot profile > > > > > >> > > > > > > > > >> > > > - Not sure, step 5 and 7 can be implemented independently > > > > without > > > > > >> > > > regression of what we have. Maybe if we do step 7 firstly > it > > > > will > > > > > >> have > > > > > >> > > only > > > > > >> > > > default slots firstly and it will simplify step 5 later. > > > > > >> > > > > > > > > >> > > > Best, > > > > > >> > > > Andrey > > > > > >> > > > > > > > > >> > > > On Mon, Sep 16, 2019 at 5:53 AM Xintong Song < > > > > > [hidden email] > > > > > >> > > > > > > >> > > > wrote: > > > > > >> > > > > > > > > >> > > > > Thanks for the comments, Till and Wenlong. > > > > > >> > > > > > > > > > >> > > > > @Wenlong > > > > > >> > > > > Regarding slot sharing, the general idea is to request a > > > slot > > > > > with > > > > > >> > > > > resources for tasks of the entire slot sharing group. > > > Details > > > > > can > > > > > >> be > > > > > >> > > > found > > > > > >> > > > > in FLIP-53 [1], regarding how to decide the slot sharing > > > > groups > > > > > >> and > > > > > >> > how > > > > > >> > > > to > > > > > >> > > > > manage task resources within the shared slots. > > > > > >> > > > > > > > > > >> > > > > Thank you~ > > > > > >> > > > > > > > > > >> > > > > Xintong Song > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > On Mon, Sep 16, 2019 at 10:42 AM wenlong.lwl < > > > > > >> > [hidden email]> > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > >> > > > > > Hi, Xintong, thanks for the great proposal. big +1 for > > the > > > > > >> feature! > > > > > >> > > It > > > > > >> > > > is > > > > > >> > > > > > something like mapreduce-1.0 to mapreduce-2.0. > > > > > >> > > > > > > > > > > >> > > > > > I like the design on the whole. One point may need to > be > > > > > >> included > > > > > >> > in > > > > > >> > > > the > > > > > >> > > > > > proposal:How we deal with slot share group and dynamic > > > slot > > > > > >> > > allocation? > > > > > >> > > > > It > > > > > >> > > > > > can be quite different with dynamic slot allocation. > > > > > >> > > > > > > > > > > >> > > > > > On Fri, 13 Sep 2019 at 16:42, Till Rohrmann < > > > > > >> [hidden email]> > > > > > >> > > > > wrote: > > > > > >> > > > > > > > > > > >> > > > > > > Thanks for the update Xintong. From a high level > > > > perspective > > > > > >> the > > > > > >> > > > > > > implementation plan looks good to me. > > > > > >> > > > > > > > > > > > >> > > > > > > Cheers, > > > > > >> > > > > > > Till > > > > > >> > > > > > > > > > > > >> > > > > > > On Thu, Sep 12, 2019 at 11:04 AM Xintong Song < > > > > > >> > > [hidden email] > > > > > >> > > > > > > > > > >> > > > > > > wrote: > > > > > >> > > > > > > > > > > > >> > > > > > > > Added implementation steps for this FLIP on the > wiki > > > > page > > > > > >> [1]. > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > Thank you~ > > > > > >> > > > > > > > > > > > > >> > > > > > > > Xintong Song > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > [1] > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > On Tue, Aug 20, 2019 at 3:43 PM Xintong Song < > > > > > >> > > > [hidden email]> > > > > > >> > > > > > > > wrote: > > > > > >> > > > > > > > > > > > > >> > > > > > > > > @Zili > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > As far as I know, Timo is drafting a FLIP that > has > > > > taken > > > > > >> the > > > > > >> > > > number > > > > > >> > > > > > 55. > > > > > >> > > > > > > > > There is a round-up number maintained on the > FLIP > > > wiki > > > > > >> page > > > > > >> > [1] > > > > > >> > > > > shows > > > > > >> > > > > > > > > which number should be used for the new FLIP, > > which > > > > > >> should be > > > > > >> > > > > > increased > > > > > >> > > > > > > > by > > > > > >> > > > > > > > > whoever takes the number for a new FLIP. > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > Thank you~ > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > Xintong Song > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > [1] > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > On Tue, Aug 20, 2019 at 3:28 AM Zili Chen < > > > > > >> > > [hidden email]> > > > > > >> > > > > > > wrote: > > > > > >> > > > > > > > > > > > > > >> > > > > > > > >> We suddenly skipped FLIP-55 lol. > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> Xintong Song <[hidden email]> > > 于2019年8月19日周一 > > > > > >> > 下午10:23写道: > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > Hi everyone, > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > We would like to start a discussion thread on > > > > > "FLIP-56: > > > > > >> > > > Dynamic > > > > > >> > > > > > Slot > > > > > >> > > > > > > > >> > Allocation" [1]. This is originally part of > the > > > > > >> discussion > > > > > >> > > > > thread > > > > > >> > > > > > > for > > > > > >> > > > > > > > >> > "FLIP-53: Fine Grained Resource Management" > > [2]. > > > As > > > > > >> Till > > > > > >> > > > > > suggested, > > > > > >> > > > > > > we > > > > > >> > > > > > > > >> > would like split the original discussion into > > two > > > > > >> topics, > > > > > >> > > and > > > > > >> > > > > > start > > > > > >> > > > > > > a > > > > > >> > > > > > > > >> > separate new discussion thread as well as > FLIP > > > > > process > > > > > >> for > > > > > >> > > > this > > > > > >> > > > > > one. > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > Thank you~ > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > Xintong Song > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > [1] > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > [2] > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > Regards, > > Tao > > > -- Regards, Tao |
In reply to this post by Xintong Song
Hi Xintong, it it a huge plan to carry on. And I get a few questions about
the details. First, does "specific request" for the slots mean the requesting slot profile contains detailed information about memory and cpu? And how does a job manager determine to ask how much memory? Is it done when it scheduled the execution graph? Or maybe I miss something here. Second, will the dynamic allocation create the fragments? For example, if a task executor has 100mb memory left and maybe other tasks all ask for a larger memory size. -- Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ |
Hi Shaoxun,
You're right, that supporting end-to-end fine grained resource management is a huge plan, and FLIP-56 is only one step towards it. Regarding your questions: First, does "specific request" for the slots mean the requesting slot > profile contains detailed information about memory and cpu? Is it done when > it scheduled the execution graph? Yes, that means slot requests contains detailed information about how many cpu/memory is needed. And how does a job manager determine to ask how much memory? > A job graph should contains how many resources each vertex/task needs, and the JobMaster knows how many resource to request for each slot by adding up the resources of tasks it plans to deploy in the slot. Regarding how to initially set the resources in the job graph, there could be various ways. - We can expose interface to let the user decide how many resources each operator needs, like what you can do currently in DataStream API. But we probably want to change that later for better usability. - The compiler can set it automatically, according to the operator type and some configured default values for each type. Anyway, the fine grained resource management is an advanced feature, targeting expert users who knows well how many resources their jobs/tasks need. There are also various efforts trying to make the task-level fine grained resource configuration automatically, which are not in the scope of this FLIP. Second, will the dynamic allocation create the fragments? Yes, it will. You can also look at FLINK-14106, where we try to make the slot allocation strategy pluggable, so we can have different strategies for different use cases. E.g., we can have a strategy to start TMs only when slot requests are received, with the exact resources requested by the slots. That avoids fragments, at the cost of longer scheduling time due to starting TMs late, which should be suitable for long running streaming jobs. We can also have another strategy that starts a configured amount of TMs before receiving any slot request, with predefined resources. The benefit is that job gets scheduled immediately, and the cost is potential fragments, which I believe is more suitable for short batch queries. Thank you~ Xintong Song On Tue, Mar 3, 2020 at 3:00 PM shaoxun <[hidden email]> wrote: > Hi Xintong, it it a huge plan to carry on. And I get a few questions about > the details. > > First, does "specific request" for the slots mean the requesting slot > profile contains detailed information about memory and cpu? And how does a > job manager determine to ask how much memory? Is it done when > it scheduled the execution graph? Or maybe I miss something here. > > Second, will the dynamic allocation create the fragments? For example, if a > task executor has 100mb memory left and maybe other tasks all ask for a > larger memory size. > > > > -- > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > |
Free forum by Nabble | Edit this page |