Hi everyone,
We would like to start a discussion thread on "FLIP-53: Fine Grained Resource Management"[1], where we propose how to improve Flink resource management and scheduling. This FLIP mainly discusses the following issues. - How to support tasks with fine grained resource requirements. - How to unify resource management for jobs with / without fine grained resource requirements. - How to unify resource management for streaming / batch jobs. Key changes proposed in the FLIP are as follows. - Unify memory management for operators with / without fine grained resource requirements by applying a fraction based quota mechanism. - Unify resource scheduling for streaming and batch jobs by setting slot sharing groups for pipelined regions during compiling stage. - Dynamically allocate slots from task executors' available resources. Please find more details in the FLIP wiki document [1]. Looking forward to your feedbacks. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management |
Hi, Xintong
Thanks to propose this FLIP. The general design looks good to me, +1 for this feature. Since slots in the same task executor could have different resource profile, we will meet resource fragment problem. Think about this case: - request A want 1G memory while request B & C want 0.5G memory - There are two task executors T1 & T2 with 1G and 0.5G free memory respectively If B come first and we cut a slot from T1 for B, A must wait for the free resource from other task. But A could have been scheduled immediately if we cut a slot from T2 for B. The logic of findMatchingSlot now become finding a task executor which has enough resource and then cut a slot from it. Current method could be seen as "First-fit strategy", which works well in general but sometimes could not be the optimization method. Actually, this problem could be abstracted as "Bin Packing Problem"[1]. Here are some common approximate algorithms: - First fit - Next fit - Best fit But it become multi-dimensional bin packing problem if we take CPU into account. It hard to define which one is best fit now. Some research addressed this problem, such like Tetris[2]. Here are some thinking about it: 1. We could make the strategy of finding matching task executor pluginable. Let user to config the best strategy in their scenario. 2. We could support batch request interface in RM, because we have opportunities to optimize if we have more information. If we know the A, B, C at the same time, we could always make the best decision. [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf [2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf Best, Yangze Guo On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]> wrote: > > Hi everyone, > > We would like to start a discussion thread on "FLIP-53: Fine Grained > Resource Management"[1], where we propose how to improve Flink resource > management and scheduling. > > This FLIP mainly discusses the following issues. > > - How to support tasks with fine grained resource requirements. > - How to unify resource management for jobs with / without fine grained > resource requirements. > - How to unify resource management for streaming / batch jobs. > > Key changes proposed in the FLIP are as follows. > > - Unify memory management for operators with / without fine grained > resource requirements by applying a fraction based quota mechanism. > - Unify resource scheduling for streaming and batch jobs by setting slot > sharing groups for pipelined regions during compiling stage. > - Dynamically allocate slots from task executors' available resources. > > Please find more details in the FLIP wiki document [1]. Looking forward to > your feedbacks. > > Thank you~ > > Xintong Song > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management |
Hi Xintong,
thanks for drafting this FLIP. I think your proposal helps to improve the execution of batch jobs more efficiently. Moreover, it enables the proper integration of the Blink planner which is very important as well. Overall, the FLIP looks good to me. I was wondering whether it wouldn't make sense to actually split it up into two FLIPs: Operator resource management and dynamic slot allocation. I think these two FLIPs could be seen as orthogonal and it would decrease the scope of each individual FLIP. Some smaller comments: - I'm not sure whether we should pass in the default slot size via an environment variable. Without having unified the way how Flink components are configured [1], I think it would be better to pass it in as part of the configuration. - I would avoid returning a null value from TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we should introduce an explicit return value saying this or throw an exception. Concerning Yangze's comments: I think you are right that it would be helpful to make the selection strategy pluggable. Also batching slot requests to the RM could be a good optimization. For the sake of keeping the scope of this FLIP smaller I would try to tackle these things after the initial version has been completed (without spoiling these optimization opportunities). In particular batching the slot requests depends on the current scheduler refactoring and could also be realized on the RM side only. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration Cheers, Till On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> wrote: > Hi, Xintong > > Thanks to propose this FLIP. The general design looks good to me, +1 > for this feature. > > Since slots in the same task executor could have different resource > profile, we will > meet resource fragment problem. Think about this case: > - request A want 1G memory while request B & C want 0.5G memory > - There are two task executors T1 & T2 with 1G and 0.5G free memory > respectively > If B come first and we cut a slot from T1 for B, A must wait for the > free resource from > other task. But A could have been scheduled immediately if we cut a > slot from T2 for B. > > The logic of findMatchingSlot now become finding a task executor which > has enough > resource and then cut a slot from it. Current method could be seen as > "First-fit strategy", > which works well in general but sometimes could not be the optimization > method. > > Actually, this problem could be abstracted as "Bin Packing Problem"[1]. > Here are > some common approximate algorithms: > - First fit > - Next fit > - Best fit > > But it become multi-dimensional bin packing problem if we take CPU > into account. It hard > to define which one is best fit now. Some research addressed this > problem, such like Tetris[2]. > > Here are some thinking about it: > 1. We could make the strategy of finding matching task executor > pluginable. Let user to config the > best strategy in their scenario. > 2. We could support batch request interface in RM, because we have > opportunities to optimize > if we have more information. If we know the A, B, C at the same time, > we could always make the best decision. > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > [2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > Best, > Yangze Guo > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]> > wrote: > > > > Hi everyone, > > > > We would like to start a discussion thread on "FLIP-53: Fine Grained > > Resource Management"[1], where we propose how to improve Flink resource > > management and scheduling. > > > > This FLIP mainly discusses the following issues. > > > > - How to support tasks with fine grained resource requirements. > > - How to unify resource management for jobs with / without fine > grained > > resource requirements. > > - How to unify resource management for streaming / batch jobs. > > > > Key changes proposed in the FLIP are as follows. > > > > - Unify memory management for operators with / without fine grained > > resource requirements by applying a fraction based quota mechanism. > > - Unify resource scheduling for streaming and batch jobs by setting > slot > > sharing groups for pipelined regions during compiling stage. > > - Dynamically allocate slots from task executors' available resources. > > > > Please find more details in the FLIP wiki document [1]. Looking forward > to > > your feedbacks. > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > |
Thanks for the feedbacks, Yangze and Till.
Yangze, I agree with you that we should make scheduling strategy pluggable and optimize the strategy to reduce the memory fragmentation problem, and thanks for the inputs on the potential algorithmic solutions. However, I'm in favor of keep this FLIP focusing on the overall mechanism design rather than strategies. Solving the fragmentation issue should be considered as an optimization, and I agree with Till that we probably should tackle this afterwards. Till, - Regarding splitting the FLIP, I think it makes sense. The operator resource management and dynamic slot allocation do not have much dependency on each other. - Regarding the default slot size, I think this is similar to FLIP-49 [1] where we want all the deriving happens at one place. I think it would be nice to pass the default slot size into the task executor in the same way that we pass in the memory pool sizes in FLIP-49 [1]. - Regarding the return value of TaskExecutorGateway#requestResource, I think you're right. We should avoid using null as the return value. I think we probably should thrown an exception here. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]> wrote: > Hi Xintong, > > thanks for drafting this FLIP. I think your proposal helps to improve the > execution of batch jobs more efficiently. Moreover, it enables the proper > integration of the Blink planner which is very important as well. > > Overall, the FLIP looks good to me. I was wondering whether it wouldn't > make sense to actually split it up into two FLIPs: Operator resource > management and dynamic slot allocation. I think these two FLIPs could be > seen as orthogonal and it would decrease the scope of each individual FLIP. > > Some smaller comments: > > - I'm not sure whether we should pass in the default slot size via an > environment variable. Without having unified the way how Flink components > are configured [1], I think it would be better to pass it in as part of the > configuration. > - I would avoid returning a null value from > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we > should introduce an explicit return value saying this or throw an > exception. > > Concerning Yangze's comments: I think you are right that it would be > helpful to make the selection strategy pluggable. Also batching slot > requests to the RM could be a good optimization. For the sake of keeping > the scope of this FLIP smaller I would try to tackle these things after the > initial version has been completed (without spoiling these optimization > opportunities). In particular batching the slot requests depends on the > current scheduler refactoring and could also be realized on the RM side > only. > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > Cheers, > Till > > > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> wrote: > > > Hi, Xintong > > > > Thanks to propose this FLIP. The general design looks good to me, +1 > > for this feature. > > > > Since slots in the same task executor could have different resource > > profile, we will > > meet resource fragment problem. Think about this case: > > - request A want 1G memory while request B & C want 0.5G memory > > - There are two task executors T1 & T2 with 1G and 0.5G free memory > > respectively > > If B come first and we cut a slot from T1 for B, A must wait for the > > free resource from > > other task. But A could have been scheduled immediately if we cut a > > slot from T2 for B. > > > > The logic of findMatchingSlot now become finding a task executor which > > has enough > > resource and then cut a slot from it. Current method could be seen as > > "First-fit strategy", > > which works well in general but sometimes could not be the optimization > > method. > > > > Actually, this problem could be abstracted as "Bin Packing Problem"[1]. > > Here are > > some common approximate algorithms: > > - First fit > > - Next fit > > - Best fit > > > > But it become multi-dimensional bin packing problem if we take CPU > > into account. It hard > > to define which one is best fit now. Some research addressed this > > problem, such like Tetris[2]. > > > > Here are some thinking about it: > > 1. We could make the strategy of finding matching task executor > > pluginable. Let user to config the > > best strategy in their scenario. > > 2. We could support batch request interface in RM, because we have > > opportunities to optimize > > if we have more information. If we know the A, B, C at the same time, > > we could always make the best decision. > > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > [2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > > Best, > > Yangze Guo > > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]> > > wrote: > > > > > > Hi everyone, > > > > > > We would like to start a discussion thread on "FLIP-53: Fine Grained > > > Resource Management"[1], where we propose how to improve Flink resource > > > management and scheduling. > > > > > > This FLIP mainly discusses the following issues. > > > > > > - How to support tasks with fine grained resource requirements. > > > - How to unify resource management for jobs with / without fine > > grained > > > resource requirements. > > > - How to unify resource management for streaming / batch jobs. > > > > > > Key changes proposed in the FLIP are as follows. > > > > > > - Unify memory management for operators with / without fine grained > > > resource requirements by applying a fraction based quota mechanism. > > > - Unify resource scheduling for streaming and batch jobs by setting > > slot > > > sharing groups for pipelined regions during compiling stage. > > > - Dynamically allocate slots from task executors' available > resources. > > > > > > Please find more details in the FLIP wiki document [1]. Looking forward > > to > > > your feedbacks. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > |
Hi Xintong,
Thanks for your detailed proposal. I think many users are suffering from waste of resources. The resource spec of all task managers are same and we have to increase all task managers to make the heavy one more stable. So we will benefit from the fine grained resource management a lot. We could get better resource utilization and stability. Just to share some thoughts. 1. How to calculate the resource specification of TaskManagers? Do they have them same resource spec calculated based on the configuration? I think we still have wasted resources in this situation. Or we could start TaskManagers with different spec. 2. If a slot is released and returned to SlotPool, does it could be reused by other SlotRequest that the request resource is smaller than it? - If it is yes, what happens to the available resource in the TaskManager. - What is the SlotStatus of the cached slot in SlotPool? The AllocationId is null? 3. In a session cluster, some jobs are configured with operator resources, meanwhile other jobs are using UNKNOWN. How to deal with this situation? Best, Yang Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: > Thanks for the feedbacks, Yangze and Till. > > Yangze, > > I agree with you that we should make scheduling strategy pluggable and > optimize the strategy to reduce the memory fragmentation problem, and > thanks for the inputs on the potential algorithmic solutions. However, I'm > in favor of keep this FLIP focusing on the overall mechanism design rather > than strategies. Solving the fragmentation issue should be considered as an > optimization, and I agree with Till that we probably should tackle this > afterwards. > > Till, > > - Regarding splitting the FLIP, I think it makes sense. The operator > resource management and dynamic slot allocation do not have much dependency > on each other. > > - Regarding the default slot size, I think this is similar to FLIP-49 [1] > where we want all the deriving happens at one place. I think it would be > nice to pass the default slot size into the task executor in the same way > that we pass in the memory pool sizes in FLIP-49 [1]. > > - Regarding the return value of TaskExecutorGateway#requestResource, I > think you're right. We should avoid using null as the return value. I think > we probably should thrown an exception here. > > Thank you~ > > Xintong Song > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]> > wrote: > > > Hi Xintong, > > > > thanks for drafting this FLIP. I think your proposal helps to improve the > > execution of batch jobs more efficiently. Moreover, it enables the proper > > integration of the Blink planner which is very important as well. > > > > Overall, the FLIP looks good to me. I was wondering whether it wouldn't > > make sense to actually split it up into two FLIPs: Operator resource > > management and dynamic slot allocation. I think these two FLIPs could be > > seen as orthogonal and it would decrease the scope of each individual > FLIP. > > > > Some smaller comments: > > > > - I'm not sure whether we should pass in the default slot size via an > > environment variable. Without having unified the way how Flink components > > are configured [1], I think it would be better to pass it in as part of > the > > configuration. > > - I would avoid returning a null value from > > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we > > should introduce an explicit return value saying this or throw an > > exception. > > > > Concerning Yangze's comments: I think you are right that it would be > > helpful to make the selection strategy pluggable. Also batching slot > > requests to the RM could be a good optimization. For the sake of keeping > > the scope of this FLIP smaller I would try to tackle these things after > the > > initial version has been completed (without spoiling these optimization > > opportunities). In particular batching the slot requests depends on the > > current scheduler refactoring and could also be realized on the RM side > > only. > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > > > Cheers, > > Till > > > > > > > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> wrote: > > > > > Hi, Xintong > > > > > > Thanks to propose this FLIP. The general design looks good to me, +1 > > > for this feature. > > > > > > Since slots in the same task executor could have different resource > > > profile, we will > > > meet resource fragment problem. Think about this case: > > > - request A want 1G memory while request B & C want 0.5G memory > > > - There are two task executors T1 & T2 with 1G and 0.5G free memory > > > respectively > > > If B come first and we cut a slot from T1 for B, A must wait for the > > > free resource from > > > other task. But A could have been scheduled immediately if we cut a > > > slot from T2 for B. > > > > > > The logic of findMatchingSlot now become finding a task executor which > > > has enough > > > resource and then cut a slot from it. Current method could be seen as > > > "First-fit strategy", > > > which works well in general but sometimes could not be the optimization > > > method. > > > > > > Actually, this problem could be abstracted as "Bin Packing Problem"[1]. > > > Here are > > > some common approximate algorithms: > > > - First fit > > > - Next fit > > > - Best fit > > > > > > But it become multi-dimensional bin packing problem if we take CPU > > > into account. It hard > > > to define which one is best fit now. Some research addressed this > > > problem, such like Tetris[2]. > > > > > > Here are some thinking about it: > > > 1. We could make the strategy of finding matching task executor > > > pluginable. Let user to config the > > > best strategy in their scenario. > > > 2. We could support batch request interface in RM, because we have > > > opportunities to optimize > > > if we have more information. If we know the A, B, C at the same time, > > > we could always make the best decision. > > > > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > > [2] > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > > > > Best, > > > Yangze Guo > > > > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]> > > > wrote: > > > > > > > > Hi everyone, > > > > > > > > We would like to start a discussion thread on "FLIP-53: Fine Grained > > > > Resource Management"[1], where we propose how to improve Flink > resource > > > > management and scheduling. > > > > > > > > This FLIP mainly discusses the following issues. > > > > > > > > - How to support tasks with fine grained resource requirements. > > > > - How to unify resource management for jobs with / without fine > > > grained > > > > resource requirements. > > > > - How to unify resource management for streaming / batch jobs. > > > > > > > > Key changes proposed in the FLIP are as follows. > > > > > > > > - Unify memory management for operators with / without fine > grained > > > > resource requirements by applying a fraction based quota > mechanism. > > > > - Unify resource scheduling for streaming and batch jobs by > setting > > > slot > > > > sharing groups for pipelined regions during compiling stage. > > > > - Dynamically allocate slots from task executors' available > > resources. > > > > > > > > Please find more details in the FLIP wiki document [1]. Looking > forward > > > to > > > > your feedbacks. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > [1] > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > > > > |
Thinks for the comments, Yang.
Regarding your questions: 1. How to calculate the resource specification of TaskManagers? Do they > have them same resource spec calculated based on the configuration? I > think > we still have wasted resources in this situation. Or we could start > TaskManagers with different spec. > I agree with you that we can further improve the resource utility by customizing task executors with different resource specifications. However, I'm in favor of limiting the scope of this FLIP and leave it as a future optimization. The plan for that part is to move the logic of deciding task executor specifications into the slot manager and make slot manager pluggable, so inside the slot manager plugin we can have different logics for deciding the task executor specifications. > 2. If a slot is released and returned to SlotPool, does it could be > reused by other SlotRequest that the request resource is smaller than > it? > No, I think slot pool should always return slots if they do not exactly match the pending requests, so that resource manager can deal with the extra resources. > - If it is yes, what happens to the available resource in the TaskManager. > - What is the SlotStatus of the cached slot in SlotPool? The > AllocationId is null? > The allocation id does not change as long as the slot is not returned from the job master, no matter its occupied or available in the slot pool. I think we have the same behavior currently. No matter how many tasks the job master deploy into the slot, concurrently or sequentially, it is one allocation from the cluster to the job until the slot is freed from the job master. > 3. In a session cluster, some jobs are configured with operator > resources, meanwhile other jobs are using UNKNOWN. How to deal with this > situation? As long as we do not mix unknown / specified resource profiles within the same job / slot, there shouldn't be a problem. Resource manager converts unknown resource profiles in slot requests to specified default resource profiles, so they can be dynamically allocated from task executors' available resources just as other slot requests with specified resource profiles. Thank you~ Xintong Song On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> wrote: > Hi Xintong, > > > Thanks for your detailed proposal. I think many users are suffering from > waste of resources. The resource spec of all task managers are same and we > have to increase all task managers to make the heavy one more stable. So we > will benefit from the fine grained resource management a lot. We could get > better resource utilization and stability. > > > Just to share some thoughts. > > > > 1. How to calculate the resource specification of TaskManagers? Do they > have them same resource spec calculated based on the configuration? I > think > we still have wasted resources in this situation. Or we could start > TaskManagers with different spec. > 2. If a slot is released and returned to SlotPool, does it could be > reused by other SlotRequest that the request resource is smaller than > it? > - If it is yes, what happens to the available resource in the > TaskManager. > - What is the SlotStatus of the cached slot in SlotPool? The > AllocationId is null? > 3. In a session cluster, some jobs are configured with operator > resources, meanwhile other jobs are using UNKNOWN. How to deal with this > situation? > > > > Best, > Yang > > Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: > > > Thanks for the feedbacks, Yangze and Till. > > > > Yangze, > > > > I agree with you that we should make scheduling strategy pluggable and > > optimize the strategy to reduce the memory fragmentation problem, and > > thanks for the inputs on the potential algorithmic solutions. However, > I'm > > in favor of keep this FLIP focusing on the overall mechanism design > rather > > than strategies. Solving the fragmentation issue should be considered as > an > > optimization, and I agree with Till that we probably should tackle this > > afterwards. > > > > Till, > > > > - Regarding splitting the FLIP, I think it makes sense. The operator > > resource management and dynamic slot allocation do not have much > dependency > > on each other. > > > > - Regarding the default slot size, I think this is similar to FLIP-49 [1] > > where we want all the deriving happens at one place. I think it would be > > nice to pass the default slot size into the task executor in the same way > > that we pass in the memory pool sizes in FLIP-49 [1]. > > > > - Regarding the return value of TaskExecutorGateway#requestResource, I > > think you're right. We should avoid using null as the return value. I > think > > we probably should thrown an exception here. > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > Hi Xintong, > > > > > > thanks for drafting this FLIP. I think your proposal helps to improve > the > > > execution of batch jobs more efficiently. Moreover, it enables the > proper > > > integration of the Blink planner which is very important as well. > > > > > > Overall, the FLIP looks good to me. I was wondering whether it wouldn't > > > make sense to actually split it up into two FLIPs: Operator resource > > > management and dynamic slot allocation. I think these two FLIPs could > be > > > seen as orthogonal and it would decrease the scope of each individual > > FLIP. > > > > > > Some smaller comments: > > > > > > - I'm not sure whether we should pass in the default slot size via an > > > environment variable. Without having unified the way how Flink > components > > > are configured [1], I think it would be better to pass it in as part of > > the > > > configuration. > > > - I would avoid returning a null value from > > > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either > we > > > should introduce an explicit return value saying this or throw an > > > exception. > > > > > > Concerning Yangze's comments: I think you are right that it would be > > > helpful to make the selection strategy pluggable. Also batching slot > > > requests to the RM could be a good optimization. For the sake of > keeping > > > the scope of this FLIP smaller I would try to tackle these things after > > the > > > initial version has been completed (without spoiling these optimization > > > opportunities). In particular batching the slot requests depends on the > > > current scheduler refactoring and could also be realized on the RM side > > > only. > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > > > > > Cheers, > > > Till > > > > > > > > > > > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> > wrote: > > > > > > > Hi, Xintong > > > > > > > > Thanks to propose this FLIP. The general design looks good to me, +1 > > > > for this feature. > > > > > > > > Since slots in the same task executor could have different resource > > > > profile, we will > > > > meet resource fragment problem. Think about this case: > > > > - request A want 1G memory while request B & C want 0.5G memory > > > > - There are two task executors T1 & T2 with 1G and 0.5G free memory > > > > respectively > > > > If B come first and we cut a slot from T1 for B, A must wait for the > > > > free resource from > > > > other task. But A could have been scheduled immediately if we cut a > > > > slot from T2 for B. > > > > > > > > The logic of findMatchingSlot now become finding a task executor > which > > > > has enough > > > > resource and then cut a slot from it. Current method could be seen as > > > > "First-fit strategy", > > > > which works well in general but sometimes could not be the > optimization > > > > method. > > > > > > > > Actually, this problem could be abstracted as "Bin Packing > Problem"[1]. > > > > Here are > > > > some common approximate algorithms: > > > > - First fit > > > > - Next fit > > > > - Best fit > > > > > > > > But it become multi-dimensional bin packing problem if we take CPU > > > > into account. It hard > > > > to define which one is best fit now. Some research addressed this > > > > problem, such like Tetris[2]. > > > > > > > > Here are some thinking about it: > > > > 1. We could make the strategy of finding matching task executor > > > > pluginable. Let user to config the > > > > best strategy in their scenario. > > > > 2. We could support batch request interface in RM, because we have > > > > opportunities to optimize > > > > if we have more information. If we know the A, B, C at the same time, > > > > we could always make the best decision. > > > > > > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > > > [2] > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email] > > > > > > wrote: > > > > > > > > > > Hi everyone, > > > > > > > > > > We would like to start a discussion thread on "FLIP-53: Fine > Grained > > > > > Resource Management"[1], where we propose how to improve Flink > > resource > > > > > management and scheduling. > > > > > > > > > > This FLIP mainly discusses the following issues. > > > > > > > > > > - How to support tasks with fine grained resource requirements. > > > > > - How to unify resource management for jobs with / without fine > > > > grained > > > > > resource requirements. > > > > > - How to unify resource management for streaming / batch jobs. > > > > > > > > > > Key changes proposed in the FLIP are as follows. > > > > > > > > > > - Unify memory management for operators with / without fine > > grained > > > > > resource requirements by applying a fraction based quota > > mechanism. > > > > > - Unify resource scheduling for streaming and batch jobs by > > setting > > > > slot > > > > > sharing groups for pipelined regions during compiling stage. > > > > > - Dynamically allocate slots from task executors' available > > > resources. > > > > > > > > > > Please find more details in the FLIP wiki document [1]. Looking > > forward > > > > to > > > > > your feedbacks. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > > > > > > > > |
Hi everyone,
As Till suggested, the original "FLIP-53: Fine Grained Resource Management" splits into two separate FLIPs, - FLIP-53: Fine Grained Operator Resource Management [1] - FLIP-56: Dynamic Slot Allocation [2] We'll continue using this discussion thread for FLIP-53. For FLIP-56, I just started a new discussion thread [3]. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation [3] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]> wrote: > Thinks for the comments, Yang. > > Regarding your questions: > > 1. How to calculate the resource specification of TaskManagers? Do they >> have them same resource spec calculated based on the configuration? I >> think >> we still have wasted resources in this situation. Or we could start >> TaskManagers with different spec. >> > I agree with you that we can further improve the resource utility by > customizing task executors with different resource specifications. However, > I'm in favor of limiting the scope of this FLIP and leave it as a future > optimization. The plan for that part is to move the logic of deciding task > executor specifications into the slot manager and make slot manager > pluggable, so inside the slot manager plugin we can have different logics > for deciding the task executor specifications. > > >> 2. If a slot is released and returned to SlotPool, does it could be >> reused by other SlotRequest that the request resource is smaller than >> it? >> > No, I think slot pool should always return slots if they do not exactly > match the pending requests, so that resource manager can deal with the > extra resources. > >> - If it is yes, what happens to the available resource in the > > TaskManager. >> - What is the SlotStatus of the cached slot in SlotPool? The >> AllocationId is null? >> > The allocation id does not change as long as the slot is not returned from > the job master, no matter its occupied or available in the slot pool. I > think we have the same behavior currently. No matter how many tasks the job > master deploy into the slot, concurrently or sequentially, it is one > allocation from the cluster to the job until the slot is freed from the job > master. > >> 3. In a session cluster, some jobs are configured with operator >> resources, meanwhile other jobs are using UNKNOWN. How to deal with >> this >> situation? > > As long as we do not mix unknown / specified resource profiles within the > same job / slot, there shouldn't be a problem. Resource manager converts > unknown resource profiles in slot requests to specified default resource > profiles, so they can be dynamically allocated from task executors' > available resources just as other slot requests with specified resource > profiles. > > Thank you~ > > Xintong Song > > > > On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> wrote: > >> Hi Xintong, >> >> >> Thanks for your detailed proposal. I think many users are suffering from >> waste of resources. The resource spec of all task managers are same and we >> have to increase all task managers to make the heavy one more stable. So >> we >> will benefit from the fine grained resource management a lot. We could get >> better resource utilization and stability. >> >> >> Just to share some thoughts. >> >> >> >> 1. How to calculate the resource specification of TaskManagers? Do they >> have them same resource spec calculated based on the configuration? I >> think >> we still have wasted resources in this situation. Or we could start >> TaskManagers with different spec. >> 2. If a slot is released and returned to SlotPool, does it could be >> reused by other SlotRequest that the request resource is smaller than >> it? >> - If it is yes, what happens to the available resource in the >> TaskManager. >> - What is the SlotStatus of the cached slot in SlotPool? The >> AllocationId is null? >> 3. In a session cluster, some jobs are configured with operator >> resources, meanwhile other jobs are using UNKNOWN. How to deal with >> this >> situation? >> >> >> >> Best, >> Yang >> >> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: >> >> > Thanks for the feedbacks, Yangze and Till. >> > >> > Yangze, >> > >> > I agree with you that we should make scheduling strategy pluggable and >> > optimize the strategy to reduce the memory fragmentation problem, and >> > thanks for the inputs on the potential algorithmic solutions. However, >> I'm >> > in favor of keep this FLIP focusing on the overall mechanism design >> rather >> > than strategies. Solving the fragmentation issue should be considered >> as an >> > optimization, and I agree with Till that we probably should tackle this >> > afterwards. >> > >> > Till, >> > >> > - Regarding splitting the FLIP, I think it makes sense. The operator >> > resource management and dynamic slot allocation do not have much >> dependency >> > on each other. >> > >> > - Regarding the default slot size, I think this is similar to FLIP-49 >> [1] >> > where we want all the deriving happens at one place. I think it would be >> > nice to pass the default slot size into the task executor in the same >> way >> > that we pass in the memory pool sizes in FLIP-49 [1]. >> > >> > - Regarding the return value of TaskExecutorGateway#requestResource, I >> > think you're right. We should avoid using null as the return value. I >> think >> > we probably should thrown an exception here. >> > >> > Thank you~ >> > >> > Xintong Song >> > >> > >> > [1] >> > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >> > >> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]> >> > wrote: >> > >> > > Hi Xintong, >> > > >> > > thanks for drafting this FLIP. I think your proposal helps to improve >> the >> > > execution of batch jobs more efficiently. Moreover, it enables the >> proper >> > > integration of the Blink planner which is very important as well. >> > > >> > > Overall, the FLIP looks good to me. I was wondering whether it >> wouldn't >> > > make sense to actually split it up into two FLIPs: Operator resource >> > > management and dynamic slot allocation. I think these two FLIPs could >> be >> > > seen as orthogonal and it would decrease the scope of each individual >> > FLIP. >> > > >> > > Some smaller comments: >> > > >> > > - I'm not sure whether we should pass in the default slot size via an >> > > environment variable. Without having unified the way how Flink >> components >> > > are configured [1], I think it would be better to pass it in as part >> of >> > the >> > > configuration. >> > > - I would avoid returning a null value from >> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either >> we >> > > should introduce an explicit return value saying this or throw an >> > > exception. >> > > >> > > Concerning Yangze's comments: I think you are right that it would be >> > > helpful to make the selection strategy pluggable. Also batching slot >> > > requests to the RM could be a good optimization. For the sake of >> keeping >> > > the scope of this FLIP smaller I would try to tackle these things >> after >> > the >> > > initial version has been completed (without spoiling these >> optimization >> > > opportunities). In particular batching the slot requests depends on >> the >> > > current scheduler refactoring and could also be realized on the RM >> side >> > > only. >> > > >> > > [1] >> > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration >> > > >> > > Cheers, >> > > Till >> > > >> > > >> > > >> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> >> wrote: >> > > >> > > > Hi, Xintong >> > > > >> > > > Thanks to propose this FLIP. The general design looks good to me, +1 >> > > > for this feature. >> > > > >> > > > Since slots in the same task executor could have different resource >> > > > profile, we will >> > > > meet resource fragment problem. Think about this case: >> > > > - request A want 1G memory while request B & C want 0.5G memory >> > > > - There are two task executors T1 & T2 with 1G and 0.5G free memory >> > > > respectively >> > > > If B come first and we cut a slot from T1 for B, A must wait for the >> > > > free resource from >> > > > other task. But A could have been scheduled immediately if we cut a >> > > > slot from T2 for B. >> > > > >> > > > The logic of findMatchingSlot now become finding a task executor >> which >> > > > has enough >> > > > resource and then cut a slot from it. Current method could be seen >> as >> > > > "First-fit strategy", >> > > > which works well in general but sometimes could not be the >> optimization >> > > > method. >> > > > >> > > > Actually, this problem could be abstracted as "Bin Packing >> Problem"[1]. >> > > > Here are >> > > > some common approximate algorithms: >> > > > - First fit >> > > > - Next fit >> > > > - Best fit >> > > > >> > > > But it become multi-dimensional bin packing problem if we take CPU >> > > > into account. It hard >> > > > to define which one is best fit now. Some research addressed this >> > > > problem, such like Tetris[2]. >> > > > >> > > > Here are some thinking about it: >> > > > 1. We could make the strategy of finding matching task executor >> > > > pluginable. Let user to config the >> > > > best strategy in their scenario. >> > > > 2. We could support batch request interface in RM, because we have >> > > > opportunities to optimize >> > > > if we have more information. If we know the A, B, C at the same >> time, >> > > > we could always make the best decision. >> > > > >> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf >> > > > [2] >> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf >> > > > >> > > > Best, >> > > > Yangze Guo >> > > > >> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < >> [hidden email]> >> > > > wrote: >> > > > > >> > > > > Hi everyone, >> > > > > >> > > > > We would like to start a discussion thread on "FLIP-53: Fine >> Grained >> > > > > Resource Management"[1], where we propose how to improve Flink >> > resource >> > > > > management and scheduling. >> > > > > >> > > > > This FLIP mainly discusses the following issues. >> > > > > >> > > > > - How to support tasks with fine grained resource requirements. >> > > > > - How to unify resource management for jobs with / without fine >> > > > grained >> > > > > resource requirements. >> > > > > - How to unify resource management for streaming / batch jobs. >> > > > > >> > > > > Key changes proposed in the FLIP are as follows. >> > > > > >> > > > > - Unify memory management for operators with / without fine >> > grained >> > > > > resource requirements by applying a fraction based quota >> > mechanism. >> > > > > - Unify resource scheduling for streaming and batch jobs by >> > setting >> > > > slot >> > > > > sharing groups for pipelined regions during compiling stage. >> > > > > - Dynamically allocate slots from task executors' available >> > > resources. >> > > > > >> > > > > Please find more details in the FLIP wiki document [1]. Looking >> > forward >> > > > to >> > > > > your feedbacks. >> > > > > >> > > > > Thank you~ >> > > > > >> > > > > Xintong Song >> > > > > >> > > > > >> > > > > [1] >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management >> > > > >> > > >> > >> > |
Added implementation steps for this FLIP on the wiki page [1].
Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]> wrote: > Hi everyone, > > As Till suggested, the original "FLIP-53: Fine Grained Resource > Management" splits into two separate FLIPs, > > - FLIP-53: Fine Grained Operator Resource Management [1] > - FLIP-56: Dynamic Slot Allocation [2] > > We'll continue using this discussion thread for FLIP-53. For FLIP-56, I > just started a new discussion thread [3]. > > Thank you~ > > Xintong Song > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > [2] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > [3] > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]> > wrote: > >> Thinks for the comments, Yang. >> >> Regarding your questions: >> >> 1. How to calculate the resource specification of TaskManagers? Do they >>> have them same resource spec calculated based on the configuration? I >>> think >>> we still have wasted resources in this situation. Or we could start >>> TaskManagers with different spec. >>> >> I agree with you that we can further improve the resource utility by >> customizing task executors with different resource specifications. However, >> I'm in favor of limiting the scope of this FLIP and leave it as a future >> optimization. The plan for that part is to move the logic of deciding task >> executor specifications into the slot manager and make slot manager >> pluggable, so inside the slot manager plugin we can have different logics >> for deciding the task executor specifications. >> >> >>> 2. If a slot is released and returned to SlotPool, does it could be >>> reused by other SlotRequest that the request resource is smaller than >>> it? >>> >> No, I think slot pool should always return slots if they do not exactly >> match the pending requests, so that resource manager can deal with the >> extra resources. >> >>> - If it is yes, what happens to the available resource in the >> >> TaskManager. >>> - What is the SlotStatus of the cached slot in SlotPool? The >>> AllocationId is null? >>> >> The allocation id does not change as long as the slot is not returned >> from the job master, no matter its occupied or available in the slot pool. >> I think we have the same behavior currently. No matter how many tasks the >> job master deploy into the slot, concurrently or sequentially, it is one >> allocation from the cluster to the job until the slot is freed from the job >> master. >> >>> 3. In a session cluster, some jobs are configured with operator >>> resources, meanwhile other jobs are using UNKNOWN. How to deal with >>> this >>> situation? >> >> As long as we do not mix unknown / specified resource profiles within the >> same job / slot, there shouldn't be a problem. Resource manager converts >> unknown resource profiles in slot requests to specified default resource >> profiles, so they can be dynamically allocated from task executors' >> available resources just as other slot requests with specified resource >> profiles. >> >> Thank you~ >> >> Xintong Song >> >> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> wrote: >> >>> Hi Xintong, >>> >>> >>> Thanks for your detailed proposal. I think many users are suffering from >>> waste of resources. The resource spec of all task managers are same and >>> we >>> have to increase all task managers to make the heavy one more stable. So >>> we >>> will benefit from the fine grained resource management a lot. We could >>> get >>> better resource utilization and stability. >>> >>> >>> Just to share some thoughts. >>> >>> >>> >>> 1. How to calculate the resource specification of TaskManagers? Do >>> they >>> have them same resource spec calculated based on the configuration? I >>> think >>> we still have wasted resources in this situation. Or we could start >>> TaskManagers with different spec. >>> 2. If a slot is released and returned to SlotPool, does it could be >>> reused by other SlotRequest that the request resource is smaller than >>> it? >>> - If it is yes, what happens to the available resource in the >>> TaskManager. >>> - What is the SlotStatus of the cached slot in SlotPool? The >>> AllocationId is null? >>> 3. In a session cluster, some jobs are configured with operator >>> resources, meanwhile other jobs are using UNKNOWN. How to deal with >>> this >>> situation? >>> >>> >>> >>> Best, >>> Yang >>> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: >>> >>> > Thanks for the feedbacks, Yangze and Till. >>> > >>> > Yangze, >>> > >>> > I agree with you that we should make scheduling strategy pluggable and >>> > optimize the strategy to reduce the memory fragmentation problem, and >>> > thanks for the inputs on the potential algorithmic solutions. However, >>> I'm >>> > in favor of keep this FLIP focusing on the overall mechanism design >>> rather >>> > than strategies. Solving the fragmentation issue should be considered >>> as an >>> > optimization, and I agree with Till that we probably should tackle this >>> > afterwards. >>> > >>> > Till, >>> > >>> > - Regarding splitting the FLIP, I think it makes sense. The operator >>> > resource management and dynamic slot allocation do not have much >>> dependency >>> > on each other. >>> > >>> > - Regarding the default slot size, I think this is similar to FLIP-49 >>> [1] >>> > where we want all the deriving happens at one place. I think it would >>> be >>> > nice to pass the default slot size into the task executor in the same >>> way >>> > that we pass in the memory pool sizes in FLIP-49 [1]. >>> > >>> > - Regarding the return value of TaskExecutorGateway#requestResource, I >>> > think you're right. We should avoid using null as the return value. I >>> think >>> > we probably should thrown an exception here. >>> > >>> > Thank you~ >>> > >>> > Xintong Song >>> > >>> > >>> > [1] >>> > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >>> > >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]> >>> > wrote: >>> > >>> > > Hi Xintong, >>> > > >>> > > thanks for drafting this FLIP. I think your proposal helps to >>> improve the >>> > > execution of batch jobs more efficiently. Moreover, it enables the >>> proper >>> > > integration of the Blink planner which is very important as well. >>> > > >>> > > Overall, the FLIP looks good to me. I was wondering whether it >>> wouldn't >>> > > make sense to actually split it up into two FLIPs: Operator resource >>> > > management and dynamic slot allocation. I think these two FLIPs >>> could be >>> > > seen as orthogonal and it would decrease the scope of each individual >>> > FLIP. >>> > > >>> > > Some smaller comments: >>> > > >>> > > - I'm not sure whether we should pass in the default slot size via an >>> > > environment variable. Without having unified the way how Flink >>> components >>> > > are configured [1], I think it would be better to pass it in as part >>> of >>> > the >>> > > configuration. >>> > > - I would avoid returning a null value from >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. >>> Either we >>> > > should introduce an explicit return value saying this or throw an >>> > > exception. >>> > > >>> > > Concerning Yangze's comments: I think you are right that it would be >>> > > helpful to make the selection strategy pluggable. Also batching slot >>> > > requests to the RM could be a good optimization. For the sake of >>> keeping >>> > > the scope of this FLIP smaller I would try to tackle these things >>> after >>> > the >>> > > initial version has been completed (without spoiling these >>> optimization >>> > > opportunities). In particular batching the slot requests depends on >>> the >>> > > current scheduler refactoring and could also be realized on the RM >>> side >>> > > only. >>> > > >>> > > [1] >>> > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration >>> > > >>> > > Cheers, >>> > > Till >>> > > >>> > > >>> > > >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> >>> wrote: >>> > > >>> > > > Hi, Xintong >>> > > > >>> > > > Thanks to propose this FLIP. The general design looks good to me, >>> +1 >>> > > > for this feature. >>> > > > >>> > > > Since slots in the same task executor could have different resource >>> > > > profile, we will >>> > > > meet resource fragment problem. Think about this case: >>> > > > - request A want 1G memory while request B & C want 0.5G memory >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G free >>> memory >>> > > > respectively >>> > > > If B come first and we cut a slot from T1 for B, A must wait for >>> the >>> > > > free resource from >>> > > > other task. But A could have been scheduled immediately if we cut a >>> > > > slot from T2 for B. >>> > > > >>> > > > The logic of findMatchingSlot now become finding a task executor >>> which >>> > > > has enough >>> > > > resource and then cut a slot from it. Current method could be seen >>> as >>> > > > "First-fit strategy", >>> > > > which works well in general but sometimes could not be the >>> optimization >>> > > > method. >>> > > > >>> > > > Actually, this problem could be abstracted as "Bin Packing >>> Problem"[1]. >>> > > > Here are >>> > > > some common approximate algorithms: >>> > > > - First fit >>> > > > - Next fit >>> > > > - Best fit >>> > > > >>> > > > But it become multi-dimensional bin packing problem if we take CPU >>> > > > into account. It hard >>> > > > to define which one is best fit now. Some research addressed this >>> > > > problem, such like Tetris[2]. >>> > > > >>> > > > Here are some thinking about it: >>> > > > 1. We could make the strategy of finding matching task executor >>> > > > pluginable. Let user to config the >>> > > > best strategy in their scenario. >>> > > > 2. We could support batch request interface in RM, because we have >>> > > > opportunities to optimize >>> > > > if we have more information. If we know the A, B, C at the same >>> time, >>> > > > we could always make the best decision. >>> > > > >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf >>> > > > [2] >>> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf >>> > > > >>> > > > Best, >>> > > > Yangze Guo >>> > > > >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < >>> [hidden email]> >>> > > > wrote: >>> > > > > >>> > > > > Hi everyone, >>> > > > > >>> > > > > We would like to start a discussion thread on "FLIP-53: Fine >>> Grained >>> > > > > Resource Management"[1], where we propose how to improve Flink >>> > resource >>> > > > > management and scheduling. >>> > > > > >>> > > > > This FLIP mainly discusses the following issues. >>> > > > > >>> > > > > - How to support tasks with fine grained resource >>> requirements. >>> > > > > - How to unify resource management for jobs with / without >>> fine >>> > > > grained >>> > > > > resource requirements. >>> > > > > - How to unify resource management for streaming / batch jobs. >>> > > > > >>> > > > > Key changes proposed in the FLIP are as follows. >>> > > > > >>> > > > > - Unify memory management for operators with / without fine >>> > grained >>> > > > > resource requirements by applying a fraction based quota >>> > mechanism. >>> > > > > - Unify resource scheduling for streaming and batch jobs by >>> > setting >>> > > > slot >>> > > > > sharing groups for pipelined regions during compiling stage. >>> > > > > - Dynamically allocate slots from task executors' available >>> > > resources. >>> > > > > >>> > > > > Please find more details in the FLIP wiki document [1]. Looking >>> > forward >>> > > > to >>> > > > > your feedbacks. >>> > > > > >>> > > > > Thank you~ >>> > > > > >>> > > > > Xintong Song >>> > > > > >>> > > > > >>> > > > > [1] >>> > > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management >>> > > > >>> > > >>> > >>> >> |
I guess there is a typo since the link to the FLIP-53 is
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management Cheers, Till On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]> wrote: > Added implementation steps for this FLIP on the wiki page [1]. > > > Thank you~ > > Xintong Song > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]> > wrote: > > > Hi everyone, > > > > As Till suggested, the original "FLIP-53: Fine Grained Resource > > Management" splits into two separate FLIPs, > > > > - FLIP-53: Fine Grained Operator Resource Management [1] > > - FLIP-56: Dynamic Slot Allocation [2] > > > > We'll continue using this discussion thread for FLIP-53. For FLIP-56, I > > just started a new discussion thread [3]. > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > [2] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > [3] > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > > > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]> > > wrote: > > > >> Thinks for the comments, Yang. > >> > >> Regarding your questions: > >> > >> 1. How to calculate the resource specification of TaskManagers? Do > they > >>> have them same resource spec calculated based on the configuration? > I > >>> think > >>> we still have wasted resources in this situation. Or we could start > >>> TaskManagers with different spec. > >>> > >> I agree with you that we can further improve the resource utility by > >> customizing task executors with different resource specifications. > However, > >> I'm in favor of limiting the scope of this FLIP and leave it as a future > >> optimization. The plan for that part is to move the logic of deciding > task > >> executor specifications into the slot manager and make slot manager > >> pluggable, so inside the slot manager plugin we can have different > logics > >> for deciding the task executor specifications. > >> > >> > >>> 2. If a slot is released and returned to SlotPool, does it could be > >>> reused by other SlotRequest that the request resource is smaller > than > >>> it? > >>> > >> No, I think slot pool should always return slots if they do not exactly > >> match the pending requests, so that resource manager can deal with the > >> extra resources. > >> > >>> - If it is yes, what happens to the available resource in the > >> > >> TaskManager. > >>> - What is the SlotStatus of the cached slot in SlotPool? The > >>> AllocationId is null? > >>> > >> The allocation id does not change as long as the slot is not returned > >> from the job master, no matter its occupied or available in the slot > pool. > >> I think we have the same behavior currently. No matter how many tasks > the > >> job master deploy into the slot, concurrently or sequentially, it is one > >> allocation from the cluster to the job until the slot is freed from the > job > >> master. > >> > >>> 3. In a session cluster, some jobs are configured with operator > >>> resources, meanwhile other jobs are using UNKNOWN. How to deal with > >>> this > >>> situation? > >> > >> As long as we do not mix unknown / specified resource profiles within > the > >> same job / slot, there shouldn't be a problem. Resource manager converts > >> unknown resource profiles in slot requests to specified default resource > >> profiles, so they can be dynamically allocated from task executors' > >> available resources just as other slot requests with specified resource > >> profiles. > >> > >> Thank you~ > >> > >> Xintong Song > >> > >> > >> > >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> > wrote: > >> > >>> Hi Xintong, > >>> > >>> > >>> Thanks for your detailed proposal. I think many users are suffering > from > >>> waste of resources. The resource spec of all task managers are same and > >>> we > >>> have to increase all task managers to make the heavy one more stable. > So > >>> we > >>> will benefit from the fine grained resource management a lot. We could > >>> get > >>> better resource utilization and stability. > >>> > >>> > >>> Just to share some thoughts. > >>> > >>> > >>> > >>> 1. How to calculate the resource specification of TaskManagers? Do > >>> they > >>> have them same resource spec calculated based on the configuration? > I > >>> think > >>> we still have wasted resources in this situation. Or we could start > >>> TaskManagers with different spec. > >>> 2. If a slot is released and returned to SlotPool, does it could be > >>> reused by other SlotRequest that the request resource is smaller > than > >>> it? > >>> - If it is yes, what happens to the available resource in the > >>> TaskManager. > >>> - What is the SlotStatus of the cached slot in SlotPool? The > >>> AllocationId is null? > >>> 3. In a session cluster, some jobs are configured with operator > >>> resources, meanwhile other jobs are using UNKNOWN. How to deal with > >>> this > >>> situation? > >>> > >>> > >>> > >>> Best, > >>> Yang > >>> > >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: > >>> > >>> > Thanks for the feedbacks, Yangze and Till. > >>> > > >>> > Yangze, > >>> > > >>> > I agree with you that we should make scheduling strategy pluggable > and > >>> > optimize the strategy to reduce the memory fragmentation problem, and > >>> > thanks for the inputs on the potential algorithmic solutions. > However, > >>> I'm > >>> > in favor of keep this FLIP focusing on the overall mechanism design > >>> rather > >>> > than strategies. Solving the fragmentation issue should be considered > >>> as an > >>> > optimization, and I agree with Till that we probably should tackle > this > >>> > afterwards. > >>> > > >>> > Till, > >>> > > >>> > - Regarding splitting the FLIP, I think it makes sense. The operator > >>> > resource management and dynamic slot allocation do not have much > >>> dependency > >>> > on each other. > >>> > > >>> > - Regarding the default slot size, I think this is similar to FLIP-49 > >>> [1] > >>> > where we want all the deriving happens at one place. I think it would > >>> be > >>> > nice to pass the default slot size into the task executor in the same > >>> way > >>> > that we pass in the memory pool sizes in FLIP-49 [1]. > >>> > > >>> > - Regarding the return value of TaskExecutorGateway#requestResource, > I > >>> > think you're right. We should avoid using null as the return value. I > >>> think > >>> > we probably should thrown an exception here. > >>> > > >>> > Thank you~ > >>> > > >>> > Xintong Song > >>> > > >>> > > >>> > [1] > >>> > > >>> > > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > >>> > > >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]> > >>> > wrote: > >>> > > >>> > > Hi Xintong, > >>> > > > >>> > > thanks for drafting this FLIP. I think your proposal helps to > >>> improve the > >>> > > execution of batch jobs more efficiently. Moreover, it enables the > >>> proper > >>> > > integration of the Blink planner which is very important as well. > >>> > > > >>> > > Overall, the FLIP looks good to me. I was wondering whether it > >>> wouldn't > >>> > > make sense to actually split it up into two FLIPs: Operator > resource > >>> > > management and dynamic slot allocation. I think these two FLIPs > >>> could be > >>> > > seen as orthogonal and it would decrease the scope of each > individual > >>> > FLIP. > >>> > > > >>> > > Some smaller comments: > >>> > > > >>> > > - I'm not sure whether we should pass in the default slot size via > an > >>> > > environment variable. Without having unified the way how Flink > >>> components > >>> > > are configured [1], I think it would be better to pass it in as > part > >>> of > >>> > the > >>> > > configuration. > >>> > > - I would avoid returning a null value from > >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. > >>> Either we > >>> > > should introduce an explicit return value saying this or throw an > >>> > > exception. > >>> > > > >>> > > Concerning Yangze's comments: I think you are right that it would > be > >>> > > helpful to make the selection strategy pluggable. Also batching > slot > >>> > > requests to the RM could be a good optimization. For the sake of > >>> keeping > >>> > > the scope of this FLIP smaller I would try to tackle these things > >>> after > >>> > the > >>> > > initial version has been completed (without spoiling these > >>> optimization > >>> > > opportunities). In particular batching the slot requests depends on > >>> the > >>> > > current scheduler refactoring and could also be realized on the RM > >>> side > >>> > > only. > >>> > > > >>> > > [1] > >>> > > > >>> > > > >>> > > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > >>> > > > >>> > > Cheers, > >>> > > Till > >>> > > > >>> > > > >>> > > > >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> > >>> wrote: > >>> > > > >>> > > > Hi, Xintong > >>> > > > > >>> > > > Thanks to propose this FLIP. The general design looks good to me, > >>> +1 > >>> > > > for this feature. > >>> > > > > >>> > > > Since slots in the same task executor could have different > resource > >>> > > > profile, we will > >>> > > > meet resource fragment problem. Think about this case: > >>> > > > - request A want 1G memory while request B & C want 0.5G memory > >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G free > >>> memory > >>> > > > respectively > >>> > > > If B come first and we cut a slot from T1 for B, A must wait for > >>> the > >>> > > > free resource from > >>> > > > other task. But A could have been scheduled immediately if we > cut a > >>> > > > slot from T2 for B. > >>> > > > > >>> > > > The logic of findMatchingSlot now become finding a task executor > >>> which > >>> > > > has enough > >>> > > > resource and then cut a slot from it. Current method could be > seen > >>> as > >>> > > > "First-fit strategy", > >>> > > > which works well in general but sometimes could not be the > >>> optimization > >>> > > > method. > >>> > > > > >>> > > > Actually, this problem could be abstracted as "Bin Packing > >>> Problem"[1]. > >>> > > > Here are > >>> > > > some common approximate algorithms: > >>> > > > - First fit > >>> > > > - Next fit > >>> > > > - Best fit > >>> > > > > >>> > > > But it become multi-dimensional bin packing problem if we take > CPU > >>> > > > into account. It hard > >>> > > > to define which one is best fit now. Some research addressed this > >>> > > > problem, such like Tetris[2]. > >>> > > > > >>> > > > Here are some thinking about it: > >>> > > > 1. We could make the strategy of finding matching task executor > >>> > > > pluginable. Let user to config the > >>> > > > best strategy in their scenario. > >>> > > > 2. We could support batch request interface in RM, because we > have > >>> > > > opportunities to optimize > >>> > > > if we have more information. If we know the A, B, C at the same > >>> time, > >>> > > > we could always make the best decision. > >>> > > > > >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > >>> > > > [2] > >>> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > >>> > > > > >>> > > > Best, > >>> > > > Yangze Guo > >>> > > > > >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > >>> [hidden email]> > >>> > > > wrote: > >>> > > > > > >>> > > > > Hi everyone, > >>> > > > > > >>> > > > > We would like to start a discussion thread on "FLIP-53: Fine > >>> Grained > >>> > > > > Resource Management"[1], where we propose how to improve Flink > >>> > resource > >>> > > > > management and scheduling. > >>> > > > > > >>> > > > > This FLIP mainly discusses the following issues. > >>> > > > > > >>> > > > > - How to support tasks with fine grained resource > >>> requirements. > >>> > > > > - How to unify resource management for jobs with / without > >>> fine > >>> > > > grained > >>> > > > > resource requirements. > >>> > > > > - How to unify resource management for streaming / batch > jobs. > >>> > > > > > >>> > > > > Key changes proposed in the FLIP are as follows. > >>> > > > > > >>> > > > > - Unify memory management for operators with / without fine > >>> > grained > >>> > > > > resource requirements by applying a fraction based quota > >>> > mechanism. > >>> > > > > - Unify resource scheduling for streaming and batch jobs by > >>> > setting > >>> > > > slot > >>> > > > > sharing groups for pipelined regions during compiling stage. > >>> > > > > - Dynamically allocate slots from task executors' available > >>> > > resources. > >>> > > > > > >>> > > > > Please find more details in the FLIP wiki document [1]. Looking > >>> > forward > >>> > > > to > >>> > > > > your feedbacks. > >>> > > > > > >>> > > > > Thank you~ > >>> > > > > > >>> > > > > Xintong Song > >>> > > > > > >>> > > > > > >>> > > > > [1] > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > >>> > > > > >>> > > > >>> > > >>> > >> > |
Thanks for creating the implementation plan Xintong. Overall, the
implementation plan looks good. I had a couple of comments: - What will happen if a user has defined a streaming job with two slot sharing groups? Would the code insert a blocking data exchange between these two groups? If yes, then this breaks existing Flink streaming jobs. - How do we detect unbounded streaming jobs to set the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to set it false if we are using the DataSet API or the Blink planner with a bounded job? Cheers, Till On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]> wrote: > I guess there is a typo since the link to the FLIP-53 is > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > Cheers, > Till > > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]> > wrote: > >> Added implementation steps for this FLIP on the wiki page [1]. >> >> >> Thank you~ >> >> Xintong Song >> >> >> [1] >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]> >> wrote: >> >> > Hi everyone, >> > >> > As Till suggested, the original "FLIP-53: Fine Grained Resource >> > Management" splits into two separate FLIPs, >> > >> > - FLIP-53: Fine Grained Operator Resource Management [1] >> > - FLIP-56: Dynamic Slot Allocation [2] >> > >> > We'll continue using this discussion thread for FLIP-53. For FLIP-56, I >> > just started a new discussion thread [3]. >> > >> > Thank you~ >> > >> > Xintong Song >> > >> > >> > [1] >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management >> > >> > [2] >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation >> > >> > [3] >> > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html >> > >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]> >> > wrote: >> > >> >> Thinks for the comments, Yang. >> >> >> >> Regarding your questions: >> >> >> >> 1. How to calculate the resource specification of TaskManagers? Do >> they >> >>> have them same resource spec calculated based on the >> configuration? I >> >>> think >> >>> we still have wasted resources in this situation. Or we could start >> >>> TaskManagers with different spec. >> >>> >> >> I agree with you that we can further improve the resource utility by >> >> customizing task executors with different resource specifications. >> However, >> >> I'm in favor of limiting the scope of this FLIP and leave it as a >> future >> >> optimization. The plan for that part is to move the logic of deciding >> task >> >> executor specifications into the slot manager and make slot manager >> >> pluggable, so inside the slot manager plugin we can have different >> logics >> >> for deciding the task executor specifications. >> >> >> >> >> >>> 2. If a slot is released and returned to SlotPool, does it could be >> >>> reused by other SlotRequest that the request resource is smaller >> than >> >>> it? >> >>> >> >> No, I think slot pool should always return slots if they do not exactly >> >> match the pending requests, so that resource manager can deal with the >> >> extra resources. >> >> >> >>> - If it is yes, what happens to the available resource in the >> >> >> >> TaskManager. >> >>> - What is the SlotStatus of the cached slot in SlotPool? The >> >>> AllocationId is null? >> >>> >> >> The allocation id does not change as long as the slot is not returned >> >> from the job master, no matter its occupied or available in the slot >> pool. >> >> I think we have the same behavior currently. No matter how many tasks >> the >> >> job master deploy into the slot, concurrently or sequentially, it is >> one >> >> allocation from the cluster to the job until the slot is freed from >> the job >> >> master. >> >> >> >>> 3. In a session cluster, some jobs are configured with operator >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal with >> >>> this >> >>> situation? >> >> >> >> As long as we do not mix unknown / specified resource profiles within >> the >> >> same job / slot, there shouldn't be a problem. Resource manager >> converts >> >> unknown resource profiles in slot requests to specified default >> resource >> >> profiles, so they can be dynamically allocated from task executors' >> >> available resources just as other slot requests with specified resource >> >> profiles. >> >> >> >> Thank you~ >> >> >> >> Xintong Song >> >> >> >> >> >> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> >> wrote: >> >> >> >>> Hi Xintong, >> >>> >> >>> >> >>> Thanks for your detailed proposal. I think many users are suffering >> from >> >>> waste of resources. The resource spec of all task managers are same >> and >> >>> we >> >>> have to increase all task managers to make the heavy one more stable. >> So >> >>> we >> >>> will benefit from the fine grained resource management a lot. We could >> >>> get >> >>> better resource utilization and stability. >> >>> >> >>> >> >>> Just to share some thoughts. >> >>> >> >>> >> >>> >> >>> 1. How to calculate the resource specification of TaskManagers? Do >> >>> they >> >>> have them same resource spec calculated based on the >> configuration? I >> >>> think >> >>> we still have wasted resources in this situation. Or we could start >> >>> TaskManagers with different spec. >> >>> 2. If a slot is released and returned to SlotPool, does it could be >> >>> reused by other SlotRequest that the request resource is smaller >> than >> >>> it? >> >>> - If it is yes, what happens to the available resource in the >> >>> TaskManager. >> >>> - What is the SlotStatus of the cached slot in SlotPool? The >> >>> AllocationId is null? >> >>> 3. In a session cluster, some jobs are configured with operator >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal with >> >>> this >> >>> situation? >> >>> >> >>> >> >>> >> >>> Best, >> >>> Yang >> >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: >> >>> >> >>> > Thanks for the feedbacks, Yangze and Till. >> >>> > >> >>> > Yangze, >> >>> > >> >>> > I agree with you that we should make scheduling strategy pluggable >> and >> >>> > optimize the strategy to reduce the memory fragmentation problem, >> and >> >>> > thanks for the inputs on the potential algorithmic solutions. >> However, >> >>> I'm >> >>> > in favor of keep this FLIP focusing on the overall mechanism design >> >>> rather >> >>> > than strategies. Solving the fragmentation issue should be >> considered >> >>> as an >> >>> > optimization, and I agree with Till that we probably should tackle >> this >> >>> > afterwards. >> >>> > >> >>> > Till, >> >>> > >> >>> > - Regarding splitting the FLIP, I think it makes sense. The operator >> >>> > resource management and dynamic slot allocation do not have much >> >>> dependency >> >>> > on each other. >> >>> > >> >>> > - Regarding the default slot size, I think this is similar to >> FLIP-49 >> >>> [1] >> >>> > where we want all the deriving happens at one place. I think it >> would >> >>> be >> >>> > nice to pass the default slot size into the task executor in the >> same >> >>> way >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. >> >>> > >> >>> > - Regarding the return value of >> TaskExecutorGateway#requestResource, I >> >>> > think you're right. We should avoid using null as the return value. >> I >> >>> think >> >>> > we probably should thrown an exception here. >> >>> > >> >>> > Thank you~ >> >>> > >> >>> > Xintong Song >> >>> > >> >>> > >> >>> > [1] >> >>> > >> >>> > >> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >> >>> > >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email] >> > >> >>> > wrote: >> >>> > >> >>> > > Hi Xintong, >> >>> > > >> >>> > > thanks for drafting this FLIP. I think your proposal helps to >> >>> improve the >> >>> > > execution of batch jobs more efficiently. Moreover, it enables the >> >>> proper >> >>> > > integration of the Blink planner which is very important as well. >> >>> > > >> >>> > > Overall, the FLIP looks good to me. I was wondering whether it >> >>> wouldn't >> >>> > > make sense to actually split it up into two FLIPs: Operator >> resource >> >>> > > management and dynamic slot allocation. I think these two FLIPs >> >>> could be >> >>> > > seen as orthogonal and it would decrease the scope of each >> individual >> >>> > FLIP. >> >>> > > >> >>> > > Some smaller comments: >> >>> > > >> >>> > > - I'm not sure whether we should pass in the default slot size >> via an >> >>> > > environment variable. Without having unified the way how Flink >> >>> components >> >>> > > are configured [1], I think it would be better to pass it in as >> part >> >>> of >> >>> > the >> >>> > > configuration. >> >>> > > - I would avoid returning a null value from >> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. >> >>> Either we >> >>> > > should introduce an explicit return value saying this or throw an >> >>> > > exception. >> >>> > > >> >>> > > Concerning Yangze's comments: I think you are right that it would >> be >> >>> > > helpful to make the selection strategy pluggable. Also batching >> slot >> >>> > > requests to the RM could be a good optimization. For the sake of >> >>> keeping >> >>> > > the scope of this FLIP smaller I would try to tackle these things >> >>> after >> >>> > the >> >>> > > initial version has been completed (without spoiling these >> >>> optimization >> >>> > > opportunities). In particular batching the slot requests depends >> on >> >>> the >> >>> > > current scheduler refactoring and could also be realized on the RM >> >>> side >> >>> > > only. >> >>> > > >> >>> > > [1] >> >>> > > >> >>> > > >> >>> > >> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration >> >>> > > >> >>> > > Cheers, >> >>> > > Till >> >>> > > >> >>> > > >> >>> > > >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> >> >>> wrote: >> >>> > > >> >>> > > > Hi, Xintong >> >>> > > > >> >>> > > > Thanks to propose this FLIP. The general design looks good to >> me, >> >>> +1 >> >>> > > > for this feature. >> >>> > > > >> >>> > > > Since slots in the same task executor could have different >> resource >> >>> > > > profile, we will >> >>> > > > meet resource fragment problem. Think about this case: >> >>> > > > - request A want 1G memory while request B & C want 0.5G memory >> >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G free >> >>> memory >> >>> > > > respectively >> >>> > > > If B come first and we cut a slot from T1 for B, A must wait for >> >>> the >> >>> > > > free resource from >> >>> > > > other task. But A could have been scheduled immediately if we >> cut a >> >>> > > > slot from T2 for B. >> >>> > > > >> >>> > > > The logic of findMatchingSlot now become finding a task executor >> >>> which >> >>> > > > has enough >> >>> > > > resource and then cut a slot from it. Current method could be >> seen >> >>> as >> >>> > > > "First-fit strategy", >> >>> > > > which works well in general but sometimes could not be the >> >>> optimization >> >>> > > > method. >> >>> > > > >> >>> > > > Actually, this problem could be abstracted as "Bin Packing >> >>> Problem"[1]. >> >>> > > > Here are >> >>> > > > some common approximate algorithms: >> >>> > > > - First fit >> >>> > > > - Next fit >> >>> > > > - Best fit >> >>> > > > >> >>> > > > But it become multi-dimensional bin packing problem if we take >> CPU >> >>> > > > into account. It hard >> >>> > > > to define which one is best fit now. Some research addressed >> this >> >>> > > > problem, such like Tetris[2]. >> >>> > > > >> >>> > > > Here are some thinking about it: >> >>> > > > 1. We could make the strategy of finding matching task executor >> >>> > > > pluginable. Let user to config the >> >>> > > > best strategy in their scenario. >> >>> > > > 2. We could support batch request interface in RM, because we >> have >> >>> > > > opportunities to optimize >> >>> > > > if we have more information. If we know the A, B, C at the same >> >>> time, >> >>> > > > we could always make the best decision. >> >>> > > > >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf >> >>> > > > [2] >> >>> > >> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf >> >>> > > > >> >>> > > > Best, >> >>> > > > Yangze Guo >> >>> > > > >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < >> >>> [hidden email]> >> >>> > > > wrote: >> >>> > > > > >> >>> > > > > Hi everyone, >> >>> > > > > >> >>> > > > > We would like to start a discussion thread on "FLIP-53: Fine >> >>> Grained >> >>> > > > > Resource Management"[1], where we propose how to improve Flink >> >>> > resource >> >>> > > > > management and scheduling. >> >>> > > > > >> >>> > > > > This FLIP mainly discusses the following issues. >> >>> > > > > >> >>> > > > > - How to support tasks with fine grained resource >> >>> requirements. >> >>> > > > > - How to unify resource management for jobs with / without >> >>> fine >> >>> > > > grained >> >>> > > > > resource requirements. >> >>> > > > > - How to unify resource management for streaming / batch >> jobs. >> >>> > > > > >> >>> > > > > Key changes proposed in the FLIP are as follows. >> >>> > > > > >> >>> > > > > - Unify memory management for operators with / without fine >> >>> > grained >> >>> > > > > resource requirements by applying a fraction based quota >> >>> > mechanism. >> >>> > > > > - Unify resource scheduling for streaming and batch jobs by >> >>> > setting >> >>> > > > slot >> >>> > > > > sharing groups for pipelined regions during compiling >> stage. >> >>> > > > > - Dynamically allocate slots from task executors' available >> >>> > > resources. >> >>> > > > > >> >>> > > > > Please find more details in the FLIP wiki document [1]. >> Looking >> >>> > forward >> >>> > > > to >> >>> > > > > your feedbacks. >> >>> > > > > >> >>> > > > > Thank you~ >> >>> > > > > >> >>> > > > > Xintong Song >> >>> > > > > >> >>> > > > > >> >>> > > > > [1] >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management >> >>> > > > >> >>> > > >> >>> > >> >>> >> >> >> > |
Thanks for the correction, Till.
Regarding your comments: - You are right, we should not change the edge type for streaming jobs. Then I think we can change the option 'allSourcesInSamePipelinedRegion' in step 2 to 'isStreamingJob', and implement the current step 2 before the current step 1 so we can use this option to decide whether should change the edge type. What do you think? - Agree. It should be easier to make the default value of 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set it to 'false' when using DataSet API or blink planner. Thank you~ Xintong Song On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]> wrote: > Thanks for creating the implementation plan Xintong. Overall, the > implementation plan looks good. I had a couple of comments: > > - What will happen if a user has defined a streaming job with two slot > sharing groups? Would the code insert a blocking data exchange between > these two groups? If yes, then this breaks existing Flink streaming jobs. > - How do we detect unbounded streaming jobs to set > the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to set > it false if we are using the DataSet API or the Blink planner with a > bounded job? > > Cheers, > Till > > On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]> > wrote: > > > I guess there is a typo since the link to the FLIP-53 is > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > Cheers, > > Till > > > > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]> > > wrote: > > > >> Added implementation steps for this FLIP on the wiki page [1]. > >> > >> > >> Thank you~ > >> > >> Xintong Song > >> > >> > >> [1] > >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > >> > >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]> > >> wrote: > >> > >> > Hi everyone, > >> > > >> > As Till suggested, the original "FLIP-53: Fine Grained Resource > >> > Management" splits into two separate FLIPs, > >> > > >> > - FLIP-53: Fine Grained Operator Resource Management [1] > >> > - FLIP-56: Dynamic Slot Allocation [2] > >> > > >> > We'll continue using this discussion thread for FLIP-53. For FLIP-56, > I > >> > just started a new discussion thread [3]. > >> > > >> > Thank you~ > >> > > >> > Xintong Song > >> > > >> > > >> > [1] > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > >> > > >> > [2] > >> > > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > >> > > >> > [3] > >> > > >> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > >> > > >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]> > >> > wrote: > >> > > >> >> Thinks for the comments, Yang. > >> >> > >> >> Regarding your questions: > >> >> > >> >> 1. How to calculate the resource specification of TaskManagers? Do > >> they > >> >>> have them same resource spec calculated based on the > >> configuration? I > >> >>> think > >> >>> we still have wasted resources in this situation. Or we could > start > >> >>> TaskManagers with different spec. > >> >>> > >> >> I agree with you that we can further improve the resource utility by > >> >> customizing task executors with different resource specifications. > >> However, > >> >> I'm in favor of limiting the scope of this FLIP and leave it as a > >> future > >> >> optimization. The plan for that part is to move the logic of deciding > >> task > >> >> executor specifications into the slot manager and make slot manager > >> >> pluggable, so inside the slot manager plugin we can have different > >> logics > >> >> for deciding the task executor specifications. > >> >> > >> >> > >> >>> 2. If a slot is released and returned to SlotPool, does it could > be > >> >>> reused by other SlotRequest that the request resource is smaller > >> than > >> >>> it? > >> >>> > >> >> No, I think slot pool should always return slots if they do not > exactly > >> >> match the pending requests, so that resource manager can deal with > the > >> >> extra resources. > >> >> > >> >>> - If it is yes, what happens to the available resource in the > >> >> > >> >> TaskManager. > >> >>> - What is the SlotStatus of the cached slot in SlotPool? The > >> >>> AllocationId is null? > >> >>> > >> >> The allocation id does not change as long as the slot is not returned > >> >> from the job master, no matter its occupied or available in the slot > >> pool. > >> >> I think we have the same behavior currently. No matter how many tasks > >> the > >> >> job master deploy into the slot, concurrently or sequentially, it is > >> one > >> >> allocation from the cluster to the job until the slot is freed from > >> the job > >> >> master. > >> >> > >> >>> 3. In a session cluster, some jobs are configured with operator > >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal > with > >> >>> this > >> >>> situation? > >> >> > >> >> As long as we do not mix unknown / specified resource profiles within > >> the > >> >> same job / slot, there shouldn't be a problem. Resource manager > >> converts > >> >> unknown resource profiles in slot requests to specified default > >> resource > >> >> profiles, so they can be dynamically allocated from task executors' > >> >> available resources just as other slot requests with specified > resource > >> >> profiles. > >> >> > >> >> Thank you~ > >> >> > >> >> Xintong Song > >> >> > >> >> > >> >> > >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> > >> wrote: > >> >> > >> >>> Hi Xintong, > >> >>> > >> >>> > >> >>> Thanks for your detailed proposal. I think many users are suffering > >> from > >> >>> waste of resources. The resource spec of all task managers are same > >> and > >> >>> we > >> >>> have to increase all task managers to make the heavy one more > stable. > >> So > >> >>> we > >> >>> will benefit from the fine grained resource management a lot. We > could > >> >>> get > >> >>> better resource utilization and stability. > >> >>> > >> >>> > >> >>> Just to share some thoughts. > >> >>> > >> >>> > >> >>> > >> >>> 1. How to calculate the resource specification of TaskManagers? > Do > >> >>> they > >> >>> have them same resource spec calculated based on the > >> configuration? I > >> >>> think > >> >>> we still have wasted resources in this situation. Or we could > start > >> >>> TaskManagers with different spec. > >> >>> 2. If a slot is released and returned to SlotPool, does it could > be > >> >>> reused by other SlotRequest that the request resource is smaller > >> than > >> >>> it? > >> >>> - If it is yes, what happens to the available resource in the > >> >>> TaskManager. > >> >>> - What is the SlotStatus of the cached slot in SlotPool? The > >> >>> AllocationId is null? > >> >>> 3. In a session cluster, some jobs are configured with operator > >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal > with > >> >>> this > >> >>> situation? > >> >>> > >> >>> > >> >>> > >> >>> Best, > >> >>> Yang > >> >>> > >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: > >> >>> > >> >>> > Thanks for the feedbacks, Yangze and Till. > >> >>> > > >> >>> > Yangze, > >> >>> > > >> >>> > I agree with you that we should make scheduling strategy pluggable > >> and > >> >>> > optimize the strategy to reduce the memory fragmentation problem, > >> and > >> >>> > thanks for the inputs on the potential algorithmic solutions. > >> However, > >> >>> I'm > >> >>> > in favor of keep this FLIP focusing on the overall mechanism > design > >> >>> rather > >> >>> > than strategies. Solving the fragmentation issue should be > >> considered > >> >>> as an > >> >>> > optimization, and I agree with Till that we probably should tackle > >> this > >> >>> > afterwards. > >> >>> > > >> >>> > Till, > >> >>> > > >> >>> > - Regarding splitting the FLIP, I think it makes sense. The > operator > >> >>> > resource management and dynamic slot allocation do not have much > >> >>> dependency > >> >>> > on each other. > >> >>> > > >> >>> > - Regarding the default slot size, I think this is similar to > >> FLIP-49 > >> >>> [1] > >> >>> > where we want all the deriving happens at one place. I think it > >> would > >> >>> be > >> >>> > nice to pass the default slot size into the task executor in the > >> same > >> >>> way > >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. > >> >>> > > >> >>> > - Regarding the return value of > >> TaskExecutorGateway#requestResource, I > >> >>> > think you're right. We should avoid using null as the return > value. > >> I > >> >>> think > >> >>> > we probably should thrown an exception here. > >> >>> > > >> >>> > Thank you~ > >> >>> > > >> >>> > Xintong Song > >> >>> > > >> >>> > > >> >>> > [1] > >> >>> > > >> >>> > > >> >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > >> >>> > > >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > [hidden email] > >> > > >> >>> > wrote: > >> >>> > > >> >>> > > Hi Xintong, > >> >>> > > > >> >>> > > thanks for drafting this FLIP. I think your proposal helps to > >> >>> improve the > >> >>> > > execution of batch jobs more efficiently. Moreover, it enables > the > >> >>> proper > >> >>> > > integration of the Blink planner which is very important as > well. > >> >>> > > > >> >>> > > Overall, the FLIP looks good to me. I was wondering whether it > >> >>> wouldn't > >> >>> > > make sense to actually split it up into two FLIPs: Operator > >> resource > >> >>> > > management and dynamic slot allocation. I think these two FLIPs > >> >>> could be > >> >>> > > seen as orthogonal and it would decrease the scope of each > >> individual > >> >>> > FLIP. > >> >>> > > > >> >>> > > Some smaller comments: > >> >>> > > > >> >>> > > - I'm not sure whether we should pass in the default slot size > >> via an > >> >>> > > environment variable. Without having unified the way how Flink > >> >>> components > >> >>> > > are configured [1], I think it would be better to pass it in as > >> part > >> >>> of > >> >>> > the > >> >>> > > configuration. > >> >>> > > - I would avoid returning a null value from > >> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. > >> >>> Either we > >> >>> > > should introduce an explicit return value saying this or throw > an > >> >>> > > exception. > >> >>> > > > >> >>> > > Concerning Yangze's comments: I think you are right that it > would > >> be > >> >>> > > helpful to make the selection strategy pluggable. Also batching > >> slot > >> >>> > > requests to the RM could be a good optimization. For the sake of > >> >>> keeping > >> >>> > > the scope of this FLIP smaller I would try to tackle these > things > >> >>> after > >> >>> > the > >> >>> > > initial version has been completed (without spoiling these > >> >>> optimization > >> >>> > > opportunities). In particular batching the slot requests depends > >> on > >> >>> the > >> >>> > > current scheduler refactoring and could also be realized on the > RM > >> >>> side > >> >>> > > only. > >> >>> > > > >> >>> > > [1] > >> >>> > > > >> >>> > > > >> >>> > > >> >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > >> >>> > > > >> >>> > > Cheers, > >> >>> > > Till > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email] > > > >> >>> wrote: > >> >>> > > > >> >>> > > > Hi, Xintong > >> >>> > > > > >> >>> > > > Thanks to propose this FLIP. The general design looks good to > >> me, > >> >>> +1 > >> >>> > > > for this feature. > >> >>> > > > > >> >>> > > > Since slots in the same task executor could have different > >> resource > >> >>> > > > profile, we will > >> >>> > > > meet resource fragment problem. Think about this case: > >> >>> > > > - request A want 1G memory while request B & C want 0.5G > memory > >> >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G free > >> >>> memory > >> >>> > > > respectively > >> >>> > > > If B come first and we cut a slot from T1 for B, A must wait > for > >> >>> the > >> >>> > > > free resource from > >> >>> > > > other task. But A could have been scheduled immediately if we > >> cut a > >> >>> > > > slot from T2 for B. > >> >>> > > > > >> >>> > > > The logic of findMatchingSlot now become finding a task > executor > >> >>> which > >> >>> > > > has enough > >> >>> > > > resource and then cut a slot from it. Current method could be > >> seen > >> >>> as > >> >>> > > > "First-fit strategy", > >> >>> > > > which works well in general but sometimes could not be the > >> >>> optimization > >> >>> > > > method. > >> >>> > > > > >> >>> > > > Actually, this problem could be abstracted as "Bin Packing > >> >>> Problem"[1]. > >> >>> > > > Here are > >> >>> > > > some common approximate algorithms: > >> >>> > > > - First fit > >> >>> > > > - Next fit > >> >>> > > > - Best fit > >> >>> > > > > >> >>> > > > But it become multi-dimensional bin packing problem if we take > >> CPU > >> >>> > > > into account. It hard > >> >>> > > > to define which one is best fit now. Some research addressed > >> this > >> >>> > > > problem, such like Tetris[2]. > >> >>> > > > > >> >>> > > > Here are some thinking about it: > >> >>> > > > 1. We could make the strategy of finding matching task > executor > >> >>> > > > pluginable. Let user to config the > >> >>> > > > best strategy in their scenario. > >> >>> > > > 2. We could support batch request interface in RM, because we > >> have > >> >>> > > > opportunities to optimize > >> >>> > > > if we have more information. If we know the A, B, C at the > same > >> >>> time, > >> >>> > > > we could always make the best decision. > >> >>> > > > > >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > >> >>> > > > [2] > >> >>> > > >> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > >> >>> > > > > >> >>> > > > Best, > >> >>> > > > Yangze Guo > >> >>> > > > > >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > >> >>> [hidden email]> > >> >>> > > > wrote: > >> >>> > > > > > >> >>> > > > > Hi everyone, > >> >>> > > > > > >> >>> > > > > We would like to start a discussion thread on "FLIP-53: Fine > >> >>> Grained > >> >>> > > > > Resource Management"[1], where we propose how to improve > Flink > >> >>> > resource > >> >>> > > > > management and scheduling. > >> >>> > > > > > >> >>> > > > > This FLIP mainly discusses the following issues. > >> >>> > > > > > >> >>> > > > > - How to support tasks with fine grained resource > >> >>> requirements. > >> >>> > > > > - How to unify resource management for jobs with / > without > >> >>> fine > >> >>> > > > grained > >> >>> > > > > resource requirements. > >> >>> > > > > - How to unify resource management for streaming / batch > >> jobs. > >> >>> > > > > > >> >>> > > > > Key changes proposed in the FLIP are as follows. > >> >>> > > > > > >> >>> > > > > - Unify memory management for operators with / without > fine > >> >>> > grained > >> >>> > > > > resource requirements by applying a fraction based quota > >> >>> > mechanism. > >> >>> > > > > - Unify resource scheduling for streaming and batch jobs > by > >> >>> > setting > >> >>> > > > slot > >> >>> > > > > sharing groups for pipelined regions during compiling > >> stage. > >> >>> > > > > - Dynamically allocate slots from task executors' > available > >> >>> > > resources. > >> >>> > > > > > >> >>> > > > > Please find more details in the FLIP wiki document [1]. > >> Looking > >> >>> > forward > >> >>> > > > to > >> >>> > > > > your feedbacks. > >> >>> > > > > > >> >>> > > > > Thank you~ > >> >>> > > > > > >> >>> > > > > Xintong Song > >> >>> > > > > > >> >>> > > > > > >> >>> > > > > [1] > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >> > >> > > > |
Regarding changing edge type, I think actually we don't need to do this for
batch jobs neither, because we don't have public interfaces for users to explicitly set slot sharing groups in DataSet API and SQL/Table API. We have such interfaces in DataStream API only. Thank you~ Xintong Song On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]> wrote: > Thanks for the correction, Till. > > Regarding your comments: > - You are right, we should not change the edge type for streaming jobs. > Then I think we can change the option 'allSourcesInSamePipelinedRegion' in > step 2 to 'isStreamingJob', and implement the current step 2 before the > current step 1 so we can use this option to decide whether should change > the edge type. What do you think? > - Agree. It should be easier to make the default value of > 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set it > to 'false' when using DataSet API or blink planner. > > Thank you~ > > Xintong Song > > > > On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]> > wrote: > >> Thanks for creating the implementation plan Xintong. Overall, the >> implementation plan looks good. I had a couple of comments: >> >> - What will happen if a user has defined a streaming job with two slot >> sharing groups? Would the code insert a blocking data exchange between >> these two groups? If yes, then this breaks existing Flink streaming jobs. >> - How do we detect unbounded streaming jobs to set >> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to >> set >> it false if we are using the DataSet API or the Blink planner with a >> bounded job? >> >> Cheers, >> Till >> >> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]> >> wrote: >> >> > I guess there is a typo since the link to the FLIP-53 is >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management >> > >> > Cheers, >> > Till >> > >> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]> >> > wrote: >> > >> >> Added implementation steps for this FLIP on the wiki page [1]. >> >> >> >> >> >> Thank you~ >> >> >> >> Xintong Song >> >> >> >> >> >> [1] >> >> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >> >> >> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]> >> >> wrote: >> >> >> >> > Hi everyone, >> >> > >> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource >> >> > Management" splits into two separate FLIPs, >> >> > >> >> > - FLIP-53: Fine Grained Operator Resource Management [1] >> >> > - FLIP-56: Dynamic Slot Allocation [2] >> >> > >> >> > We'll continue using this discussion thread for FLIP-53. For >> FLIP-56, I >> >> > just started a new discussion thread [3]. >> >> > >> >> > Thank you~ >> >> > >> >> > Xintong Song >> >> > >> >> > >> >> > [1] >> >> > >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management >> >> > >> >> > [2] >> >> > >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation >> >> > >> >> > [3] >> >> > >> >> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html >> >> > >> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]> >> >> > wrote: >> >> > >> >> >> Thinks for the comments, Yang. >> >> >> >> >> >> Regarding your questions: >> >> >> >> >> >> 1. How to calculate the resource specification of TaskManagers? >> Do >> >> they >> >> >>> have them same resource spec calculated based on the >> >> configuration? I >> >> >>> think >> >> >>> we still have wasted resources in this situation. Or we could >> start >> >> >>> TaskManagers with different spec. >> >> >>> >> >> >> I agree with you that we can further improve the resource utility by >> >> >> customizing task executors with different resource specifications. >> >> However, >> >> >> I'm in favor of limiting the scope of this FLIP and leave it as a >> >> future >> >> >> optimization. The plan for that part is to move the logic of >> deciding >> >> task >> >> >> executor specifications into the slot manager and make slot manager >> >> >> pluggable, so inside the slot manager plugin we can have different >> >> logics >> >> >> for deciding the task executor specifications. >> >> >> >> >> >> >> >> >>> 2. If a slot is released and returned to SlotPool, does it >> could be >> >> >>> reused by other SlotRequest that the request resource is smaller >> >> than >> >> >>> it? >> >> >>> >> >> >> No, I think slot pool should always return slots if they do not >> exactly >> >> >> match the pending requests, so that resource manager can deal with >> the >> >> >> extra resources. >> >> >> >> >> >>> - If it is yes, what happens to the available resource in the >> >> >> >> >> >> TaskManager. >> >> >>> - What is the SlotStatus of the cached slot in SlotPool? The >> >> >>> AllocationId is null? >> >> >>> >> >> >> The allocation id does not change as long as the slot is not >> returned >> >> >> from the job master, no matter its occupied or available in the slot >> >> pool. >> >> >> I think we have the same behavior currently. No matter how many >> tasks >> >> the >> >> >> job master deploy into the slot, concurrently or sequentially, it is >> >> one >> >> >> allocation from the cluster to the job until the slot is freed from >> >> the job >> >> >> master. >> >> >> >> >> >>> 3. In a session cluster, some jobs are configured with operator >> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal >> with >> >> >>> this >> >> >>> situation? >> >> >> >> >> >> As long as we do not mix unknown / specified resource profiles >> within >> >> the >> >> >> same job / slot, there shouldn't be a problem. Resource manager >> >> converts >> >> >> unknown resource profiles in slot requests to specified default >> >> resource >> >> >> profiles, so they can be dynamically allocated from task executors' >> >> >> available resources just as other slot requests with specified >> resource >> >> >> profiles. >> >> >> >> >> >> Thank you~ >> >> >> >> >> >> Xintong Song >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> >> >> wrote: >> >> >> >> >> >>> Hi Xintong, >> >> >>> >> >> >>> >> >> >>> Thanks for your detailed proposal. I think many users are suffering >> >> from >> >> >>> waste of resources. The resource spec of all task managers are same >> >> and >> >> >>> we >> >> >>> have to increase all task managers to make the heavy one more >> stable. >> >> So >> >> >>> we >> >> >>> will benefit from the fine grained resource management a lot. We >> could >> >> >>> get >> >> >>> better resource utilization and stability. >> >> >>> >> >> >>> >> >> >>> Just to share some thoughts. >> >> >>> >> >> >>> >> >> >>> >> >> >>> 1. How to calculate the resource specification of TaskManagers? >> Do >> >> >>> they >> >> >>> have them same resource spec calculated based on the >> >> configuration? I >> >> >>> think >> >> >>> we still have wasted resources in this situation. Or we could >> start >> >> >>> TaskManagers with different spec. >> >> >>> 2. If a slot is released and returned to SlotPool, does it >> could be >> >> >>> reused by other SlotRequest that the request resource is smaller >> >> than >> >> >>> it? >> >> >>> - If it is yes, what happens to the available resource in the >> >> >>> TaskManager. >> >> >>> - What is the SlotStatus of the cached slot in SlotPool? The >> >> >>> AllocationId is null? >> >> >>> 3. In a session cluster, some jobs are configured with operator >> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal >> with >> >> >>> this >> >> >>> situation? >> >> >>> >> >> >>> >> >> >>> >> >> >>> Best, >> >> >>> Yang >> >> >>> >> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: >> >> >>> >> >> >>> > Thanks for the feedbacks, Yangze and Till. >> >> >>> > >> >> >>> > Yangze, >> >> >>> > >> >> >>> > I agree with you that we should make scheduling strategy >> pluggable >> >> and >> >> >>> > optimize the strategy to reduce the memory fragmentation problem, >> >> and >> >> >>> > thanks for the inputs on the potential algorithmic solutions. >> >> However, >> >> >>> I'm >> >> >>> > in favor of keep this FLIP focusing on the overall mechanism >> design >> >> >>> rather >> >> >>> > than strategies. Solving the fragmentation issue should be >> >> considered >> >> >>> as an >> >> >>> > optimization, and I agree with Till that we probably should >> tackle >> >> this >> >> >>> > afterwards. >> >> >>> > >> >> >>> > Till, >> >> >>> > >> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The >> operator >> >> >>> > resource management and dynamic slot allocation do not have much >> >> >>> dependency >> >> >>> > on each other. >> >> >>> > >> >> >>> > - Regarding the default slot size, I think this is similar to >> >> FLIP-49 >> >> >>> [1] >> >> >>> > where we want all the deriving happens at one place. I think it >> >> would >> >> >>> be >> >> >>> > nice to pass the default slot size into the task executor in the >> >> same >> >> >>> way >> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. >> >> >>> > >> >> >>> > - Regarding the return value of >> >> TaskExecutorGateway#requestResource, I >> >> >>> > think you're right. We should avoid using null as the return >> value. >> >> I >> >> >>> think >> >> >>> > we probably should thrown an exception here. >> >> >>> > >> >> >>> > Thank you~ >> >> >>> > >> >> >>> > Xintong Song >> >> >>> > >> >> >>> > >> >> >>> > [1] >> >> >>> > >> >> >>> > >> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >> >> >>> > >> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < >> [hidden email] >> >> > >> >> >>> > wrote: >> >> >>> > >> >> >>> > > Hi Xintong, >> >> >>> > > >> >> >>> > > thanks for drafting this FLIP. I think your proposal helps to >> >> >>> improve the >> >> >>> > > execution of batch jobs more efficiently. Moreover, it enables >> the >> >> >>> proper >> >> >>> > > integration of the Blink planner which is very important as >> well. >> >> >>> > > >> >> >>> > > Overall, the FLIP looks good to me. I was wondering whether it >> >> >>> wouldn't >> >> >>> > > make sense to actually split it up into two FLIPs: Operator >> >> resource >> >> >>> > > management and dynamic slot allocation. I think these two FLIPs >> >> >>> could be >> >> >>> > > seen as orthogonal and it would decrease the scope of each >> >> individual >> >> >>> > FLIP. >> >> >>> > > >> >> >>> > > Some smaller comments: >> >> >>> > > >> >> >>> > > - I'm not sure whether we should pass in the default slot size >> >> via an >> >> >>> > > environment variable. Without having unified the way how Flink >> >> >>> components >> >> >>> > > are configured [1], I think it would be better to pass it in as >> >> part >> >> >>> of >> >> >>> > the >> >> >>> > > configuration. >> >> >>> > > - I would avoid returning a null value from >> >> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. >> >> >>> Either we >> >> >>> > > should introduce an explicit return value saying this or throw >> an >> >> >>> > > exception. >> >> >>> > > >> >> >>> > > Concerning Yangze's comments: I think you are right that it >> would >> >> be >> >> >>> > > helpful to make the selection strategy pluggable. Also batching >> >> slot >> >> >>> > > requests to the RM could be a good optimization. For the sake >> of >> >> >>> keeping >> >> >>> > > the scope of this FLIP smaller I would try to tackle these >> things >> >> >>> after >> >> >>> > the >> >> >>> > > initial version has been completed (without spoiling these >> >> >>> optimization >> >> >>> > > opportunities). In particular batching the slot requests >> depends >> >> on >> >> >>> the >> >> >>> > > current scheduler refactoring and could also be realized on >> the RM >> >> >>> side >> >> >>> > > only. >> >> >>> > > >> >> >>> > > [1] >> >> >>> > > >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration >> >> >>> > > >> >> >>> > > Cheers, >> >> >>> > > Till >> >> >>> > > >> >> >>> > > >> >> >>> > > >> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < >> [hidden email]> >> >> >>> wrote: >> >> >>> > > >> >> >>> > > > Hi, Xintong >> >> >>> > > > >> >> >>> > > > Thanks to propose this FLIP. The general design looks good to >> >> me, >> >> >>> +1 >> >> >>> > > > for this feature. >> >> >>> > > > >> >> >>> > > > Since slots in the same task executor could have different >> >> resource >> >> >>> > > > profile, we will >> >> >>> > > > meet resource fragment problem. Think about this case: >> >> >>> > > > - request A want 1G memory while request B & C want 0.5G >> memory >> >> >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G free >> >> >>> memory >> >> >>> > > > respectively >> >> >>> > > > If B come first and we cut a slot from T1 for B, A must wait >> for >> >> >>> the >> >> >>> > > > free resource from >> >> >>> > > > other task. But A could have been scheduled immediately if we >> >> cut a >> >> >>> > > > slot from T2 for B. >> >> >>> > > > >> >> >>> > > > The logic of findMatchingSlot now become finding a task >> executor >> >> >>> which >> >> >>> > > > has enough >> >> >>> > > > resource and then cut a slot from it. Current method could be >> >> seen >> >> >>> as >> >> >>> > > > "First-fit strategy", >> >> >>> > > > which works well in general but sometimes could not be the >> >> >>> optimization >> >> >>> > > > method. >> >> >>> > > > >> >> >>> > > > Actually, this problem could be abstracted as "Bin Packing >> >> >>> Problem"[1]. >> >> >>> > > > Here are >> >> >>> > > > some common approximate algorithms: >> >> >>> > > > - First fit >> >> >>> > > > - Next fit >> >> >>> > > > - Best fit >> >> >>> > > > >> >> >>> > > > But it become multi-dimensional bin packing problem if we >> take >> >> CPU >> >> >>> > > > into account. It hard >> >> >>> > > > to define which one is best fit now. Some research addressed >> >> this >> >> >>> > > > problem, such like Tetris[2]. >> >> >>> > > > >> >> >>> > > > Here are some thinking about it: >> >> >>> > > > 1. We could make the strategy of finding matching task >> executor >> >> >>> > > > pluginable. Let user to config the >> >> >>> > > > best strategy in their scenario. >> >> >>> > > > 2. We could support batch request interface in RM, because we >> >> have >> >> >>> > > > opportunities to optimize >> >> >>> > > > if we have more information. If we know the A, B, C at the >> same >> >> >>> time, >> >> >>> > > > we could always make the best decision. >> >> >>> > > > >> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf >> >> >>> > > > [2] >> >> >>> > >> >> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf >> >> >>> > > > >> >> >>> > > > Best, >> >> >>> > > > Yangze Guo >> >> >>> > > > >> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < >> >> >>> [hidden email]> >> >> >>> > > > wrote: >> >> >>> > > > > >> >> >>> > > > > Hi everyone, >> >> >>> > > > > >> >> >>> > > > > We would like to start a discussion thread on "FLIP-53: >> Fine >> >> >>> Grained >> >> >>> > > > > Resource Management"[1], where we propose how to improve >> Flink >> >> >>> > resource >> >> >>> > > > > management and scheduling. >> >> >>> > > > > >> >> >>> > > > > This FLIP mainly discusses the following issues. >> >> >>> > > > > >> >> >>> > > > > - How to support tasks with fine grained resource >> >> >>> requirements. >> >> >>> > > > > - How to unify resource management for jobs with / >> without >> >> >>> fine >> >> >>> > > > grained >> >> >>> > > > > resource requirements. >> >> >>> > > > > - How to unify resource management for streaming / batch >> >> jobs. >> >> >>> > > > > >> >> >>> > > > > Key changes proposed in the FLIP are as follows. >> >> >>> > > > > >> >> >>> > > > > - Unify memory management for operators with / without >> fine >> >> >>> > grained >> >> >>> > > > > resource requirements by applying a fraction based quota >> >> >>> > mechanism. >> >> >>> > > > > - Unify resource scheduling for streaming and batch >> jobs by >> >> >>> > setting >> >> >>> > > > slot >> >> >>> > > > > sharing groups for pipelined regions during compiling >> >> stage. >> >> >>> > > > > - Dynamically allocate slots from task executors' >> available >> >> >>> > > resources. >> >> >>> > > > > >> >> >>> > > > > Please find more details in the FLIP wiki document [1]. >> >> Looking >> >> >>> > forward >> >> >>> > > > to >> >> >>> > > > > your feedbacks. >> >> >>> > > > > >> >> >>> > > > > Thank you~ >> >> >>> > > > > >> >> >>> > > > > Xintong Song >> >> >>> > > > > >> >> >>> > > > > >> >> >>> > > > > [1] >> >> >>> > > > > >> >> >>> > > > >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management >> >> >>> > > > >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> >> >> >> > >> > |
Updated the FLIP wiki page [1], with the following changes.
- Remove the step of converting pipelined edges between different slot sharing groups into blocking edges. - Set `allSourcesInSamePipelinedRegion` to true by default. Thank you~ Xintong Song On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]> wrote: > Regarding changing edge type, I think actually we don't need to do this > for batch jobs neither, because we don't have public interfaces for users > to explicitly set slot sharing groups in DataSet API and SQL/Table API. We > have such interfaces in DataStream API only. > > Thank you~ > > Xintong Song > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]> > wrote: > >> Thanks for the correction, Till. >> >> Regarding your comments: >> - You are right, we should not change the edge type for streaming jobs. >> Then I think we can change the option 'allSourcesInSamePipelinedRegion' in >> step 2 to 'isStreamingJob', and implement the current step 2 before the >> current step 1 so we can use this option to decide whether should change >> the edge type. What do you think? >> - Agree. It should be easier to make the default value of >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set it >> to 'false' when using DataSet API or blink planner. >> >> Thank you~ >> >> Xintong Song >> >> >> >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]> >> wrote: >> >>> Thanks for creating the implementation plan Xintong. Overall, the >>> implementation plan looks good. I had a couple of comments: >>> >>> - What will happen if a user has defined a streaming job with two slot >>> sharing groups? Would the code insert a blocking data exchange between >>> these two groups? If yes, then this breaks existing Flink streaming jobs. >>> - How do we detect unbounded streaming jobs to set >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to >>> set >>> it false if we are using the DataSet API or the Blink planner with a >>> bounded job? >>> >>> Cheers, >>> Till >>> >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]> >>> wrote: >>> >>> > I guess there is a typo since the link to the FLIP-53 is >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management >>> > >>> > Cheers, >>> > Till >>> > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]> >>> > wrote: >>> > >>> >> Added implementation steps for this FLIP on the wiki page [1]. >>> >> >>> >> >>> >> Thank you~ >>> >> >>> >> Xintong Song >>> >> >>> >> >>> >> [1] >>> >> >>> >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >>> >> >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]> >>> >> wrote: >>> >> >>> >> > Hi everyone, >>> >> > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource >>> >> > Management" splits into two separate FLIPs, >>> >> > >>> >> > - FLIP-53: Fine Grained Operator Resource Management [1] >>> >> > - FLIP-56: Dynamic Slot Allocation [2] >>> >> > >>> >> > We'll continue using this discussion thread for FLIP-53. For >>> FLIP-56, I >>> >> > just started a new discussion thread [3]. >>> >> > >>> >> > Thank you~ >>> >> > >>> >> > Xintong Song >>> >> > >>> >> > >>> >> > [1] >>> >> > >>> >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management >>> >> > >>> >> > [2] >>> >> > >>> >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation >>> >> > >>> >> > [3] >>> >> > >>> >> >>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html >>> >> > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email] >>> > >>> >> > wrote: >>> >> > >>> >> >> Thinks for the comments, Yang. >>> >> >> >>> >> >> Regarding your questions: >>> >> >> >>> >> >> 1. How to calculate the resource specification of TaskManagers? >>> Do >>> >> they >>> >> >>> have them same resource spec calculated based on the >>> >> configuration? I >>> >> >>> think >>> >> >>> we still have wasted resources in this situation. Or we could >>> start >>> >> >>> TaskManagers with different spec. >>> >> >>> >>> >> >> I agree with you that we can further improve the resource utility >>> by >>> >> >> customizing task executors with different resource specifications. >>> >> However, >>> >> >> I'm in favor of limiting the scope of this FLIP and leave it as a >>> >> future >>> >> >> optimization. The plan for that part is to move the logic of >>> deciding >>> >> task >>> >> >> executor specifications into the slot manager and make slot manager >>> >> >> pluggable, so inside the slot manager plugin we can have different >>> >> logics >>> >> >> for deciding the task executor specifications. >>> >> >> >>> >> >> >>> >> >>> 2. If a slot is released and returned to SlotPool, does it >>> could be >>> >> >>> reused by other SlotRequest that the request resource is >>> smaller >>> >> than >>> >> >>> it? >>> >> >>> >>> >> >> No, I think slot pool should always return slots if they do not >>> exactly >>> >> >> match the pending requests, so that resource manager can deal with >>> the >>> >> >> extra resources. >>> >> >> >>> >> >>> - If it is yes, what happens to the available resource in >>> the >>> >> >> >>> >> >> TaskManager. >>> >> >>> - What is the SlotStatus of the cached slot in SlotPool? The >>> >> >>> AllocationId is null? >>> >> >>> >>> >> >> The allocation id does not change as long as the slot is not >>> returned >>> >> >> from the job master, no matter its occupied or available in the >>> slot >>> >> pool. >>> >> >> I think we have the same behavior currently. No matter how many >>> tasks >>> >> the >>> >> >> job master deploy into the slot, concurrently or sequentially, it >>> is >>> >> one >>> >> >> allocation from the cluster to the job until the slot is freed from >>> >> the job >>> >> >> master. >>> >> >> >>> >> >>> 3. In a session cluster, some jobs are configured with operator >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal >>> with >>> >> >>> this >>> >> >>> situation? >>> >> >> >>> >> >> As long as we do not mix unknown / specified resource profiles >>> within >>> >> the >>> >> >> same job / slot, there shouldn't be a problem. Resource manager >>> >> converts >>> >> >> unknown resource profiles in slot requests to specified default >>> >> resource >>> >> >> profiles, so they can be dynamically allocated from task executors' >>> >> >> available resources just as other slot requests with specified >>> resource >>> >> >> profiles. >>> >> >> >>> >> >> Thank you~ >>> >> >> >>> >> >> Xintong Song >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> >>> >> wrote: >>> >> >> >>> >> >>> Hi Xintong, >>> >> >>> >>> >> >>> >>> >> >>> Thanks for your detailed proposal. I think many users are >>> suffering >>> >> from >>> >> >>> waste of resources. The resource spec of all task managers are >>> same >>> >> and >>> >> >>> we >>> >> >>> have to increase all task managers to make the heavy one more >>> stable. >>> >> So >>> >> >>> we >>> >> >>> will benefit from the fine grained resource management a lot. We >>> could >>> >> >>> get >>> >> >>> better resource utilization and stability. >>> >> >>> >>> >> >>> >>> >> >>> Just to share some thoughts. >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> 1. How to calculate the resource specification of >>> TaskManagers? Do >>> >> >>> they >>> >> >>> have them same resource spec calculated based on the >>> >> configuration? I >>> >> >>> think >>> >> >>> we still have wasted resources in this situation. Or we could >>> start >>> >> >>> TaskManagers with different spec. >>> >> >>> 2. If a slot is released and returned to SlotPool, does it >>> could be >>> >> >>> reused by other SlotRequest that the request resource is >>> smaller >>> >> than >>> >> >>> it? >>> >> >>> - If it is yes, what happens to the available resource in >>> the >>> >> >>> TaskManager. >>> >> >>> - What is the SlotStatus of the cached slot in SlotPool? The >>> >> >>> AllocationId is null? >>> >> >>> 3. In a session cluster, some jobs are configured with operator >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to deal >>> with >>> >> >>> this >>> >> >>> situation? >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> Best, >>> >> >>> Yang >>> >> >>> >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: >>> >> >>> >>> >> >>> > Thanks for the feedbacks, Yangze and Till. >>> >> >>> > >>> >> >>> > Yangze, >>> >> >>> > >>> >> >>> > I agree with you that we should make scheduling strategy >>> pluggable >>> >> and >>> >> >>> > optimize the strategy to reduce the memory fragmentation >>> problem, >>> >> and >>> >> >>> > thanks for the inputs on the potential algorithmic solutions. >>> >> However, >>> >> >>> I'm >>> >> >>> > in favor of keep this FLIP focusing on the overall mechanism >>> design >>> >> >>> rather >>> >> >>> > than strategies. Solving the fragmentation issue should be >>> >> considered >>> >> >>> as an >>> >> >>> > optimization, and I agree with Till that we probably should >>> tackle >>> >> this >>> >> >>> > afterwards. >>> >> >>> > >>> >> >>> > Till, >>> >> >>> > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The >>> operator >>> >> >>> > resource management and dynamic slot allocation do not have much >>> >> >>> dependency >>> >> >>> > on each other. >>> >> >>> > >>> >> >>> > - Regarding the default slot size, I think this is similar to >>> >> FLIP-49 >>> >> >>> [1] >>> >> >>> > where we want all the deriving happens at one place. I think it >>> >> would >>> >> >>> be >>> >> >>> > nice to pass the default slot size into the task executor in the >>> >> same >>> >> >>> way >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. >>> >> >>> > >>> >> >>> > - Regarding the return value of >>> >> TaskExecutorGateway#requestResource, I >>> >> >>> > think you're right. We should avoid using null as the return >>> value. >>> >> I >>> >> >>> think >>> >> >>> > we probably should thrown an exception here. >>> >> >>> > >>> >> >>> > Thank you~ >>> >> >>> > >>> >> >>> > Xintong Song >>> >> >>> > >>> >> >>> > >>> >> >>> > [1] >>> >> >>> > >>> >> >>> > >>> >> >>> >>> >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors >>> >> >>> > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < >>> [hidden email] >>> >> > >>> >> >>> > wrote: >>> >> >>> > >>> >> >>> > > Hi Xintong, >>> >> >>> > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal helps to >>> >> >>> improve the >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it >>> enables the >>> >> >>> proper >>> >> >>> > > integration of the Blink planner which is very important as >>> well. >>> >> >>> > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering whether it >>> >> >>> wouldn't >>> >> >>> > > make sense to actually split it up into two FLIPs: Operator >>> >> resource >>> >> >>> > > management and dynamic slot allocation. I think these two >>> FLIPs >>> >> >>> could be >>> >> >>> > > seen as orthogonal and it would decrease the scope of each >>> >> individual >>> >> >>> > FLIP. >>> >> >>> > > >>> >> >>> > > Some smaller comments: >>> >> >>> > > >>> >> >>> > > - I'm not sure whether we should pass in the default slot size >>> >> via an >>> >> >>> > > environment variable. Without having unified the way how Flink >>> >> >>> components >>> >> >>> > > are configured [1], I think it would be better to pass it in >>> as >>> >> part >>> >> >>> of >>> >> >>> > the >>> >> >>> > > configuration. >>> >> >>> > > - I would avoid returning a null value from >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. >>> >> >>> Either we >>> >> >>> > > should introduce an explicit return value saying this or >>> throw an >>> >> >>> > > exception. >>> >> >>> > > >>> >> >>> > > Concerning Yangze's comments: I think you are right that it >>> would >>> >> be >>> >> >>> > > helpful to make the selection strategy pluggable. Also >>> batching >>> >> slot >>> >> >>> > > requests to the RM could be a good optimization. For the sake >>> of >>> >> >>> keeping >>> >> >>> > > the scope of this FLIP smaller I would try to tackle these >>> things >>> >> >>> after >>> >> >>> > the >>> >> >>> > > initial version has been completed (without spoiling these >>> >> >>> optimization >>> >> >>> > > opportunities). In particular batching the slot requests >>> depends >>> >> on >>> >> >>> the >>> >> >>> > > current scheduler refactoring and could also be realized on >>> the RM >>> >> >>> side >>> >> >>> > > only. >>> >> >>> > > >>> >> >>> > > [1] >>> >> >>> > > >>> >> >>> > > >>> >> >>> > >>> >> >>> >>> >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration >>> >> >>> > > >>> >> >>> > > Cheers, >>> >> >>> > > Till >>> >> >>> > > >>> >> >>> > > >>> >> >>> > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < >>> [hidden email]> >>> >> >>> wrote: >>> >> >>> > > >>> >> >>> > > > Hi, Xintong >>> >> >>> > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design looks good >>> to >>> >> me, >>> >> >>> +1 >>> >> >>> > > > for this feature. >>> >> >>> > > > >>> >> >>> > > > Since slots in the same task executor could have different >>> >> resource >>> >> >>> > > > profile, we will >>> >> >>> > > > meet resource fragment problem. Think about this case: >>> >> >>> > > > - request A want 1G memory while request B & C want 0.5G >>> memory >>> >> >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G >>> free >>> >> >>> memory >>> >> >>> > > > respectively >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A must >>> wait for >>> >> >>> the >>> >> >>> > > > free resource from >>> >> >>> > > > other task. But A could have been scheduled immediately if >>> we >>> >> cut a >>> >> >>> > > > slot from T2 for B. >>> >> >>> > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a task >>> executor >>> >> >>> which >>> >> >>> > > > has enough >>> >> >>> > > > resource and then cut a slot from it. Current method could >>> be >>> >> seen >>> >> >>> as >>> >> >>> > > > "First-fit strategy", >>> >> >>> > > > which works well in general but sometimes could not be the >>> >> >>> optimization >>> >> >>> > > > method. >>> >> >>> > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin Packing >>> >> >>> Problem"[1]. >>> >> >>> > > > Here are >>> >> >>> > > > some common approximate algorithms: >>> >> >>> > > > - First fit >>> >> >>> > > > - Next fit >>> >> >>> > > > - Best fit >>> >> >>> > > > >>> >> >>> > > > But it become multi-dimensional bin packing problem if we >>> take >>> >> CPU >>> >> >>> > > > into account. It hard >>> >> >>> > > > to define which one is best fit now. Some research addressed >>> >> this >>> >> >>> > > > problem, such like Tetris[2]. >>> >> >>> > > > >>> >> >>> > > > Here are some thinking about it: >>> >> >>> > > > 1. We could make the strategy of finding matching task >>> executor >>> >> >>> > > > pluginable. Let user to config the >>> >> >>> > > > best strategy in their scenario. >>> >> >>> > > > 2. We could support batch request interface in RM, because >>> we >>> >> have >>> >> >>> > > > opportunities to optimize >>> >> >>> > > > if we have more information. If we know the A, B, C at the >>> same >>> >> >>> time, >>> >> >>> > > > we could always make the best decision. >>> >> >>> > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf >>> >> >>> > > > [2] >>> >> >>> > >>> >> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf >>> >> >>> > > > >>> >> >>> > > > Best, >>> >> >>> > > > Yangze Guo >>> >> >>> > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < >>> >> >>> [hidden email]> >>> >> >>> > > > wrote: >>> >> >>> > > > > >>> >> >>> > > > > Hi everyone, >>> >> >>> > > > > >>> >> >>> > > > > We would like to start a discussion thread on "FLIP-53: >>> Fine >>> >> >>> Grained >>> >> >>> > > > > Resource Management"[1], where we propose how to improve >>> Flink >>> >> >>> > resource >>> >> >>> > > > > management and scheduling. >>> >> >>> > > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues. >>> >> >>> > > > > >>> >> >>> > > > > - How to support tasks with fine grained resource >>> >> >>> requirements. >>> >> >>> > > > > - How to unify resource management for jobs with / >>> without >>> >> >>> fine >>> >> >>> > > > grained >>> >> >>> > > > > resource requirements. >>> >> >>> > > > > - How to unify resource management for streaming / >>> batch >>> >> jobs. >>> >> >>> > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows. >>> >> >>> > > > > >>> >> >>> > > > > - Unify memory management for operators with / without >>> fine >>> >> >>> > grained >>> >> >>> > > > > resource requirements by applying a fraction based >>> quota >>> >> >>> > mechanism. >>> >> >>> > > > > - Unify resource scheduling for streaming and batch >>> jobs by >>> >> >>> > setting >>> >> >>> > > > slot >>> >> >>> > > > > sharing groups for pipelined regions during compiling >>> >> stage. >>> >> >>> > > > > - Dynamically allocate slots from task executors' >>> available >>> >> >>> > > resources. >>> >> >>> > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki document [1]. >>> >> Looking >>> >> >>> > forward >>> >> >>> > > > to >>> >> >>> > > > > your feedbacks. >>> >> >>> > > > > >>> >> >>> > > > > Thank you~ >>> >> >>> > > > > >>> >> >>> > > > > Xintong Song >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > > > > [1] >>> >> >>> > > > > >>> >> >>> > > > >>> >> >>> > > >>> >> >>> > >>> >> >>> >>> >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management >>> >> >>> > > > >>> >> >>> > > >>> >> >>> > >>> >> >>> >>> >> >> >>> >> >>> > >>> >> |
Thanks Xintong for proposing this improvement. Fine grained resources can
be very helpful when user has good planning on resources. I have a few questions: 1. Currently in a batch job, vertices from different regions can run at the same time in slots from the same shared group, as long as they do not have data dependency on each other and available slot count is not smaller than the *max* of parallelism of all tasks. With changes in this FLIP however, tasks from different regions cannot share slots anymore. Once available slot count is smaller than the *sum* of all parallelism of tasks from all regions, tasks may need to be executed sequentially, which might result in a performance regression. Is this(performance regression to existing DataSet jobs) considered as a necessary and accepted trade off in this FLIP? 2. The network memory depends on the input/output ExecutionEdge count and thus can be different even for parallel instances of the same JobVertex. Does this mean that when adding task resources to calculating the slot resource for a shared group, the max possible network memory of the vertex instance shall be used? This might result in larger resource required than actually needed. And some minor comments: 1. Regarding "fracManagedMemOnHeap = 1 / numOpsUseOnHeapManagedMemory", I guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ? 2. I think the *StreamGraphGenerator* in the #Slot Sharing section and implementation step 4 should be *StreamingJobGraphGenerator*, as *StreamGraphGenerator* is not aware of JobGraph and pipelined region. Thanks, Zhu Zhu Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道: > Updated the FLIP wiki page [1], with the following changes. > > - Remove the step of converting pipelined edges between different slot > sharing groups into blocking edges. > - Set `allSourcesInSamePipelinedRegion` to true by default. > > Thank you~ > > Xintong Song > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]> > wrote: > > > Regarding changing edge type, I think actually we don't need to do this > > for batch jobs neither, because we don't have public interfaces for users > > to explicitly set slot sharing groups in DataSet API and SQL/Table API. > We > > have such interfaces in DataStream API only. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]> > > wrote: > > > >> Thanks for the correction, Till. > >> > >> Regarding your comments: > >> - You are right, we should not change the edge type for streaming jobs. > >> Then I think we can change the option 'allSourcesInSamePipelinedRegion' > in > >> step 2 to 'isStreamingJob', and implement the current step 2 before the > >> current step 1 so we can use this option to decide whether should change > >> the edge type. What do you think? > >> - Agree. It should be easier to make the default value of > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set > it > >> to 'false' when using DataSet API or blink planner. > >> > >> Thank you~ > >> > >> Xintong Song > >> > >> > >> > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]> > >> wrote: > >> > >>> Thanks for creating the implementation plan Xintong. Overall, the > >>> implementation plan looks good. I had a couple of comments: > >>> > >>> - What will happen if a user has defined a streaming job with two slot > >>> sharing groups? Would the code insert a blocking data exchange between > >>> these two groups? If yes, then this breaks existing Flink streaming > jobs. > >>> - How do we detect unbounded streaming jobs to set > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to > >>> set > >>> it false if we are using the DataSet API or the Blink planner with a > >>> bounded job? > >>> > >>> Cheers, > >>> Till > >>> > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]> > >>> wrote: > >>> > >>> > I guess there is a typo since the link to the FLIP-53 is > >>> > > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > >>> > > >>> > Cheers, > >>> > Till > >>> > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]> > >>> > wrote: > >>> > > >>> >> Added implementation steps for this FLIP on the wiki page [1]. > >>> >> > >>> >> > >>> >> Thank you~ > >>> >> > >>> >> Xintong Song > >>> >> > >>> >> > >>> >> [1] > >>> >> > >>> >> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > >>> >> > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song < > [hidden email]> > >>> >> wrote: > >>> >> > >>> >> > Hi everyone, > >>> >> > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource > >>> >> > Management" splits into two separate FLIPs, > >>> >> > > >>> >> > - FLIP-53: Fine Grained Operator Resource Management [1] > >>> >> > - FLIP-56: Dynamic Slot Allocation [2] > >>> >> > > >>> >> > We'll continue using this discussion thread for FLIP-53. For > >>> FLIP-56, I > >>> >> > just started a new discussion thread [3]. > >>> >> > > >>> >> > Thank you~ > >>> >> > > >>> >> > Xintong Song > >>> >> > > >>> >> > > >>> >> > [1] > >>> >> > > >>> >> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > >>> >> > > >>> >> > [2] > >>> >> > > >>> >> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > >>> >> > > >>> >> > [3] > >>> >> > > >>> >> > >>> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > >>> >> > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song < > [hidden email] > >>> > > >>> >> > wrote: > >>> >> > > >>> >> >> Thinks for the comments, Yang. > >>> >> >> > >>> >> >> Regarding your questions: > >>> >> >> > >>> >> >> 1. How to calculate the resource specification of > TaskManagers? > >>> Do > >>> >> they > >>> >> >>> have them same resource spec calculated based on the > >>> >> configuration? I > >>> >> >>> think > >>> >> >>> we still have wasted resources in this situation. Or we could > >>> start > >>> >> >>> TaskManagers with different spec. > >>> >> >>> > >>> >> >> I agree with you that we can further improve the resource utility > >>> by > >>> >> >> customizing task executors with different resource > specifications. > >>> >> However, > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave it as a > >>> >> future > >>> >> >> optimization. The plan for that part is to move the logic of > >>> deciding > >>> >> task > >>> >> >> executor specifications into the slot manager and make slot > manager > >>> >> >> pluggable, so inside the slot manager plugin we can have > different > >>> >> logics > >>> >> >> for deciding the task executor specifications. > >>> >> >> > >>> >> >> > >>> >> >>> 2. If a slot is released and returned to SlotPool, does it > >>> could be > >>> >> >>> reused by other SlotRequest that the request resource is > >>> smaller > >>> >> than > >>> >> >>> it? > >>> >> >>> > >>> >> >> No, I think slot pool should always return slots if they do not > >>> exactly > >>> >> >> match the pending requests, so that resource manager can deal > with > >>> the > >>> >> >> extra resources. > >>> >> >> > >>> >> >>> - If it is yes, what happens to the available resource in > >>> the > >>> >> >> > >>> >> >> TaskManager. > >>> >> >>> - What is the SlotStatus of the cached slot in SlotPool? > The > >>> >> >>> AllocationId is null? > >>> >> >>> > >>> >> >> The allocation id does not change as long as the slot is not > >>> returned > >>> >> >> from the job master, no matter its occupied or available in the > >>> slot > >>> >> pool. > >>> >> >> I think we have the same behavior currently. No matter how many > >>> tasks > >>> >> the > >>> >> >> job master deploy into the slot, concurrently or sequentially, it > >>> is > >>> >> one > >>> >> >> allocation from the cluster to the job until the slot is freed > from > >>> >> the job > >>> >> >> master. > >>> >> >> > >>> >> >>> 3. In a session cluster, some jobs are configured with > operator > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to > deal > >>> with > >>> >> >>> this > >>> >> >>> situation? > >>> >> >> > >>> >> >> As long as we do not mix unknown / specified resource profiles > >>> within > >>> >> the > >>> >> >> same job / slot, there shouldn't be a problem. Resource manager > >>> >> converts > >>> >> >> unknown resource profiles in slot requests to specified default > >>> >> resource > >>> >> >> profiles, so they can be dynamically allocated from task > executors' > >>> >> >> available resources just as other slot requests with specified > >>> resource > >>> >> >> profiles. > >>> >> >> > >>> >> >> Thank you~ > >>> >> >> > >>> >> >> Xintong Song > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang < > [hidden email]> > >>> >> wrote: > >>> >> >> > >>> >> >>> Hi Xintong, > >>> >> >>> > >>> >> >>> > >>> >> >>> Thanks for your detailed proposal. I think many users are > >>> suffering > >>> >> from > >>> >> >>> waste of resources. The resource spec of all task managers are > >>> same > >>> >> and > >>> >> >>> we > >>> >> >>> have to increase all task managers to make the heavy one more > >>> stable. > >>> >> So > >>> >> >>> we > >>> >> >>> will benefit from the fine grained resource management a lot. We > >>> could > >>> >> >>> get > >>> >> >>> better resource utilization and stability. > >>> >> >>> > >>> >> >>> > >>> >> >>> Just to share some thoughts. > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> 1. How to calculate the resource specification of > >>> TaskManagers? Do > >>> >> >>> they > >>> >> >>> have them same resource spec calculated based on the > >>> >> configuration? I > >>> >> >>> think > >>> >> >>> we still have wasted resources in this situation. Or we could > >>> start > >>> >> >>> TaskManagers with different spec. > >>> >> >>> 2. If a slot is released and returned to SlotPool, does it > >>> could be > >>> >> >>> reused by other SlotRequest that the request resource is > >>> smaller > >>> >> than > >>> >> >>> it? > >>> >> >>> - If it is yes, what happens to the available resource in > >>> the > >>> >> >>> TaskManager. > >>> >> >>> - What is the SlotStatus of the cached slot in SlotPool? > The > >>> >> >>> AllocationId is null? > >>> >> >>> 3. In a session cluster, some jobs are configured with > operator > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to > deal > >>> with > >>> >> >>> this > >>> >> >>> situation? > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> Best, > >>> >> >>> Yang > >>> >> >>> > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: > >>> >> >>> > >>> >> >>> > Thanks for the feedbacks, Yangze and Till. > >>> >> >>> > > >>> >> >>> > Yangze, > >>> >> >>> > > >>> >> >>> > I agree with you that we should make scheduling strategy > >>> pluggable > >>> >> and > >>> >> >>> > optimize the strategy to reduce the memory fragmentation > >>> problem, > >>> >> and > >>> >> >>> > thanks for the inputs on the potential algorithmic solutions. > >>> >> However, > >>> >> >>> I'm > >>> >> >>> > in favor of keep this FLIP focusing on the overall mechanism > >>> design > >>> >> >>> rather > >>> >> >>> > than strategies. Solving the fragmentation issue should be > >>> >> considered > >>> >> >>> as an > >>> >> >>> > optimization, and I agree with Till that we probably should > >>> tackle > >>> >> this > >>> >> >>> > afterwards. > >>> >> >>> > > >>> >> >>> > Till, > >>> >> >>> > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The > >>> operator > >>> >> >>> > resource management and dynamic slot allocation do not have > much > >>> >> >>> dependency > >>> >> >>> > on each other. > >>> >> >>> > > >>> >> >>> > - Regarding the default slot size, I think this is similar to > >>> >> FLIP-49 > >>> >> >>> [1] > >>> >> >>> > where we want all the deriving happens at one place. I think > it > >>> >> would > >>> >> >>> be > >>> >> >>> > nice to pass the default slot size into the task executor in > the > >>> >> same > >>> >> >>> way > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. > >>> >> >>> > > >>> >> >>> > - Regarding the return value of > >>> >> TaskExecutorGateway#requestResource, I > >>> >> >>> > think you're right. We should avoid using null as the return > >>> value. > >>> >> I > >>> >> >>> think > >>> >> >>> > we probably should thrown an exception here. > >>> >> >>> > > >>> >> >>> > Thank you~ > >>> >> >>> > > >>> >> >>> > Xintong Song > >>> >> >>> > > >>> >> >>> > > >>> >> >>> > [1] > >>> >> >>> > > >>> >> >>> > > >>> >> >>> > >>> >> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > >>> >> >>> > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > >>> [hidden email] > >>> >> > > >>> >> >>> > wrote: > >>> >> >>> > > >>> >> >>> > > Hi Xintong, > >>> >> >>> > > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal helps > to > >>> >> >>> improve the > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it > >>> enables the > >>> >> >>> proper > >>> >> >>> > > integration of the Blink planner which is very important as > >>> well. > >>> >> >>> > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering whether > it > >>> >> >>> wouldn't > >>> >> >>> > > make sense to actually split it up into two FLIPs: Operator > >>> >> resource > >>> >> >>> > > management and dynamic slot allocation. I think these two > >>> FLIPs > >>> >> >>> could be > >>> >> >>> > > seen as orthogonal and it would decrease the scope of each > >>> >> individual > >>> >> >>> > FLIP. > >>> >> >>> > > > >>> >> >>> > > Some smaller comments: > >>> >> >>> > > > >>> >> >>> > > - I'm not sure whether we should pass in the default slot > size > >>> >> via an > >>> >> >>> > > environment variable. Without having unified the way how > Flink > >>> >> >>> components > >>> >> >>> > > are configured [1], I think it would be better to pass it in > >>> as > >>> >> part > >>> >> >>> of > >>> >> >>> > the > >>> >> >>> > > configuration. > >>> >> >>> > > - I would avoid returning a null value from > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be > fulfilled. > >>> >> >>> Either we > >>> >> >>> > > should introduce an explicit return value saying this or > >>> throw an > >>> >> >>> > > exception. > >>> >> >>> > > > >>> >> >>> > > Concerning Yangze's comments: I think you are right that it > >>> would > >>> >> be > >>> >> >>> > > helpful to make the selection strategy pluggable. Also > >>> batching > >>> >> slot > >>> >> >>> > > requests to the RM could be a good optimization. For the > sake > >>> of > >>> >> >>> keeping > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle these > >>> things > >>> >> >>> after > >>> >> >>> > the > >>> >> >>> > > initial version has been completed (without spoiling these > >>> >> >>> optimization > >>> >> >>> > > opportunities). In particular batching the slot requests > >>> depends > >>> >> on > >>> >> >>> the > >>> >> >>> > > current scheduler refactoring and could also be realized on > >>> the RM > >>> >> >>> side > >>> >> >>> > > only. > >>> >> >>> > > > >>> >> >>> > > [1] > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> > > >>> >> >>> > >>> >> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > >>> >> >>> > > > >>> >> >>> > > Cheers, > >>> >> >>> > > Till > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < > >>> [hidden email]> > >>> >> >>> wrote: > >>> >> >>> > > > >>> >> >>> > > > Hi, Xintong > >>> >> >>> > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design looks good > >>> to > >>> >> me, > >>> >> >>> +1 > >>> >> >>> > > > for this feature. > >>> >> >>> > > > > >>> >> >>> > > > Since slots in the same task executor could have different > >>> >> resource > >>> >> >>> > > > profile, we will > >>> >> >>> > > > meet resource fragment problem. Think about this case: > >>> >> >>> > > > - request A want 1G memory while request B & C want 0.5G > >>> memory > >>> >> >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G > >>> free > >>> >> >>> memory > >>> >> >>> > > > respectively > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A must > >>> wait for > >>> >> >>> the > >>> >> >>> > > > free resource from > >>> >> >>> > > > other task. But A could have been scheduled immediately if > >>> we > >>> >> cut a > >>> >> >>> > > > slot from T2 for B. > >>> >> >>> > > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a task > >>> executor > >>> >> >>> which > >>> >> >>> > > > has enough > >>> >> >>> > > > resource and then cut a slot from it. Current method could > >>> be > >>> >> seen > >>> >> >>> as > >>> >> >>> > > > "First-fit strategy", > >>> >> >>> > > > which works well in general but sometimes could not be the > >>> >> >>> optimization > >>> >> >>> > > > method. > >>> >> >>> > > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin Packing > >>> >> >>> Problem"[1]. > >>> >> >>> > > > Here are > >>> >> >>> > > > some common approximate algorithms: > >>> >> >>> > > > - First fit > >>> >> >>> > > > - Next fit > >>> >> >>> > > > - Best fit > >>> >> >>> > > > > >>> >> >>> > > > But it become multi-dimensional bin packing problem if we > >>> take > >>> >> CPU > >>> >> >>> > > > into account. It hard > >>> >> >>> > > > to define which one is best fit now. Some research > addressed > >>> >> this > >>> >> >>> > > > problem, such like Tetris[2]. > >>> >> >>> > > > > >>> >> >>> > > > Here are some thinking about it: > >>> >> >>> > > > 1. We could make the strategy of finding matching task > >>> executor > >>> >> >>> > > > pluginable. Let user to config the > >>> >> >>> > > > best strategy in their scenario. > >>> >> >>> > > > 2. We could support batch request interface in RM, because > >>> we > >>> >> have > >>> >> >>> > > > opportunities to optimize > >>> >> >>> > > > if we have more information. If we know the A, B, C at the > >>> same > >>> >> >>> time, > >>> >> >>> > > > we could always make the best decision. > >>> >> >>> > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > >>> >> >>> > > > [2] > >>> >> >>> > > >>> >> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > >>> >> >>> > > > > >>> >> >>> > > > Best, > >>> >> >>> > > > Yangze Guo > >>> >> >>> > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > >>> >> >>> [hidden email]> > >>> >> >>> > > > wrote: > >>> >> >>> > > > > > >>> >> >>> > > > > Hi everyone, > >>> >> >>> > > > > > >>> >> >>> > > > > We would like to start a discussion thread on "FLIP-53: > >>> Fine > >>> >> >>> Grained > >>> >> >>> > > > > Resource Management"[1], where we propose how to improve > >>> Flink > >>> >> >>> > resource > >>> >> >>> > > > > management and scheduling. > >>> >> >>> > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues. > >>> >> >>> > > > > > >>> >> >>> > > > > - How to support tasks with fine grained resource > >>> >> >>> requirements. > >>> >> >>> > > > > - How to unify resource management for jobs with / > >>> without > >>> >> >>> fine > >>> >> >>> > > > grained > >>> >> >>> > > > > resource requirements. > >>> >> >>> > > > > - How to unify resource management for streaming / > >>> batch > >>> >> jobs. > >>> >> >>> > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows. > >>> >> >>> > > > > > >>> >> >>> > > > > - Unify memory management for operators with / > without > >>> fine > >>> >> >>> > grained > >>> >> >>> > > > > resource requirements by applying a fraction based > >>> quota > >>> >> >>> > mechanism. > >>> >> >>> > > > > - Unify resource scheduling for streaming and batch > >>> jobs by > >>> >> >>> > setting > >>> >> >>> > > > slot > >>> >> >>> > > > > sharing groups for pipelined regions during compiling > >>> >> stage. > >>> >> >>> > > > > - Dynamically allocate slots from task executors' > >>> available > >>> >> >>> > > resources. > >>> >> >>> > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki document [1]. > >>> >> Looking > >>> >> >>> > forward > >>> >> >>> > > > to > >>> >> >>> > > > > your feedbacks. > >>> >> >>> > > > > > >>> >> >>> > > > > Thank you~ > >>> >> >>> > > > > > >>> >> >>> > > > > Xintong Song > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > [1] > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> >>> > > >>> >> >>> > >>> >> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > >>> >> >>> > > > > >>> >> >>> > > > >>> >> >>> > > >>> >> >>> > >>> >> >> > >>> >> > >>> > > >>> > >> > |
Thanks Xingtong for driving this effort, I haven't finished the whole
document yet, but have couple of questions: 1. Regarding to network memory, the document said it will be derived by framework automatically. I'm wondering whether we should delete this dimension from user- facing API? 2. Regarding to fraction based quota, I don't quite get the meaning of "slotSharingGroupOnHeapManagedMem" and "slotSharingGroupOffHeapManagedMem". What if the sharing group is mixed with specified resource and UNKNOWN resource requirements. 3 IIUC, even user had set resource requirements, lets say 500MB off-heap managed memory, during execution the operator may or may not have 500MB off-heap managed memory, right? Best, Kurt On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote: > Thanks Xintong for proposing this improvement. Fine grained resources can > be very helpful when user has good planning on resources. > > I have a few questions: > 1. Currently in a batch job, vertices from different regions can run at the > same time in slots from the same shared group, as long as they do not have > data dependency on each other and available slot count is not smaller than > the *max* of parallelism of all tasks. > With changes in this FLIP however, tasks from different regions cannot > share slots anymore. > Once available slot count is smaller than the *sum* of all parallelism of > tasks from all regions, tasks may need to be executed sequentially, which > might result in a performance regression. > Is this(performance regression to existing DataSet jobs) considered as a > necessary and accepted trade off in this FLIP? > > 2. The network memory depends on the input/output ExecutionEdge count and > thus can be different even for parallel instances of the same JobVertex. > Does this mean that when adding task resources to calculating the slot > resource for a shared group, the max possible network memory of the vertex > instance shall be used? > This might result in larger resource required than actually needed. > > And some minor comments: > 1. Regarding "fracManagedMemOnHeap = 1 / numOpsUseOnHeapManagedMemory", I > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ? > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section and > implementation step 4 should be *StreamingJobGraphGenerator*, as > *StreamGraphGenerator* is not aware of JobGraph and pipelined region. > > > Thanks, > Zhu Zhu > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道: > > > Updated the FLIP wiki page [1], with the following changes. > > > > - Remove the step of converting pipelined edges between different slot > > sharing groups into blocking edges. > > - Set `allSourcesInSamePipelinedRegion` to true by default. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]> > > wrote: > > > > > Regarding changing edge type, I think actually we don't need to do this > > > for batch jobs neither, because we don't have public interfaces for > users > > > to explicitly set slot sharing groups in DataSet API and SQL/Table API. > > We > > > have such interfaces in DataStream API only. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]> > > > wrote: > > > > > >> Thanks for the correction, Till. > > >> > > >> Regarding your comments: > > >> - You are right, we should not change the edge type for streaming > jobs. > > >> Then I think we can change the option > 'allSourcesInSamePipelinedRegion' > > in > > >> step 2 to 'isStreamingJob', and implement the current step 2 before > the > > >> current step 1 so we can use this option to decide whether should > change > > >> the edge type. What do you think? > > >> - Agree. It should be easier to make the default value of > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and > set > > it > > >> to 'false' when using DataSet API or blink planner. > > >> > > >> Thank you~ > > >> > > >> Xintong Song > > >> > > >> > > >> > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]> > > >> wrote: > > >> > > >>> Thanks for creating the implementation plan Xintong. Overall, the > > >>> implementation plan looks good. I had a couple of comments: > > >>> > > >>> - What will happen if a user has defined a streaming job with two > slot > > >>> sharing groups? Would the code insert a blocking data exchange > between > > >>> these two groups? If yes, then this breaks existing Flink streaming > > jobs. > > >>> - How do we detect unbounded streaming jobs to set > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier > to > > >>> set > > >>> it false if we are using the DataSet API or the Blink planner with a > > >>> bounded job? > > >>> > > >>> Cheers, > > >>> Till > > >>> > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]> > > >>> wrote: > > >>> > > >>> > I guess there is a typo since the link to the FLIP-53 is > > >>> > > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > >>> > > > >>> > Cheers, > > >>> > Till > > >>> > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song < > [hidden email]> > > >>> > wrote: > > >>> > > > >>> >> Added implementation steps for this FLIP on the wiki page [1]. > > >>> >> > > >>> >> > > >>> >> Thank you~ > > >>> >> > > >>> >> Xintong Song > > >>> >> > > >>> >> > > >>> >> [1] > > >>> >> > > >>> >> > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > >>> >> > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song < > > [hidden email]> > > >>> >> wrote: > > >>> >> > > >>> >> > Hi everyone, > > >>> >> > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource > > >>> >> > Management" splits into two separate FLIPs, > > >>> >> > > > >>> >> > - FLIP-53: Fine Grained Operator Resource Management [1] > > >>> >> > - FLIP-56: Dynamic Slot Allocation [2] > > >>> >> > > > >>> >> > We'll continue using this discussion thread for FLIP-53. For > > >>> FLIP-56, I > > >>> >> > just started a new discussion thread [3]. > > >>> >> > > > >>> >> > Thank you~ > > >>> >> > > > >>> >> > Xintong Song > > >>> >> > > > >>> >> > > > >>> >> > [1] > > >>> >> > > > >>> >> > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > >>> >> > > > >>> >> > [2] > > >>> >> > > > >>> >> > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > >>> >> > > > >>> >> > [3] > > >>> >> > > > >>> >> > > >>> > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > >>> >> > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song < > > [hidden email] > > >>> > > > >>> >> > wrote: > > >>> >> > > > >>> >> >> Thinks for the comments, Yang. > > >>> >> >> > > >>> >> >> Regarding your questions: > > >>> >> >> > > >>> >> >> 1. How to calculate the resource specification of > > TaskManagers? > > >>> Do > > >>> >> they > > >>> >> >>> have them same resource spec calculated based on the > > >>> >> configuration? I > > >>> >> >>> think > > >>> >> >>> we still have wasted resources in this situation. Or we > could > > >>> start > > >>> >> >>> TaskManagers with different spec. > > >>> >> >>> > > >>> >> >> I agree with you that we can further improve the resource > utility > > >>> by > > >>> >> >> customizing task executors with different resource > > specifications. > > >>> >> However, > > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave it > as a > > >>> >> future > > >>> >> >> optimization. The plan for that part is to move the logic of > > >>> deciding > > >>> >> task > > >>> >> >> executor specifications into the slot manager and make slot > > manager > > >>> >> >> pluggable, so inside the slot manager plugin we can have > > different > > >>> >> logics > > >>> >> >> for deciding the task executor specifications. > > >>> >> >> > > >>> >> >> > > >>> >> >>> 2. If a slot is released and returned to SlotPool, does it > > >>> could be > > >>> >> >>> reused by other SlotRequest that the request resource is > > >>> smaller > > >>> >> than > > >>> >> >>> it? > > >>> >> >>> > > >>> >> >> No, I think slot pool should always return slots if they do not > > >>> exactly > > >>> >> >> match the pending requests, so that resource manager can deal > > with > > >>> the > > >>> >> >> extra resources. > > >>> >> >> > > >>> >> >>> - If it is yes, what happens to the available resource > in > > >>> the > > >>> >> >> > > >>> >> >> TaskManager. > > >>> >> >>> - What is the SlotStatus of the cached slot in SlotPool? > > The > > >>> >> >>> AllocationId is null? > > >>> >> >>> > > >>> >> >> The allocation id does not change as long as the slot is not > > >>> returned > > >>> >> >> from the job master, no matter its occupied or available in the > > >>> slot > > >>> >> pool. > > >>> >> >> I think we have the same behavior currently. No matter how many > > >>> tasks > > >>> >> the > > >>> >> >> job master deploy into the slot, concurrently or sequentially, > it > > >>> is > > >>> >> one > > >>> >> >> allocation from the cluster to the job until the slot is freed > > from > > >>> >> the job > > >>> >> >> master. > > >>> >> >> > > >>> >> >>> 3. In a session cluster, some jobs are configured with > > operator > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to > > deal > > >>> with > > >>> >> >>> this > > >>> >> >>> situation? > > >>> >> >> > > >>> >> >> As long as we do not mix unknown / specified resource profiles > > >>> within > > >>> >> the > > >>> >> >> same job / slot, there shouldn't be a problem. Resource manager > > >>> >> converts > > >>> >> >> unknown resource profiles in slot requests to specified default > > >>> >> resource > > >>> >> >> profiles, so they can be dynamically allocated from task > > executors' > > >>> >> >> available resources just as other slot requests with specified > > >>> resource > > >>> >> >> profiles. > > >>> >> >> > > >>> >> >> Thank you~ > > >>> >> >> > > >>> >> >> Xintong Song > > >>> >> >> > > >>> >> >> > > >>> >> >> > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang < > > [hidden email]> > > >>> >> wrote: > > >>> >> >> > > >>> >> >>> Hi Xintong, > > >>> >> >>> > > >>> >> >>> > > >>> >> >>> Thanks for your detailed proposal. I think many users are > > >>> suffering > > >>> >> from > > >>> >> >>> waste of resources. The resource spec of all task managers are > > >>> same > > >>> >> and > > >>> >> >>> we > > >>> >> >>> have to increase all task managers to make the heavy one more > > >>> stable. > > >>> >> So > > >>> >> >>> we > > >>> >> >>> will benefit from the fine grained resource management a lot. > We > > >>> could > > >>> >> >>> get > > >>> >> >>> better resource utilization and stability. > > >>> >> >>> > > >>> >> >>> > > >>> >> >>> Just to share some thoughts. > > >>> >> >>> > > >>> >> >>> > > >>> >> >>> > > >>> >> >>> 1. How to calculate the resource specification of > > >>> TaskManagers? Do > > >>> >> >>> they > > >>> >> >>> have them same resource spec calculated based on the > > >>> >> configuration? I > > >>> >> >>> think > > >>> >> >>> we still have wasted resources in this situation. Or we > could > > >>> start > > >>> >> >>> TaskManagers with different spec. > > >>> >> >>> 2. If a slot is released and returned to SlotPool, does it > > >>> could be > > >>> >> >>> reused by other SlotRequest that the request resource is > > >>> smaller > > >>> >> than > > >>> >> >>> it? > > >>> >> >>> - If it is yes, what happens to the available resource > in > > >>> the > > >>> >> >>> TaskManager. > > >>> >> >>> - What is the SlotStatus of the cached slot in SlotPool? > > The > > >>> >> >>> AllocationId is null? > > >>> >> >>> 3. In a session cluster, some jobs are configured with > > operator > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to > > deal > > >>> with > > >>> >> >>> this > > >>> >> >>> situation? > > >>> >> >>> > > >>> >> >>> > > >>> >> >>> > > >>> >> >>> Best, > > >>> >> >>> Yang > > >>> >> >>> > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道: > > >>> >> >>> > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till. > > >>> >> >>> > > > >>> >> >>> > Yangze, > > >>> >> >>> > > > >>> >> >>> > I agree with you that we should make scheduling strategy > > >>> pluggable > > >>> >> and > > >>> >> >>> > optimize the strategy to reduce the memory fragmentation > > >>> problem, > > >>> >> and > > >>> >> >>> > thanks for the inputs on the potential algorithmic > solutions. > > >>> >> However, > > >>> >> >>> I'm > > >>> >> >>> > in favor of keep this FLIP focusing on the overall mechanism > > >>> design > > >>> >> >>> rather > > >>> >> >>> > than strategies. Solving the fragmentation issue should be > > >>> >> considered > > >>> >> >>> as an > > >>> >> >>> > optimization, and I agree with Till that we probably should > > >>> tackle > > >>> >> this > > >>> >> >>> > afterwards. > > >>> >> >>> > > > >>> >> >>> > Till, > > >>> >> >>> > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The > > >>> operator > > >>> >> >>> > resource management and dynamic slot allocation do not have > > much > > >>> >> >>> dependency > > >>> >> >>> > on each other. > > >>> >> >>> > > > >>> >> >>> > - Regarding the default slot size, I think this is similar > to > > >>> >> FLIP-49 > > >>> >> >>> [1] > > >>> >> >>> > where we want all the deriving happens at one place. I think > > it > > >>> >> would > > >>> >> >>> be > > >>> >> >>> > nice to pass the default slot size into the task executor in > > the > > >>> >> same > > >>> >> >>> way > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. > > >>> >> >>> > > > >>> >> >>> > - Regarding the return value of > > >>> >> TaskExecutorGateway#requestResource, I > > >>> >> >>> > think you're right. We should avoid using null as the return > > >>> value. > > >>> >> I > > >>> >> >>> think > > >>> >> >>> > we probably should thrown an exception here. > > >>> >> >>> > > > >>> >> >>> > Thank you~ > > >>> >> >>> > > > >>> >> >>> > Xintong Song > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> > [1] > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> > > >>> >> > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > >>> >> >>> > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > > >>> [hidden email] > > >>> >> > > > >>> >> >>> > wrote: > > >>> >> >>> > > > >>> >> >>> > > Hi Xintong, > > >>> >> >>> > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal helps > > to > > >>> >> >>> improve the > > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it > > >>> enables the > > >>> >> >>> proper > > >>> >> >>> > > integration of the Blink planner which is very important > as > > >>> well. > > >>> >> >>> > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering > whether > > it > > >>> >> >>> wouldn't > > >>> >> >>> > > make sense to actually split it up into two FLIPs: > Operator > > >>> >> resource > > >>> >> >>> > > management and dynamic slot allocation. I think these two > > >>> FLIPs > > >>> >> >>> could be > > >>> >> >>> > > seen as orthogonal and it would decrease the scope of each > > >>> >> individual > > >>> >> >>> > FLIP. > > >>> >> >>> > > > > >>> >> >>> > > Some smaller comments: > > >>> >> >>> > > > > >>> >> >>> > > - I'm not sure whether we should pass in the default slot > > size > > >>> >> via an > > >>> >> >>> > > environment variable. Without having unified the way how > > Flink > > >>> >> >>> components > > >>> >> >>> > > are configured [1], I think it would be better to pass it > in > > >>> as > > >>> >> part > > >>> >> >>> of > > >>> >> >>> > the > > >>> >> >>> > > configuration. > > >>> >> >>> > > - I would avoid returning a null value from > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be > > fulfilled. > > >>> >> >>> Either we > > >>> >> >>> > > should introduce an explicit return value saying this or > > >>> throw an > > >>> >> >>> > > exception. > > >>> >> >>> > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are right that > it > > >>> would > > >>> >> be > > >>> >> >>> > > helpful to make the selection strategy pluggable. Also > > >>> batching > > >>> >> slot > > >>> >> >>> > > requests to the RM could be a good optimization. For the > > sake > > >>> of > > >>> >> >>> keeping > > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle these > > >>> things > > >>> >> >>> after > > >>> >> >>> > the > > >>> >> >>> > > initial version has been completed (without spoiling these > > >>> >> >>> optimization > > >>> >> >>> > > opportunities). In particular batching the slot requests > > >>> depends > > >>> >> on > > >>> >> >>> the > > >>> >> >>> > > current scheduler refactoring and could also be realized > on > > >>> the RM > > >>> >> >>> side > > >>> >> >>> > > only. > > >>> >> >>> > > > > >>> >> >>> > > [1] > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> >>> > > >>> >> > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > >>> >> >>> > > > > >>> >> >>> > > Cheers, > > >>> >> >>> > > Till > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < > > >>> [hidden email]> > > >>> >> >>> wrote: > > >>> >> >>> > > > > >>> >> >>> > > > Hi, Xintong > > >>> >> >>> > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design looks > good > > >>> to > > >>> >> me, > > >>> >> >>> +1 > > >>> >> >>> > > > for this feature. > > >>> >> >>> > > > > > >>> >> >>> > > > Since slots in the same task executor could have > different > > >>> >> resource > > >>> >> >>> > > > profile, we will > > >>> >> >>> > > > meet resource fragment problem. Think about this case: > > >>> >> >>> > > > - request A want 1G memory while request B & C want > 0.5G > > >>> memory > > >>> >> >>> > > > - There are two task executors T1 & T2 with 1G and 0.5G > > >>> free > > >>> >> >>> memory > > >>> >> >>> > > > respectively > > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A must > > >>> wait for > > >>> >> >>> the > > >>> >> >>> > > > free resource from > > >>> >> >>> > > > other task. But A could have been scheduled immediately > if > > >>> we > > >>> >> cut a > > >>> >> >>> > > > slot from T2 for B. > > >>> >> >>> > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a task > > >>> executor > > >>> >> >>> which > > >>> >> >>> > > > has enough > > >>> >> >>> > > > resource and then cut a slot from it. Current method > could > > >>> be > > >>> >> seen > > >>> >> >>> as > > >>> >> >>> > > > "First-fit strategy", > > >>> >> >>> > > > which works well in general but sometimes could not be > the > > >>> >> >>> optimization > > >>> >> >>> > > > method. > > >>> >> >>> > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin > Packing > > >>> >> >>> Problem"[1]. > > >>> >> >>> > > > Here are > > >>> >> >>> > > > some common approximate algorithms: > > >>> >> >>> > > > - First fit > > >>> >> >>> > > > - Next fit > > >>> >> >>> > > > - Best fit > > >>> >> >>> > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing problem if > we > > >>> take > > >>> >> CPU > > >>> >> >>> > > > into account. It hard > > >>> >> >>> > > > to define which one is best fit now. Some research > > addressed > > >>> >> this > > >>> >> >>> > > > problem, such like Tetris[2]. > > >>> >> >>> > > > > > >>> >> >>> > > > Here are some thinking about it: > > >>> >> >>> > > > 1. We could make the strategy of finding matching task > > >>> executor > > >>> >> >>> > > > pluginable. Let user to config the > > >>> >> >>> > > > best strategy in their scenario. > > >>> >> >>> > > > 2. We could support batch request interface in RM, > because > > >>> we > > >>> >> have > > >>> >> >>> > > > opportunities to optimize > > >>> >> >>> > > > if we have more information. If we know the A, B, C at > the > > >>> same > > >>> >> >>> time, > > >>> >> >>> > > > we could always make the best decision. > > >>> >> >>> > > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > >>> >> >>> > > > [2] > > >>> >> >>> > > > >>> >> > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > >>> >> >>> > > > > > >>> >> >>> > > > Best, > > >>> >> >>> > > > Yangze Guo > > >>> >> >>> > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > > >>> >> >>> [hidden email]> > > >>> >> >>> > > > wrote: > > >>> >> >>> > > > > > > >>> >> >>> > > > > Hi everyone, > > >>> >> >>> > > > > > > >>> >> >>> > > > > We would like to start a discussion thread on > "FLIP-53: > > >>> Fine > > >>> >> >>> Grained > > >>> >> >>> > > > > Resource Management"[1], where we propose how to > improve > > >>> Flink > > >>> >> >>> > resource > > >>> >> >>> > > > > management and scheduling. > > >>> >> >>> > > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues. > > >>> >> >>> > > > > > > >>> >> >>> > > > > - How to support tasks with fine grained resource > > >>> >> >>> requirements. > > >>> >> >>> > > > > - How to unify resource management for jobs with / > > >>> without > > >>> >> >>> fine > > >>> >> >>> > > > grained > > >>> >> >>> > > > > resource requirements. > > >>> >> >>> > > > > - How to unify resource management for streaming / > > >>> batch > > >>> >> jobs. > > >>> >> >>> > > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows. > > >>> >> >>> > > > > > > >>> >> >>> > > > > - Unify memory management for operators with / > > without > > >>> fine > > >>> >> >>> > grained > > >>> >> >>> > > > > resource requirements by applying a fraction based > > >>> quota > > >>> >> >>> > mechanism. > > >>> >> >>> > > > > - Unify resource scheduling for streaming and batch > > >>> jobs by > > >>> >> >>> > setting > > >>> >> >>> > > > slot > > >>> >> >>> > > > > sharing groups for pipelined regions during > compiling > > >>> >> stage. > > >>> >> >>> > > > > - Dynamically allocate slots from task executors' > > >>> available > > >>> >> >>> > > resources. > > >>> >> >>> > > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki document > [1]. > > >>> >> Looking > > >>> >> >>> > forward > > >>> >> >>> > > > to > > >>> >> >>> > > > > your feedbacks. > > >>> >> >>> > > > > > > >>> >> >>> > > > > Thank you~ > > >>> >> >>> > > > > > > >>> >> >>> > > > > Xintong Song > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > [1] > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> >>> > > >>> >> > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> >>> > > >>> >> >> > > >>> >> > > >>> > > > >>> > > >> > > > |
Thanks for the comments, Zhu & Kurt.
Andrey and I also had some discussions offline, and I would like to first post a summary of our discussion: 1. The motivation of the fraction based approach is to unify resource management for both operators with specified and unknown resource requirements. 2. The fraction based approach proposed in this FLIP should only affect streaming jobs (both bounded and unbounded). For DataSet jobs, there are already some fraction based approach (in TaskConfig and ChainedDriver), and we do not make any change to the existing approach. 3. The scope of this FLIP does not include discussion of how to set ResourceSpec for operators. 1. For blink jobs, the optimizer can set operator resources for the users, according to their configurations (default: unknown) 2. For DataStream jobs, there are no method / interface to set operator resources at the moment (1.10). We can have in the future. 3. For DataSet jobs, there are existing user interfaces to set operator resources. 4. The FLIP should explain more about how ResourceSpecs works 1. PhysicalTransformations (deployed with operators into the StreamTasks) get ResourceSpec: unknown by default or known (e.g. from the Blink planner) 2. While generating stream graph, calculate fractions and set to StreamConfig 3. While scheduling, convert ResourceSpec to ResourceProfile (ResourceSpec + network memory), and deploy to slots / TMs matching the resources 4. While starting Task in TM, each operator gets fraction converted back to the original absolute value requested by user or fair unknown share of the slot 5. We should not set `allSourcesInSamePipelinedRegion` to `false` for DataSet jobs. Behaviors of DataSet jobs should not be changed. 6. The FLIP document should differentiate works planed in this FLIP and the future follow-ups more clearly, by put the follow-ups in a separate section 7. Another limitation of the rejected alternative setting fractions at scheduling time is that, the scheduler implementation does not know which tasks will be deployed into the same slot in advance. Andrey, Please bring it up if there is anything I missed. Zhu, regarding your comments: 1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for DataSet jobs (point 5 in the discussion summary above), then there shouldn't be any regression right? 2. I think it makes sense to set the max possible network memory for the JobVertex. When you say parallel instances of the same JobVertex may have need different network memory, I guess you mean the rescale scenarios where parallelisms of upstream / downstream vertex cannot be exactly divided by parallelism of downstream / upstream vertex? I would say it's acceptable to have slight difference between actually needed and allocated network memory. 3. Yes, by numOpsUseOnHeapManagedMemory I mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the doc. 4. Yes, it should be StreamingJobGraphGenerator. Thanks for the correction. Kurt, regarding your comments: 1. I think we don't have network memory in ResourceSpec, which is the user facing API. We only have network memory in ResourceProfile, which is used internally for scheduling. The reason we do not expose network memory to the user is that, currently how many network buffers each task needs is decided by the topology of execution graph (how many input / output channels it has). 2. In the section "Operator Resource Requirements": "For the first version, we do not support mixing operators with specified / unknown resource requirements in the same job. Either all or none of the operators of the same job should specify their resource requirements. StreamGraphGenerator should check this and throw an error when mixing of specified / unknown resource requirements is detected, during the compilation stage." 3. If the user set a resource requirement, then it is guaranteed that the task should get at least the much resource, otherwise there should be an exception. That should be guaranteed by the "Dynamic Slot Allocation" approach (FLIP-56). I'll update the FLIP document addressing the comments ASAP. Thank you~ Xintong Song On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote: > Thanks Xingtong for driving this effort, I haven't finished the whole > document yet, > but have couple of questions: > > 1. Regarding to network memory, the document said it will be derived by > framework > automatically. I'm wondering whether we should delete this dimension from > user- > facing API? > > 2. Regarding to fraction based quota, I don't quite get the meaning of > "slotSharingGroupOnHeapManagedMem" and "slotSharingGroupOffHeapManagedMem". > What if the sharing group is mixed with specified resource and UNKNOWN > resource > requirements. > > 3 IIUC, even user had set resource requirements, lets say 500MB off-heap > managed > memory, during execution the operator may or may not have 500MB off-heap > managed > memory, right? > > Best, > Kurt > > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote: > > > Thanks Xintong for proposing this improvement. Fine grained resources can > > be very helpful when user has good planning on resources. > > > > I have a few questions: > > 1. Currently in a batch job, vertices from different regions can run at > the > > same time in slots from the same shared group, as long as they do not > have > > data dependency on each other and available slot count is not smaller > than > > the *max* of parallelism of all tasks. > > With changes in this FLIP however, tasks from different regions cannot > > share slots anymore. > > Once available slot count is smaller than the *sum* of all parallelism of > > tasks from all regions, tasks may need to be executed sequentially, which > > might result in a performance regression. > > Is this(performance regression to existing DataSet jobs) considered as a > > necessary and accepted trade off in this FLIP? > > > > 2. The network memory depends on the input/output ExecutionEdge count and > > thus can be different even for parallel instances of the same JobVertex. > > Does this mean that when adding task resources to calculating the slot > > resource for a shared group, the max possible network memory of the > vertex > > instance shall be used? > > This might result in larger resource required than actually needed. > > > > And some minor comments: > > 1. Regarding "fracManagedMemOnHeap = 1 / numOpsUseOnHeapManagedMemory", I > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ? > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section and > > implementation step 4 should be *StreamingJobGraphGenerator*, as > > *StreamGraphGenerator* is not aware of JobGraph and pipelined region. > > > > > > Thanks, > > Zhu Zhu > > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道: > > > > > Updated the FLIP wiki page [1], with the following changes. > > > > > > - Remove the step of converting pipelined edges between different > slot > > > sharing groups into blocking edges. > > > - Set `allSourcesInSamePipelinedRegion` to true by default. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > Regarding changing edge type, I think actually we don't need to do > this > > > > for batch jobs neither, because we don't have public interfaces for > > users > > > > to explicitly set slot sharing groups in DataSet API and SQL/Table > API. > > > We > > > > have such interfaces in DataStream API only. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email] > > > > > > wrote: > > > > > > > >> Thanks for the correction, Till. > > > >> > > > >> Regarding your comments: > > > >> - You are right, we should not change the edge type for streaming > > jobs. > > > >> Then I think we can change the option > > 'allSourcesInSamePipelinedRegion' > > > in > > > >> step 2 to 'isStreamingJob', and implement the current step 2 before > > the > > > >> current step 1 so we can use this option to decide whether should > > change > > > >> the edge type. What do you think? > > > >> - Agree. It should be easier to make the default value of > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and > > set > > > it > > > >> to 'false' when using DataSet API or blink planner. > > > >> > > > >> Thank you~ > > > >> > > > >> Xintong Song > > > >> > > > >> > > > >> > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email] > > > > > >> wrote: > > > >> > > > >>> Thanks for creating the implementation plan Xintong. Overall, the > > > >>> implementation plan looks good. I had a couple of comments: > > > >>> > > > >>> - What will happen if a user has defined a streaming job with two > > slot > > > >>> sharing groups? Would the code insert a blocking data exchange > > between > > > >>> these two groups? If yes, then this breaks existing Flink streaming > > > jobs. > > > >>> - How do we detect unbounded streaming jobs to set > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be > easier > > to > > > >>> set > > > >>> it false if we are using the DataSet API or the Blink planner with > a > > > >>> bounded job? > > > >>> > > > >>> Cheers, > > > >>> Till > > > >>> > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann < > [hidden email]> > > > >>> wrote: > > > >>> > > > >>> > I guess there is a typo since the link to the FLIP-53 is > > > >>> > > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > >>> > > > > >>> > Cheers, > > > >>> > Till > > > >>> > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song < > > [hidden email]> > > > >>> > wrote: > > > >>> > > > > >>> >> Added implementation steps for this FLIP on the wiki page [1]. > > > >>> >> > > > >>> >> > > > >>> >> Thank you~ > > > >>> >> > > > >>> >> Xintong Song > > > >>> >> > > > >>> >> > > > >>> >> [1] > > > >>> >> > > > >>> >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > >>> >> > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song < > > > [hidden email]> > > > >>> >> wrote: > > > >>> >> > > > >>> >> > Hi everyone, > > > >>> >> > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained > Resource > > > >>> >> > Management" splits into two separate FLIPs, > > > >>> >> > > > > >>> >> > - FLIP-53: Fine Grained Operator Resource Management [1] > > > >>> >> > - FLIP-56: Dynamic Slot Allocation [2] > > > >>> >> > > > > >>> >> > We'll continue using this discussion thread for FLIP-53. For > > > >>> FLIP-56, I > > > >>> >> > just started a new discussion thread [3]. > > > >>> >> > > > > >>> >> > Thank you~ > > > >>> >> > > > > >>> >> > Xintong Song > > > >>> >> > > > > >>> >> > > > > >>> >> > [1] > > > >>> >> > > > > >>> >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > >>> >> > > > > >>> >> > [2] > > > >>> >> > > > > >>> >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > >>> >> > > > > >>> >> > [3] > > > >>> >> > > > > >>> >> > > > >>> > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > > >>> >> > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song < > > > [hidden email] > > > >>> > > > > >>> >> > wrote: > > > >>> >> > > > > >>> >> >> Thinks for the comments, Yang. > > > >>> >> >> > > > >>> >> >> Regarding your questions: > > > >>> >> >> > > > >>> >> >> 1. How to calculate the resource specification of > > > TaskManagers? > > > >>> Do > > > >>> >> they > > > >>> >> >>> have them same resource spec calculated based on the > > > >>> >> configuration? I > > > >>> >> >>> think > > > >>> >> >>> we still have wasted resources in this situation. Or we > > could > > > >>> start > > > >>> >> >>> TaskManagers with different spec. > > > >>> >> >>> > > > >>> >> >> I agree with you that we can further improve the resource > > utility > > > >>> by > > > >>> >> >> customizing task executors with different resource > > > specifications. > > > >>> >> However, > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave it > > as a > > > >>> >> future > > > >>> >> >> optimization. The plan for that part is to move the logic of > > > >>> deciding > > > >>> >> task > > > >>> >> >> executor specifications into the slot manager and make slot > > > manager > > > >>> >> >> pluggable, so inside the slot manager plugin we can have > > > different > > > >>> >> logics > > > >>> >> >> for deciding the task executor specifications. > > > >>> >> >> > > > >>> >> >> > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, does > it > > > >>> could be > > > >>> >> >>> reused by other SlotRequest that the request resource is > > > >>> smaller > > > >>> >> than > > > >>> >> >>> it? > > > >>> >> >>> > > > >>> >> >> No, I think slot pool should always return slots if they do > not > > > >>> exactly > > > >>> >> >> match the pending requests, so that resource manager can deal > > > with > > > >>> the > > > >>> >> >> extra resources. > > > >>> >> >> > > > >>> >> >>> - If it is yes, what happens to the available resource > > in > > > >>> the > > > >>> >> >> > > > >>> >> >> TaskManager. > > > >>> >> >>> - What is the SlotStatus of the cached slot in > SlotPool? > > > The > > > >>> >> >>> AllocationId is null? > > > >>> >> >>> > > > >>> >> >> The allocation id does not change as long as the slot is not > > > >>> returned > > > >>> >> >> from the job master, no matter its occupied or available in > the > > > >>> slot > > > >>> >> pool. > > > >>> >> >> I think we have the same behavior currently. No matter how > many > > > >>> tasks > > > >>> >> the > > > >>> >> >> job master deploy into the slot, concurrently or > sequentially, > > it > > > >>> is > > > >>> >> one > > > >>> >> >> allocation from the cluster to the job until the slot is > freed > > > from > > > >>> >> the job > > > >>> >> >> master. > > > >>> >> >> > > > >>> >> >>> 3. In a session cluster, some jobs are configured with > > > operator > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to > > > deal > > > >>> with > > > >>> >> >>> this > > > >>> >> >>> situation? > > > >>> >> >> > > > >>> >> >> As long as we do not mix unknown / specified resource > profiles > > > >>> within > > > >>> >> the > > > >>> >> >> same job / slot, there shouldn't be a problem. Resource > manager > > > >>> >> converts > > > >>> >> >> unknown resource profiles in slot requests to specified > default > > > >>> >> resource > > > >>> >> >> profiles, so they can be dynamically allocated from task > > > executors' > > > >>> >> >> available resources just as other slot requests with > specified > > > >>> resource > > > >>> >> >> profiles. > > > >>> >> >> > > > >>> >> >> Thank you~ > > > >>> >> >> > > > >>> >> >> Xintong Song > > > >>> >> >> > > > >>> >> >> > > > >>> >> >> > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang < > > > [hidden email]> > > > >>> >> wrote: > > > >>> >> >> > > > >>> >> >>> Hi Xintong, > > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> Thanks for your detailed proposal. I think many users are > > > >>> suffering > > > >>> >> from > > > >>> >> >>> waste of resources. The resource spec of all task managers > are > > > >>> same > > > >>> >> and > > > >>> >> >>> we > > > >>> >> >>> have to increase all task managers to make the heavy one > more > > > >>> stable. > > > >>> >> So > > > >>> >> >>> we > > > >>> >> >>> will benefit from the fine grained resource management a > lot. > > We > > > >>> could > > > >>> >> >>> get > > > >>> >> >>> better resource utilization and stability. > > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> Just to share some thoughts. > > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> 1. How to calculate the resource specification of > > > >>> TaskManagers? Do > > > >>> >> >>> they > > > >>> >> >>> have them same resource spec calculated based on the > > > >>> >> configuration? I > > > >>> >> >>> think > > > >>> >> >>> we still have wasted resources in this situation. Or we > > could > > > >>> start > > > >>> >> >>> TaskManagers with different spec. > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, does > it > > > >>> could be > > > >>> >> >>> reused by other SlotRequest that the request resource is > > > >>> smaller > > > >>> >> than > > > >>> >> >>> it? > > > >>> >> >>> - If it is yes, what happens to the available resource > > in > > > >>> the > > > >>> >> >>> TaskManager. > > > >>> >> >>> - What is the SlotStatus of the cached slot in > SlotPool? > > > The > > > >>> >> >>> AllocationId is null? > > > >>> >> >>> 3. In a session cluster, some jobs are configured with > > > operator > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How to > > > deal > > > >>> with > > > >>> >> >>> this > > > >>> >> >>> situation? > > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> > > > >>> >> >>> Best, > > > >>> >> >>> Yang > > > >>> >> >>> > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 > 下午8:57写道: > > > >>> >> >>> > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till. > > > >>> >> >>> > > > > >>> >> >>> > Yangze, > > > >>> >> >>> > > > > >>> >> >>> > I agree with you that we should make scheduling strategy > > > >>> pluggable > > > >>> >> and > > > >>> >> >>> > optimize the strategy to reduce the memory fragmentation > > > >>> problem, > > > >>> >> and > > > >>> >> >>> > thanks for the inputs on the potential algorithmic > > solutions. > > > >>> >> However, > > > >>> >> >>> I'm > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall > mechanism > > > >>> design > > > >>> >> >>> rather > > > >>> >> >>> > than strategies. Solving the fragmentation issue should be > > > >>> >> considered > > > >>> >> >>> as an > > > >>> >> >>> > optimization, and I agree with Till that we probably > should > > > >>> tackle > > > >>> >> this > > > >>> >> >>> > afterwards. > > > >>> >> >>> > > > > >>> >> >>> > Till, > > > >>> >> >>> > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. > The > > > >>> operator > > > >>> >> >>> > resource management and dynamic slot allocation do not > have > > > much > > > >>> >> >>> dependency > > > >>> >> >>> > on each other. > > > >>> >> >>> > > > > >>> >> >>> > - Regarding the default slot size, I think this is similar > > to > > > >>> >> FLIP-49 > > > >>> >> >>> [1] > > > >>> >> >>> > where we want all the deriving happens at one place. I > think > > > it > > > >>> >> would > > > >>> >> >>> be > > > >>> >> >>> > nice to pass the default slot size into the task executor > in > > > the > > > >>> >> same > > > >>> >> >>> way > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. > > > >>> >> >>> > > > > >>> >> >>> > - Regarding the return value of > > > >>> >> TaskExecutorGateway#requestResource, I > > > >>> >> >>> > think you're right. We should avoid using null as the > return > > > >>> value. > > > >>> >> I > > > >>> >> >>> think > > > >>> >> >>> > we probably should thrown an exception here. > > > >>> >> >>> > > > > >>> >> >>> > Thank you~ > > > >>> >> >>> > > > > >>> >> >>> > Xintong Song > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > [1] > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > >>> >> >>> > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > > > >>> [hidden email] > > > >>> >> > > > > >>> >> >>> > wrote: > > > >>> >> >>> > > > > >>> >> >>> > > Hi Xintong, > > > >>> >> >>> > > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal > helps > > > to > > > >>> >> >>> improve the > > > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it > > > >>> enables the > > > >>> >> >>> proper > > > >>> >> >>> > > integration of the Blink planner which is very important > > as > > > >>> well. > > > >>> >> >>> > > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering > > whether > > > it > > > >>> >> >>> wouldn't > > > >>> >> >>> > > make sense to actually split it up into two FLIPs: > > Operator > > > >>> >> resource > > > >>> >> >>> > > management and dynamic slot allocation. I think these > two > > > >>> FLIPs > > > >>> >> >>> could be > > > >>> >> >>> > > seen as orthogonal and it would decrease the scope of > each > > > >>> >> individual > > > >>> >> >>> > FLIP. > > > >>> >> >>> > > > > > >>> >> >>> > > Some smaller comments: > > > >>> >> >>> > > > > > >>> >> >>> > > - I'm not sure whether we should pass in the default > slot > > > size > > > >>> >> via an > > > >>> >> >>> > > environment variable. Without having unified the way how > > > Flink > > > >>> >> >>> components > > > >>> >> >>> > > are configured [1], I think it would be better to pass > it > > in > > > >>> as > > > >>> >> part > > > >>> >> >>> of > > > >>> >> >>> > the > > > >>> >> >>> > > configuration. > > > >>> >> >>> > > - I would avoid returning a null value from > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be > > > fulfilled. > > > >>> >> >>> Either we > > > >>> >> >>> > > should introduce an explicit return value saying this or > > > >>> throw an > > > >>> >> >>> > > exception. > > > >>> >> >>> > > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are right that > > it > > > >>> would > > > >>> >> be > > > >>> >> >>> > > helpful to make the selection strategy pluggable. Also > > > >>> batching > > > >>> >> slot > > > >>> >> >>> > > requests to the RM could be a good optimization. For the > > > sake > > > >>> of > > > >>> >> >>> keeping > > > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle > these > > > >>> things > > > >>> >> >>> after > > > >>> >> >>> > the > > > >>> >> >>> > > initial version has been completed (without spoiling > these > > > >>> >> >>> optimization > > > >>> >> >>> > > opportunities). In particular batching the slot requests > > > >>> depends > > > >>> >> on > > > >>> >> >>> the > > > >>> >> >>> > > current scheduler refactoring and could also be realized > > on > > > >>> the RM > > > >>> >> >>> side > > > >>> >> >>> > > only. > > > >>> >> >>> > > > > > >>> >> >>> > > [1] > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > > >>> >> >>> > > > > > >>> >> >>> > > Cheers, > > > >>> >> >>> > > Till > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < > > > >>> [hidden email]> > > > >>> >> >>> wrote: > > > >>> >> >>> > > > > > >>> >> >>> > > > Hi, Xintong > > > >>> >> >>> > > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design looks > > good > > > >>> to > > > >>> >> me, > > > >>> >> >>> +1 > > > >>> >> >>> > > > for this feature. > > > >>> >> >>> > > > > > > >>> >> >>> > > > Since slots in the same task executor could have > > different > > > >>> >> resource > > > >>> >> >>> > > > profile, we will > > > >>> >> >>> > > > meet resource fragment problem. Think about this case: > > > >>> >> >>> > > > - request A want 1G memory while request B & C want > > 0.5G > > > >>> memory > > > >>> >> >>> > > > - There are two task executors T1 & T2 with 1G and > 0.5G > > > >>> free > > > >>> >> >>> memory > > > >>> >> >>> > > > respectively > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A > must > > > >>> wait for > > > >>> >> >>> the > > > >>> >> >>> > > > free resource from > > > >>> >> >>> > > > other task. But A could have been scheduled > immediately > > if > > > >>> we > > > >>> >> cut a > > > >>> >> >>> > > > slot from T2 for B. > > > >>> >> >>> > > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a > task > > > >>> executor > > > >>> >> >>> which > > > >>> >> >>> > > > has enough > > > >>> >> >>> > > > resource and then cut a slot from it. Current method > > could > > > >>> be > > > >>> >> seen > > > >>> >> >>> as > > > >>> >> >>> > > > "First-fit strategy", > > > >>> >> >>> > > > which works well in general but sometimes could not be > > the > > > >>> >> >>> optimization > > > >>> >> >>> > > > method. > > > >>> >> >>> > > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin > > Packing > > > >>> >> >>> Problem"[1]. > > > >>> >> >>> > > > Here are > > > >>> >> >>> > > > some common approximate algorithms: > > > >>> >> >>> > > > - First fit > > > >>> >> >>> > > > - Next fit > > > >>> >> >>> > > > - Best fit > > > >>> >> >>> > > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing problem if > > we > > > >>> take > > > >>> >> CPU > > > >>> >> >>> > > > into account. It hard > > > >>> >> >>> > > > to define which one is best fit now. Some research > > > addressed > > > >>> >> this > > > >>> >> >>> > > > problem, such like Tetris[2]. > > > >>> >> >>> > > > > > > >>> >> >>> > > > Here are some thinking about it: > > > >>> >> >>> > > > 1. We could make the strategy of finding matching task > > > >>> executor > > > >>> >> >>> > > > pluginable. Let user to config the > > > >>> >> >>> > > > best strategy in their scenario. > > > >>> >> >>> > > > 2. We could support batch request interface in RM, > > because > > > >>> we > > > >>> >> have > > > >>> >> >>> > > > opportunities to optimize > > > >>> >> >>> > > > if we have more information. If we know the A, B, C at > > the > > > >>> same > > > >>> >> >>> time, > > > >>> >> >>> > > > we could always make the best decision. > > > >>> >> >>> > > > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > > >>> >> >>> > > > [2] > > > >>> >> >>> > > > > >>> >> > > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > >>> >> >>> > > > > > > >>> >> >>> > > > Best, > > > >>> >> >>> > > > Yangze Guo > > > >>> >> >>> > > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > > > >>> >> >>> [hidden email]> > > > >>> >> >>> > > > wrote: > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > Hi everyone, > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > We would like to start a discussion thread on > > "FLIP-53: > > > >>> Fine > > > >>> >> >>> Grained > > > >>> >> >>> > > > > Resource Management"[1], where we propose how to > > improve > > > >>> Flink > > > >>> >> >>> > resource > > > >>> >> >>> > > > > management and scheduling. > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues. > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > - How to support tasks with fine grained resource > > > >>> >> >>> requirements. > > > >>> >> >>> > > > > - How to unify resource management for jobs with > / > > > >>> without > > > >>> >> >>> fine > > > >>> >> >>> > > > grained > > > >>> >> >>> > > > > resource requirements. > > > >>> >> >>> > > > > - How to unify resource management for streaming > / > > > >>> batch > > > >>> >> jobs. > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows. > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > - Unify memory management for operators with / > > > without > > > >>> fine > > > >>> >> >>> > grained > > > >>> >> >>> > > > > resource requirements by applying a fraction > based > > > >>> quota > > > >>> >> >>> > mechanism. > > > >>> >> >>> > > > > - Unify resource scheduling for streaming and > batch > > > >>> jobs by > > > >>> >> >>> > setting > > > >>> >> >>> > > > slot > > > >>> >> >>> > > > > sharing groups for pipelined regions during > > compiling > > > >>> >> stage. > > > >>> >> >>> > > > > - Dynamically allocate slots from task executors' > > > >>> available > > > >>> >> >>> > > resources. > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki document > > [1]. > > > >>> >> Looking > > > >>> >> >>> > forward > > > >>> >> >>> > > > to > > > >>> >> >>> > > > > your feedbacks. > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > Thank you~ > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > Xintong Song > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > [1] > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> >>> > > > >>> >> >> > > > >>> >> > > > >>> > > > > >>> > > > >> > > > > > > |
Thanks Xintong for the explanation.
For question #1, I think it's good as long as DataSet job behaviors remains the same. For question #2, agreed that the resource difference is small enough(at most 1 edge diff) in current supported point-wise execution edge connection patterns. Thanks, Zhu Zhu Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道: > Thanks for the comments, Zhu & Kurt. > > Andrey and I also had some discussions offline, and I would like to first > post a summary of our discussion: > > 1. The motivation of the fraction based approach is to unify resource > management for both operators with specified and unknown resource > requirements. > 2. The fraction based approach proposed in this FLIP should only affect > streaming jobs (both bounded and unbounded). For DataSet jobs, there are > already some fraction based approach (in TaskConfig and ChainedDriver), > and > we do not make any change to the existing approach. > 3. The scope of this FLIP does not include discussion of how to set > ResourceSpec for operators. > 1. For blink jobs, the optimizer can set operator resources for the > users, according to their configurations (default: unknown) > 2. For DataStream jobs, there are no method / interface to set > operator resources at the moment (1.10). We can have in the future. > 3. For DataSet jobs, there are existing user interfaces to set > operator resources. > 4. The FLIP should explain more about how ResourceSpecs works > 1. PhysicalTransformations (deployed with operators into the > StreamTasks) get ResourceSpec: unknown by default or known (e.g. > from the > Blink planner) > 2. While generating stream graph, calculate fractions and set to > StreamConfig > 3. While scheduling, convert ResourceSpec to ResourceProfile > (ResourceSpec + network memory), and deploy to slots / TMs matching > the > resources > 4. While starting Task in TM, each operator gets fraction converted > back to the original absolute value requested by user or fair > unknown share > of the slot > 5. We should not set `allSourcesInSamePipelinedRegion` to `false` for > DataSet jobs. Behaviors of DataSet jobs should not be changed. > 6. The FLIP document should differentiate works planed in this FLIP and > the future follow-ups more clearly, by put the follow-ups in a separate > section > 7. Another limitation of the rejected alternative setting fractions at > scheduling time is that, the scheduler implementation does not know > which > tasks will be deployed into the same slot in advance. > > Andrey, Please bring it up if there is anything I missed. > > Zhu, regarding your comments: > > 1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for > DataSet jobs (point 5 in the discussion summary above), then there > shouldn't be any regression right? > 2. I think it makes sense to set the max possible network memory for the > JobVertex. When you say parallel instances of the same JobVertex may > have > need different network memory, I guess you mean the rescale scenarios > where > parallelisms of upstream / downstream vertex cannot be exactly divided > by > parallelism of downstream / upstream vertex? I would say it's > acceptable to > have slight difference between actually needed and allocated network > memory. > 3. Yes, by numOpsUseOnHeapManagedMemory I mean > numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the doc. > 4. Yes, it should be StreamingJobGraphGenerator. Thanks for the > correction. > > > Kurt, regarding your comments: > > 1. I think we don't have network memory in ResourceSpec, which is the > user facing API. We only have network memory in ResourceProfile, which > is > used internally for scheduling. The reason we do not expose network > memory > to the user is that, currently how many network buffers each task needs > is > decided by the topology of execution graph (how many input / output > channels it has). > 2. In the section "Operator Resource Requirements": "For the first > version, we do not support mixing operators with specified / unknown > resource requirements in the same job. Either all or none of the > operators > of the same job should specify their resource requirements. > StreamGraphGenerator should check this and throw an error when mixing of > specified / unknown resource requirements is detected, during the > compilation stage." > 3. If the user set a resource requirement, then it is guaranteed that > the task should get at least the much resource, otherwise there should > be > an exception. That should be guaranteed by the "Dynamic Slot Allocation" > approach (FLIP-56). > > > I'll update the FLIP document addressing the comments ASAP. > > > Thank you~ > > Xintong Song > > > > On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote: > > > Thanks Xingtong for driving this effort, I haven't finished the whole > > document yet, > > but have couple of questions: > > > > 1. Regarding to network memory, the document said it will be derived by > > framework > > automatically. I'm wondering whether we should delete this dimension from > > user- > > facing API? > > > > 2. Regarding to fraction based quota, I don't quite get the meaning of > > "slotSharingGroupOnHeapManagedMem" and > "slotSharingGroupOffHeapManagedMem". > > What if the sharing group is mixed with specified resource and UNKNOWN > > resource > > requirements. > > > > 3 IIUC, even user had set resource requirements, lets say 500MB off-heap > > managed > > memory, during execution the operator may or may not have 500MB off-heap > > managed > > memory, right? > > > > Best, > > Kurt > > > > > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote: > > > > > Thanks Xintong for proposing this improvement. Fine grained resources > can > > > be very helpful when user has good planning on resources. > > > > > > I have a few questions: > > > 1. Currently in a batch job, vertices from different regions can run at > > the > > > same time in slots from the same shared group, as long as they do not > > have > > > data dependency on each other and available slot count is not smaller > > than > > > the *max* of parallelism of all tasks. > > > With changes in this FLIP however, tasks from different regions cannot > > > share slots anymore. > > > Once available slot count is smaller than the *sum* of all parallelism > of > > > tasks from all regions, tasks may need to be executed sequentially, > which > > > might result in a performance regression. > > > Is this(performance regression to existing DataSet jobs) considered as > a > > > necessary and accepted trade off in this FLIP? > > > > > > 2. The network memory depends on the input/output ExecutionEdge count > and > > > thus can be different even for parallel instances of the same > JobVertex. > > > Does this mean that when adding task resources to calculating the slot > > > resource for a shared group, the max possible network memory of the > > vertex > > > instance shall be used? > > > This might result in larger resource required than actually needed. > > > > > > And some minor comments: > > > 1. Regarding "fracManagedMemOnHeap = 1 / > numOpsUseOnHeapManagedMemory", I > > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ? > > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section and > > > implementation step 4 should be *StreamingJobGraphGenerator*, as > > > *StreamGraphGenerator* is not aware of JobGraph and pipelined region. > > > > > > > > > Thanks, > > > Zhu Zhu > > > > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道: > > > > > > > Updated the FLIP wiki page [1], with the following changes. > > > > > > > > - Remove the step of converting pipelined edges between different > > slot > > > > sharing groups into blocking edges. > > > > - Set `allSourcesInSamePipelinedRegion` to true by default. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]> > > > > wrote: > > > > > > > > > Regarding changing edge type, I think actually we don't need to do > > this > > > > > for batch jobs neither, because we don't have public interfaces for > > > users > > > > > to explicitly set slot sharing groups in DataSet API and SQL/Table > > API. > > > > We > > > > > have such interfaces in DataStream API only. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song < > [hidden email] > > > > > > > > wrote: > > > > > > > > > >> Thanks for the correction, Till. > > > > >> > > > > >> Regarding your comments: > > > > >> - You are right, we should not change the edge type for streaming > > > jobs. > > > > >> Then I think we can change the option > > > 'allSourcesInSamePipelinedRegion' > > > > in > > > > >> step 2 to 'isStreamingJob', and implement the current step 2 > before > > > the > > > > >> current step 1 so we can use this option to decide whether should > > > change > > > > >> the edge type. What do you think? > > > > >> - Agree. It should be easier to make the default value of > > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', > and > > > set > > > > it > > > > >> to 'false' when using DataSet API or blink planner. > > > > >> > > > > >> Thank you~ > > > > >> > > > > >> Xintong Song > > > > >> > > > > >> > > > > >> > > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann < > [hidden email] > > > > > > > >> wrote: > > > > >> > > > > >>> Thanks for creating the implementation plan Xintong. Overall, the > > > > >>> implementation plan looks good. I had a couple of comments: > > > > >>> > > > > >>> - What will happen if a user has defined a streaming job with two > > > slot > > > > >>> sharing groups? Would the code insert a blocking data exchange > > > between > > > > >>> these two groups? If yes, then this breaks existing Flink > streaming > > > > jobs. > > > > >>> - How do we detect unbounded streaming jobs to set > > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be > > easier > > > to > > > > >>> set > > > > >>> it false if we are using the DataSet API or the Blink planner > with > > a > > > > >>> bounded job? > > > > >>> > > > > >>> Cheers, > > > > >>> Till > > > > >>> > > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann < > > [hidden email]> > > > > >>> wrote: > > > > >>> > > > > >>> > I guess there is a typo since the link to the FLIP-53 is > > > > >>> > > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > >>> > > > > > >>> > Cheers, > > > > >>> > Till > > > > >>> > > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song < > > > [hidden email]> > > > > >>> > wrote: > > > > >>> > > > > > >>> >> Added implementation steps for this FLIP on the wiki page [1]. > > > > >>> >> > > > > >>> >> > > > > >>> >> Thank you~ > > > > >>> >> > > > > >>> >> Xintong Song > > > > >>> >> > > > > >>> >> > > > > >>> >> [1] > > > > >>> >> > > > > >>> >> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > >>> >> > > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song < > > > > [hidden email]> > > > > >>> >> wrote: > > > > >>> >> > > > > >>> >> > Hi everyone, > > > > >>> >> > > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained > > Resource > > > > >>> >> > Management" splits into two separate FLIPs, > > > > >>> >> > > > > > >>> >> > - FLIP-53: Fine Grained Operator Resource Management [1] > > > > >>> >> > - FLIP-56: Dynamic Slot Allocation [2] > > > > >>> >> > > > > > >>> >> > We'll continue using this discussion thread for FLIP-53. For > > > > >>> FLIP-56, I > > > > >>> >> > just started a new discussion thread [3]. > > > > >>> >> > > > > > >>> >> > Thank you~ > > > > >>> >> > > > > > >>> >> > Xintong Song > > > > >>> >> > > > > > >>> >> > > > > > >>> >> > [1] > > > > >>> >> > > > > > >>> >> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > >>> >> > > > > > >>> >> > [2] > > > > >>> >> > > > > > >>> >> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > >>> >> > > > > > >>> >> > [3] > > > > >>> >> > > > > > >>> >> > > > > >>> > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > > > >>> >> > > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song < > > > > [hidden email] > > > > >>> > > > > > >>> >> > wrote: > > > > >>> >> > > > > > >>> >> >> Thinks for the comments, Yang. > > > > >>> >> >> > > > > >>> >> >> Regarding your questions: > > > > >>> >> >> > > > > >>> >> >> 1. How to calculate the resource specification of > > > > TaskManagers? > > > > >>> Do > > > > >>> >> they > > > > >>> >> >>> have them same resource spec calculated based on the > > > > >>> >> configuration? I > > > > >>> >> >>> think > > > > >>> >> >>> we still have wasted resources in this situation. Or we > > > could > > > > >>> start > > > > >>> >> >>> TaskManagers with different spec. > > > > >>> >> >>> > > > > >>> >> >> I agree with you that we can further improve the resource > > > utility > > > > >>> by > > > > >>> >> >> customizing task executors with different resource > > > > specifications. > > > > >>> >> However, > > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave > it > > > as a > > > > >>> >> future > > > > >>> >> >> optimization. The plan for that part is to move the logic > of > > > > >>> deciding > > > > >>> >> task > > > > >>> >> >> executor specifications into the slot manager and make slot > > > > manager > > > > >>> >> >> pluggable, so inside the slot manager plugin we can have > > > > different > > > > >>> >> logics > > > > >>> >> >> for deciding the task executor specifications. > > > > >>> >> >> > > > > >>> >> >> > > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, does > > it > > > > >>> could be > > > > >>> >> >>> reused by other SlotRequest that the request resource > is > > > > >>> smaller > > > > >>> >> than > > > > >>> >> >>> it? > > > > >>> >> >>> > > > > >>> >> >> No, I think slot pool should always return slots if they do > > not > > > > >>> exactly > > > > >>> >> >> match the pending requests, so that resource manager can > deal > > > > with > > > > >>> the > > > > >>> >> >> extra resources. > > > > >>> >> >> > > > > >>> >> >>> - If it is yes, what happens to the available > resource > > > in > > > > >>> the > > > > >>> >> >> > > > > >>> >> >> TaskManager. > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > SlotPool? > > > > The > > > > >>> >> >>> AllocationId is null? > > > > >>> >> >>> > > > > >>> >> >> The allocation id does not change as long as the slot is > not > > > > >>> returned > > > > >>> >> >> from the job master, no matter its occupied or available in > > the > > > > >>> slot > > > > >>> >> pool. > > > > >>> >> >> I think we have the same behavior currently. No matter how > > many > > > > >>> tasks > > > > >>> >> the > > > > >>> >> >> job master deploy into the slot, concurrently or > > sequentially, > > > it > > > > >>> is > > > > >>> >> one > > > > >>> >> >> allocation from the cluster to the job until the slot is > > freed > > > > from > > > > >>> >> the job > > > > >>> >> >> master. > > > > >>> >> >> > > > > >>> >> >>> 3. In a session cluster, some jobs are configured with > > > > operator > > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How > to > > > > deal > > > > >>> with > > > > >>> >> >>> this > > > > >>> >> >>> situation? > > > > >>> >> >> > > > > >>> >> >> As long as we do not mix unknown / specified resource > > profiles > > > > >>> within > > > > >>> >> the > > > > >>> >> >> same job / slot, there shouldn't be a problem. Resource > > manager > > > > >>> >> converts > > > > >>> >> >> unknown resource profiles in slot requests to specified > > default > > > > >>> >> resource > > > > >>> >> >> profiles, so they can be dynamically allocated from task > > > > executors' > > > > >>> >> >> available resources just as other slot requests with > > specified > > > > >>> resource > > > > >>> >> >> profiles. > > > > >>> >> >> > > > > >>> >> >> Thank you~ > > > > >>> >> >> > > > > >>> >> >> Xintong Song > > > > >>> >> >> > > > > >>> >> >> > > > > >>> >> >> > > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang < > > > > [hidden email]> > > > > >>> >> wrote: > > > > >>> >> >> > > > > >>> >> >>> Hi Xintong, > > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> Thanks for your detailed proposal. I think many users are > > > > >>> suffering > > > > >>> >> from > > > > >>> >> >>> waste of resources. The resource spec of all task managers > > are > > > > >>> same > > > > >>> >> and > > > > >>> >> >>> we > > > > >>> >> >>> have to increase all task managers to make the heavy one > > more > > > > >>> stable. > > > > >>> >> So > > > > >>> >> >>> we > > > > >>> >> >>> will benefit from the fine grained resource management a > > lot. > > > We > > > > >>> could > > > > >>> >> >>> get > > > > >>> >> >>> better resource utilization and stability. > > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> Just to share some thoughts. > > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> 1. How to calculate the resource specification of > > > > >>> TaskManagers? Do > > > > >>> >> >>> they > > > > >>> >> >>> have them same resource spec calculated based on the > > > > >>> >> configuration? I > > > > >>> >> >>> think > > > > >>> >> >>> we still have wasted resources in this situation. Or we > > > could > > > > >>> start > > > > >>> >> >>> TaskManagers with different spec. > > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, does > > it > > > > >>> could be > > > > >>> >> >>> reused by other SlotRequest that the request resource > is > > > > >>> smaller > > > > >>> >> than > > > > >>> >> >>> it? > > > > >>> >> >>> - If it is yes, what happens to the available > resource > > > in > > > > >>> the > > > > >>> >> >>> TaskManager. > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > SlotPool? > > > > The > > > > >>> >> >>> AllocationId is null? > > > > >>> >> >>> 3. In a session cluster, some jobs are configured with > > > > operator > > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. How > to > > > > deal > > > > >>> with > > > > >>> >> >>> this > > > > >>> >> >>> situation? > > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> > > > > >>> >> >>> Best, > > > > >>> >> >>> Yang > > > > >>> >> >>> > > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 > > 下午8:57写道: > > > > >>> >> >>> > > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till. > > > > >>> >> >>> > > > > > >>> >> >>> > Yangze, > > > > >>> >> >>> > > > > > >>> >> >>> > I agree with you that we should make scheduling strategy > > > > >>> pluggable > > > > >>> >> and > > > > >>> >> >>> > optimize the strategy to reduce the memory fragmentation > > > > >>> problem, > > > > >>> >> and > > > > >>> >> >>> > thanks for the inputs on the potential algorithmic > > > solutions. > > > > >>> >> However, > > > > >>> >> >>> I'm > > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall > > mechanism > > > > >>> design > > > > >>> >> >>> rather > > > > >>> >> >>> > than strategies. Solving the fragmentation issue should > be > > > > >>> >> considered > > > > >>> >> >>> as an > > > > >>> >> >>> > optimization, and I agree with Till that we probably > > should > > > > >>> tackle > > > > >>> >> this > > > > >>> >> >>> > afterwards. > > > > >>> >> >>> > > > > > >>> >> >>> > Till, > > > > >>> >> >>> > > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. > > The > > > > >>> operator > > > > >>> >> >>> > resource management and dynamic slot allocation do not > > have > > > > much > > > > >>> >> >>> dependency > > > > >>> >> >>> > on each other. > > > > >>> >> >>> > > > > > >>> >> >>> > - Regarding the default slot size, I think this is > similar > > > to > > > > >>> >> FLIP-49 > > > > >>> >> >>> [1] > > > > >>> >> >>> > where we want all the deriving happens at one place. I > > think > > > > it > > > > >>> >> would > > > > >>> >> >>> be > > > > >>> >> >>> > nice to pass the default slot size into the task > executor > > in > > > > the > > > > >>> >> same > > > > >>> >> >>> way > > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. > > > > >>> >> >>> > > > > > >>> >> >>> > - Regarding the return value of > > > > >>> >> TaskExecutorGateway#requestResource, I > > > > >>> >> >>> > think you're right. We should avoid using null as the > > return > > > > >>> value. > > > > >>> >> I > > > > >>> >> >>> think > > > > >>> >> >>> > we probably should thrown an exception here. > > > > >>> >> >>> > > > > > >>> >> >>> > Thank you~ > > > > >>> >> >>> > > > > > >>> >> >>> > Xintong Song > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > [1] > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > >>> >> >>> > > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > > > > >>> [hidden email] > > > > >>> >> > > > > > >>> >> >>> > wrote: > > > > >>> >> >>> > > > > > >>> >> >>> > > Hi Xintong, > > > > >>> >> >>> > > > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal > > helps > > > > to > > > > >>> >> >>> improve the > > > > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it > > > > >>> enables the > > > > >>> >> >>> proper > > > > >>> >> >>> > > integration of the Blink planner which is very > important > > > as > > > > >>> well. > > > > >>> >> >>> > > > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering > > > whether > > > > it > > > > >>> >> >>> wouldn't > > > > >>> >> >>> > > make sense to actually split it up into two FLIPs: > > > Operator > > > > >>> >> resource > > > > >>> >> >>> > > management and dynamic slot allocation. I think these > > two > > > > >>> FLIPs > > > > >>> >> >>> could be > > > > >>> >> >>> > > seen as orthogonal and it would decrease the scope of > > each > > > > >>> >> individual > > > > >>> >> >>> > FLIP. > > > > >>> >> >>> > > > > > > >>> >> >>> > > Some smaller comments: > > > > >>> >> >>> > > > > > > >>> >> >>> > > - I'm not sure whether we should pass in the default > > slot > > > > size > > > > >>> >> via an > > > > >>> >> >>> > > environment variable. Without having unified the way > how > > > > Flink > > > > >>> >> >>> components > > > > >>> >> >>> > > are configured [1], I think it would be better to pass > > it > > > in > > > > >>> as > > > > >>> >> part > > > > >>> >> >>> of > > > > >>> >> >>> > the > > > > >>> >> >>> > > configuration. > > > > >>> >> >>> > > - I would avoid returning a null value from > > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be > > > > fulfilled. > > > > >>> >> >>> Either we > > > > >>> >> >>> > > should introduce an explicit return value saying this > or > > > > >>> throw an > > > > >>> >> >>> > > exception. > > > > >>> >> >>> > > > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are right > that > > > it > > > > >>> would > > > > >>> >> be > > > > >>> >> >>> > > helpful to make the selection strategy pluggable. Also > > > > >>> batching > > > > >>> >> slot > > > > >>> >> >>> > > requests to the RM could be a good optimization. For > the > > > > sake > > > > >>> of > > > > >>> >> >>> keeping > > > > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle > > these > > > > >>> things > > > > >>> >> >>> after > > > > >>> >> >>> > the > > > > >>> >> >>> > > initial version has been completed (without spoiling > > these > > > > >>> >> >>> optimization > > > > >>> >> >>> > > opportunities). In particular batching the slot > requests > > > > >>> depends > > > > >>> >> on > > > > >>> >> >>> the > > > > >>> >> >>> > > current scheduler refactoring and could also be > realized > > > on > > > > >>> the RM > > > > >>> >> >>> side > > > > >>> >> >>> > > only. > > > > >>> >> >>> > > > > > > >>> >> >>> > > [1] > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > > > >>> >> >>> > > > > > > >>> >> >>> > > Cheers, > > > > >>> >> >>> > > Till > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < > > > > >>> [hidden email]> > > > > >>> >> >>> wrote: > > > > >>> >> >>> > > > > > > >>> >> >>> > > > Hi, Xintong > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design > looks > > > good > > > > >>> to > > > > >>> >> me, > > > > >>> >> >>> +1 > > > > >>> >> >>> > > > for this feature. > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > Since slots in the same task executor could have > > > different > > > > >>> >> resource > > > > >>> >> >>> > > > profile, we will > > > > >>> >> >>> > > > meet resource fragment problem. Think about this > case: > > > > >>> >> >>> > > > - request A want 1G memory while request B & C want > > > 0.5G > > > > >>> memory > > > > >>> >> >>> > > > - There are two task executors T1 & T2 with 1G and > > 0.5G > > > > >>> free > > > > >>> >> >>> memory > > > > >>> >> >>> > > > respectively > > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A > > must > > > > >>> wait for > > > > >>> >> >>> the > > > > >>> >> >>> > > > free resource from > > > > >>> >> >>> > > > other task. But A could have been scheduled > > immediately > > > if > > > > >>> we > > > > >>> >> cut a > > > > >>> >> >>> > > > slot from T2 for B. > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a > > task > > > > >>> executor > > > > >>> >> >>> which > > > > >>> >> >>> > > > has enough > > > > >>> >> >>> > > > resource and then cut a slot from it. Current method > > > could > > > > >>> be > > > > >>> >> seen > > > > >>> >> >>> as > > > > >>> >> >>> > > > "First-fit strategy", > > > > >>> >> >>> > > > which works well in general but sometimes could not > be > > > the > > > > >>> >> >>> optimization > > > > >>> >> >>> > > > method. > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin > > > Packing > > > > >>> >> >>> Problem"[1]. > > > > >>> >> >>> > > > Here are > > > > >>> >> >>> > > > some common approximate algorithms: > > > > >>> >> >>> > > > - First fit > > > > >>> >> >>> > > > - Next fit > > > > >>> >> >>> > > > - Best fit > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing problem > if > > > we > > > > >>> take > > > > >>> >> CPU > > > > >>> >> >>> > > > into account. It hard > > > > >>> >> >>> > > > to define which one is best fit now. Some research > > > > addressed > > > > >>> >> this > > > > >>> >> >>> > > > problem, such like Tetris[2]. > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > Here are some thinking about it: > > > > >>> >> >>> > > > 1. We could make the strategy of finding matching > task > > > > >>> executor > > > > >>> >> >>> > > > pluginable. Let user to config the > > > > >>> >> >>> > > > best strategy in their scenario. > > > > >>> >> >>> > > > 2. We could support batch request interface in RM, > > > because > > > > >>> we > > > > >>> >> have > > > > >>> >> >>> > > > opportunities to optimize > > > > >>> >> >>> > > > if we have more information. If we know the A, B, C > at > > > the > > > > >>> same > > > > >>> >> >>> time, > > > > >>> >> >>> > > > we could always make the best decision. > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > > > >>> >> >>> > > > [2] > > > > >>> >> >>> > > > > > >>> >> > > > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > Best, > > > > >>> >> >>> > > > Yangze Guo > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > > > > >>> >> >>> [hidden email]> > > > > >>> >> >>> > > > wrote: > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > Hi everyone, > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > We would like to start a discussion thread on > > > "FLIP-53: > > > > >>> Fine > > > > >>> >> >>> Grained > > > > >>> >> >>> > > > > Resource Management"[1], where we propose how to > > > improve > > > > >>> Flink > > > > >>> >> >>> > resource > > > > >>> >> >>> > > > > management and scheduling. > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues. > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > - How to support tasks with fine grained > resource > > > > >>> >> >>> requirements. > > > > >>> >> >>> > > > > - How to unify resource management for jobs > with > > / > > > > >>> without > > > > >>> >> >>> fine > > > > >>> >> >>> > > > grained > > > > >>> >> >>> > > > > resource requirements. > > > > >>> >> >>> > > > > - How to unify resource management for > streaming > > / > > > > >>> batch > > > > >>> >> jobs. > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows. > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > - Unify memory management for operators with / > > > > without > > > > >>> fine > > > > >>> >> >>> > grained > > > > >>> >> >>> > > > > resource requirements by applying a fraction > > based > > > > >>> quota > > > > >>> >> >>> > mechanism. > > > > >>> >> >>> > > > > - Unify resource scheduling for streaming and > > batch > > > > >>> jobs by > > > > >>> >> >>> > setting > > > > >>> >> >>> > > > slot > > > > >>> >> >>> > > > > sharing groups for pipelined regions during > > > compiling > > > > >>> >> stage. > > > > >>> >> >>> > > > > - Dynamically allocate slots from task > executors' > > > > >>> available > > > > >>> >> >>> > > resources. > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki document > > > [1]. > > > > >>> >> Looking > > > > >>> >> >>> > forward > > > > >>> >> >>> > > > to > > > > >>> >> >>> > > > > your feedbacks. > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > Thank you~ > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > Xintong Song > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > [1] > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > >>> >> >> > > > > >>> >> > > > > >>> > > > > > >>> > > > > >> > > > > > > > > > > |
@all
The FLIP document [1] has been updated. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management On Tue, Sep 3, 2019 at 7:20 PM Zhu Zhu <[hidden email]> wrote: > Thanks Xintong for the explanation. > > For question #1, I think it's good as long as DataSet job behaviors remains > the same. > > For question #2, agreed that the resource difference is small enough(at > most 1 edge diff) in current supported point-wise execution edge connection > patterns. > > Thanks, > Zhu Zhu > > Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道: > > > Thanks for the comments, Zhu & Kurt. > > > > Andrey and I also had some discussions offline, and I would like to first > > post a summary of our discussion: > > > > 1. The motivation of the fraction based approach is to unify resource > > management for both operators with specified and unknown resource > > requirements. > > 2. The fraction based approach proposed in this FLIP should only > affect > > streaming jobs (both bounded and unbounded). For DataSet jobs, there > are > > already some fraction based approach (in TaskConfig and > ChainedDriver), > > and > > we do not make any change to the existing approach. > > 3. The scope of this FLIP does not include discussion of how to set > > ResourceSpec for operators. > > 1. For blink jobs, the optimizer can set operator resources for the > > users, according to their configurations (default: unknown) > > 2. For DataStream jobs, there are no method / interface to set > > operator resources at the moment (1.10). We can have in the future. > > 3. For DataSet jobs, there are existing user interfaces to set > > operator resources. > > 4. The FLIP should explain more about how ResourceSpecs works > > 1. PhysicalTransformations (deployed with operators into the > > StreamTasks) get ResourceSpec: unknown by default or known (e.g. > > from the > > Blink planner) > > 2. While generating stream graph, calculate fractions and set to > > StreamConfig > > 3. While scheduling, convert ResourceSpec to ResourceProfile > > (ResourceSpec + network memory), and deploy to slots / TMs matching > > the > > resources > > 4. While starting Task in TM, each operator gets fraction converted > > back to the original absolute value requested by user or fair > > unknown share > > of the slot > > 5. We should not set `allSourcesInSamePipelinedRegion` to `false` > for > > DataSet jobs. Behaviors of DataSet jobs should not be changed. > > 6. The FLIP document should differentiate works planed in this FLIP > and > > the future follow-ups more clearly, by put the follow-ups in a > separate > > section > > 7. Another limitation of the rejected alternative setting fractions at > > scheduling time is that, the scheduler implementation does not know > > which > > tasks will be deployed into the same slot in advance. > > > > Andrey, Please bring it up if there is anything I missed. > > > > Zhu, regarding your comments: > > > > 1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for > > DataSet jobs (point 5 in the discussion summary above), then there > > shouldn't be any regression right? > > 2. I think it makes sense to set the max possible network memory for > the > > JobVertex. When you say parallel instances of the same JobVertex may > > have > > need different network memory, I guess you mean the rescale scenarios > > where > > parallelisms of upstream / downstream vertex cannot be exactly divided > > by > > parallelism of downstream / upstream vertex? I would say it's > > acceptable to > > have slight difference between actually needed and allocated network > > memory. > > 3. Yes, by numOpsUseOnHeapManagedMemory I mean > > numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the doc. > > 4. Yes, it should be StreamingJobGraphGenerator. Thanks for the > > correction. > > > > > > Kurt, regarding your comments: > > > > 1. I think we don't have network memory in ResourceSpec, which is the > > user facing API. We only have network memory in ResourceProfile, which > > is > > used internally for scheduling. The reason we do not expose network > > memory > > to the user is that, currently how many network buffers each task > needs > > is > > decided by the topology of execution graph (how many input / output > > channels it has). > > 2. In the section "Operator Resource Requirements": "For the first > > version, we do not support mixing operators with specified / unknown > > resource requirements in the same job. Either all or none of the > > operators > > of the same job should specify their resource requirements. > > StreamGraphGenerator should check this and throw an error when mixing > of > > specified / unknown resource requirements is detected, during the > > compilation stage." > > 3. If the user set a resource requirement, then it is guaranteed that > > the task should get at least the much resource, otherwise there should > > be > > an exception. That should be guaranteed by the "Dynamic Slot > Allocation" > > approach (FLIP-56). > > > > > > I'll update the FLIP document addressing the comments ASAP. > > > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote: > > > > > Thanks Xingtong for driving this effort, I haven't finished the whole > > > document yet, > > > but have couple of questions: > > > > > > 1. Regarding to network memory, the document said it will be derived by > > > framework > > > automatically. I'm wondering whether we should delete this dimension > from > > > user- > > > facing API? > > > > > > 2. Regarding to fraction based quota, I don't quite get the meaning of > > > "slotSharingGroupOnHeapManagedMem" and > > "slotSharingGroupOffHeapManagedMem". > > > What if the sharing group is mixed with specified resource and UNKNOWN > > > resource > > > requirements. > > > > > > 3 IIUC, even user had set resource requirements, lets say 500MB > off-heap > > > managed > > > memory, during execution the operator may or may not have 500MB > off-heap > > > managed > > > memory, right? > > > > > > Best, > > > Kurt > > > > > > > > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote: > > > > > > > Thanks Xintong for proposing this improvement. Fine grained resources > > can > > > > be very helpful when user has good planning on resources. > > > > > > > > I have a few questions: > > > > 1. Currently in a batch job, vertices from different regions can run > at > > > the > > > > same time in slots from the same shared group, as long as they do not > > > have > > > > data dependency on each other and available slot count is not smaller > > > than > > > > the *max* of parallelism of all tasks. > > > > With changes in this FLIP however, tasks from different regions > cannot > > > > share slots anymore. > > > > Once available slot count is smaller than the *sum* of all > parallelism > > of > > > > tasks from all regions, tasks may need to be executed sequentially, > > which > > > > might result in a performance regression. > > > > Is this(performance regression to existing DataSet jobs) considered > as > > a > > > > necessary and accepted trade off in this FLIP? > > > > > > > > 2. The network memory depends on the input/output ExecutionEdge count > > and > > > > thus can be different even for parallel instances of the same > > JobVertex. > > > > Does this mean that when adding task resources to calculating the > slot > > > > resource for a shared group, the max possible network memory of the > > > vertex > > > > instance shall be used? > > > > This might result in larger resource required than actually needed. > > > > > > > > And some minor comments: > > > > 1. Regarding "fracManagedMemOnHeap = 1 / > > numOpsUseOnHeapManagedMemory", I > > > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ? > > > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section > and > > > > implementation step 4 should be *StreamingJobGraphGenerator*, as > > > > *StreamGraphGenerator* is not aware of JobGraph and pipelined region. > > > > > > > > > > > > Thanks, > > > > Zhu Zhu > > > > > > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道: > > > > > > > > > Updated the FLIP wiki page [1], with the following changes. > > > > > > > > > > - Remove the step of converting pipelined edges between > different > > > slot > > > > > sharing groups into blocking edges. > > > > > - Set `allSourcesInSamePipelinedRegion` to true by default. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song < > [hidden email]> > > > > > wrote: > > > > > > > > > > > Regarding changing edge type, I think actually we don't need to > do > > > this > > > > > > for batch jobs neither, because we don't have public interfaces > for > > > > users > > > > > > to explicitly set slot sharing groups in DataSet API and > SQL/Table > > > API. > > > > > We > > > > > > have such interfaces in DataStream API only. > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song < > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > >> Thanks for the correction, Till. > > > > > >> > > > > > >> Regarding your comments: > > > > > >> - You are right, we should not change the edge type for > streaming > > > > jobs. > > > > > >> Then I think we can change the option > > > > 'allSourcesInSamePipelinedRegion' > > > > > in > > > > > >> step 2 to 'isStreamingJob', and implement the current step 2 > > before > > > > the > > > > > >> current step 1 so we can use this option to decide whether > should > > > > change > > > > > >> the edge type. What do you think? > > > > > >> - Agree. It should be easier to make the default value of > > > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', > > and > > > > set > > > > > it > > > > > >> to 'false' when using DataSet API or blink planner. > > > > > >> > > > > > >> Thank you~ > > > > > >> > > > > > >> Xintong Song > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann < > > [hidden email] > > > > > > > > > >> wrote: > > > > > >> > > > > > >>> Thanks for creating the implementation plan Xintong. Overall, > the > > > > > >>> implementation plan looks good. I had a couple of comments: > > > > > >>> > > > > > >>> - What will happen if a user has defined a streaming job with > two > > > > slot > > > > > >>> sharing groups? Would the code insert a blocking data exchange > > > > between > > > > > >>> these two groups? If yes, then this breaks existing Flink > > streaming > > > > > jobs. > > > > > >>> - How do we detect unbounded streaming jobs to set > > > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be > > > easier > > > > to > > > > > >>> set > > > > > >>> it false if we are using the DataSet API or the Blink planner > > with > > > a > > > > > >>> bounded job? > > > > > >>> > > > > > >>> Cheers, > > > > > >>> Till > > > > > >>> > > > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann < > > > [hidden email]> > > > > > >>> wrote: > > > > > >>> > > > > > >>> > I guess there is a typo since the link to the FLIP-53 is > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > > >>> > > > > > > >>> > Cheers, > > > > > >>> > Till > > > > > >>> > > > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song < > > > > [hidden email]> > > > > > >>> > wrote: > > > > > >>> > > > > > > >>> >> Added implementation steps for this FLIP on the wiki page > [1]. > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> Thank you~ > > > > > >>> >> > > > > > >>> >> Xintong Song > > > > > >>> >> > > > > > >>> >> > > > > > >>> >> [1] > > > > > >>> >> > > > > > >>> >> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > >>> >> > > > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song < > > > > > [hidden email]> > > > > > >>> >> wrote: > > > > > >>> >> > > > > > >>> >> > Hi everyone, > > > > > >>> >> > > > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained > > > Resource > > > > > >>> >> > Management" splits into two separate FLIPs, > > > > > >>> >> > > > > > > >>> >> > - FLIP-53: Fine Grained Operator Resource Management > [1] > > > > > >>> >> > - FLIP-56: Dynamic Slot Allocation [2] > > > > > >>> >> > > > > > > >>> >> > We'll continue using this discussion thread for FLIP-53. > For > > > > > >>> FLIP-56, I > > > > > >>> >> > just started a new discussion thread [3]. > > > > > >>> >> > > > > > > >>> >> > Thank you~ > > > > > >>> >> > > > > > > >>> >> > Xintong Song > > > > > >>> >> > > > > > > >>> >> > > > > > > >>> >> > [1] > > > > > >>> >> > > > > > > >>> >> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > > >>> >> > > > > > > >>> >> > [2] > > > > > >>> >> > > > > > > >>> >> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > >>> >> > > > > > > >>> >> > [3] > > > > > >>> >> > > > > > > >>> >> > > > > > >>> > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > > > > >>> >> > > > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song < > > > > > [hidden email] > > > > > >>> > > > > > > >>> >> > wrote: > > > > > >>> >> > > > > > > >>> >> >> Thinks for the comments, Yang. > > > > > >>> >> >> > > > > > >>> >> >> Regarding your questions: > > > > > >>> >> >> > > > > > >>> >> >> 1. How to calculate the resource specification of > > > > > TaskManagers? > > > > > >>> Do > > > > > >>> >> they > > > > > >>> >> >>> have them same resource spec calculated based on the > > > > > >>> >> configuration? I > > > > > >>> >> >>> think > > > > > >>> >> >>> we still have wasted resources in this situation. Or > we > > > > could > > > > > >>> start > > > > > >>> >> >>> TaskManagers with different spec. > > > > > >>> >> >>> > > > > > >>> >> >> I agree with you that we can further improve the resource > > > > utility > > > > > >>> by > > > > > >>> >> >> customizing task executors with different resource > > > > > specifications. > > > > > >>> >> However, > > > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave > > it > > > > as a > > > > > >>> >> future > > > > > >>> >> >> optimization. The plan for that part is to move the logic > > of > > > > > >>> deciding > > > > > >>> >> task > > > > > >>> >> >> executor specifications into the slot manager and make > slot > > > > > manager > > > > > >>> >> >> pluggable, so inside the slot manager plugin we can have > > > > > different > > > > > >>> >> logics > > > > > >>> >> >> for deciding the task executor specifications. > > > > > >>> >> >> > > > > > >>> >> >> > > > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, > does > > > it > > > > > >>> could be > > > > > >>> >> >>> reused by other SlotRequest that the request resource > > is > > > > > >>> smaller > > > > > >>> >> than > > > > > >>> >> >>> it? > > > > > >>> >> >>> > > > > > >>> >> >> No, I think slot pool should always return slots if they > do > > > not > > > > > >>> exactly > > > > > >>> >> >> match the pending requests, so that resource manager can > > deal > > > > > with > > > > > >>> the > > > > > >>> >> >> extra resources. > > > > > >>> >> >> > > > > > >>> >> >>> - If it is yes, what happens to the available > > resource > > > > in > > > > > >>> the > > > > > >>> >> >> > > > > > >>> >> >> TaskManager. > > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > > SlotPool? > > > > > The > > > > > >>> >> >>> AllocationId is null? > > > > > >>> >> >>> > > > > > >>> >> >> The allocation id does not change as long as the slot is > > not > > > > > >>> returned > > > > > >>> >> >> from the job master, no matter its occupied or available > in > > > the > > > > > >>> slot > > > > > >>> >> pool. > > > > > >>> >> >> I think we have the same behavior currently. No matter > how > > > many > > > > > >>> tasks > > > > > >>> >> the > > > > > >>> >> >> job master deploy into the slot, concurrently or > > > sequentially, > > > > it > > > > > >>> is > > > > > >>> >> one > > > > > >>> >> >> allocation from the cluster to the job until the slot is > > > freed > > > > > from > > > > > >>> >> the job > > > > > >>> >> >> master. > > > > > >>> >> >> > > > > > >>> >> >>> 3. In a session cluster, some jobs are configured > with > > > > > operator > > > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. > How > > to > > > > > deal > > > > > >>> with > > > > > >>> >> >>> this > > > > > >>> >> >>> situation? > > > > > >>> >> >> > > > > > >>> >> >> As long as we do not mix unknown / specified resource > > > profiles > > > > > >>> within > > > > > >>> >> the > > > > > >>> >> >> same job / slot, there shouldn't be a problem. Resource > > > manager > > > > > >>> >> converts > > > > > >>> >> >> unknown resource profiles in slot requests to specified > > > default > > > > > >>> >> resource > > > > > >>> >> >> profiles, so they can be dynamically allocated from task > > > > > executors' > > > > > >>> >> >> available resources just as other slot requests with > > > specified > > > > > >>> resource > > > > > >>> >> >> profiles. > > > > > >>> >> >> > > > > > >>> >> >> Thank you~ > > > > > >>> >> >> > > > > > >>> >> >> Xintong Song > > > > > >>> >> >> > > > > > >>> >> >> > > > > > >>> >> >> > > > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang < > > > > > [hidden email]> > > > > > >>> >> wrote: > > > > > >>> >> >> > > > > > >>> >> >>> Hi Xintong, > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> Thanks for your detailed proposal. I think many users > are > > > > > >>> suffering > > > > > >>> >> from > > > > > >>> >> >>> waste of resources. The resource spec of all task > managers > > > are > > > > > >>> same > > > > > >>> >> and > > > > > >>> >> >>> we > > > > > >>> >> >>> have to increase all task managers to make the heavy one > > > more > > > > > >>> stable. > > > > > >>> >> So > > > > > >>> >> >>> we > > > > > >>> >> >>> will benefit from the fine grained resource management a > > > lot. > > > > We > > > > > >>> could > > > > > >>> >> >>> get > > > > > >>> >> >>> better resource utilization and stability. > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> Just to share some thoughts. > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> 1. How to calculate the resource specification of > > > > > >>> TaskManagers? Do > > > > > >>> >> >>> they > > > > > >>> >> >>> have them same resource spec calculated based on the > > > > > >>> >> configuration? I > > > > > >>> >> >>> think > > > > > >>> >> >>> we still have wasted resources in this situation. Or > we > > > > could > > > > > >>> start > > > > > >>> >> >>> TaskManagers with different spec. > > > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, > does > > > it > > > > > >>> could be > > > > > >>> >> >>> reused by other SlotRequest that the request resource > > is > > > > > >>> smaller > > > > > >>> >> than > > > > > >>> >> >>> it? > > > > > >>> >> >>> - If it is yes, what happens to the available > > resource > > > > in > > > > > >>> the > > > > > >>> >> >>> TaskManager. > > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > > SlotPool? > > > > > The > > > > > >>> >> >>> AllocationId is null? > > > > > >>> >> >>> 3. In a session cluster, some jobs are configured > with > > > > > operator > > > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. > How > > to > > > > > deal > > > > > >>> with > > > > > >>> >> >>> this > > > > > >>> >> >>> situation? > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> > > > > > >>> >> >>> Best, > > > > > >>> >> >>> Yang > > > > > >>> >> >>> > > > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 > > > 下午8:57写道: > > > > > >>> >> >>> > > > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till. > > > > > >>> >> >>> > > > > > > >>> >> >>> > Yangze, > > > > > >>> >> >>> > > > > > > >>> >> >>> > I agree with you that we should make scheduling > strategy > > > > > >>> pluggable > > > > > >>> >> and > > > > > >>> >> >>> > optimize the strategy to reduce the memory > fragmentation > > > > > >>> problem, > > > > > >>> >> and > > > > > >>> >> >>> > thanks for the inputs on the potential algorithmic > > > > solutions. > > > > > >>> >> However, > > > > > >>> >> >>> I'm > > > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall > > > mechanism > > > > > >>> design > > > > > >>> >> >>> rather > > > > > >>> >> >>> > than strategies. Solving the fragmentation issue > should > > be > > > > > >>> >> considered > > > > > >>> >> >>> as an > > > > > >>> >> >>> > optimization, and I agree with Till that we probably > > > should > > > > > >>> tackle > > > > > >>> >> this > > > > > >>> >> >>> > afterwards. > > > > > >>> >> >>> > > > > > > >>> >> >>> > Till, > > > > > >>> >> >>> > > > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes > sense. > > > The > > > > > >>> operator > > > > > >>> >> >>> > resource management and dynamic slot allocation do not > > > have > > > > > much > > > > > >>> >> >>> dependency > > > > > >>> >> >>> > on each other. > > > > > >>> >> >>> > > > > > > >>> >> >>> > - Regarding the default slot size, I think this is > > similar > > > > to > > > > > >>> >> FLIP-49 > > > > > >>> >> >>> [1] > > > > > >>> >> >>> > where we want all the deriving happens at one place. I > > > think > > > > > it > > > > > >>> >> would > > > > > >>> >> >>> be > > > > > >>> >> >>> > nice to pass the default slot size into the task > > executor > > > in > > > > > the > > > > > >>> >> same > > > > > >>> >> >>> way > > > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1]. > > > > > >>> >> >>> > > > > > > >>> >> >>> > - Regarding the return value of > > > > > >>> >> TaskExecutorGateway#requestResource, I > > > > > >>> >> >>> > think you're right. We should avoid using null as the > > > return > > > > > >>> value. > > > > > >>> >> I > > > > > >>> >> >>> think > > > > > >>> >> >>> > we probably should thrown an exception here. > > > > > >>> >> >>> > > > > > > >>> >> >>> > Thank you~ > > > > > >>> >> >>> > > > > > > >>> >> >>> > Xintong Song > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > [1] > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > >>> >> >>> > > > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > > > > > >>> [hidden email] > > > > > >>> >> > > > > > > >>> >> >>> > wrote: > > > > > >>> >> >>> > > > > > > >>> >> >>> > > Hi Xintong, > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal > > > helps > > > > > to > > > > > >>> >> >>> improve the > > > > > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, > it > > > > > >>> enables the > > > > > >>> >> >>> proper > > > > > >>> >> >>> > > integration of the Blink planner which is very > > important > > > > as > > > > > >>> well. > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering > > > > whether > > > > > it > > > > > >>> >> >>> wouldn't > > > > > >>> >> >>> > > make sense to actually split it up into two FLIPs: > > > > Operator > > > > > >>> >> resource > > > > > >>> >> >>> > > management and dynamic slot allocation. I think > these > > > two > > > > > >>> FLIPs > > > > > >>> >> >>> could be > > > > > >>> >> >>> > > seen as orthogonal and it would decrease the scope > of > > > each > > > > > >>> >> individual > > > > > >>> >> >>> > FLIP. > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > Some smaller comments: > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > - I'm not sure whether we should pass in the default > > > slot > > > > > size > > > > > >>> >> via an > > > > > >>> >> >>> > > environment variable. Without having unified the way > > how > > > > > Flink > > > > > >>> >> >>> components > > > > > >>> >> >>> > > are configured [1], I think it would be better to > pass > > > it > > > > in > > > > > >>> as > > > > > >>> >> part > > > > > >>> >> >>> of > > > > > >>> >> >>> > the > > > > > >>> >> >>> > > configuration. > > > > > >>> >> >>> > > - I would avoid returning a null value from > > > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be > > > > > fulfilled. > > > > > >>> >> >>> Either we > > > > > >>> >> >>> > > should introduce an explicit return value saying > this > > or > > > > > >>> throw an > > > > > >>> >> >>> > > exception. > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are right > > that > > > > it > > > > > >>> would > > > > > >>> >> be > > > > > >>> >> >>> > > helpful to make the selection strategy pluggable. > Also > > > > > >>> batching > > > > > >>> >> slot > > > > > >>> >> >>> > > requests to the RM could be a good optimization. For > > the > > > > > sake > > > > > >>> of > > > > > >>> >> >>> keeping > > > > > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle > > > these > > > > > >>> things > > > > > >>> >> >>> after > > > > > >>> >> >>> > the > > > > > >>> >> >>> > > initial version has been completed (without spoiling > > > these > > > > > >>> >> >>> optimization > > > > > >>> >> >>> > > opportunities). In particular batching the slot > > requests > > > > > >>> depends > > > > > >>> >> on > > > > > >>> >> >>> the > > > > > >>> >> >>> > > current scheduler refactoring and could also be > > realized > > > > on > > > > > >>> the RM > > > > > >>> >> >>> side > > > > > >>> >> >>> > > only. > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > [1] > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > Cheers, > > > > > >>> >> >>> > > Till > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < > > > > > >>> [hidden email]> > > > > > >>> >> >>> wrote: > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > Hi, Xintong > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design > > looks > > > > good > > > > > >>> to > > > > > >>> >> me, > > > > > >>> >> >>> +1 > > > > > >>> >> >>> > > > for this feature. > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > Since slots in the same task executor could have > > > > different > > > > > >>> >> resource > > > > > >>> >> >>> > > > profile, we will > > > > > >>> >> >>> > > > meet resource fragment problem. Think about this > > case: > > > > > >>> >> >>> > > > - request A want 1G memory while request B & C > want > > > > 0.5G > > > > > >>> memory > > > > > >>> >> >>> > > > - There are two task executors T1 & T2 with 1G > and > > > 0.5G > > > > > >>> free > > > > > >>> >> >>> memory > > > > > >>> >> >>> > > > respectively > > > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A > > > must > > > > > >>> wait for > > > > > >>> >> >>> the > > > > > >>> >> >>> > > > free resource from > > > > > >>> >> >>> > > > other task. But A could have been scheduled > > > immediately > > > > if > > > > > >>> we > > > > > >>> >> cut a > > > > > >>> >> >>> > > > slot from T2 for B. > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a > > > task > > > > > >>> executor > > > > > >>> >> >>> which > > > > > >>> >> >>> > > > has enough > > > > > >>> >> >>> > > > resource and then cut a slot from it. Current > method > > > > could > > > > > >>> be > > > > > >>> >> seen > > > > > >>> >> >>> as > > > > > >>> >> >>> > > > "First-fit strategy", > > > > > >>> >> >>> > > > which works well in general but sometimes could > not > > be > > > > the > > > > > >>> >> >>> optimization > > > > > >>> >> >>> > > > method. > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin > > > > Packing > > > > > >>> >> >>> Problem"[1]. > > > > > >>> >> >>> > > > Here are > > > > > >>> >> >>> > > > some common approximate algorithms: > > > > > >>> >> >>> > > > - First fit > > > > > >>> >> >>> > > > - Next fit > > > > > >>> >> >>> > > > - Best fit > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing > problem > > if > > > > we > > > > > >>> take > > > > > >>> >> CPU > > > > > >>> >> >>> > > > into account. It hard > > > > > >>> >> >>> > > > to define which one is best fit now. Some research > > > > > addressed > > > > > >>> >> this > > > > > >>> >> >>> > > > problem, such like Tetris[2]. > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > Here are some thinking about it: > > > > > >>> >> >>> > > > 1. We could make the strategy of finding matching > > task > > > > > >>> executor > > > > > >>> >> >>> > > > pluginable. Let user to config the > > > > > >>> >> >>> > > > best strategy in their scenario. > > > > > >>> >> >>> > > > 2. We could support batch request interface in RM, > > > > because > > > > > >>> we > > > > > >>> >> have > > > > > >>> >> >>> > > > opportunities to optimize > > > > > >>> >> >>> > > > if we have more information. If we know the A, B, > C > > at > > > > the > > > > > >>> same > > > > > >>> >> >>> time, > > > > > >>> >> >>> > > > we could always make the best decision. > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > > > > >>> >> >>> > > > [2] > > > > > >>> >> >>> > > > > > > >>> >> > > > > > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > Best, > > > > > >>> >> >>> > > > Yangze Guo > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > > > > > >>> >> >>> [hidden email]> > > > > > >>> >> >>> > > > wrote: > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > Hi everyone, > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > We would like to start a discussion thread on > > > > "FLIP-53: > > > > > >>> Fine > > > > > >>> >> >>> Grained > > > > > >>> >> >>> > > > > Resource Management"[1], where we propose how to > > > > improve > > > > > >>> Flink > > > > > >>> >> >>> > resource > > > > > >>> >> >>> > > > > management and scheduling. > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues. > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > - How to support tasks with fine grained > > resource > > > > > >>> >> >>> requirements. > > > > > >>> >> >>> > > > > - How to unify resource management for jobs > > with > > > / > > > > > >>> without > > > > > >>> >> >>> fine > > > > > >>> >> >>> > > > grained > > > > > >>> >> >>> > > > > resource requirements. > > > > > >>> >> >>> > > > > - How to unify resource management for > > streaming > > > / > > > > > >>> batch > > > > > >>> >> jobs. > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows. > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > - Unify memory management for operators with > / > > > > > without > > > > > >>> fine > > > > > >>> >> >>> > grained > > > > > >>> >> >>> > > > > resource requirements by applying a fraction > > > based > > > > > >>> quota > > > > > >>> >> >>> > mechanism. > > > > > >>> >> >>> > > > > - Unify resource scheduling for streaming and > > > batch > > > > > >>> jobs by > > > > > >>> >> >>> > setting > > > > > >>> >> >>> > > > slot > > > > > >>> >> >>> > > > > sharing groups for pipelined regions during > > > > compiling > > > > > >>> >> stage. > > > > > >>> >> >>> > > > > - Dynamically allocate slots from task > > executors' > > > > > >>> available > > > > > >>> >> >>> > > resources. > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki > document > > > > [1]. > > > > > >>> >> Looking > > > > > >>> >> >>> > forward > > > > > >>> >> >>> > > > to > > > > > >>> >> >>> > > > > your feedbacks. > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > Thank you~ > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > Xintong Song > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > [1] > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > >>> >> >> > > > > > >>> >> > > > > > >>> > > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > |
Thanks for updating the FLIP Xintong. It looks good to me. I would be ok to
start a vote for it. Best, Andrey On Wed, Sep 4, 2019 at 10:03 AM Xintong Song <[hidden email]> wrote: > @all > > The FLIP document [1] has been updated. > > Thank you~ > > Xintong Song > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > On Tue, Sep 3, 2019 at 7:20 PM Zhu Zhu <[hidden email]> wrote: > > > Thanks Xintong for the explanation. > > > > For question #1, I think it's good as long as DataSet job behaviors > remains > > the same. > > > > For question #2, agreed that the resource difference is small enough(at > > most 1 edge diff) in current supported point-wise execution edge > connection > > patterns. > > > > Thanks, > > Zhu Zhu > > > > Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道: > > > > > Thanks for the comments, Zhu & Kurt. > > > > > > Andrey and I also had some discussions offline, and I would like to > first > > > post a summary of our discussion: > > > > > > 1. The motivation of the fraction based approach is to unify > resource > > > management for both operators with specified and unknown resource > > > requirements. > > > 2. The fraction based approach proposed in this FLIP should only > > affect > > > streaming jobs (both bounded and unbounded). For DataSet jobs, there > > are > > > already some fraction based approach (in TaskConfig and > > ChainedDriver), > > > and > > > we do not make any change to the existing approach. > > > 3. The scope of this FLIP does not include discussion of how to set > > > ResourceSpec for operators. > > > 1. For blink jobs, the optimizer can set operator resources for > the > > > users, according to their configurations (default: unknown) > > > 2. For DataStream jobs, there are no method / interface to set > > > operator resources at the moment (1.10). We can have in the > future. > > > 3. For DataSet jobs, there are existing user interfaces to set > > > operator resources. > > > 4. The FLIP should explain more about how ResourceSpecs works > > > 1. PhysicalTransformations (deployed with operators into the > > > StreamTasks) get ResourceSpec: unknown by default or known (e.g. > > > from the > > > Blink planner) > > > 2. While generating stream graph, calculate fractions and set to > > > StreamConfig > > > 3. While scheduling, convert ResourceSpec to ResourceProfile > > > (ResourceSpec + network memory), and deploy to slots / TMs > matching > > > the > > > resources > > > 4. While starting Task in TM, each operator gets fraction > converted > > > back to the original absolute value requested by user or fair > > > unknown share > > > of the slot > > > 5. We should not set `allSourcesInSamePipelinedRegion` to `false` > > for > > > DataSet jobs. Behaviors of DataSet jobs should not be changed. > > > 6. The FLIP document should differentiate works planed in this FLIP > > and > > > the future follow-ups more clearly, by put the follow-ups in a > > separate > > > section > > > 7. Another limitation of the rejected alternative setting fractions > at > > > scheduling time is that, the scheduler implementation does not know > > > which > > > tasks will be deployed into the same slot in advance. > > > > > > Andrey, Please bring it up if there is anything I missed. > > > > > > Zhu, regarding your comments: > > > > > > 1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for > > > DataSet jobs (point 5 in the discussion summary above), then there > > > shouldn't be any regression right? > > > 2. I think it makes sense to set the max possible network memory for > > the > > > JobVertex. When you say parallel instances of the same JobVertex may > > > have > > > need different network memory, I guess you mean the rescale > scenarios > > > where > > > parallelisms of upstream / downstream vertex cannot be exactly > divided > > > by > > > parallelism of downstream / upstream vertex? I would say it's > > > acceptable to > > > have slight difference between actually needed and allocated network > > > memory. > > > 3. Yes, by numOpsUseOnHeapManagedMemory I mean > > > numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the > doc. > > > 4. Yes, it should be StreamingJobGraphGenerator. Thanks for the > > > correction. > > > > > > > > > Kurt, regarding your comments: > > > > > > 1. I think we don't have network memory in ResourceSpec, which is > the > > > user facing API. We only have network memory in ResourceProfile, > which > > > is > > > used internally for scheduling. The reason we do not expose network > > > memory > > > to the user is that, currently how many network buffers each task > > needs > > > is > > > decided by the topology of execution graph (how many input / output > > > channels it has). > > > 2. In the section "Operator Resource Requirements": "For the first > > > version, we do not support mixing operators with specified / unknown > > > resource requirements in the same job. Either all or none of the > > > operators > > > of the same job should specify their resource requirements. > > > StreamGraphGenerator should check this and throw an error when > mixing > > of > > > specified / unknown resource requirements is detected, during the > > > compilation stage." > > > 3. If the user set a resource requirement, then it is guaranteed > that > > > the task should get at least the much resource, otherwise there > should > > > be > > > an exception. That should be guaranteed by the "Dynamic Slot > > Allocation" > > > approach (FLIP-56). > > > > > > > > > I'll update the FLIP document addressing the comments ASAP. > > > > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote: > > > > > > > Thanks Xingtong for driving this effort, I haven't finished the whole > > > > document yet, > > > > but have couple of questions: > > > > > > > > 1. Regarding to network memory, the document said it will be derived > by > > > > framework > > > > automatically. I'm wondering whether we should delete this dimension > > from > > > > user- > > > > facing API? > > > > > > > > 2. Regarding to fraction based quota, I don't quite get the meaning > of > > > > "slotSharingGroupOnHeapManagedMem" and > > > "slotSharingGroupOffHeapManagedMem". > > > > What if the sharing group is mixed with specified resource and > UNKNOWN > > > > resource > > > > requirements. > > > > > > > > 3 IIUC, even user had set resource requirements, lets say 500MB > > off-heap > > > > managed > > > > memory, during execution the operator may or may not have 500MB > > off-heap > > > > managed > > > > memory, right? > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote: > > > > > > > > > Thanks Xintong for proposing this improvement. Fine grained > resources > > > can > > > > > be very helpful when user has good planning on resources. > > > > > > > > > > I have a few questions: > > > > > 1. Currently in a batch job, vertices from different regions can > run > > at > > > > the > > > > > same time in slots from the same shared group, as long as they do > not > > > > have > > > > > data dependency on each other and available slot count is not > smaller > > > > than > > > > > the *max* of parallelism of all tasks. > > > > > With changes in this FLIP however, tasks from different regions > > cannot > > > > > share slots anymore. > > > > > Once available slot count is smaller than the *sum* of all > > parallelism > > > of > > > > > tasks from all regions, tasks may need to be executed sequentially, > > > which > > > > > might result in a performance regression. > > > > > Is this(performance regression to existing DataSet jobs) considered > > as > > > a > > > > > necessary and accepted trade off in this FLIP? > > > > > > > > > > 2. The network memory depends on the input/output ExecutionEdge > count > > > and > > > > > thus can be different even for parallel instances of the same > > > JobVertex. > > > > > Does this mean that when adding task resources to calculating the > > slot > > > > > resource for a shared group, the max possible network memory of the > > > > vertex > > > > > instance shall be used? > > > > > This might result in larger resource required than actually needed. > > > > > > > > > > And some minor comments: > > > > > 1. Regarding "fracManagedMemOnHeap = 1 / > > > numOpsUseOnHeapManagedMemory", I > > > > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ? > > > > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section > > and > > > > > implementation step 4 should be *StreamingJobGraphGenerator*, as > > > > > *StreamGraphGenerator* is not aware of JobGraph and pipelined > region. > > > > > > > > > > > > > > > Thanks, > > > > > Zhu Zhu > > > > > > > > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道: > > > > > > > > > > > Updated the FLIP wiki page [1], with the following changes. > > > > > > > > > > > > - Remove the step of converting pipelined edges between > > different > > > > slot > > > > > > sharing groups into blocking edges. > > > > > > - Set `allSourcesInSamePipelinedRegion` to true by default. > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Regarding changing edge type, I think actually we don't need to > > do > > > > this > > > > > > > for batch jobs neither, because we don't have public interfaces > > for > > > > > users > > > > > > > to explicitly set slot sharing groups in DataSet API and > > SQL/Table > > > > API. > > > > > > We > > > > > > > have such interfaces in DataStream API only. > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song < > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > >> Thanks for the correction, Till. > > > > > > >> > > > > > > >> Regarding your comments: > > > > > > >> - You are right, we should not change the edge type for > > streaming > > > > > jobs. > > > > > > >> Then I think we can change the option > > > > > 'allSourcesInSamePipelinedRegion' > > > > > > in > > > > > > >> step 2 to 'isStreamingJob', and implement the current step 2 > > > before > > > > > the > > > > > > >> current step 1 so we can use this option to decide whether > > should > > > > > change > > > > > > >> the edge type. What do you think? > > > > > > >> - Agree. It should be easier to make the default value of > > > > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') > 'true', > > > and > > > > > set > > > > > > it > > > > > > >> to 'false' when using DataSet API or blink planner. > > > > > > >> > > > > > > >> Thank you~ > > > > > > >> > > > > > > >> Xintong Song > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann < > > > [hidden email] > > > > > > > > > > > >> wrote: > > > > > > >> > > > > > > >>> Thanks for creating the implementation plan Xintong. Overall, > > the > > > > > > >>> implementation plan looks good. I had a couple of comments: > > > > > > >>> > > > > > > >>> - What will happen if a user has defined a streaming job with > > two > > > > > slot > > > > > > >>> sharing groups? Would the code insert a blocking data > exchange > > > > > between > > > > > > >>> these two groups? If yes, then this breaks existing Flink > > > streaming > > > > > > jobs. > > > > > > >>> - How do we detect unbounded streaming jobs to set > > > > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be > > > > easier > > > > > to > > > > > > >>> set > > > > > > >>> it false if we are using the DataSet API or the Blink planner > > > with > > > > a > > > > > > >>> bounded job? > > > > > > >>> > > > > > > >>> Cheers, > > > > > > >>> Till > > > > > > >>> > > > > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann < > > > > [hidden email]> > > > > > > >>> wrote: > > > > > > >>> > > > > > > >>> > I guess there is a typo since the link to the FLIP-53 is > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > > > >>> > > > > > > > >>> > Cheers, > > > > > > >>> > Till > > > > > > >>> > > > > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song < > > > > > [hidden email]> > > > > > > >>> > wrote: > > > > > > >>> > > > > > > > >>> >> Added implementation steps for this FLIP on the wiki page > > [1]. > > > > > > >>> >> > > > > > > >>> >> > > > > > > >>> >> Thank you~ > > > > > > >>> >> > > > > > > >>> >> Xintong Song > > > > > > >>> >> > > > > > > >>> >> > > > > > > >>> >> [1] > > > > > > >>> >> > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > >>> >> > > > > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song < > > > > > > [hidden email]> > > > > > > >>> >> wrote: > > > > > > >>> >> > > > > > > >>> >> > Hi everyone, > > > > > > >>> >> > > > > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained > > > > Resource > > > > > > >>> >> > Management" splits into two separate FLIPs, > > > > > > >>> >> > > > > > > > >>> >> > - FLIP-53: Fine Grained Operator Resource Management > > [1] > > > > > > >>> >> > - FLIP-56: Dynamic Slot Allocation [2] > > > > > > >>> >> > > > > > > > >>> >> > We'll continue using this discussion thread for FLIP-53. > > For > > > > > > >>> FLIP-56, I > > > > > > >>> >> > just started a new discussion thread [3]. > > > > > > >>> >> > > > > > > > >>> >> > Thank you~ > > > > > > >>> >> > > > > > > > >>> >> > Xintong Song > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> >> > [1] > > > > > > >>> >> > > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > > > >>> >> > > > > > > > >>> >> > [2] > > > > > > >>> >> > > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > >>> >> > > > > > > > >>> >> > [3] > > > > > > >>> >> > > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > > > > > >>> >> > > > > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song < > > > > > > [hidden email] > > > > > > >>> > > > > > > > >>> >> > wrote: > > > > > > >>> >> > > > > > > > >>> >> >> Thinks for the comments, Yang. > > > > > > >>> >> >> > > > > > > >>> >> >> Regarding your questions: > > > > > > >>> >> >> > > > > > > >>> >> >> 1. How to calculate the resource specification of > > > > > > TaskManagers? > > > > > > >>> Do > > > > > > >>> >> they > > > > > > >>> >> >>> have them same resource spec calculated based on > the > > > > > > >>> >> configuration? I > > > > > > >>> >> >>> think > > > > > > >>> >> >>> we still have wasted resources in this situation. > Or > > we > > > > > could > > > > > > >>> start > > > > > > >>> >> >>> TaskManagers with different spec. > > > > > > >>> >> >>> > > > > > > >>> >> >> I agree with you that we can further improve the > resource > > > > > utility > > > > > > >>> by > > > > > > >>> >> >> customizing task executors with different resource > > > > > > specifications. > > > > > > >>> >> However, > > > > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and > leave > > > it > > > > > as a > > > > > > >>> >> future > > > > > > >>> >> >> optimization. The plan for that part is to move the > logic > > > of > > > > > > >>> deciding > > > > > > >>> >> task > > > > > > >>> >> >> executor specifications into the slot manager and make > > slot > > > > > > manager > > > > > > >>> >> >> pluggable, so inside the slot manager plugin we can > have > > > > > > different > > > > > > >>> >> logics > > > > > > >>> >> >> for deciding the task executor specifications. > > > > > > >>> >> >> > > > > > > >>> >> >> > > > > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, > > does > > > > it > > > > > > >>> could be > > > > > > >>> >> >>> reused by other SlotRequest that the request > resource > > > is > > > > > > >>> smaller > > > > > > >>> >> than > > > > > > >>> >> >>> it? > > > > > > >>> >> >>> > > > > > > >>> >> >> No, I think slot pool should always return slots if > they > > do > > > > not > > > > > > >>> exactly > > > > > > >>> >> >> match the pending requests, so that resource manager > can > > > deal > > > > > > with > > > > > > >>> the > > > > > > >>> >> >> extra resources. > > > > > > >>> >> >> > > > > > > >>> >> >>> - If it is yes, what happens to the available > > > resource > > > > > in > > > > > > >>> the > > > > > > >>> >> >> > > > > > > >>> >> >> TaskManager. > > > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > > > SlotPool? > > > > > > The > > > > > > >>> >> >>> AllocationId is null? > > > > > > >>> >> >>> > > > > > > >>> >> >> The allocation id does not change as long as the slot > is > > > not > > > > > > >>> returned > > > > > > >>> >> >> from the job master, no matter its occupied or > available > > in > > > > the > > > > > > >>> slot > > > > > > >>> >> pool. > > > > > > >>> >> >> I think we have the same behavior currently. No matter > > how > > > > many > > > > > > >>> tasks > > > > > > >>> >> the > > > > > > >>> >> >> job master deploy into the slot, concurrently or > > > > sequentially, > > > > > it > > > > > > >>> is > > > > > > >>> >> one > > > > > > >>> >> >> allocation from the cluster to the job until the slot > is > > > > freed > > > > > > from > > > > > > >>> >> the job > > > > > > >>> >> >> master. > > > > > > >>> >> >> > > > > > > >>> >> >>> 3. In a session cluster, some jobs are configured > > with > > > > > > operator > > > > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. > > How > > > to > > > > > > deal > > > > > > >>> with > > > > > > >>> >> >>> this > > > > > > >>> >> >>> situation? > > > > > > >>> >> >> > > > > > > >>> >> >> As long as we do not mix unknown / specified resource > > > > profiles > > > > > > >>> within > > > > > > >>> >> the > > > > > > >>> >> >> same job / slot, there shouldn't be a problem. Resource > > > > manager > > > > > > >>> >> converts > > > > > > >>> >> >> unknown resource profiles in slot requests to specified > > > > default > > > > > > >>> >> resource > > > > > > >>> >> >> profiles, so they can be dynamically allocated from > task > > > > > > executors' > > > > > > >>> >> >> available resources just as other slot requests with > > > > specified > > > > > > >>> resource > > > > > > >>> >> >> profiles. > > > > > > >>> >> >> > > > > > > >>> >> >> Thank you~ > > > > > > >>> >> >> > > > > > > >>> >> >> Xintong Song > > > > > > >>> >> >> > > > > > > >>> >> >> > > > > > > >>> >> >> > > > > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang < > > > > > > [hidden email]> > > > > > > >>> >> wrote: > > > > > > >>> >> >> > > > > > > >>> >> >>> Hi Xintong, > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> Thanks for your detailed proposal. I think many users > > are > > > > > > >>> suffering > > > > > > >>> >> from > > > > > > >>> >> >>> waste of resources. The resource spec of all task > > managers > > > > are > > > > > > >>> same > > > > > > >>> >> and > > > > > > >>> >> >>> we > > > > > > >>> >> >>> have to increase all task managers to make the heavy > one > > > > more > > > > > > >>> stable. > > > > > > >>> >> So > > > > > > >>> >> >>> we > > > > > > >>> >> >>> will benefit from the fine grained resource > management a > > > > lot. > > > > > We > > > > > > >>> could > > > > > > >>> >> >>> get > > > > > > >>> >> >>> better resource utilization and stability. > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> Just to share some thoughts. > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> 1. How to calculate the resource specification of > > > > > > >>> TaskManagers? Do > > > > > > >>> >> >>> they > > > > > > >>> >> >>> have them same resource spec calculated based on > the > > > > > > >>> >> configuration? I > > > > > > >>> >> >>> think > > > > > > >>> >> >>> we still have wasted resources in this situation. > Or > > we > > > > > could > > > > > > >>> start > > > > > > >>> >> >>> TaskManagers with different spec. > > > > > > >>> >> >>> 2. If a slot is released and returned to SlotPool, > > does > > > > it > > > > > > >>> could be > > > > > > >>> >> >>> reused by other SlotRequest that the request > resource > > > is > > > > > > >>> smaller > > > > > > >>> >> than > > > > > > >>> >> >>> it? > > > > > > >>> >> >>> - If it is yes, what happens to the available > > > resource > > > > > in > > > > > > >>> the > > > > > > >>> >> >>> TaskManager. > > > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > > > SlotPool? > > > > > > The > > > > > > >>> >> >>> AllocationId is null? > > > > > > >>> >> >>> 3. In a session cluster, some jobs are configured > > with > > > > > > operator > > > > > > >>> >> >>> resources, meanwhile other jobs are using UNKNOWN. > > How > > > to > > > > > > deal > > > > > > >>> with > > > > > > >>> >> >>> this > > > > > > >>> >> >>> situation? > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> > > > > > > >>> >> >>> Best, > > > > > > >>> >> >>> Yang > > > > > > >>> >> >>> > > > > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 > > > > 下午8:57写道: > > > > > > >>> >> >>> > > > > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till. > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > Yangze, > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > I agree with you that we should make scheduling > > strategy > > > > > > >>> pluggable > > > > > > >>> >> and > > > > > > >>> >> >>> > optimize the strategy to reduce the memory > > fragmentation > > > > > > >>> problem, > > > > > > >>> >> and > > > > > > >>> >> >>> > thanks for the inputs on the potential algorithmic > > > > > solutions. > > > > > > >>> >> However, > > > > > > >>> >> >>> I'm > > > > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall > > > > mechanism > > > > > > >>> design > > > > > > >>> >> >>> rather > > > > > > >>> >> >>> > than strategies. Solving the fragmentation issue > > should > > > be > > > > > > >>> >> considered > > > > > > >>> >> >>> as an > > > > > > >>> >> >>> > optimization, and I agree with Till that we probably > > > > should > > > > > > >>> tackle > > > > > > >>> >> this > > > > > > >>> >> >>> > afterwards. > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > Till, > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes > > sense. > > > > The > > > > > > >>> operator > > > > > > >>> >> >>> > resource management and dynamic slot allocation do > not > > > > have > > > > > > much > > > > > > >>> >> >>> dependency > > > > > > >>> >> >>> > on each other. > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > - Regarding the default slot size, I think this is > > > similar > > > > > to > > > > > > >>> >> FLIP-49 > > > > > > >>> >> >>> [1] > > > > > > >>> >> >>> > where we want all the deriving happens at one > place. I > > > > think > > > > > > it > > > > > > >>> >> would > > > > > > >>> >> >>> be > > > > > > >>> >> >>> > nice to pass the default slot size into the task > > > executor > > > > in > > > > > > the > > > > > > >>> >> same > > > > > > >>> >> >>> way > > > > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 > [1]. > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > - Regarding the return value of > > > > > > >>> >> TaskExecutorGateway#requestResource, I > > > > > > >>> >> >>> > think you're right. We should avoid using null as > the > > > > return > > > > > > >>> value. > > > > > > >>> >> I > > > > > > >>> >> >>> think > > > > > > >>> >> >>> > we probably should thrown an exception here. > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > Thank you~ > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > Xintong Song > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > [1] > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > > > > > > >>> [hidden email] > > > > > > >>> >> > > > > > > > >>> >> >>> > wrote: > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > Hi Xintong, > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your > proposal > > > > helps > > > > > > to > > > > > > >>> >> >>> improve the > > > > > > >>> >> >>> > > execution of batch jobs more efficiently. > Moreover, > > it > > > > > > >>> enables the > > > > > > >>> >> >>> proper > > > > > > >>> >> >>> > > integration of the Blink planner which is very > > > important > > > > > as > > > > > > >>> well. > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was > wondering > > > > > whether > > > > > > it > > > > > > >>> >> >>> wouldn't > > > > > > >>> >> >>> > > make sense to actually split it up into two FLIPs: > > > > > Operator > > > > > > >>> >> resource > > > > > > >>> >> >>> > > management and dynamic slot allocation. I think > > these > > > > two > > > > > > >>> FLIPs > > > > > > >>> >> >>> could be > > > > > > >>> >> >>> > > seen as orthogonal and it would decrease the scope > > of > > > > each > > > > > > >>> >> individual > > > > > > >>> >> >>> > FLIP. > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > Some smaller comments: > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > - I'm not sure whether we should pass in the > default > > > > slot > > > > > > size > > > > > > >>> >> via an > > > > > > >>> >> >>> > > environment variable. Without having unified the > way > > > how > > > > > > Flink > > > > > > >>> >> >>> components > > > > > > >>> >> >>> > > are configured [1], I think it would be better to > > pass > > > > it > > > > > in > > > > > > >>> as > > > > > > >>> >> part > > > > > > >>> >> >>> of > > > > > > >>> >> >>> > the > > > > > > >>> >> >>> > > configuration. > > > > > > >>> >> >>> > > - I would avoid returning a null value from > > > > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot > be > > > > > > fulfilled. > > > > > > >>> >> >>> Either we > > > > > > >>> >> >>> > > should introduce an explicit return value saying > > this > > > or > > > > > > >>> throw an > > > > > > >>> >> >>> > > exception. > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are > right > > > that > > > > > it > > > > > > >>> would > > > > > > >>> >> be > > > > > > >>> >> >>> > > helpful to make the selection strategy pluggable. > > Also > > > > > > >>> batching > > > > > > >>> >> slot > > > > > > >>> >> >>> > > requests to the RM could be a good optimization. > For > > > the > > > > > > sake > > > > > > >>> of > > > > > > >>> >> >>> keeping > > > > > > >>> >> >>> > > the scope of this FLIP smaller I would try to > tackle > > > > these > > > > > > >>> things > > > > > > >>> >> >>> after > > > > > > >>> >> >>> > the > > > > > > >>> >> >>> > > initial version has been completed (without > spoiling > > > > these > > > > > > >>> >> >>> optimization > > > > > > >>> >> >>> > > opportunities). In particular batching the slot > > > requests > > > > > > >>> depends > > > > > > >>> >> on > > > > > > >>> >> >>> the > > > > > > >>> >> >>> > > current scheduler refactoring and could also be > > > realized > > > > > on > > > > > > >>> the RM > > > > > > >>> >> >>> side > > > > > > >>> >> >>> > > only. > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > [1] > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > Cheers, > > > > > > >>> >> >>> > > Till > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < > > > > > > >>> [hidden email]> > > > > > > >>> >> >>> wrote: > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > Hi, Xintong > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design > > > looks > > > > > good > > > > > > >>> to > > > > > > >>> >> me, > > > > > > >>> >> >>> +1 > > > > > > >>> >> >>> > > > for this feature. > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > Since slots in the same task executor could have > > > > > different > > > > > > >>> >> resource > > > > > > >>> >> >>> > > > profile, we will > > > > > > >>> >> >>> > > > meet resource fragment problem. Think about this > > > case: > > > > > > >>> >> >>> > > > - request A want 1G memory while request B & C > > want > > > > > 0.5G > > > > > > >>> memory > > > > > > >>> >> >>> > > > - There are two task executors T1 & T2 with 1G > > and > > > > 0.5G > > > > > > >>> free > > > > > > >>> >> >>> memory > > > > > > >>> >> >>> > > > respectively > > > > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for > B, A > > > > must > > > > > > >>> wait for > > > > > > >>> >> >>> the > > > > > > >>> >> >>> > > > free resource from > > > > > > >>> >> >>> > > > other task. But A could have been scheduled > > > > immediately > > > > > if > > > > > > >>> we > > > > > > >>> >> cut a > > > > > > >>> >> >>> > > > slot from T2 for B. > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become > finding a > > > > task > > > > > > >>> executor > > > > > > >>> >> >>> which > > > > > > >>> >> >>> > > > has enough > > > > > > >>> >> >>> > > > resource and then cut a slot from it. Current > > method > > > > > could > > > > > > >>> be > > > > > > >>> >> seen > > > > > > >>> >> >>> as > > > > > > >>> >> >>> > > > "First-fit strategy", > > > > > > >>> >> >>> > > > which works well in general but sometimes could > > not > > > be > > > > > the > > > > > > >>> >> >>> optimization > > > > > > >>> >> >>> > > > method. > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as > "Bin > > > > > Packing > > > > > > >>> >> >>> Problem"[1]. > > > > > > >>> >> >>> > > > Here are > > > > > > >>> >> >>> > > > some common approximate algorithms: > > > > > > >>> >> >>> > > > - First fit > > > > > > >>> >> >>> > > > - Next fit > > > > > > >>> >> >>> > > > - Best fit > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing > > problem > > > if > > > > > we > > > > > > >>> take > > > > > > >>> >> CPU > > > > > > >>> >> >>> > > > into account. It hard > > > > > > >>> >> >>> > > > to define which one is best fit now. Some > research > > > > > > addressed > > > > > > >>> >> this > > > > > > >>> >> >>> > > > problem, such like Tetris[2]. > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > Here are some thinking about it: > > > > > > >>> >> >>> > > > 1. We could make the strategy of finding > matching > > > task > > > > > > >>> executor > > > > > > >>> >> >>> > > > pluginable. Let user to config the > > > > > > >>> >> >>> > > > best strategy in their scenario. > > > > > > >>> >> >>> > > > 2. We could support batch request interface in > RM, > > > > > because > > > > > > >>> we > > > > > > >>> >> have > > > > > > >>> >> >>> > > > opportunities to optimize > > > > > > >>> >> >>> > > > if we have more information. If we know the A, > B, > > C > > > at > > > > > the > > > > > > >>> same > > > > > > >>> >> >>> time, > > > > > > >>> >> >>> > > > we could always make the best decision. > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf > > > > > > >>> >> >>> > > > [2] > > > > > > >>> >> >>> > > > > > > > >>> >> > > > > > > > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > Best, > > > > > > >>> >> >>> > > > Yangze Guo > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song < > > > > > > >>> >> >>> [hidden email]> > > > > > > >>> >> >>> > > > wrote: > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > Hi everyone, > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > We would like to start a discussion thread on > > > > > "FLIP-53: > > > > > > >>> Fine > > > > > > >>> >> >>> Grained > > > > > > >>> >> >>> > > > > Resource Management"[1], where we propose how > to > > > > > improve > > > > > > >>> Flink > > > > > > >>> >> >>> > resource > > > > > > >>> >> >>> > > > > management and scheduling. > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following > issues. > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > - How to support tasks with fine grained > > > resource > > > > > > >>> >> >>> requirements. > > > > > > >>> >> >>> > > > > - How to unify resource management for jobs > > > with > > > > / > > > > > > >>> without > > > > > > >>> >> >>> fine > > > > > > >>> >> >>> > > > grained > > > > > > >>> >> >>> > > > > resource requirements. > > > > > > >>> >> >>> > > > > - How to unify resource management for > > > streaming > > > > / > > > > > > >>> batch > > > > > > >>> >> jobs. > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as > follows. > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > - Unify memory management for operators > with > > / > > > > > > without > > > > > > >>> fine > > > > > > >>> >> >>> > grained > > > > > > >>> >> >>> > > > > resource requirements by applying a > fraction > > > > based > > > > > > >>> quota > > > > > > >>> >> >>> > mechanism. > > > > > > >>> >> >>> > > > > - Unify resource scheduling for streaming > and > > > > batch > > > > > > >>> jobs by > > > > > > >>> >> >>> > setting > > > > > > >>> >> >>> > > > slot > > > > > > >>> >> >>> > > > > sharing groups for pipelined regions during > > > > > compiling > > > > > > >>> >> stage. > > > > > > >>> >> >>> > > > > - Dynamically allocate slots from task > > > executors' > > > > > > >>> available > > > > > > >>> >> >>> > > resources. > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki > > document > > > > > [1]. > > > > > > >>> >> Looking > > > > > > >>> >> >>> > forward > > > > > > >>> >> >>> > > > to > > > > > > >>> >> >>> > > > > your feedbacks. > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > Thank you~ > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > Xintong Song > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > [1] > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > >>> >> >> > > > > > > >>> >> > > > > > > >>> > > > > > > > >>> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > |
Thanks all for joining the discussion.
It seems to me that there is a consensus on the current FLIP document. So if there is no objection, I would like to start the voting process for this FLIP. Thank you~ Xintong Song On Wed, Sep 4, 2019 at 8:23 PM Andrey Zagrebin <[hidden email]> wrote: > Thanks for updating the FLIP Xintong. It looks good to me. I would be ok to > start a vote for it. > > Best, > Andrey > > On Wed, Sep 4, 2019 at 10:03 AM Xintong Song <[hidden email]> > wrote: > > > @all > > > > The FLIP document [1] has been updated. > > > > Thank you~ > > > > Xintong Song > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > On Tue, Sep 3, 2019 at 7:20 PM Zhu Zhu <[hidden email]> wrote: > > > > > Thanks Xintong for the explanation. > > > > > > For question #1, I think it's good as long as DataSet job behaviors > > remains > > > the same. > > > > > > For question #2, agreed that the resource difference is small enough(at > > > most 1 edge diff) in current supported point-wise execution edge > > connection > > > patterns. > > > > > > Thanks, > > > Zhu Zhu > > > > > > Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道: > > > > > > > Thanks for the comments, Zhu & Kurt. > > > > > > > > Andrey and I also had some discussions offline, and I would like to > > first > > > > post a summary of our discussion: > > > > > > > > 1. The motivation of the fraction based approach is to unify > > resource > > > > management for both operators with specified and unknown resource > > > > requirements. > > > > 2. The fraction based approach proposed in this FLIP should only > > > affect > > > > streaming jobs (both bounded and unbounded). For DataSet jobs, > there > > > are > > > > already some fraction based approach (in TaskConfig and > > > ChainedDriver), > > > > and > > > > we do not make any change to the existing approach. > > > > 3. The scope of this FLIP does not include discussion of how to > set > > > > ResourceSpec for operators. > > > > 1. For blink jobs, the optimizer can set operator resources for > > the > > > > users, according to their configurations (default: unknown) > > > > 2. For DataStream jobs, there are no method / interface to set > > > > operator resources at the moment (1.10). We can have in the > > future. > > > > 3. For DataSet jobs, there are existing user interfaces to set > > > > operator resources. > > > > 4. The FLIP should explain more about how ResourceSpecs works > > > > 1. PhysicalTransformations (deployed with operators into the > > > > StreamTasks) get ResourceSpec: unknown by default or known > (e.g. > > > > from the > > > > Blink planner) > > > > 2. While generating stream graph, calculate fractions and set > to > > > > StreamConfig > > > > 3. While scheduling, convert ResourceSpec to ResourceProfile > > > > (ResourceSpec + network memory), and deploy to slots / TMs > > matching > > > > the > > > > resources > > > > 4. While starting Task in TM, each operator gets fraction > > converted > > > > back to the original absolute value requested by user or fair > > > > unknown share > > > > of the slot > > > > 5. We should not set `allSourcesInSamePipelinedRegion` to > `false` > > > for > > > > DataSet jobs. Behaviors of DataSet jobs should not be changed. > > > > 6. The FLIP document should differentiate works planed in this > FLIP > > > and > > > > the future follow-ups more clearly, by put the follow-ups in a > > > separate > > > > section > > > > 7. Another limitation of the rejected alternative setting > fractions > > at > > > > scheduling time is that, the scheduler implementation does not > know > > > > which > > > > tasks will be deployed into the same slot in advance. > > > > > > > > Andrey, Please bring it up if there is anything I missed. > > > > > > > > Zhu, regarding your comments: > > > > > > > > 1. If we do not set `allSourcesInSamePipelinedRegion` to `false` > for > > > > DataSet jobs (point 5 in the discussion summary above), then there > > > > shouldn't be any regression right? > > > > 2. I think it makes sense to set the max possible network memory > for > > > the > > > > JobVertex. When you say parallel instances of the same JobVertex > may > > > > have > > > > need different network memory, I guess you mean the rescale > > scenarios > > > > where > > > > parallelisms of upstream / downstream vertex cannot be exactly > > divided > > > > by > > > > parallelism of downstream / upstream vertex? I would say it's > > > > acceptable to > > > > have slight difference between actually needed and allocated > network > > > > memory. > > > > 3. Yes, by numOpsUseOnHeapManagedMemory I mean > > > > numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the > > doc. > > > > 4. Yes, it should be StreamingJobGraphGenerator. Thanks for the > > > > correction. > > > > > > > > > > > > Kurt, regarding your comments: > > > > > > > > 1. I think we don't have network memory in ResourceSpec, which is > > the > > > > user facing API. We only have network memory in ResourceProfile, > > which > > > > is > > > > used internally for scheduling. The reason we do not expose > network > > > > memory > > > > to the user is that, currently how many network buffers each task > > > needs > > > > is > > > > decided by the topology of execution graph (how many input / > output > > > > channels it has). > > > > 2. In the section "Operator Resource Requirements": "For the first > > > > version, we do not support mixing operators with specified / > unknown > > > > resource requirements in the same job. Either all or none of the > > > > operators > > > > of the same job should specify their resource requirements. > > > > StreamGraphGenerator should check this and throw an error when > > mixing > > > of > > > > specified / unknown resource requirements is detected, during the > > > > compilation stage." > > > > 3. If the user set a resource requirement, then it is guaranteed > > that > > > > the task should get at least the much resource, otherwise there > > should > > > > be > > > > an exception. That should be guaranteed by the "Dynamic Slot > > > Allocation" > > > > approach (FLIP-56). > > > > > > > > > > > > I'll update the FLIP document addressing the comments ASAP. > > > > > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote: > > > > > > > > > Thanks Xingtong for driving this effort, I haven't finished the > whole > > > > > document yet, > > > > > but have couple of questions: > > > > > > > > > > 1. Regarding to network memory, the document said it will be > derived > > by > > > > > framework > > > > > automatically. I'm wondering whether we should delete this > dimension > > > from > > > > > user- > > > > > facing API? > > > > > > > > > > 2. Regarding to fraction based quota, I don't quite get the meaning > > of > > > > > "slotSharingGroupOnHeapManagedMem" and > > > > "slotSharingGroupOffHeapManagedMem". > > > > > What if the sharing group is mixed with specified resource and > > UNKNOWN > > > > > resource > > > > > requirements. > > > > > > > > > > 3 IIUC, even user had set resource requirements, lets say 500MB > > > off-heap > > > > > managed > > > > > memory, during execution the operator may or may not have 500MB > > > off-heap > > > > > managed > > > > > memory, right? > > > > > > > > > > Best, > > > > > Kurt > > > > > > > > > > > > > > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote: > > > > > > > > > > > Thanks Xintong for proposing this improvement. Fine grained > > resources > > > > can > > > > > > be very helpful when user has good planning on resources. > > > > > > > > > > > > I have a few questions: > > > > > > 1. Currently in a batch job, vertices from different regions can > > run > > > at > > > > > the > > > > > > same time in slots from the same shared group, as long as they do > > not > > > > > have > > > > > > data dependency on each other and available slot count is not > > smaller > > > > > than > > > > > > the *max* of parallelism of all tasks. > > > > > > With changes in this FLIP however, tasks from different regions > > > cannot > > > > > > share slots anymore. > > > > > > Once available slot count is smaller than the *sum* of all > > > parallelism > > > > of > > > > > > tasks from all regions, tasks may need to be executed > sequentially, > > > > which > > > > > > might result in a performance regression. > > > > > > Is this(performance regression to existing DataSet jobs) > considered > > > as > > > > a > > > > > > necessary and accepted trade off in this FLIP? > > > > > > > > > > > > 2. The network memory depends on the input/output ExecutionEdge > > count > > > > and > > > > > > thus can be different even for parallel instances of the same > > > > JobVertex. > > > > > > Does this mean that when adding task resources to calculating the > > > slot > > > > > > resource for a shared group, the max possible network memory of > the > > > > > vertex > > > > > > instance shall be used? > > > > > > This might result in larger resource required than actually > needed. > > > > > > > > > > > > And some minor comments: > > > > > > 1. Regarding "fracManagedMemOnHeap = 1 / > > > > numOpsUseOnHeapManagedMemory", I > > > > > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ? > > > > > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing > section > > > and > > > > > > implementation step 4 should be *StreamingJobGraphGenerator*, as > > > > > > *StreamGraphGenerator* is not aware of JobGraph and pipelined > > region. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > Zhu Zhu > > > > > > > > > > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道: > > > > > > > > > > > > > Updated the FLIP wiki page [1], with the following changes. > > > > > > > > > > > > > > - Remove the step of converting pipelined edges between > > > different > > > > > slot > > > > > > > sharing groups into blocking edges. > > > > > > > - Set `allSourcesInSamePipelinedRegion` to true by default. > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > Regarding changing edge type, I think actually we don't need > to > > > do > > > > > this > > > > > > > > for batch jobs neither, because we don't have public > interfaces > > > for > > > > > > users > > > > > > > > to explicitly set slot sharing groups in DataSet API and > > > SQL/Table > > > > > API. > > > > > > > We > > > > > > > > have such interfaces in DataStream API only. > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song < > > > > [hidden email] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Thanks for the correction, Till. > > > > > > > >> > > > > > > > >> Regarding your comments: > > > > > > > >> - You are right, we should not change the edge type for > > > streaming > > > > > > jobs. > > > > > > > >> Then I think we can change the option > > > > > > 'allSourcesInSamePipelinedRegion' > > > > > > > in > > > > > > > >> step 2 to 'isStreamingJob', and implement the current step 2 > > > > before > > > > > > the > > > > > > > >> current step 1 so we can use this option to decide whether > > > should > > > > > > change > > > > > > > >> the edge type. What do you think? > > > > > > > >> - Agree. It should be easier to make the default value of > > > > > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') > > 'true', > > > > and > > > > > > set > > > > > > > it > > > > > > > >> to 'false' when using DataSet API or blink planner. > > > > > > > >> > > > > > > > >> Thank you~ > > > > > > > >> > > > > > > > >> Xintong Song > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann < > > > > [hidden email] > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >>> Thanks for creating the implementation plan Xintong. > Overall, > > > the > > > > > > > >>> implementation plan looks good. I had a couple of comments: > > > > > > > >>> > > > > > > > >>> - What will happen if a user has defined a streaming job > with > > > two > > > > > > slot > > > > > > > >>> sharing groups? Would the code insert a blocking data > > exchange > > > > > > between > > > > > > > >>> these two groups? If yes, then this breaks existing Flink > > > > streaming > > > > > > > jobs. > > > > > > > >>> - How do we detect unbounded streaming jobs to set > > > > > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it > be > > > > > easier > > > > > > to > > > > > > > >>> set > > > > > > > >>> it false if we are using the DataSet API or the Blink > planner > > > > with > > > > > a > > > > > > > >>> bounded job? > > > > > > > >>> > > > > > > > >>> Cheers, > > > > > > > >>> Till > > > > > > > >>> > > > > > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann < > > > > > [hidden email]> > > > > > > > >>> wrote: > > > > > > > >>> > > > > > > > >>> > I guess there is a typo since the link to the FLIP-53 is > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > > > > >>> > > > > > > > > >>> > Cheers, > > > > > > > >>> > Till > > > > > > > >>> > > > > > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song < > > > > > > [hidden email]> > > > > > > > >>> > wrote: > > > > > > > >>> > > > > > > > > >>> >> Added implementation steps for this FLIP on the wiki > page > > > [1]. > > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> >> Thank you~ > > > > > > > >>> >> > > > > > > > >>> >> Xintong Song > > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> >> [1] > > > > > > > >>> >> > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > > >>> >> > > > > > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song < > > > > > > > [hidden email]> > > > > > > > >>> >> wrote: > > > > > > > >>> >> > > > > > > > >>> >> > Hi everyone, > > > > > > > >>> >> > > > > > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained > > > > > Resource > > > > > > > >>> >> > Management" splits into two separate FLIPs, > > > > > > > >>> >> > > > > > > > > >>> >> > - FLIP-53: Fine Grained Operator Resource > Management > > > [1] > > > > > > > >>> >> > - FLIP-56: Dynamic Slot Allocation [2] > > > > > > > >>> >> > > > > > > > > >>> >> > We'll continue using this discussion thread for > FLIP-53. > > > For > > > > > > > >>> FLIP-56, I > > > > > > > >>> >> > just started a new discussion thread [3]. > > > > > > > >>> >> > > > > > > > > >>> >> > Thank you~ > > > > > > > >>> >> > > > > > > > > >>> >> > Xintong Song > > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > > >>> >> > [1] > > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management > > > > > > > >>> >> > > > > > > > > >>> >> > [2] > > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation > > > > > > > >>> >> > > > > > > > > >>> >> > [3] > > > > > > > >>> >> > > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html > > > > > > > >>> >> > > > > > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song < > > > > > > > [hidden email] > > > > > > > >>> > > > > > > > > >>> >> > wrote: > > > > > > > >>> >> > > > > > > > > >>> >> >> Thinks for the comments, Yang. > > > > > > > >>> >> >> > > > > > > > >>> >> >> Regarding your questions: > > > > > > > >>> >> >> > > > > > > > >>> >> >> 1. How to calculate the resource specification of > > > > > > > TaskManagers? > > > > > > > >>> Do > > > > > > > >>> >> they > > > > > > > >>> >> >>> have them same resource spec calculated based on > > the > > > > > > > >>> >> configuration? I > > > > > > > >>> >> >>> think > > > > > > > >>> >> >>> we still have wasted resources in this situation. > > Or > > > we > > > > > > could > > > > > > > >>> start > > > > > > > >>> >> >>> TaskManagers with different spec. > > > > > > > >>> >> >>> > > > > > > > >>> >> >> I agree with you that we can further improve the > > resource > > > > > > utility > > > > > > > >>> by > > > > > > > >>> >> >> customizing task executors with different resource > > > > > > > specifications. > > > > > > > >>> >> However, > > > > > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and > > leave > > > > it > > > > > > as a > > > > > > > >>> >> future > > > > > > > >>> >> >> optimization. The plan for that part is to move the > > logic > > > > of > > > > > > > >>> deciding > > > > > > > >>> >> task > > > > > > > >>> >> >> executor specifications into the slot manager and > make > > > slot > > > > > > > manager > > > > > > > >>> >> >> pluggable, so inside the slot manager plugin we can > > have > > > > > > > different > > > > > > > >>> >> logics > > > > > > > >>> >> >> for deciding the task executor specifications. > > > > > > > >>> >> >> > > > > > > > >>> >> >> > > > > > > > >>> >> >>> 2. If a slot is released and returned to > SlotPool, > > > does > > > > > it > > > > > > > >>> could be > > > > > > > >>> >> >>> reused by other SlotRequest that the request > > resource > > > > is > > > > > > > >>> smaller > > > > > > > >>> >> than > > > > > > > >>> >> >>> it? > > > > > > > >>> >> >>> > > > > > > > >>> >> >> No, I think slot pool should always return slots if > > they > > > do > > > > > not > > > > > > > >>> exactly > > > > > > > >>> >> >> match the pending requests, so that resource manager > > can > > > > deal > > > > > > > with > > > > > > > >>> the > > > > > > > >>> >> >> extra resources. > > > > > > > >>> >> >> > > > > > > > >>> >> >>> - If it is yes, what happens to the available > > > > resource > > > > > > in > > > > > > > >>> the > > > > > > > >>> >> >> > > > > > > > >>> >> >> TaskManager. > > > > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > > > > SlotPool? > > > > > > > The > > > > > > > >>> >> >>> AllocationId is null? > > > > > > > >>> >> >>> > > > > > > > >>> >> >> The allocation id does not change as long as the slot > > is > > > > not > > > > > > > >>> returned > > > > > > > >>> >> >> from the job master, no matter its occupied or > > available > > > in > > > > > the > > > > > > > >>> slot > > > > > > > >>> >> pool. > > > > > > > >>> >> >> I think we have the same behavior currently. No > matter > > > how > > > > > many > > > > > > > >>> tasks > > > > > > > >>> >> the > > > > > > > >>> >> >> job master deploy into the slot, concurrently or > > > > > sequentially, > > > > > > it > > > > > > > >>> is > > > > > > > >>> >> one > > > > > > > >>> >> >> allocation from the cluster to the job until the slot > > is > > > > > freed > > > > > > > from > > > > > > > >>> >> the job > > > > > > > >>> >> >> master. > > > > > > > >>> >> >> > > > > > > > >>> >> >>> 3. In a session cluster, some jobs are configured > > > with > > > > > > > operator > > > > > > > >>> >> >>> resources, meanwhile other jobs are using > UNKNOWN. > > > How > > > > to > > > > > > > deal > > > > > > > >>> with > > > > > > > >>> >> >>> this > > > > > > > >>> >> >>> situation? > > > > > > > >>> >> >> > > > > > > > >>> >> >> As long as we do not mix unknown / specified resource > > > > > profiles > > > > > > > >>> within > > > > > > > >>> >> the > > > > > > > >>> >> >> same job / slot, there shouldn't be a problem. > Resource > > > > > manager > > > > > > > >>> >> converts > > > > > > > >>> >> >> unknown resource profiles in slot requests to > specified > > > > > default > > > > > > > >>> >> resource > > > > > > > >>> >> >> profiles, so they can be dynamically allocated from > > task > > > > > > > executors' > > > > > > > >>> >> >> available resources just as other slot requests with > > > > > specified > > > > > > > >>> resource > > > > > > > >>> >> >> profiles. > > > > > > > >>> >> >> > > > > > > > >>> >> >> Thank you~ > > > > > > > >>> >> >> > > > > > > > >>> >> >> Xintong Song > > > > > > > >>> >> >> > > > > > > > >>> >> >> > > > > > > > >>> >> >> > > > > > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang < > > > > > > > [hidden email]> > > > > > > > >>> >> wrote: > > > > > > > >>> >> >> > > > > > > > >>> >> >>> Hi Xintong, > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> Thanks for your detailed proposal. I think many > users > > > are > > > > > > > >>> suffering > > > > > > > >>> >> from > > > > > > > >>> >> >>> waste of resources. The resource spec of all task > > > managers > > > > > are > > > > > > > >>> same > > > > > > > >>> >> and > > > > > > > >>> >> >>> we > > > > > > > >>> >> >>> have to increase all task managers to make the heavy > > one > > > > > more > > > > > > > >>> stable. > > > > > > > >>> >> So > > > > > > > >>> >> >>> we > > > > > > > >>> >> >>> will benefit from the fine grained resource > > management a > > > > > lot. > > > > > > We > > > > > > > >>> could > > > > > > > >>> >> >>> get > > > > > > > >>> >> >>> better resource utilization and stability. > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> Just to share some thoughts. > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> 1. How to calculate the resource specification of > > > > > > > >>> TaskManagers? Do > > > > > > > >>> >> >>> they > > > > > > > >>> >> >>> have them same resource spec calculated based on > > the > > > > > > > >>> >> configuration? I > > > > > > > >>> >> >>> think > > > > > > > >>> >> >>> we still have wasted resources in this situation. > > Or > > > we > > > > > > could > > > > > > > >>> start > > > > > > > >>> >> >>> TaskManagers with different spec. > > > > > > > >>> >> >>> 2. If a slot is released and returned to > SlotPool, > > > does > > > > > it > > > > > > > >>> could be > > > > > > > >>> >> >>> reused by other SlotRequest that the request > > resource > > > > is > > > > > > > >>> smaller > > > > > > > >>> >> than > > > > > > > >>> >> >>> it? > > > > > > > >>> >> >>> - If it is yes, what happens to the available > > > > resource > > > > > > in > > > > > > > >>> the > > > > > > > >>> >> >>> TaskManager. > > > > > > > >>> >> >>> - What is the SlotStatus of the cached slot in > > > > > SlotPool? > > > > > > > The > > > > > > > >>> >> >>> AllocationId is null? > > > > > > > >>> >> >>> 3. In a session cluster, some jobs are configured > > > with > > > > > > > operator > > > > > > > >>> >> >>> resources, meanwhile other jobs are using > UNKNOWN. > > > How > > > > to > > > > > > > deal > > > > > > > >>> with > > > > > > > >>> >> >>> this > > > > > > > >>> >> >>> situation? > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> Best, > > > > > > > >>> >> >>> Yang > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 > > > > > 下午8:57写道: > > > > > > > >>> >> >>> > > > > > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till. > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > Yangze, > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > I agree with you that we should make scheduling > > > strategy > > > > > > > >>> pluggable > > > > > > > >>> >> and > > > > > > > >>> >> >>> > optimize the strategy to reduce the memory > > > fragmentation > > > > > > > >>> problem, > > > > > > > >>> >> and > > > > > > > >>> >> >>> > thanks for the inputs on the potential algorithmic > > > > > > solutions. > > > > > > > >>> >> However, > > > > > > > >>> >> >>> I'm > > > > > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall > > > > > mechanism > > > > > > > >>> design > > > > > > > >>> >> >>> rather > > > > > > > >>> >> >>> > than strategies. Solving the fragmentation issue > > > should > > > > be > > > > > > > >>> >> considered > > > > > > > >>> >> >>> as an > > > > > > > >>> >> >>> > optimization, and I agree with Till that we > probably > > > > > should > > > > > > > >>> tackle > > > > > > > >>> >> this > > > > > > > >>> >> >>> > afterwards. > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > Till, > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes > > > sense. > > > > > The > > > > > > > >>> operator > > > > > > > >>> >> >>> > resource management and dynamic slot allocation do > > not > > > > > have > > > > > > > much > > > > > > > >>> >> >>> dependency > > > > > > > >>> >> >>> > on each other. > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > - Regarding the default slot size, I think this is > > > > similar > > > > > > to > > > > > > > >>> >> FLIP-49 > > > > > > > >>> >> >>> [1] > > > > > > > >>> >> >>> > where we want all the deriving happens at one > > place. I > > > > > think > > > > > > > it > > > > > > > >>> >> would > > > > > > > >>> >> >>> be > > > > > > > >>> >> >>> > nice to pass the default slot size into the task > > > > executor > > > > > in > > > > > > > the > > > > > > > >>> >> same > > > > > > > >>> >> >>> way > > > > > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 > > [1]. > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > - Regarding the return value of > > > > > > > >>> >> TaskExecutorGateway#requestResource, I > > > > > > > >>> >> >>> > think you're right. We should avoid using null as > > the > > > > > return > > > > > > > >>> value. > > > > > > > >>> >> I > > > > > > > >>> >> >>> think > > > > > > > >>> >> >>> > we probably should thrown an exception here. > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > Thank you~ > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > Xintong Song > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > [1] > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann < > > > > > > > >>> [hidden email] > > > > > > > >>> >> > > > > > > > > >>> >> >>> > wrote: > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > Hi Xintong, > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your > > proposal > > > > > helps > > > > > > > to > > > > > > > >>> >> >>> improve the > > > > > > > >>> >> >>> > > execution of batch jobs more efficiently. > > Moreover, > > > it > > > > > > > >>> enables the > > > > > > > >>> >> >>> proper > > > > > > > >>> >> >>> > > integration of the Blink planner which is very > > > > important > > > > > > as > > > > > > > >>> well. > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was > > wondering > > > > > > whether > > > > > > > it > > > > > > > >>> >> >>> wouldn't > > > > > > > >>> >> >>> > > make sense to actually split it up into two > FLIPs: > > > > > > Operator > > > > > > > >>> >> resource > > > > > > > >>> >> >>> > > management and dynamic slot allocation. I think > > > these > > > > > two > > > > > > > >>> FLIPs > > > > > > > >>> >> >>> could be > > > > > > > >>> >> >>> > > seen as orthogonal and it would decrease the > scope > > > of > > > > > each > > > > > > > >>> >> individual > > > > > > > >>> >> >>> > FLIP. > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > Some smaller comments: > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > - I'm not sure whether we should pass in the > > default > > > > > slot > > > > > > > size > > > > > > > >>> >> via an > > > > > > > >>> >> >>> > > environment variable. Without having unified the > > way > > > > how > > > > > > > Flink > > > > > > > >>> >> >>> components > > > > > > > >>> >> >>> > > are configured [1], I think it would be better > to > > > pass > > > > > it > > > > > > in > > > > > > > >>> as > > > > > > > >>> >> part > > > > > > > >>> >> >>> of > > > > > > > >>> >> >>> > the > > > > > > > >>> >> >>> > > configuration. > > > > > > > >>> >> >>> > > - I would avoid returning a null value from > > > > > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot > > be > > > > > > > fulfilled. > > > > > > > >>> >> >>> Either we > > > > > > > >>> >> >>> > > should introduce an explicit return value saying > > > this > > > > or > > > > > > > >>> throw an > > > > > > > >>> >> >>> > > exception. > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are > > right > > > > that > > > > > > it > > > > > > > >>> would > > > > > > > >>> >> be > > > > > > > >>> >> >>> > > helpful to make the selection strategy > pluggable. > > > Also > > > > > > > >>> batching > > > > > > > >>> >> slot > > > > > > > >>> >> >>> > > requests to the RM could be a good optimization. > > For > > > > the > > > > > > > sake > > > > > > > >>> of > > > > > > > >>> >> >>> keeping > > > > > > > >>> >> >>> > > the scope of this FLIP smaller I would try to > > tackle > > > > > these > > > > > > > >>> things > > > > > > > >>> >> >>> after > > > > > > > >>> >> >>> > the > > > > > > > >>> >> >>> > > initial version has been completed (without > > spoiling > > > > > these > > > > > > > >>> >> >>> optimization > > > > > > > >>> >> >>> > > opportunities). In particular batching the slot > > > > requests > > > > > > > >>> depends > > > > > > > >>> >> on > > > > > > > >>> >> >>> the > > > > > > > >>> >> >>> > > current scheduler refactoring and could also be > > > > realized > > > > > > on > > > > > > > >>> the RM > > > > > > > >>> >> >>> side > > > > > > > >>> >> >>> > > only. > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > [1] > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > Cheers, > > > > > > > >>> >> >>> > > Till > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo < > > > > > > > >>> [hidden email]> > > > > > > > >>> >> >>> wrote: > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > Hi, Xintong > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general > design > > > > looks > > > > > > good > > > > > > > >>> to > > > > > > > >>> >> me, > > > > > > > >>> >> >>> +1 > > > > > > > >>> >> >>> > > > for this feature. > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > Since slots in the same task executor could > have > > > > > > different > > > > > > > >>> >> resource > > > > > > > >>> >> >>> > > > profile, we will > > > > > > > >>> >> >>> > > > meet resource fragment problem. Think about > this > > > > case: > > > > > > > >>> >> >>> > > > - request A want 1G memory while request B & > C > > > want > > > > > > 0.5G > > > > > > > >>> memory > > > > > > > >>> >> >>> > > > - There are two task executors T1 & T2 with > 1G > > > and > > > > > 0.5G > > > > > > > >>> free > > > > > > > >>> >> >>> memory > > > > > > > >>> >> >>> > > > respectively > > > > > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for > > B, A > > > > > must > > > > > > > >>> wait for > > > > > > > >>> >> >>> the > > > > > > > >>> >> >>> > > > free resource from > > > > > > > >>> >> >>> > > > other task. But A could have been scheduled > > > > > immediately > > > > > > if > > > > > > > >>> we > > > > > > > >>> >> cut a > > > > > > > >>> >> >>> > > > slot from T2 for B. > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become > > finding a > > > > > task > > > > > > > >>> executor > > > > > > > >>> >> >>> which > > > > > > > >>> >> >>> > > > has enough > > > > > > > >>> >> >>> > > > resource and then cut a slot from it. Current > > > method > > > > > > could > > > > > > > >>> be > > > > > > > >>> >> seen > > > > > > > >>> >> >>> as > > > > > > > >>> >> >>> > > > "First-fit strategy", > > > > > > > >>> >> >>> > > > which works well in general but sometimes > could > > > not > > > > be > > > > > > the > > > > > > > >>> >> >>> optimization > > > > > > > >>> >> >>> > > > method. > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as > > "Bin > > > > > > Packing > > > > > > > >>> >> >>> Problem"[1]. > > > > > > > >>> >> >>> > > > Here are > > > > > > > >>> >> >>> > > > some common approximate algorithms: > > > > > > > >>> >> >>> > > > - First fit > > > > > > > >>> >> >>> > > > - Next fit > > > > > > > >>> >> >>> > > > - Best fit > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing > > > problem > > > > if > > > > > > we > > > > > > > >>> take > > > > > > > >>> >> CPU > > > > > > > >>> >> >>> > > > into account. It hard > > > > > > > >>> >> >>> > > > to define which one is best fit now. Some > > research > > > > > > > addressed > > > > > > > >>> >> this > > > > > > > >>> >> >>> > > > problem, such like Tetris[2]. > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > Here are some thinking about it: > > > > > > > >>> >> >>> > > > 1. We could make the strategy of finding > > matching > > > > task > > > > > > > >>> executor > > > > > > > >>> >> >>> > > > pluginable. Let user to config the > > > > > > > >>> >> >>> > > > best strategy in their scenario. > > > > > > > >>> >> >>> > > > 2. We could support batch request interface in > > RM, > > > > > > because > > > > > > > >>> we > > > > > > > >>> >> have > > > > > > > >>> >> >>> > > > opportunities to optimize > > > > > > > >>> >> >>> > > > if we have more information. If we know the A, > > B, > > > C > > > > at > > > > > > the > > > > > > > >>> same > > > > > > > >>> >> >>> time, > > > > > > > >>> >> >>> > > > we could always make the best decision. > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > [1] > http://www.or.deis.unibo.it/kp/Chapter8.pdf > > > > > > > >>> >> >>> > > > [2] > > > > > > > >>> >> >>> > > > > > > > > >>> >> > > > > > > > > > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > Best, > > > > > > > >>> >> >>> > > > Yangze Guo > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song > < > > > > > > > >>> >> >>> [hidden email]> > > > > > > > >>> >> >>> > > > wrote: > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > Hi everyone, > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > We would like to start a discussion thread > on > > > > > > "FLIP-53: > > > > > > > >>> Fine > > > > > > > >>> >> >>> Grained > > > > > > > >>> >> >>> > > > > Resource Management"[1], where we propose > how > > to > > > > > > improve > > > > > > > >>> Flink > > > > > > > >>> >> >>> > resource > > > > > > > >>> >> >>> > > > > management and scheduling. > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following > > issues. > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > - How to support tasks with fine grained > > > > resource > > > > > > > >>> >> >>> requirements. > > > > > > > >>> >> >>> > > > > - How to unify resource management for > jobs > > > > with > > > > > / > > > > > > > >>> without > > > > > > > >>> >> >>> fine > > > > > > > >>> >> >>> > > > grained > > > > > > > >>> >> >>> > > > > resource requirements. > > > > > > > >>> >> >>> > > > > - How to unify resource management for > > > > streaming > > > > > / > > > > > > > >>> batch > > > > > > > >>> >> jobs. > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as > > follows. > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > - Unify memory management for operators > > with > > > / > > > > > > > without > > > > > > > >>> fine > > > > > > > >>> >> >>> > grained > > > > > > > >>> >> >>> > > > > resource requirements by applying a > > fraction > > > > > based > > > > > > > >>> quota > > > > > > > >>> >> >>> > mechanism. > > > > > > > >>> >> >>> > > > > - Unify resource scheduling for streaming > > and > > > > > batch > > > > > > > >>> jobs by > > > > > > > >>> >> >>> > setting > > > > > > > >>> >> >>> > > > slot > > > > > > > >>> >> >>> > > > > sharing groups for pipelined regions > during > > > > > > compiling > > > > > > > >>> >> stage. > > > > > > > >>> >> >>> > > > > - Dynamically allocate slots from task > > > > executors' > > > > > > > >>> available > > > > > > > >>> >> >>> > > resources. > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki > > > document > > > > > > [1]. > > > > > > > >>> >> Looking > > > > > > > >>> >> >>> > forward > > > > > > > >>> >> >>> > > > to > > > > > > > >>> >> >>> > > > > your feedbacks. > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > Thank you~ > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > Xintong Song > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > [1] > > > > > > > >>> >> >>> > > > > > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management > > > > > > > >>> >> >>> > > > > > > > > > > >>> >> >>> > > > > > > > > > >>> >> >>> > > > > > > > > >>> >> >>> > > > > > > > >>> >> >> > > > > > > > >>> >> > > > > > > > >>> > > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |