Hi, there,
We would like to start a discussion thread on "FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements"[1], where we propose Slot Sharing Group (SSG) based runtime interfaces for specifying fine-grained resource requirements. In this FLIP: - Expound the user story of fine-grained resource management. - Propose runtime interfaces for specifying SSG-based resource requirements. - Discuss the pros and cons of the three potential granularities for specifying the resource requirements (op, task and slot sharing group) and explain why we choose the slot sharing group. Please find more details in the FLIP wiki document [1]. Looking forward to your feedback. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements Best, Yangze Guo |
Thanks for drafting this FLIP and starting this discussion Yangze.
I like that defining resource requirements on a slot sharing group makes the overall setup easier and improves usability of resource requirements. What I do not like about it is that it changes slot sharing groups from being a scheduling hint to something which needs to be supported in order to support fine grained resource requirements. So far, the idea of slot sharing groups was that it tells the system that a set of operators can be deployed in the same slot. But the system still had the freedom to say that it would rather place these tasks in different slots if it wanted. If we now specify resource requirements on a per slot sharing group, then the only option for a scheduler which does not support slot sharing groups is to say that every operator in this slot sharing group needs a slot with the same resources as the whole group. So for example, if we have a job consisting of two operator op_1 and op_2 where each op needs 100 MB of memory, we would then say that the slot sharing group needs 200 MB of memory to run. If we have a cluster with 2 TMs with one slot of 100 MB each, then the system cannot run this job. If the resources were specified on an operator level, then the system could still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. Originally, one of the primary goals of slot sharing groups was to make it easier for the user to reason about how many slots a job needs independent of the actual number of operators in the job. Interestingly, if all operators have their resources properly specified, then slot sharing is no longer needed because Flink could slice off the appropriately sized slots for every Task individually. What matters is whether the whole cluster has enough resources to run all tasks or not. Cheers, Till On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> wrote: > Hi, there, > > We would like to start a discussion thread on "FLIP-156: Runtime > Interfaces for Fine-Grained Resource Requirements"[1], where we > propose Slot Sharing Group (SSG) based runtime interfaces for > specifying fine-grained resource requirements. > > In this FLIP: > - Expound the user story of fine-grained resource management. > - Propose runtime interfaces for specifying SSG-based resource > requirements. > - Discuss the pros and cons of the three potential granularities for > specifying the resource requirements (op, task and slot sharing group) > and explain why we choose the slot sharing group. > > Please find more details in the FLIP wiki document [1]. Looking > forward to your feedback. > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > Best, > Yangze Guo > |
Will declaring them on slot sharing groups not also waste resources if
the parallelism of operators within that group are different? It also seems like quite a hassle for users having to recalculate the resource requirements if they change the slot sharing. I'd think that it's not really workable for users that create a set of re-usable operators which are mixed and matched in their applications; managing the resources requirements in such a setting would be a nightmare, and in the end would require operator-level requirements any way. In that sense, I'm not even sure whether it really increases usability. My main worry is that it if we wire the runtime to work on SSGs it's gonna be difficult to implement more fine-grained approaches, which would not be the case if, for the runtime, they are always defined on an operator-level. On 1/7/2021 2:42 PM, Till Rohrmann wrote: > Thanks for drafting this FLIP and starting this discussion Yangze. > > I like that defining resource requirements on a slot sharing group makes > the overall setup easier and improves usability of resource requirements. > > What I do not like about it is that it changes slot sharing groups from > being a scheduling hint to something which needs to be supported in order > to support fine grained resource requirements. So far, the idea of slot > sharing groups was that it tells the system that a set of operators can be > deployed in the same slot. But the system still had the freedom to say that > it would rather place these tasks in different slots if it wanted. If we > now specify resource requirements on a per slot sharing group, then the > only option for a scheduler which does not support slot sharing groups is > to say that every operator in this slot sharing group needs a slot with the > same resources as the whole group. > > So for example, if we have a job consisting of two operator op_1 and op_2 > where each op needs 100 MB of memory, we would then say that the slot > sharing group needs 200 MB of memory to run. If we have a cluster with 2 > TMs with one slot of 100 MB each, then the system cannot run this job. If > the resources were specified on an operator level, then the system could > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > Originally, one of the primary goals of slot sharing groups was to make it > easier for the user to reason about how many slots a job needs independent > of the actual number of operators in the job. Interestingly, if all > operators have their resources properly specified, then slot sharing is no > longer needed because Flink could slice off the appropriately sized slots > for every Task individually. What matters is whether the whole cluster has > enough resources to run all tasks or not. > > Cheers, > Till > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> wrote: > >> Hi, there, >> >> We would like to start a discussion thread on "FLIP-156: Runtime >> Interfaces for Fine-Grained Resource Requirements"[1], where we >> propose Slot Sharing Group (SSG) based runtime interfaces for >> specifying fine-grained resource requirements. >> >> In this FLIP: >> - Expound the user story of fine-grained resource management. >> - Propose runtime interfaces for specifying SSG-based resource >> requirements. >> - Discuss the pros and cons of the three potential granularities for >> specifying the resource requirements (op, task and slot sharing group) >> and explain why we choose the slot sharing group. >> >> Please find more details in the FLIP wiki document [1]. Looking >> forward to your feedback. >> >> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements >> >> Best, >> Yangze Guo >> |
Thanks for your feedback.
@Till > the only option for a scheduler which does not support slot sharing groups is to say that every operator in this slot sharing group needs a slot with the same resources as the whole group. At the moment, all the implementations of the scheduler respect the slot sharing group. Regarding your example, in that case, user can directly split two operators into two slot sharing groups with 100M respectively. > If all operators have their resources properly specified, then slot sharing is no longer needed. I also agree with it. However, specifying resource requirements for each operator is impractical for complex jobs that contain tens or even hundreds of operators. It's also hard to have a default value for operator resource requirements. The SSG-based approach makes the user's configuration more flexible. In many cases, users just care about/know the resource requirement of some subgraphs. Enforcing them to provide more information harms the usability. If the expert user knows more fine-grained resource requirements, the operator granularity resource requirements can be realized by configuring the slot sharing group arrangement. @Chesney > Will declaring them on slot sharing groups not also waste resources if the parallelism of operators within that group are different? Yes, we list it as one of the cons of the SSG-based approach. In that case, user needs to separate operators with different parallelisms into different SSGs. However, compared to the benefits we list, we tend to treat it as a trade-off between usability and resource utilization for the user to decide. All in all, fine-grained resource management is for expert users to further optimize resource utilization, such an extra effort might be worth it. > It also seems like quite a hassle for users having to recalculate the resource requirements if they change the slot sharing. If an expert user knows the exact resource requirements of each operator, they can just place each operator in different slot sharing groups. If they want some of them placed in the same slot, they just need to sum up the resource requirements of those operators. There is no need to maintain the resource requirement of a set of re-usable operators. > My main worry is that if we wire the runtime to work on SSGs it's gonna be difficult to implement more fine-grained approaches. One of the important reasons we choose the SSG-based approach is that we find that the slot is the basic unit for resource management in Flinkās runtime. - Runtime interfaces should only require the minimum set of information needed. Operator-level resource requirements will be converted to Slot-level. - So far, the end-user interfaces for specifying resource requirements are still under discussion. For runtime interfaces, it should only require the minimum set of information needed for resource management. Best, Yangze Guo On Thu, Jan 7, 2021 at 10:00 PM Chesnay Schepler <[hidden email]> wrote: > > Will declaring them on slot sharing groups not also waste resources if > the parallelism of operators within that group are different? > > It also seems like quite a hassle for users having to recalculate the > resource requirements if they change the slot sharing. > I'd think that it's not really workable for users that create a set of > re-usable operators which are mixed and matched in their applications; > managing the resources requirements in such a setting would be a > nightmare, and in the end would require operator-level requirements any way. > In that sense, I'm not even sure whether it really increases usability. > > My main worry is that it if we wire the runtime to work on SSGs it's > gonna be difficult to implement more fine-grained approaches, which > would not be the case if, for the runtime, they are always defined on an > operator-level. > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > Thanks for drafting this FLIP and starting this discussion Yangze. > > > > I like that defining resource requirements on a slot sharing group makes > > the overall setup easier and improves usability of resource requirements. > > > > What I do not like about it is that it changes slot sharing groups from > > being a scheduling hint to something which needs to be supported in order > > to support fine grained resource requirements. So far, the idea of slot > > sharing groups was that it tells the system that a set of operators can be > > deployed in the same slot. But the system still had the freedom to say that > > it would rather place these tasks in different slots if it wanted. If we > > now specify resource requirements on a per slot sharing group, then the > > only option for a scheduler which does not support slot sharing groups is > > to say that every operator in this slot sharing group needs a slot with the > > same resources as the whole group. > > > > So for example, if we have a job consisting of two operator op_1 and op_2 > > where each op needs 100 MB of memory, we would then say that the slot > > sharing group needs 200 MB of memory to run. If we have a cluster with 2 > > TMs with one slot of 100 MB each, then the system cannot run this job. If > > the resources were specified on an operator level, then the system could > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > Originally, one of the primary goals of slot sharing groups was to make it > > easier for the user to reason about how many slots a job needs independent > > of the actual number of operators in the job. Interestingly, if all > > operators have their resources properly specified, then slot sharing is no > > longer needed because Flink could slice off the appropriately sized slots > > for every Task individually. What matters is whether the whole cluster has > > enough resources to run all tasks or not. > > > > Cheers, > > Till > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> wrote: > > > >> Hi, there, > >> > >> We would like to start a discussion thread on "FLIP-156: Runtime > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > >> propose Slot Sharing Group (SSG) based runtime interfaces for > >> specifying fine-grained resource requirements. > >> > >> In this FLIP: > >> - Expound the user story of fine-grained resource management. > >> - Propose runtime interfaces for specifying SSG-based resource > >> requirements. > >> - Discuss the pros and cons of the three potential granularities for > >> specifying the resource requirements (op, task and slot sharing group) > >> and explain why we choose the slot sharing group. > >> > >> Please find more details in the FLIP wiki document [1]. Looking > >> forward to your feedback. > >> > >> [1] > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > >> > >> Best, > >> Yangze Guo > >> > |
In reply to this post by Chesnay Schepler-3
Thanks for drafting the FLIP and driving the discussion, Yangze.
And Thanks for the feedback, Till and Chesnay. @Till, I agree that specifying requirements for SSGs means that SSGs need to be supported in fine-grained resource management, otherwise each operator might use as many resources as the whole group. However, I cannot think of a strong reason for not supporting SSGs in fine-grained resource management. > Interestingly, if all operators have their resources properly specified, > then slot sharing is no longer needed because Flink could slice off the > appropriately sized slots for every Task individually. > So for example, if we have a job consisting of two operator op_1 and op_2 > where each op needs 100 MB of memory, we would then say that the slot > sharing group needs 200 MB of memory to run. If we have a cluster with 2 > TMs with one slot of 100 MB each, then the system cannot run this job. If > the resources were specified on an operator level, then the system could > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. Couldn't agree more that if all operators' requirements are properly specified, slot sharing should be no longer needed. I think this exactly disproves the example. If we already know op_1 and op_2 each needs 100 MB of memory, why would we put them in the same group? If they are in separate groups, with the proposed approach the system can freely deploy them to either a 200 MB TM or two 100 MB TMs. Moreover, the precondition for not needing slot sharing is having resource requirements properly specified for all operators. This is not always possible, and usually requires tremendous efforts. One of the benefits for SSG-based requirements is that it allows the user to freely decide the granularity, thus efforts they want to pay. I would consider SSG in fine-grained resource management as a group of operators that the user would like to specify the total resource for. There can be only one group in the job, 2~3 groups dividing the job into a few major parts, or as many groups as the number of tasks/operators, depending on how fine-grained the user is able to specify the resources. Having to support SSGs might be a constraint. But given that all the current scheduler implementations already support SSGs, I tend to think that as an acceptable price for the above discussed usability and flexibility. @Chesnay Will declaring them on slot sharing groups not also waste resources if the > parallelism of operators within that group are different? > Yes. It's a trade-off between usability and resource utilization. To avoid such wasting, the user can define more groups, so that each group contains less operators and the chance of having operators with different parallelism will be reduced. The price is to have more resource requirements to specify. It also seems like quite a hassle for users having to recalculate the > resource requirements if they change the slot sharing. > I'd think that it's not really workable for users that create a set of > re-usable operators which are mixed and matched in their applications; > managing the resources requirements in such a setting would be a > nightmare, and in the end would require operator-level requirements any > way. > In that sense, I'm not even sure whether it really increases usability. > - As mentioned in my reply to Till's comment, there's no reason to put multiple operators whose individual resource requirements are already known into the same group in fine-grained resource management. - Even an operator implementation is reused for multiple applications, it does not guarantee the same resource requirements. During our years of practices in Alibaba, with per-operator requirements specified for Blink's fine-grained resource management, very few users (including our specialists who are dedicated to supporting Blink users) are as experienced as to accurately predict/estimate the operator resource requirements. Most people rely on the execution-time metrics (throughput, delay, cpu load, memory usage, GC pressure, etc.) to improve the specification. To sum up: If the user is capable of providing proper resource requirements for every operator, that's definitely a good thing and we would not need to rely on the SSGs. However, that shouldn't be a *must* for the fine-grained resource management to work. For those users who are capable and do not like having to set each operator to a separate SSG, I would be ok to have both SSG-based and operator-based runtime interfaces and to only fallback to the SSG requirements when the operator requirements are not specified. However, as the first step, I think we should prioritise the use cases where users are not that experienced. Thank you~ Xintong Song On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <[hidden email]> wrote: > Will declaring them on slot sharing groups not also waste resources if > the parallelism of operators within that group are different? > > It also seems like quite a hassle for users having to recalculate the > resource requirements if they change the slot sharing. > I'd think that it's not really workable for users that create a set of > re-usable operators which are mixed and matched in their applications; > managing the resources requirements in such a setting would be a > nightmare, and in the end would require operator-level requirements any > way. > In that sense, I'm not even sure whether it really increases usability. > > My main worry is that it if we wire the runtime to work on SSGs it's > gonna be difficult to implement more fine-grained approaches, which > would not be the case if, for the runtime, they are always defined on an > operator-level. > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > Thanks for drafting this FLIP and starting this discussion Yangze. > > > > I like that defining resource requirements on a slot sharing group makes > > the overall setup easier and improves usability of resource requirements. > > > > What I do not like about it is that it changes slot sharing groups from > > being a scheduling hint to something which needs to be supported in order > > to support fine grained resource requirements. So far, the idea of slot > > sharing groups was that it tells the system that a set of operators can > be > > deployed in the same slot. But the system still had the freedom to say > that > > it would rather place these tasks in different slots if it wanted. If we > > now specify resource requirements on a per slot sharing group, then the > > only option for a scheduler which does not support slot sharing groups is > > to say that every operator in this slot sharing group needs a slot with > the > > same resources as the whole group. > > > > So for example, if we have a job consisting of two operator op_1 and op_2 > > where each op needs 100 MB of memory, we would then say that the slot > > sharing group needs 200 MB of memory to run. If we have a cluster with 2 > > TMs with one slot of 100 MB each, then the system cannot run this job. If > > the resources were specified on an operator level, then the system could > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > Originally, one of the primary goals of slot sharing groups was to make > it > > easier for the user to reason about how many slots a job needs > independent > > of the actual number of operators in the job. Interestingly, if all > > operators have their resources properly specified, then slot sharing is > no > > longer needed because Flink could slice off the appropriately sized slots > > for every Task individually. What matters is whether the whole cluster > has > > enough resources to run all tasks or not. > > > > Cheers, > > Till > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> wrote: > > > >> Hi, there, > >> > >> We would like to start a discussion thread on "FLIP-156: Runtime > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > >> propose Slot Sharing Group (SSG) based runtime interfaces for > >> specifying fine-grained resource requirements. > >> > >> In this FLIP: > >> - Expound the user story of fine-grained resource management. > >> - Propose runtime interfaces for specifying SSG-based resource > >> requirements. > >> - Discuss the pros and cons of the three potential granularities for > >> specifying the resource requirements (op, task and slot sharing group) > >> and explain why we choose the slot sharing group. > >> > >> Please find more details in the FLIP wiki document [1]. Looking > >> forward to your feedback. > >> > >> [1] > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > >> > >> Best, > >> Yangze Guo > >> > > |
Thanks a lot, Yangze and Xintong for this FLIP.
I want to say, first of all, that this is super well written. And the points that the FLIP makes about how to expose the configuration to users is exactly the right thing to figure out first. So good job here! About how to let users specify the resource profiles. If I can sum the FLIP and previous discussion up in my own words, the problem is the following: Operator-level specification is the simplest and cleanest approach, because > it avoids mixing operator configuration (resource) and scheduling. No > matter what other parameters change (chaining, slot sharing, switching > pipelined and blocking shuffles), the resource profiles stay the same. > But it would require that a user specifies resources on all operators, > which makes it hard to use. That's why the FLIP suggests going with > specifying resources on a Sharing-Group. I think both thoughts are important, so can we find a solution where the Resource Profiles are specified on an Operator, but we still avoid that we need to specify a resource profile on every operator? What do you think about something like the following: - Resource Profiles are specified on an operator level. - Not all operators need profiles - All Operators without a Resource Profile ended up in the default slot sharing group with a default profile (will get a default slot). - All Operators with a Resource Profile will go into another slot sharing group (the resource-specified-group). - Users can define different slot sharing groups for operators like they do now, with the exception that you cannot mix operators that have a resource profile and operators that have no resource profile. - The default case where no operator has a resource profile is just a special case of this model - The chaining logic sums up the profiles per operator, like it does now, and the scheduler sums up the profiles of the tasks that it schedules together. There is another question about reactive scaling raised in the FLIP. I need to think a bit about that. That is indeed a bit more tricky once we have slots of different sizes. It is not clear then which of the different slot requests the ResourceManager should fulfill when new resources (TMs) show up, or how the JobManager redistributes the slots resources when resources (TMs) disappear This question is pretty orthogonal, though, to the "how to specify the resources". Best, Stephan On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email]> wrote: > Thanks for drafting the FLIP and driving the discussion, Yangze. > And Thanks for the feedback, Till and Chesnay. > > @Till, > > I agree that specifying requirements for SSGs means that SSGs need to be > supported in fine-grained resource management, otherwise each operator > might use as many resources as the whole group. However, I cannot think of > a strong reason for not supporting SSGs in fine-grained resource > management. > > > > Interestingly, if all operators have their resources properly specified, > > then slot sharing is no longer needed because Flink could slice off the > > appropriately sized slots for every Task individually. > > > > So for example, if we have a job consisting of two operator op_1 and op_2 > > where each op needs 100 MB of memory, we would then say that the slot > > sharing group needs 200 MB of memory to run. If we have a cluster with 2 > > TMs with one slot of 100 MB each, then the system cannot run this job. If > > the resources were specified on an operator level, then the system could > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > Couldn't agree more that if all operators' requirements are properly > specified, slot sharing should be no longer needed. I think this exactly > disproves the example. If we already know op_1 and op_2 each needs 100 MB > of memory, why would we put them in the same group? If they are in separate > groups, with the proposed approach the system can freely deploy them to > either a 200 MB TM or two 100 MB TMs. > > Moreover, the precondition for not needing slot sharing is having resource > requirements properly specified for all operators. This is not always > possible, and usually requires tremendous efforts. One of the benefits for > SSG-based requirements is that it allows the user to freely decide the > granularity, thus efforts they want to pay. I would consider SSG in > fine-grained resource management as a group of operators that the user > would like to specify the total resource for. There can be only one group > in the job, 2~3 groups dividing the job into a few major parts, or as many > groups as the number of tasks/operators, depending on how fine-grained the > user is able to specify the resources. > > Having to support SSGs might be a constraint. But given that all the > current scheduler implementations already support SSGs, I tend to think > that as an acceptable price for the above discussed usability and > flexibility. > > @Chesnay > > Will declaring them on slot sharing groups not also waste resources if the > > parallelism of operators within that group are different? > > > Yes. It's a trade-off between usability and resource utilization. To avoid > such wasting, the user can define more groups, so that each group contains > less operators and the chance of having operators with different > parallelism will be reduced. The price is to have more resource > requirements to specify. > > It also seems like quite a hassle for users having to recalculate the > > resource requirements if they change the slot sharing. > > I'd think that it's not really workable for users that create a set of > > re-usable operators which are mixed and matched in their applications; > > managing the resources requirements in such a setting would be a > > nightmare, and in the end would require operator-level requirements any > > way. > > In that sense, I'm not even sure whether it really increases usability. > > > > - As mentioned in my reply to Till's comment, there's no reason to put > multiple operators whose individual resource requirements are already > known > into the same group in fine-grained resource management. > - Even an operator implementation is reused for multiple applications, > it does not guarantee the same resource requirements. During our years > of > practices in Alibaba, with per-operator requirements specified for > Blink's > fine-grained resource management, very few users (including our > specialists > who are dedicated to supporting Blink users) are as experienced as to > accurately predict/estimate the operator resource requirements. Most > people > rely on the execution-time metrics (throughput, delay, cpu load, memory > usage, GC pressure, etc.) to improve the specification. > > To sum up: > If the user is capable of providing proper resource requirements for every > operator, that's definitely a good thing and we would not need to rely on > the SSGs. However, that shouldn't be a *must* for the fine-grained resource > management to work. For those users who are capable and do not like having > to set each operator to a separate SSG, I would be ok to have both > SSG-based and operator-based runtime interfaces and to only fallback to the > SSG requirements when the operator requirements are not specified. However, > as the first step, I think we should prioritise the use cases where users > are not that experienced. > > Thank you~ > > Xintong Song > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <[hidden email]> > wrote: > > > Will declaring them on slot sharing groups not also waste resources if > > the parallelism of operators within that group are different? > > > > It also seems like quite a hassle for users having to recalculate the > > resource requirements if they change the slot sharing. > > I'd think that it's not really workable for users that create a set of > > re-usable operators which are mixed and matched in their applications; > > managing the resources requirements in such a setting would be a > > nightmare, and in the end would require operator-level requirements any > > way. > > In that sense, I'm not even sure whether it really increases usability. > > > > My main worry is that it if we wire the runtime to work on SSGs it's > > gonna be difficult to implement more fine-grained approaches, which > > would not be the case if, for the runtime, they are always defined on an > > operator-level. > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > Thanks for drafting this FLIP and starting this discussion Yangze. > > > > > > I like that defining resource requirements on a slot sharing group > makes > > > the overall setup easier and improves usability of resource > requirements. > > > > > > What I do not like about it is that it changes slot sharing groups from > > > being a scheduling hint to something which needs to be supported in > order > > > to support fine grained resource requirements. So far, the idea of slot > > > sharing groups was that it tells the system that a set of operators can > > be > > > deployed in the same slot. But the system still had the freedom to say > > that > > > it would rather place these tasks in different slots if it wanted. If > we > > > now specify resource requirements on a per slot sharing group, then the > > > only option for a scheduler which does not support slot sharing groups > is > > > to say that every operator in this slot sharing group needs a slot with > > the > > > same resources as the whole group. > > > > > > So for example, if we have a job consisting of two operator op_1 and > op_2 > > > where each op needs 100 MB of memory, we would then say that the slot > > > sharing group needs 200 MB of memory to run. If we have a cluster with > 2 > > > TMs with one slot of 100 MB each, then the system cannot run this job. > If > > > the resources were specified on an operator level, then the system > could > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > Originally, one of the primary goals of slot sharing groups was to make > > it > > > easier for the user to reason about how many slots a job needs > > independent > > > of the actual number of operators in the job. Interestingly, if all > > > operators have their resources properly specified, then slot sharing is > > no > > > longer needed because Flink could slice off the appropriately sized > slots > > > for every Task individually. What matters is whether the whole cluster > > has > > > enough resources to run all tasks or not. > > > > > > Cheers, > > > Till > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> wrote: > > > > > >> Hi, there, > > >> > > >> We would like to start a discussion thread on "FLIP-156: Runtime > > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > > >> propose Slot Sharing Group (SSG) based runtime interfaces for > > >> specifying fine-grained resource requirements. > > >> > > >> In this FLIP: > > >> - Expound the user story of fine-grained resource management. > > >> - Propose runtime interfaces for specifying SSG-based resource > > >> requirements. > > >> - Discuss the pros and cons of the three potential granularities for > > >> specifying the resource requirements (op, task and slot sharing group) > > >> and explain why we choose the slot sharing group. > > >> > > >> Please find more details in the FLIP wiki document [1]. Looking > > >> forward to your feedback. > > >> > > >> [1] > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > >> > > >> Best, > > >> Yangze Guo > > >> > > > > > |
Thanks for the feedback, Stephan.
Actually, your proposal has also come to my mind at some point. And I have some concerns about it. 1. It does not give users the same control as the SSG-based approach. While both approaches do not require specifying for each operator, SSG-based approach supports the semantic that "some operators together use this much resource" while the operator-based approach doesn't. Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and at some point there's an agg o_n (1 < n < m) which significantly reduces the data amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., o_n) and SSG_2 (o_n+1, ... o_m), so that configuring much higher parallelisms for operators in SSG_1 than for operators in SSG_2 won't lead to too much wasting of resources. If the two SSGs end up needing different resources, with the SSG-based approach one can directly specify resources for the two groups. However, with the operator-based approach, the user will have to specify resources for each operator in one of the two groups, and tune the default slot resource via configurations to fit the other group. 2. It increases the chance of breaking operator chains. Setting chainnable operators into different slot sharing groups will prevent them from being chained. In the current implementation, downstream operators, if SSG not explicitly specified, will be set to the same group as the chainable upstream operators (unless multiple upstream operators in different groups), to reduce the chance of breaking chains. Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding SSGs based on whether resource is specified we will easily get groups like (o_1, o_3) & (o_2, o_4), where none of the operators can be chained. This is also possible for the SSG-based approach, but I believe the chance is much smaller because there's no strong reason for users to specify the groups with alternate operators like that. We are more likely to get groups like (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and o_3. 3. It complicates the system by having two different mechanisms for sharing managed memory in a slot. - In FLIP-141, we introduced the intra-slot managed memory sharing mechanism, where managed memory is first distributed according to the consumer type, then further distributed across operators of that consumer type. - With the operator-based approach, managed memory size specified for an operator should account for all the consumer types of that operator. That means the managed memory is first distributed across operators, then distributed to different consumer types of each operator. Unfortunately, the different order of the two calculation steps can lead to different results. To be specific, the semantic of the configuration option `consumer-weights` changed (within a slot vs. within an operator). To sum up things: While (3) might be a bit more implementation related, I think (1) and (2) somehow suggest that, the price for the proposed approach to avoid specifying resource for every operator is that it's not as independent from operator chaining and slot sharing as the operator-based approach discussed in the FLIP. Thank you~ Xintong Song On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> wrote: > Thanks a lot, Yangze and Xintong for this FLIP. > > I want to say, first of all, that this is super well written. And the > points that the FLIP makes about how to expose the configuration to users > is exactly the right thing to figure out first. > So good job here! > > About how to let users specify the resource profiles. If I can sum the FLIP > and previous discussion up in my own words, the problem is the following: > > Operator-level specification is the simplest and cleanest approach, because > > it avoids mixing operator configuration (resource) and scheduling. No > > matter what other parameters change (chaining, slot sharing, switching > > pipelined and blocking shuffles), the resource profiles stay the same. > > But it would require that a user specifies resources on all operators, > > which makes it hard to use. That's why the FLIP suggests going with > > specifying resources on a Sharing-Group. > > > I think both thoughts are important, so can we find a solution where the > Resource Profiles are specified on an Operator, but we still avoid that we > need to specify a resource profile on every operator? > > What do you think about something like the following: > - Resource Profiles are specified on an operator level. > - Not all operators need profiles > - All Operators without a Resource Profile ended up in the default slot > sharing group with a default profile (will get a default slot). > - All Operators with a Resource Profile will go into another slot sharing > group (the resource-specified-group). > - Users can define different slot sharing groups for operators like they > do now, with the exception that you cannot mix operators that have a > resource profile and operators that have no resource profile. > - The default case where no operator has a resource profile is just a > special case of this model > - The chaining logic sums up the profiles per operator, like it does now, > and the scheduler sums up the profiles of the tasks that it schedules > together. > > > There is another question about reactive scaling raised in the FLIP. I need > to think a bit about that. That is indeed a bit more tricky once we have > slots of different sizes. > It is not clear then which of the different slot requests the > ResourceManager should fulfill when new resources (TMs) show up, or how the > JobManager redistributes the slots resources when resources (TMs) disappear > This question is pretty orthogonal, though, to the "how to specify the > resources". > > > Best, > Stephan > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email]> wrote: > > > Thanks for drafting the FLIP and driving the discussion, Yangze. > > And Thanks for the feedback, Till and Chesnay. > > > > @Till, > > > > I agree that specifying requirements for SSGs means that SSGs need to be > > supported in fine-grained resource management, otherwise each operator > > might use as many resources as the whole group. However, I cannot think > of > > a strong reason for not supporting SSGs in fine-grained resource > > management. > > > > > > > Interestingly, if all operators have their resources properly > specified, > > > then slot sharing is no longer needed because Flink could slice off the > > > appropriately sized slots for every Task individually. > > > > > > > So for example, if we have a job consisting of two operator op_1 and op_2 > > > where each op needs 100 MB of memory, we would then say that the slot > > > sharing group needs 200 MB of memory to run. If we have a cluster with > 2 > > > TMs with one slot of 100 MB each, then the system cannot run this job. > If > > > the resources were specified on an operator level, then the system > could > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > Couldn't agree more that if all operators' requirements are properly > > specified, slot sharing should be no longer needed. I think this exactly > > disproves the example. If we already know op_1 and op_2 each needs 100 MB > > of memory, why would we put them in the same group? If they are in > separate > > groups, with the proposed approach the system can freely deploy them to > > either a 200 MB TM or two 100 MB TMs. > > > > Moreover, the precondition for not needing slot sharing is having > resource > > requirements properly specified for all operators. This is not always > > possible, and usually requires tremendous efforts. One of the benefits > for > > SSG-based requirements is that it allows the user to freely decide the > > granularity, thus efforts they want to pay. I would consider SSG in > > fine-grained resource management as a group of operators that the user > > would like to specify the total resource for. There can be only one group > > in the job, 2~3 groups dividing the job into a few major parts, or as > many > > groups as the number of tasks/operators, depending on how fine-grained > the > > user is able to specify the resources. > > > > Having to support SSGs might be a constraint. But given that all the > > current scheduler implementations already support SSGs, I tend to think > > that as an acceptable price for the above discussed usability and > > flexibility. > > > > @Chesnay > > > > Will declaring them on slot sharing groups not also waste resources if > the > > > parallelism of operators within that group are different? > > > > > Yes. It's a trade-off between usability and resource utilization. To > avoid > > such wasting, the user can define more groups, so that each group > contains > > less operators and the chance of having operators with different > > parallelism will be reduced. The price is to have more resource > > requirements to specify. > > > > It also seems like quite a hassle for users having to recalculate the > > > resource requirements if they change the slot sharing. > > > I'd think that it's not really workable for users that create a set of > > > re-usable operators which are mixed and matched in their applications; > > > managing the resources requirements in such a setting would be a > > > nightmare, and in the end would require operator-level requirements any > > > way. > > > In that sense, I'm not even sure whether it really increases usability. > > > > > > > - As mentioned in my reply to Till's comment, there's no reason to put > > multiple operators whose individual resource requirements are already > > known > > into the same group in fine-grained resource management. > > - Even an operator implementation is reused for multiple applications, > > it does not guarantee the same resource requirements. During our years > > of > > practices in Alibaba, with per-operator requirements specified for > > Blink's > > fine-grained resource management, very few users (including our > > specialists > > who are dedicated to supporting Blink users) are as experienced as to > > accurately predict/estimate the operator resource requirements. Most > > people > > rely on the execution-time metrics (throughput, delay, cpu load, > memory > > usage, GC pressure, etc.) to improve the specification. > > > > To sum up: > > If the user is capable of providing proper resource requirements for > every > > operator, that's definitely a good thing and we would not need to rely on > > the SSGs. However, that shouldn't be a *must* for the fine-grained > resource > > management to work. For those users who are capable and do not like > having > > to set each operator to a separate SSG, I would be ok to have both > > SSG-based and operator-based runtime interfaces and to only fallback to > the > > SSG requirements when the operator requirements are not specified. > However, > > as the first step, I think we should prioritise the use cases where users > > are not that experienced. > > > > Thank you~ > > > > Xintong Song > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <[hidden email]> > > wrote: > > > > > Will declaring them on slot sharing groups not also waste resources if > > > the parallelism of operators within that group are different? > > > > > > It also seems like quite a hassle for users having to recalculate the > > > resource requirements if they change the slot sharing. > > > I'd think that it's not really workable for users that create a set of > > > re-usable operators which are mixed and matched in their applications; > > > managing the resources requirements in such a setting would be a > > > nightmare, and in the end would require operator-level requirements any > > > way. > > > In that sense, I'm not even sure whether it really increases usability. > > > > > > My main worry is that it if we wire the runtime to work on SSGs it's > > > gonna be difficult to implement more fine-grained approaches, which > > > would not be the case if, for the runtime, they are always defined on > an > > > operator-level. > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > Thanks for drafting this FLIP and starting this discussion Yangze. > > > > > > > > I like that defining resource requirements on a slot sharing group > > makes > > > > the overall setup easier and improves usability of resource > > requirements. > > > > > > > > What I do not like about it is that it changes slot sharing groups > from > > > > being a scheduling hint to something which needs to be supported in > > order > > > > to support fine grained resource requirements. So far, the idea of > slot > > > > sharing groups was that it tells the system that a set of operators > can > > > be > > > > deployed in the same slot. But the system still had the freedom to > say > > > that > > > > it would rather place these tasks in different slots if it wanted. If > > we > > > > now specify resource requirements on a per slot sharing group, then > the > > > > only option for a scheduler which does not support slot sharing > groups > > is > > > > to say that every operator in this slot sharing group needs a slot > with > > > the > > > > same resources as the whole group. > > > > > > > > So for example, if we have a job consisting of two operator op_1 and > > op_2 > > > > where each op needs 100 MB of memory, we would then say that the slot > > > > sharing group needs 200 MB of memory to run. If we have a cluster > with > > 2 > > > > TMs with one slot of 100 MB each, then the system cannot run this > job. > > If > > > > the resources were specified on an operator level, then the system > > could > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > Originally, one of the primary goals of slot sharing groups was to > make > > > it > > > > easier for the user to reason about how many slots a job needs > > > independent > > > > of the actual number of operators in the job. Interestingly, if all > > > > operators have their resources properly specified, then slot sharing > is > > > no > > > > longer needed because Flink could slice off the appropriately sized > > slots > > > > for every Task individually. What matters is whether the whole > cluster > > > has > > > > enough resources to run all tasks or not. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> > wrote: > > > > > > > >> Hi, there, > > > >> > > > >> We would like to start a discussion thread on "FLIP-156: Runtime > > > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > > > >> propose Slot Sharing Group (SSG) based runtime interfaces for > > > >> specifying fine-grained resource requirements. > > > >> > > > >> In this FLIP: > > > >> - Expound the user story of fine-grained resource management. > > > >> - Propose runtime interfaces for specifying SSG-based resource > > > >> requirements. > > > >> - Discuss the pros and cons of the three potential granularities for > > > >> specifying the resource requirements (op, task and slot sharing > group) > > > >> and explain why we choose the slot sharing group. > > > >> > > > >> Please find more details in the FLIP wiki document [1]. Looking > > > >> forward to your feedback. > > > >> > > > >> [1] > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > >> > > > >> Best, > > > >> Yangze Guo > > > >> > > > > > > > > > |
Thanks for the responses Xintong and Stephan,
I agree that being able to define the resource requirements for a group of operators is more user friendly. However, my concern is that we are exposing thereby internal runtime strategies which might limit our flexibility to execute a given job. Moreover, the semantics of configuring resource requirements for SSGs could break if switching from streaming to batch execution. If one defines the resource requirements for op_1 -> op_2 which run in pipelined mode when using the streaming execution, then how do we interpret these requirements when op_1 -> op_2 are executed with a blocking data exchange in batch execution mode? Consequently, I am still leaning towards Stephan's proposal to set the resource requirements per operator. Maybe the following proposal makes the configuration easier: If the user wants to use fine-grained resource requirements, then she needs to specify the default size which is used for operators which have no explicit resource annotation. If this holds true, then every operator would have a resource requirement and the system can try to execute the operators in the best possible manner w/o being constrained by how the user set the SSG requirements. Cheers, Till On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> wrote: > Thanks for the feedback, Stephan. > > Actually, your proposal has also come to my mind at some point. And I have > some concerns about it. > > > 1. It does not give users the same control as the SSG-based approach. > > > While both approaches do not require specifying for each operator, > SSG-based approach supports the semantic that "some operators together use > this much resource" while the operator-based approach doesn't. > > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and at some > point there's an agg o_n (1 < n < m) which significantly reduces the data > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., o_n) > and SSG_2 (o_n+1, ... o_m), so that configuring much higher parallelisms > for operators in SSG_1 than for operators in SSG_2 won't lead to too much > wasting of resources. If the two SSGs end up needing different resources, > with the SSG-based approach one can directly specify resources for the two > groups. However, with the operator-based approach, the user will have to > specify resources for each operator in one of the two groups, and tune the > default slot resource via configurations to fit the other group. > > > 2. It increases the chance of breaking operator chains. > > > Setting chainnable operators into different slot sharing groups will > prevent them from being chained. In the current implementation, downstream > operators, if SSG not explicitly specified, will be set to the same group > as the chainable upstream operators (unless multiple upstream operators in > different groups), to reduce the chance of breaking chains. > > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding SSGs > based on whether resource is specified we will easily get groups like (o_1, > o_3) & (o_2, o_4), where none of the operators can be chained. This is also > possible for the SSG-based approach, but I believe the chance is much > smaller because there's no strong reason for users to specify the groups > with alternate operators like that. We are more likely to get groups like > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and o_3. > > > 3. It complicates the system by having two different mechanisms for sharing > managed memory in a slot. > > > - In FLIP-141, we introduced the intra-slot managed memory sharing > mechanism, where managed memory is first distributed according to the > consumer type, then further distributed across operators of that consumer > type. > > - With the operator-based approach, managed memory size specified for an > operator should account for all the consumer types of that operator. That > means the managed memory is first distributed across operators, then > distributed to different consumer types of each operator. > > > Unfortunately, the different order of the two calculation steps can lead to > different results. To be specific, the semantic of the configuration option > `consumer-weights` changed (within a slot vs. within an operator). > > > > To sum up things: > > While (3) might be a bit more implementation related, I think (1) and (2) > somehow suggest that, the price for the proposed approach to avoid > specifying resource for every operator is that it's not as independent from > operator chaining and slot sharing as the operator-based approach discussed > in the FLIP. > > > Thank you~ > > Xintong Song > > > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> wrote: > > > Thanks a lot, Yangze and Xintong for this FLIP. > > > > I want to say, first of all, that this is super well written. And the > > points that the FLIP makes about how to expose the configuration to users > > is exactly the right thing to figure out first. > > So good job here! > > > > About how to let users specify the resource profiles. If I can sum the > FLIP > > and previous discussion up in my own words, the problem is the following: > > > > Operator-level specification is the simplest and cleanest approach, > because > > > it avoids mixing operator configuration (resource) and scheduling. No > > > matter what other parameters change (chaining, slot sharing, switching > > > pipelined and blocking shuffles), the resource profiles stay the same. > > > But it would require that a user specifies resources on all operators, > > > which makes it hard to use. That's why the FLIP suggests going with > > > specifying resources on a Sharing-Group. > > > > > > I think both thoughts are important, so can we find a solution where the > > Resource Profiles are specified on an Operator, but we still avoid that > we > > need to specify a resource profile on every operator? > > > > What do you think about something like the following: > > - Resource Profiles are specified on an operator level. > > - Not all operators need profiles > > - All Operators without a Resource Profile ended up in the default slot > > sharing group with a default profile (will get a default slot). > > - All Operators with a Resource Profile will go into another slot > sharing > > group (the resource-specified-group). > > - Users can define different slot sharing groups for operators like > they > > do now, with the exception that you cannot mix operators that have a > > resource profile and operators that have no resource profile. > > - The default case where no operator has a resource profile is just a > > special case of this model > > - The chaining logic sums up the profiles per operator, like it does > now, > > and the scheduler sums up the profiles of the tasks that it schedules > > together. > > > > > > There is another question about reactive scaling raised in the FLIP. I > need > > to think a bit about that. That is indeed a bit more tricky once we have > > slots of different sizes. > > It is not clear then which of the different slot requests the > > ResourceManager should fulfill when new resources (TMs) show up, or how > the > > JobManager redistributes the slots resources when resources (TMs) > disappear > > This question is pretty orthogonal, though, to the "how to specify the > > resources". > > > > > > Best, > > Stephan > > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email]> > wrote: > > > > > Thanks for drafting the FLIP and driving the discussion, Yangze. > > > And Thanks for the feedback, Till and Chesnay. > > > > > > @Till, > > > > > > I agree that specifying requirements for SSGs means that SSGs need to > be > > > supported in fine-grained resource management, otherwise each operator > > > might use as many resources as the whole group. However, I cannot think > > of > > > a strong reason for not supporting SSGs in fine-grained resource > > > management. > > > > > > > > > > Interestingly, if all operators have their resources properly > > specified, > > > > then slot sharing is no longer needed because Flink could slice off > the > > > > appropriately sized slots for every Task individually. > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 and > op_2 > > > > where each op needs 100 MB of memory, we would then say that the slot > > > > sharing group needs 200 MB of memory to run. If we have a cluster > with > > 2 > > > > TMs with one slot of 100 MB each, then the system cannot run this > job. > > If > > > > the resources were specified on an operator level, then the system > > could > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > Couldn't agree more that if all operators' requirements are properly > > > specified, slot sharing should be no longer needed. I think this > exactly > > > disproves the example. If we already know op_1 and op_2 each needs 100 > MB > > > of memory, why would we put them in the same group? If they are in > > separate > > > groups, with the proposed approach the system can freely deploy them to > > > either a 200 MB TM or two 100 MB TMs. > > > > > > Moreover, the precondition for not needing slot sharing is having > > resource > > > requirements properly specified for all operators. This is not always > > > possible, and usually requires tremendous efforts. One of the benefits > > for > > > SSG-based requirements is that it allows the user to freely decide the > > > granularity, thus efforts they want to pay. I would consider SSG in > > > fine-grained resource management as a group of operators that the user > > > would like to specify the total resource for. There can be only one > group > > > in the job, 2~3 groups dividing the job into a few major parts, or as > > many > > > groups as the number of tasks/operators, depending on how fine-grained > > the > > > user is able to specify the resources. > > > > > > Having to support SSGs might be a constraint. But given that all the > > > current scheduler implementations already support SSGs, I tend to think > > > that as an acceptable price for the above discussed usability and > > > flexibility. > > > > > > @Chesnay > > > > > > Will declaring them on slot sharing groups not also waste resources if > > the > > > > parallelism of operators within that group are different? > > > > > > > Yes. It's a trade-off between usability and resource utilization. To > > avoid > > > such wasting, the user can define more groups, so that each group > > contains > > > less operators and the chance of having operators with different > > > parallelism will be reduced. The price is to have more resource > > > requirements to specify. > > > > > > It also seems like quite a hassle for users having to recalculate the > > > > resource requirements if they change the slot sharing. > > > > I'd think that it's not really workable for users that create a set > of > > > > re-usable operators which are mixed and matched in their > applications; > > > > managing the resources requirements in such a setting would be a > > > > nightmare, and in the end would require operator-level requirements > any > > > > way. > > > > In that sense, I'm not even sure whether it really increases > usability. > > > > > > > > > > - As mentioned in my reply to Till's comment, there's no reason to > put > > > multiple operators whose individual resource requirements are > already > > > known > > > into the same group in fine-grained resource management. > > > - Even an operator implementation is reused for multiple > applications, > > > it does not guarantee the same resource requirements. During our > years > > > of > > > practices in Alibaba, with per-operator requirements specified for > > > Blink's > > > fine-grained resource management, very few users (including our > > > specialists > > > who are dedicated to supporting Blink users) are as experienced as > to > > > accurately predict/estimate the operator resource requirements. Most > > > people > > > rely on the execution-time metrics (throughput, delay, cpu load, > > memory > > > usage, GC pressure, etc.) to improve the specification. > > > > > > To sum up: > > > If the user is capable of providing proper resource requirements for > > every > > > operator, that's definitely a good thing and we would not need to rely > on > > > the SSGs. However, that shouldn't be a *must* for the fine-grained > > resource > > > management to work. For those users who are capable and do not like > > having > > > to set each operator to a separate SSG, I would be ok to have both > > > SSG-based and operator-based runtime interfaces and to only fallback to > > the > > > SSG requirements when the operator requirements are not specified. > > However, > > > as the first step, I think we should prioritise the use cases where > users > > > are not that experienced. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <[hidden email]> > > > wrote: > > > > > > > Will declaring them on slot sharing groups not also waste resources > if > > > > the parallelism of operators within that group are different? > > > > > > > > It also seems like quite a hassle for users having to recalculate the > > > > resource requirements if they change the slot sharing. > > > > I'd think that it's not really workable for users that create a set > of > > > > re-usable operators which are mixed and matched in their > applications; > > > > managing the resources requirements in such a setting would be a > > > > nightmare, and in the end would require operator-level requirements > any > > > > way. > > > > In that sense, I'm not even sure whether it really increases > usability. > > > > > > > > My main worry is that it if we wire the runtime to work on SSGs it's > > > > gonna be difficult to implement more fine-grained approaches, which > > > > would not be the case if, for the runtime, they are always defined on > > an > > > > operator-level. > > > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > Thanks for drafting this FLIP and starting this discussion Yangze. > > > > > > > > > > I like that defining resource requirements on a slot sharing group > > > makes > > > > > the overall setup easier and improves usability of resource > > > requirements. > > > > > > > > > > What I do not like about it is that it changes slot sharing groups > > from > > > > > being a scheduling hint to something which needs to be supported in > > > order > > > > > to support fine grained resource requirements. So far, the idea of > > slot > > > > > sharing groups was that it tells the system that a set of operators > > can > > > > be > > > > > deployed in the same slot. But the system still had the freedom to > > say > > > > that > > > > > it would rather place these tasks in different slots if it wanted. > If > > > we > > > > > now specify resource requirements on a per slot sharing group, then > > the > > > > > only option for a scheduler which does not support slot sharing > > groups > > > is > > > > > to say that every operator in this slot sharing group needs a slot > > with > > > > the > > > > > same resources as the whole group. > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 > and > > > op_2 > > > > > where each op needs 100 MB of memory, we would then say that the > slot > > > > > sharing group needs 200 MB of memory to run. If we have a cluster > > with > > > 2 > > > > > TMs with one slot of 100 MB each, then the system cannot run this > > job. > > > If > > > > > the resources were specified on an operator level, then the system > > > could > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > > Originally, one of the primary goals of slot sharing groups was to > > make > > > > it > > > > > easier for the user to reason about how many slots a job needs > > > > independent > > > > > of the actual number of operators in the job. Interestingly, if all > > > > > operators have their resources properly specified, then slot > sharing > > is > > > > no > > > > > longer needed because Flink could slice off the appropriately sized > > > slots > > > > > for every Task individually. What matters is whether the whole > > cluster > > > > has > > > > > enough resources to run all tasks or not. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> > > wrote: > > > > > > > > > >> Hi, there, > > > > >> > > > > >> We would like to start a discussion thread on "FLIP-156: Runtime > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces for > > > > >> specifying fine-grained resource requirements. > > > > >> > > > > >> In this FLIP: > > > > >> - Expound the user story of fine-grained resource management. > > > > >> - Propose runtime interfaces for specifying SSG-based resource > > > > >> requirements. > > > > >> - Discuss the pros and cons of the three potential granularities > for > > > > >> specifying the resource requirements (op, task and slot sharing > > group) > > > > >> and explain why we choose the slot sharing group. > > > > >> > > > > >> Please find more details in the FLIP wiki document [1]. Looking > > > > >> forward to your feedback. > > > > >> > > > > >> [1] > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > >> > > > > >> Best, > > > > >> Yangze Guo > > > > >> > > > > > > > > > > > > > > |
Thanks for the feedback, Till.
## I feel that what you proposed (operator-based + default value) might be subsumed by the SSG-based approach. Thinking of op_1 -> op_2, there are the following 4 cases, categorized by whether the resource requirements are known to the users. 1. *Both known.* As previously mentioned, there's no reason to put multiple operators whose individual resource requirements are already known into the same group in fine-grained resource management. And if op_1 and op_2 are in different groups, there should be no problem switching data exchange mode from pipelined to blocking. This is equivalent to specifying operator resource requirements in your proposal. 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a SSG whose resource is not specified thus would have the default slot resource. This is equivalent to having default operator resources in your proposal. 3. *Both unknown*. The user can either set op_1 and op_2 to the same SSG or separate SSGs. - If op_1 and op_2 are in the same SSG, it will be equivalent to the coarse-grained resource management, where op_1 and op_2 share a default size slot no matter which data exchange mode is used. - If op_1 and op_2 are in different SSGs, then each of them will use a default size slot. This is equivalent to setting them with default operator resources in your proposal. 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* - It is possible that the user learns the total / max resource requirement from executing and monitoring the job, while not being aware of individual operator requirements. - I believe this is the case your proposal does not cover. And TBH, this is probably how most users learn the resource requirements, according to my experiences. - In this case, the user might need to specify different resources if he wants to switch the execution mode, which should not be worse than not being able to use fine-grained resource management. ## An additional idea inspired by your proposal. We may provide multiple options for deciding resources for SSGs whose requirement is not specified, if needed. - Default slot resource (current design) - Default operator resource times number of operators (equivalent to your proposal) ## Exposing internal runtime strategies Theoretically, yes. Tying to the SSGs, the resource requirements might be affected if how SSGs are internally handled changes in future. Practically, I do not concretely see at the moment what kind of changes we may want in future that might conflict with this FLIP proposal, as the question of switching data exchange mode answered above. I'd suggest to not give up the user friendliness we may gain now for the future problems that may or may not exist. Moreover, the SSG-based approach has the flexibility to achieve the equivalent behavior as the operator-based approach, if we set each operator (or task) to a separate SSG. We can even provide a shortcut option to automatically do that for users, if needed. Thank you~ Xintong Song On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email]> wrote: > Thanks for the responses Xintong and Stephan, > > I agree that being able to define the resource requirements for a group of > operators is more user friendly. However, my concern is that we are > exposing thereby internal runtime strategies which might limit our > flexibility to execute a given job. Moreover, the semantics of configuring > resource requirements for SSGs could break if switching from streaming to > batch execution. If one defines the resource requirements for op_1 -> op_2 > which run in pipelined mode when using the streaming execution, then how do > we interpret these requirements when op_1 -> op_2 are executed with a > blocking data exchange in batch execution mode? Consequently, I am still > leaning towards Stephan's proposal to set the resource requirements per > operator. > > Maybe the following proposal makes the configuration easier: If the user > wants to use fine-grained resource requirements, then she needs to specify > the default size which is used for operators which have no explicit > resource annotation. If this holds true, then every operator would have a > resource requirement and the system can try to execute the operators in the > best possible manner w/o being constrained by how the user set the SSG > requirements. > > Cheers, > Till > > On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> > wrote: > > > Thanks for the feedback, Stephan. > > > > Actually, your proposal has also come to my mind at some point. And I > have > > some concerns about it. > > > > > > 1. It does not give users the same control as the SSG-based approach. > > > > > > While both approaches do not require specifying for each operator, > > SSG-based approach supports the semantic that "some operators together > use > > this much resource" while the operator-based approach doesn't. > > > > > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and at > some > > point there's an agg o_n (1 < n < m) which significantly reduces the data > > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., o_n) > > and SSG_2 (o_n+1, ... o_m), so that configuring much higher parallelisms > > for operators in SSG_1 than for operators in SSG_2 won't lead to too much > > wasting of resources. If the two SSGs end up needing different resources, > > with the SSG-based approach one can directly specify resources for the > two > > groups. However, with the operator-based approach, the user will have to > > specify resources for each operator in one of the two groups, and tune > the > > default slot resource via configurations to fit the other group. > > > > > > 2. It increases the chance of breaking operator chains. > > > > > > Setting chainnable operators into different slot sharing groups will > > prevent them from being chained. In the current implementation, > downstream > > operators, if SSG not explicitly specified, will be set to the same group > > as the chainable upstream operators (unless multiple upstream operators > in > > different groups), to reduce the chance of breaking chains. > > > > > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding SSGs > > based on whether resource is specified we will easily get groups like > (o_1, > > o_3) & (o_2, o_4), where none of the operators can be chained. This is > also > > possible for the SSG-based approach, but I believe the chance is much > > smaller because there's no strong reason for users to specify the groups > > with alternate operators like that. We are more likely to get groups like > > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and o_3. > > > > > > 3. It complicates the system by having two different mechanisms for > sharing > > managed memory in a slot. > > > > > > - In FLIP-141, we introduced the intra-slot managed memory sharing > > mechanism, where managed memory is first distributed according to the > > consumer type, then further distributed across operators of that consumer > > type. > > > > - With the operator-based approach, managed memory size specified for an > > operator should account for all the consumer types of that operator. That > > means the managed memory is first distributed across operators, then > > distributed to different consumer types of each operator. > > > > > > Unfortunately, the different order of the two calculation steps can lead > to > > different results. To be specific, the semantic of the configuration > option > > `consumer-weights` changed (within a slot vs. within an operator). > > > > > > > > To sum up things: > > > > While (3) might be a bit more implementation related, I think (1) and (2) > > somehow suggest that, the price for the proposed approach to avoid > > specifying resource for every operator is that it's not as independent > from > > operator chaining and slot sharing as the operator-based approach > discussed > > in the FLIP. > > > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> wrote: > > > > > Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > I want to say, first of all, that this is super well written. And the > > > points that the FLIP makes about how to expose the configuration to > users > > > is exactly the right thing to figure out first. > > > So good job here! > > > > > > About how to let users specify the resource profiles. If I can sum the > > FLIP > > > and previous discussion up in my own words, the problem is the > following: > > > > > > Operator-level specification is the simplest and cleanest approach, > > because > > > > it avoids mixing operator configuration (resource) and scheduling. No > > > > matter what other parameters change (chaining, slot sharing, > switching > > > > pipelined and blocking shuffles), the resource profiles stay the > same. > > > > But it would require that a user specifies resources on all > operators, > > > > which makes it hard to use. That's why the FLIP suggests going with > > > > specifying resources on a Sharing-Group. > > > > > > > > > I think both thoughts are important, so can we find a solution where > the > > > Resource Profiles are specified on an Operator, but we still avoid that > > we > > > need to specify a resource profile on every operator? > > > > > > What do you think about something like the following: > > > - Resource Profiles are specified on an operator level. > > > - Not all operators need profiles > > > - All Operators without a Resource Profile ended up in the default > slot > > > sharing group with a default profile (will get a default slot). > > > - All Operators with a Resource Profile will go into another slot > > sharing > > > group (the resource-specified-group). > > > - Users can define different slot sharing groups for operators like > > they > > > do now, with the exception that you cannot mix operators that have a > > > resource profile and operators that have no resource profile. > > > - The default case where no operator has a resource profile is just a > > > special case of this model > > > - The chaining logic sums up the profiles per operator, like it does > > now, > > > and the scheduler sums up the profiles of the tasks that it schedules > > > together. > > > > > > > > > There is another question about reactive scaling raised in the FLIP. I > > need > > > to think a bit about that. That is indeed a bit more tricky once we > have > > > slots of different sizes. > > > It is not clear then which of the different slot requests the > > > ResourceManager should fulfill when new resources (TMs) show up, or how > > the > > > JobManager redistributes the slots resources when resources (TMs) > > disappear > > > This question is pretty orthogonal, though, to the "how to specify the > > > resources". > > > > > > > > > Best, > > > Stephan > > > > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email]> > > wrote: > > > > > > > Thanks for drafting the FLIP and driving the discussion, Yangze. > > > > And Thanks for the feedback, Till and Chesnay. > > > > > > > > @Till, > > > > > > > > I agree that specifying requirements for SSGs means that SSGs need to > > be > > > > supported in fine-grained resource management, otherwise each > operator > > > > might use as many resources as the whole group. However, I cannot > think > > > of > > > > a strong reason for not supporting SSGs in fine-grained resource > > > > management. > > > > > > > > > > > > > Interestingly, if all operators have their resources properly > > > specified, > > > > > then slot sharing is no longer needed because Flink could slice off > > the > > > > > appropriately sized slots for every Task individually. > > > > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 and > > op_2 > > > > > where each op needs 100 MB of memory, we would then say that the > slot > > > > > sharing group needs 200 MB of memory to run. If we have a cluster > > with > > > 2 > > > > > TMs with one slot of 100 MB each, then the system cannot run this > > job. > > > If > > > > > the resources were specified on an operator level, then the system > > > could > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > > > > Couldn't agree more that if all operators' requirements are properly > > > > specified, slot sharing should be no longer needed. I think this > > exactly > > > > disproves the example. If we already know op_1 and op_2 each needs > 100 > > MB > > > > of memory, why would we put them in the same group? If they are in > > > separate > > > > groups, with the proposed approach the system can freely deploy them > to > > > > either a 200 MB TM or two 100 MB TMs. > > > > > > > > Moreover, the precondition for not needing slot sharing is having > > > resource > > > > requirements properly specified for all operators. This is not always > > > > possible, and usually requires tremendous efforts. One of the > benefits > > > for > > > > SSG-based requirements is that it allows the user to freely decide > the > > > > granularity, thus efforts they want to pay. I would consider SSG in > > > > fine-grained resource management as a group of operators that the > user > > > > would like to specify the total resource for. There can be only one > > group > > > > in the job, 2~3 groups dividing the job into a few major parts, or as > > > many > > > > groups as the number of tasks/operators, depending on how > fine-grained > > > the > > > > user is able to specify the resources. > > > > > > > > Having to support SSGs might be a constraint. But given that all the > > > > current scheduler implementations already support SSGs, I tend to > think > > > > that as an acceptable price for the above discussed usability and > > > > flexibility. > > > > > > > > @Chesnay > > > > > > > > Will declaring them on slot sharing groups not also waste resources > if > > > the > > > > > parallelism of operators within that group are different? > > > > > > > > > Yes. It's a trade-off between usability and resource utilization. To > > > avoid > > > > such wasting, the user can define more groups, so that each group > > > contains > > > > less operators and the chance of having operators with different > > > > parallelism will be reduced. The price is to have more resource > > > > requirements to specify. > > > > > > > > It also seems like quite a hassle for users having to recalculate the > > > > > resource requirements if they change the slot sharing. > > > > > I'd think that it's not really workable for users that create a set > > of > > > > > re-usable operators which are mixed and matched in their > > applications; > > > > > managing the resources requirements in such a setting would be a > > > > > nightmare, and in the end would require operator-level requirements > > any > > > > > way. > > > > > In that sense, I'm not even sure whether it really increases > > usability. > > > > > > > > > > > > > - As mentioned in my reply to Till's comment, there's no reason to > > put > > > > multiple operators whose individual resource requirements are > > already > > > > known > > > > into the same group in fine-grained resource management. > > > > - Even an operator implementation is reused for multiple > > applications, > > > > it does not guarantee the same resource requirements. During our > > years > > > > of > > > > practices in Alibaba, with per-operator requirements specified for > > > > Blink's > > > > fine-grained resource management, very few users (including our > > > > specialists > > > > who are dedicated to supporting Blink users) are as experienced as > > to > > > > accurately predict/estimate the operator resource requirements. > Most > > > > people > > > > rely on the execution-time metrics (throughput, delay, cpu load, > > > memory > > > > usage, GC pressure, etc.) to improve the specification. > > > > > > > > To sum up: > > > > If the user is capable of providing proper resource requirements for > > > every > > > > operator, that's definitely a good thing and we would not need to > rely > > on > > > > the SSGs. However, that shouldn't be a *must* for the fine-grained > > > resource > > > > management to work. For those users who are capable and do not like > > > having > > > > to set each operator to a separate SSG, I would be ok to have both > > > > SSG-based and operator-based runtime interfaces and to only fallback > to > > > the > > > > SSG requirements when the operator requirements are not specified. > > > However, > > > > as the first step, I think we should prioritise the use cases where > > users > > > > are not that experienced. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <[hidden email]> > > > > wrote: > > > > > > > > > Will declaring them on slot sharing groups not also waste resources > > if > > > > > the parallelism of operators within that group are different? > > > > > > > > > > It also seems like quite a hassle for users having to recalculate > the > > > > > resource requirements if they change the slot sharing. > > > > > I'd think that it's not really workable for users that create a set > > of > > > > > re-usable operators which are mixed and matched in their > > applications; > > > > > managing the resources requirements in such a setting would be a > > > > > nightmare, and in the end would require operator-level requirements > > any > > > > > way. > > > > > In that sense, I'm not even sure whether it really increases > > usability. > > > > > > > > > > My main worry is that it if we wire the runtime to work on SSGs > it's > > > > > gonna be difficult to implement more fine-grained approaches, which > > > > > would not be the case if, for the runtime, they are always defined > on > > > an > > > > > operator-level. > > > > > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > > Thanks for drafting this FLIP and starting this discussion > Yangze. > > > > > > > > > > > > I like that defining resource requirements on a slot sharing > group > > > > makes > > > > > > the overall setup easier and improves usability of resource > > > > requirements. > > > > > > > > > > > > What I do not like about it is that it changes slot sharing > groups > > > from > > > > > > being a scheduling hint to something which needs to be supported > in > > > > order > > > > > > to support fine grained resource requirements. So far, the idea > of > > > slot > > > > > > sharing groups was that it tells the system that a set of > operators > > > can > > > > > be > > > > > > deployed in the same slot. But the system still had the freedom > to > > > say > > > > > that > > > > > > it would rather place these tasks in different slots if it > wanted. > > If > > > > we > > > > > > now specify resource requirements on a per slot sharing group, > then > > > the > > > > > > only option for a scheduler which does not support slot sharing > > > groups > > > > is > > > > > > to say that every operator in this slot sharing group needs a > slot > > > with > > > > > the > > > > > > same resources as the whole group. > > > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 > > and > > > > op_2 > > > > > > where each op needs 100 MB of memory, we would then say that the > > slot > > > > > > sharing group needs 200 MB of memory to run. If we have a cluster > > > with > > > > 2 > > > > > > TMs with one slot of 100 MB each, then the system cannot run this > > > job. > > > > If > > > > > > the resources were specified on an operator level, then the > system > > > > could > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > > > > Originally, one of the primary goals of slot sharing groups was > to > > > make > > > > > it > > > > > > easier for the user to reason about how many slots a job needs > > > > > independent > > > > > > of the actual number of operators in the job. Interestingly, if > all > > > > > > operators have their resources properly specified, then slot > > sharing > > > is > > > > > no > > > > > > longer needed because Flink could slice off the appropriately > sized > > > > slots > > > > > > for every Task individually. What matters is whether the whole > > > cluster > > > > > has > > > > > > enough resources to run all tasks or not. > > > > > > > > > > > > Cheers, > > > > > > Till > > > > > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> > > > wrote: > > > > > > > > > > > >> Hi, there, > > > > > >> > > > > > >> We would like to start a discussion thread on "FLIP-156: Runtime > > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces for > > > > > >> specifying fine-grained resource requirements. > > > > > >> > > > > > >> In this FLIP: > > > > > >> - Expound the user story of fine-grained resource management. > > > > > >> - Propose runtime interfaces for specifying SSG-based resource > > > > > >> requirements. > > > > > >> - Discuss the pros and cons of the three potential granularities > > for > > > > > >> specifying the resource requirements (op, task and slot sharing > > > group) > > > > > >> and explain why we choose the slot sharing group. > > > > > >> > > > > > >> Please find more details in the FLIP wiki document [1]. Looking > > > > > >> forward to your feedback. > > > > > >> > > > > > >> [1] > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > >> > > > > > >> Best, > > > > > >> Yangze Guo > > > > > >> > > > > > > > > > > > > > > > > > > > > |
Thanks for the responses, Till and Xintong.
I second Xintong's comment that SSG-based runtime interface will give us the flexibility to achieve op/task-based approach. That's one of the most important reasons for our design choice. Some cents regarding the default operator resource: - It might be good for the scenario of DataStream jobs. ** For light-weight operators, the accumulative configuration error will not be significant. Then, the resource of a task used is proportional to the number of operators it contains. ** For heavy operators like join and window or operators using the external resources, user will turn to the fine-grained resource configuration. - It can increase the stability for the standalone cluster where task executors registered are heterogeneous(with different default slot resources). - It might not be good for SQL users. The operators that SQL will be transferred to is a black box to the user. We also do not guarantee the cross-version of consistency of the transformation so far. I think it can be treated as a follow-up work when the fine-grained resource management is end-to-end ready. Best, Yangze Guo On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <[hidden email]> wrote: > > Thanks for the feedback, Till. > > ## I feel that what you proposed (operator-based + default value) might be > subsumed by the SSG-based approach. > Thinking of op_1 -> op_2, there are the following 4 cases, categorized by > whether the resource requirements are known to the users. > > 1. *Both known.* As previously mentioned, there's no reason to put > multiple operators whose individual resource requirements are already known > into the same group in fine-grained resource management. And if op_1 and > op_2 are in different groups, there should be no problem switching data > exchange mode from pipelined to blocking. This is equivalent to specifying > operator resource requirements in your proposal. > 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a > SSG whose resource is not specified thus would have the default slot > resource. This is equivalent to having default operator resources in your > proposal. > 3. *Both unknown*. The user can either set op_1 and op_2 to the same SSG > or separate SSGs. > - If op_1 and op_2 are in the same SSG, it will be equivalent to the > coarse-grained resource management, where op_1 and op_2 share a default > size slot no matter which data exchange mode is used. > - If op_1 and op_2 are in different SSGs, then each of them will use > a default size slot. This is equivalent to setting them with default > operator resources in your proposal. > 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* > - It is possible that the user learns the total / max resource > requirement from executing and monitoring the job, while not > being aware of > individual operator requirements. > - I believe this is the case your proposal does not cover. And TBH, > this is probably how most users learn the resource requirements, > according > to my experiences. > - In this case, the user might need to specify different resources if > he wants to switch the execution mode, which should not be worse than not > being able to use fine-grained resource management. > > > ## An additional idea inspired by your proposal. > We may provide multiple options for deciding resources for SSGs whose > requirement is not specified, if needed. > > - Default slot resource (current design) > - Default operator resource times number of operators (equivalent to > your proposal) > > > ## Exposing internal runtime strategies > Theoretically, yes. Tying to the SSGs, the resource requirements might be > affected if how SSGs are internally handled changes in future. Practically, > I do not concretely see at the moment what kind of changes we may want in > future that might conflict with this FLIP proposal, as the question of > switching data exchange mode answered above. I'd suggest to not give up the > user friendliness we may gain now for the future problems that may or may > not exist. > > Moreover, the SSG-based approach has the flexibility to achieve the > equivalent behavior as the operator-based approach, if we set each operator > (or task) to a separate SSG. We can even provide a shortcut option to > automatically do that for users, if needed. > > > Thank you~ > > Xintong Song > > > > On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email]> wrote: > > > Thanks for the responses Xintong and Stephan, > > > > I agree that being able to define the resource requirements for a group of > > operators is more user friendly. However, my concern is that we are > > exposing thereby internal runtime strategies which might limit our > > flexibility to execute a given job. Moreover, the semantics of configuring > > resource requirements for SSGs could break if switching from streaming to > > batch execution. If one defines the resource requirements for op_1 -> op_2 > > which run in pipelined mode when using the streaming execution, then how do > > we interpret these requirements when op_1 -> op_2 are executed with a > > blocking data exchange in batch execution mode? Consequently, I am still > > leaning towards Stephan's proposal to set the resource requirements per > > operator. > > > > Maybe the following proposal makes the configuration easier: If the user > > wants to use fine-grained resource requirements, then she needs to specify > > the default size which is used for operators which have no explicit > > resource annotation. If this holds true, then every operator would have a > > resource requirement and the system can try to execute the operators in the > > best possible manner w/o being constrained by how the user set the SSG > > requirements. > > > > Cheers, > > Till > > > > On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> > > wrote: > > > > > Thanks for the feedback, Stephan. > > > > > > Actually, your proposal has also come to my mind at some point. And I > > have > > > some concerns about it. > > > > > > > > > 1. It does not give users the same control as the SSG-based approach. > > > > > > > > > While both approaches do not require specifying for each operator, > > > SSG-based approach supports the semantic that "some operators together > > use > > > this much resource" while the operator-based approach doesn't. > > > > > > > > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and at > > some > > > point there's an agg o_n (1 < n < m) which significantly reduces the data > > > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., o_n) > > > and SSG_2 (o_n+1, ... o_m), so that configuring much higher parallelisms > > > for operators in SSG_1 than for operators in SSG_2 won't lead to too much > > > wasting of resources. If the two SSGs end up needing different resources, > > > with the SSG-based approach one can directly specify resources for the > > two > > > groups. However, with the operator-based approach, the user will have to > > > specify resources for each operator in one of the two groups, and tune > > the > > > default slot resource via configurations to fit the other group. > > > > > > > > > 2. It increases the chance of breaking operator chains. > > > > > > > > > Setting chainnable operators into different slot sharing groups will > > > prevent them from being chained. In the current implementation, > > downstream > > > operators, if SSG not explicitly specified, will be set to the same group > > > as the chainable upstream operators (unless multiple upstream operators > > in > > > different groups), to reduce the chance of breaking chains. > > > > > > > > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding SSGs > > > based on whether resource is specified we will easily get groups like > > (o_1, > > > o_3) & (o_2, o_4), where none of the operators can be chained. This is > > also > > > possible for the SSG-based approach, but I believe the chance is much > > > smaller because there's no strong reason for users to specify the groups > > > with alternate operators like that. We are more likely to get groups like > > > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and o_3. > > > > > > > > > 3. It complicates the system by having two different mechanisms for > > sharing > > > managed memory in a slot. > > > > > > > > > - In FLIP-141, we introduced the intra-slot managed memory sharing > > > mechanism, where managed memory is first distributed according to the > > > consumer type, then further distributed across operators of that consumer > > > type. > > > > > > - With the operator-based approach, managed memory size specified for an > > > operator should account for all the consumer types of that operator. That > > > means the managed memory is first distributed across operators, then > > > distributed to different consumer types of each operator. > > > > > > > > > Unfortunately, the different order of the two calculation steps can lead > > to > > > different results. To be specific, the semantic of the configuration > > option > > > `consumer-weights` changed (within a slot vs. within an operator). > > > > > > > > > > > > To sum up things: > > > > > > While (3) might be a bit more implementation related, I think (1) and (2) > > > somehow suggest that, the price for the proposed approach to avoid > > > specifying resource for every operator is that it's not as independent > > from > > > operator chaining and slot sharing as the operator-based approach > > discussed > > > in the FLIP. > > > > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> wrote: > > > > > > > Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > > > I want to say, first of all, that this is super well written. And the > > > > points that the FLIP makes about how to expose the configuration to > > users > > > > is exactly the right thing to figure out first. > > > > So good job here! > > > > > > > > About how to let users specify the resource profiles. If I can sum the > > > FLIP > > > > and previous discussion up in my own words, the problem is the > > following: > > > > > > > > Operator-level specification is the simplest and cleanest approach, > > > because > > > > > it avoids mixing operator configuration (resource) and scheduling. No > > > > > matter what other parameters change (chaining, slot sharing, > > switching > > > > > pipelined and blocking shuffles), the resource profiles stay the > > same. > > > > > But it would require that a user specifies resources on all > > operators, > > > > > which makes it hard to use. That's why the FLIP suggests going with > > > > > specifying resources on a Sharing-Group. > > > > > > > > > > > > I think both thoughts are important, so can we find a solution where > > the > > > > Resource Profiles are specified on an Operator, but we still avoid that > > > we > > > > need to specify a resource profile on every operator? > > > > > > > > What do you think about something like the following: > > > > - Resource Profiles are specified on an operator level. > > > > - Not all operators need profiles > > > > - All Operators without a Resource Profile ended up in the default > > slot > > > > sharing group with a default profile (will get a default slot). > > > > - All Operators with a Resource Profile will go into another slot > > > sharing > > > > group (the resource-specified-group). > > > > - Users can define different slot sharing groups for operators like > > > they > > > > do now, with the exception that you cannot mix operators that have a > > > > resource profile and operators that have no resource profile. > > > > - The default case where no operator has a resource profile is just a > > > > special case of this model > > > > - The chaining logic sums up the profiles per operator, like it does > > > now, > > > > and the scheduler sums up the profiles of the tasks that it schedules > > > > together. > > > > > > > > > > > > There is another question about reactive scaling raised in the FLIP. I > > > need > > > > to think a bit about that. That is indeed a bit more tricky once we > > have > > > > slots of different sizes. > > > > It is not clear then which of the different slot requests the > > > > ResourceManager should fulfill when new resources (TMs) show up, or how > > > the > > > > JobManager redistributes the slots resources when resources (TMs) > > > disappear > > > > This question is pretty orthogonal, though, to the "how to specify the > > > > resources". > > > > > > > > > > > > Best, > > > > Stephan > > > > > > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > > > Thanks for drafting the FLIP and driving the discussion, Yangze. > > > > > And Thanks for the feedback, Till and Chesnay. > > > > > > > > > > @Till, > > > > > > > > > > I agree that specifying requirements for SSGs means that SSGs need to > > > be > > > > > supported in fine-grained resource management, otherwise each > > operator > > > > > might use as many resources as the whole group. However, I cannot > > think > > > > of > > > > > a strong reason for not supporting SSGs in fine-grained resource > > > > > management. > > > > > > > > > > > > > > > > Interestingly, if all operators have their resources properly > > > > specified, > > > > > > then slot sharing is no longer needed because Flink could slice off > > > the > > > > > > appropriately sized slots for every Task individually. > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 and > > > op_2 > > > > > > where each op needs 100 MB of memory, we would then say that the > > slot > > > > > > sharing group needs 200 MB of memory to run. If we have a cluster > > > with > > > > 2 > > > > > > TMs with one slot of 100 MB each, then the system cannot run this > > > job. > > > > If > > > > > > the resources were specified on an operator level, then the system > > > > could > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > > > > > > > Couldn't agree more that if all operators' requirements are properly > > > > > specified, slot sharing should be no longer needed. I think this > > > exactly > > > > > disproves the example. If we already know op_1 and op_2 each needs > > 100 > > > MB > > > > > of memory, why would we put them in the same group? If they are in > > > > separate > > > > > groups, with the proposed approach the system can freely deploy them > > to > > > > > either a 200 MB TM or two 100 MB TMs. > > > > > > > > > > Moreover, the precondition for not needing slot sharing is having > > > > resource > > > > > requirements properly specified for all operators. This is not always > > > > > possible, and usually requires tremendous efforts. One of the > > benefits > > > > for > > > > > SSG-based requirements is that it allows the user to freely decide > > the > > > > > granularity, thus efforts they want to pay. I would consider SSG in > > > > > fine-grained resource management as a group of operators that the > > user > > > > > would like to specify the total resource for. There can be only one > > > group > > > > > in the job, 2~3 groups dividing the job into a few major parts, or as > > > > many > > > > > groups as the number of tasks/operators, depending on how > > fine-grained > > > > the > > > > > user is able to specify the resources. > > > > > > > > > > Having to support SSGs might be a constraint. But given that all the > > > > > current scheduler implementations already support SSGs, I tend to > > think > > > > > that as an acceptable price for the above discussed usability and > > > > > flexibility. > > > > > > > > > > @Chesnay > > > > > > > > > > Will declaring them on slot sharing groups not also waste resources > > if > > > > the > > > > > > parallelism of operators within that group are different? > > > > > > > > > > > Yes. It's a trade-off between usability and resource utilization. To > > > > avoid > > > > > such wasting, the user can define more groups, so that each group > > > > contains > > > > > less operators and the chance of having operators with different > > > > > parallelism will be reduced. The price is to have more resource > > > > > requirements to specify. > > > > > > > > > > It also seems like quite a hassle for users having to recalculate the > > > > > > resource requirements if they change the slot sharing. > > > > > > I'd think that it's not really workable for users that create a set > > > of > > > > > > re-usable operators which are mixed and matched in their > > > applications; > > > > > > managing the resources requirements in such a setting would be a > > > > > > nightmare, and in the end would require operator-level requirements > > > any > > > > > > way. > > > > > > In that sense, I'm not even sure whether it really increases > > > usability. > > > > > > > > > > > > > > > > - As mentioned in my reply to Till's comment, there's no reason to > > > put > > > > > multiple operators whose individual resource requirements are > > > already > > > > > known > > > > > into the same group in fine-grained resource management. > > > > > - Even an operator implementation is reused for multiple > > > applications, > > > > > it does not guarantee the same resource requirements. During our > > > years > > > > > of > > > > > practices in Alibaba, with per-operator requirements specified for > > > > > Blink's > > > > > fine-grained resource management, very few users (including our > > > > > specialists > > > > > who are dedicated to supporting Blink users) are as experienced as > > > to > > > > > accurately predict/estimate the operator resource requirements. > > Most > > > > > people > > > > > rely on the execution-time metrics (throughput, delay, cpu load, > > > > memory > > > > > usage, GC pressure, etc.) to improve the specification. > > > > > > > > > > To sum up: > > > > > If the user is capable of providing proper resource requirements for > > > > every > > > > > operator, that's definitely a good thing and we would not need to > > rely > > > on > > > > > the SSGs. However, that shouldn't be a *must* for the fine-grained > > > > resource > > > > > management to work. For those users who are capable and do not like > > > > having > > > > > to set each operator to a separate SSG, I would be ok to have both > > > > > SSG-based and operator-based runtime interfaces and to only fallback > > to > > > > the > > > > > SSG requirements when the operator requirements are not specified. > > > > However, > > > > > as the first step, I think we should prioritise the use cases where > > > users > > > > > are not that experienced. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler <[hidden email]> > > > > > wrote: > > > > > > > > > > > Will declaring them on slot sharing groups not also waste resources > > > if > > > > > > the parallelism of operators within that group are different? > > > > > > > > > > > > It also seems like quite a hassle for users having to recalculate > > the > > > > > > resource requirements if they change the slot sharing. > > > > > > I'd think that it's not really workable for users that create a set > > > of > > > > > > re-usable operators which are mixed and matched in their > > > applications; > > > > > > managing the resources requirements in such a setting would be a > > > > > > nightmare, and in the end would require operator-level requirements > > > any > > > > > > way. > > > > > > In that sense, I'm not even sure whether it really increases > > > usability. > > > > > > > > > > > > My main worry is that it if we wire the runtime to work on SSGs > > it's > > > > > > gonna be difficult to implement more fine-grained approaches, which > > > > > > would not be the case if, for the runtime, they are always defined > > on > > > > an > > > > > > operator-level. > > > > > > > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > > > Thanks for drafting this FLIP and starting this discussion > > Yangze. > > > > > > > > > > > > > > I like that defining resource requirements on a slot sharing > > group > > > > > makes > > > > > > > the overall setup easier and improves usability of resource > > > > > requirements. > > > > > > > > > > > > > > What I do not like about it is that it changes slot sharing > > groups > > > > from > > > > > > > being a scheduling hint to something which needs to be supported > > in > > > > > order > > > > > > > to support fine grained resource requirements. So far, the idea > > of > > > > slot > > > > > > > sharing groups was that it tells the system that a set of > > operators > > > > can > > > > > > be > > > > > > > deployed in the same slot. But the system still had the freedom > > to > > > > say > > > > > > that > > > > > > > it would rather place these tasks in different slots if it > > wanted. > > > If > > > > > we > > > > > > > now specify resource requirements on a per slot sharing group, > > then > > > > the > > > > > > > only option for a scheduler which does not support slot sharing > > > > groups > > > > > is > > > > > > > to say that every operator in this slot sharing group needs a > > slot > > > > with > > > > > > the > > > > > > > same resources as the whole group. > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 > > > and > > > > > op_2 > > > > > > > where each op needs 100 MB of memory, we would then say that the > > > slot > > > > > > > sharing group needs 200 MB of memory to run. If we have a cluster > > > > with > > > > > 2 > > > > > > > TMs with one slot of 100 MB each, then the system cannot run this > > > > job. > > > > > If > > > > > > > the resources were specified on an operator level, then the > > system > > > > > could > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to TM_2. > > > > > > > > > > > > > > Originally, one of the primary goals of slot sharing groups was > > to > > > > make > > > > > > it > > > > > > > easier for the user to reason about how many slots a job needs > > > > > > independent > > > > > > > of the actual number of operators in the job. Interestingly, if > > all > > > > > > > operators have their resources properly specified, then slot > > > sharing > > > > is > > > > > > no > > > > > > > longer needed because Flink could slice off the appropriately > > sized > > > > > slots > > > > > > > for every Task individually. What matters is whether the whole > > > > cluster > > > > > > has > > > > > > > enough resources to run all tasks or not. > > > > > > > > > > > > > > Cheers, > > > > > > > Till > > > > > > > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo <[hidden email]> > > > > wrote: > > > > > > > > > > > > > >> Hi, there, > > > > > > >> > > > > > > >> We would like to start a discussion thread on "FLIP-156: Runtime > > > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], where we > > > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces for > > > > > > >> specifying fine-grained resource requirements. > > > > > > >> > > > > > > >> In this FLIP: > > > > > > >> - Expound the user story of fine-grained resource management. > > > > > > >> - Propose runtime interfaces for specifying SSG-based resource > > > > > > >> requirements. > > > > > > >> - Discuss the pros and cons of the three potential granularities > > > for > > > > > > >> specifying the resource requirements (op, task and slot sharing > > > > group) > > > > > > >> and explain why we choose the slot sharing group. > > > > > > >> > > > > > > >> Please find more details in the FLIP wiki document [1]. Looking > > > > > > >> forward to your feedback. > > > > > > >> > > > > > > >> [1] > > > > > > >> > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > >> > > > > > > >> Best, > > > > > > >> Yangze Guo > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > |
Maybe a different minor idea: Would it be possible to treat the SSG
resource requirements as a hint for the runtime similar to how slot sharing groups are designed at the moment? Meaning that we don't give the guarantee that Flink will always deploy this set of tasks together no matter what comes. If, for example, the runtime can derive by some means the resource requirements for each task based on the requirements for the SSG, this could be possible. One easy strategy would be to give every task the same resources as the whole slot sharing group. Another one could be distributing the resources equally among the tasks. This does not even have to be implemented but we would give ourselves the freedom to change scheduling if need should arise. Cheers, Till On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email]> wrote: > Thanks for the responses, Till and Xintong. > > I second Xintong's comment that SSG-based runtime interface will give > us the flexibility to achieve op/task-based approach. That's one of > the most important reasons for our design choice. > > Some cents regarding the default operator resource: > - It might be good for the scenario of DataStream jobs. > ** For light-weight operators, the accumulative configuration error > will not be significant. Then, the resource of a task used is > proportional to the number of operators it contains. > ** For heavy operators like join and window or operators using the > external resources, user will turn to the fine-grained resource > configuration. > - It can increase the stability for the standalone cluster where task > executors registered are heterogeneous(with different default slot > resources). > - It might not be good for SQL users. The operators that SQL will be > transferred to is a black box to the user. We also do not guarantee > the cross-version of consistency of the transformation so far. > > I think it can be treated as a follow-up work when the fine-grained > resource management is end-to-end ready. > > Best, > Yangze Guo > > > On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <[hidden email]> > wrote: > > > > Thanks for the feedback, Till. > > > > ## I feel that what you proposed (operator-based + default value) might > be > > subsumed by the SSG-based approach. > > Thinking of op_1 -> op_2, there are the following 4 cases, categorized by > > whether the resource requirements are known to the users. > > > > 1. *Both known.* As previously mentioned, there's no reason to put > > multiple operators whose individual resource requirements are already > known > > into the same group in fine-grained resource management. And if op_1 > and > > op_2 are in different groups, there should be no problem switching > data > > exchange mode from pipelined to blocking. This is equivalent to > specifying > > operator resource requirements in your proposal. > > 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a > > SSG whose resource is not specified thus would have the default slot > > resource. This is equivalent to having default operator resources in > your > > proposal. > > 3. *Both unknown*. The user can either set op_1 and op_2 to the same > SSG > > or separate SSGs. > > - If op_1 and op_2 are in the same SSG, it will be equivalent to > the > > coarse-grained resource management, where op_1 and op_2 share a > default > > size slot no matter which data exchange mode is used. > > - If op_1 and op_2 are in different SSGs, then each of them will > use > > a default size slot. This is equivalent to setting them with > default > > operator resources in your proposal. > > 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* > > - It is possible that the user learns the total / max resource > > requirement from executing and monitoring the job, while not > > being aware of > > individual operator requirements. > > - I believe this is the case your proposal does not cover. And TBH, > > this is probably how most users learn the resource requirements, > > according > > to my experiences. > > - In this case, the user might need to specify different resources > if > > he wants to switch the execution mode, which should not be worse > than not > > being able to use fine-grained resource management. > > > > > > ## An additional idea inspired by your proposal. > > We may provide multiple options for deciding resources for SSGs whose > > requirement is not specified, if needed. > > > > - Default slot resource (current design) > > - Default operator resource times number of operators (equivalent to > > your proposal) > > > > > > ## Exposing internal runtime strategies > > Theoretically, yes. Tying to the SSGs, the resource requirements might be > > affected if how SSGs are internally handled changes in future. > Practically, > > I do not concretely see at the moment what kind of changes we may want in > > future that might conflict with this FLIP proposal, as the question of > > switching data exchange mode answered above. I'd suggest to not give up > the > > user friendliness we may gain now for the future problems that may or may > > not exist. > > > > Moreover, the SSG-based approach has the flexibility to achieve the > > equivalent behavior as the operator-based approach, if we set each > operator > > (or task) to a separate SSG. We can even provide a shortcut option to > > automatically do that for users, if needed. > > > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email]> > wrote: > > > > > Thanks for the responses Xintong and Stephan, > > > > > > I agree that being able to define the resource requirements for a > group of > > > operators is more user friendly. However, my concern is that we are > > > exposing thereby internal runtime strategies which might limit our > > > flexibility to execute a given job. Moreover, the semantics of > configuring > > > resource requirements for SSGs could break if switching from streaming > to > > > batch execution. If one defines the resource requirements for op_1 -> > op_2 > > > which run in pipelined mode when using the streaming execution, then > how do > > > we interpret these requirements when op_1 -> op_2 are executed with a > > > blocking data exchange in batch execution mode? Consequently, I am > still > > > leaning towards Stephan's proposal to set the resource requirements per > > > operator. > > > > > > Maybe the following proposal makes the configuration easier: If the > user > > > wants to use fine-grained resource requirements, then she needs to > specify > > > the default size which is used for operators which have no explicit > > > resource annotation. If this holds true, then every operator would > have a > > > resource requirement and the system can try to execute the operators > in the > > > best possible manner w/o being constrained by how the user set the SSG > > > requirements. > > > > > > Cheers, > > > Till > > > > > > On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > Thanks for the feedback, Stephan. > > > > > > > > Actually, your proposal has also come to my mind at some point. And I > > > have > > > > some concerns about it. > > > > > > > > > > > > 1. It does not give users the same control as the SSG-based approach. > > > > > > > > > > > > While both approaches do not require specifying for each operator, > > > > SSG-based approach supports the semantic that "some operators > together > > > use > > > > this much resource" while the operator-based approach doesn't. > > > > > > > > > > > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and > at > > > some > > > > point there's an agg o_n (1 < n < m) which significantly reduces the > data > > > > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., > o_n) > > > > and SSG_2 (o_n+1, ... o_m), so that configuring much higher > parallelisms > > > > for operators in SSG_1 than for operators in SSG_2 won't lead to too > much > > > > wasting of resources. If the two SSGs end up needing different > resources, > > > > with the SSG-based approach one can directly specify resources for > the > > > two > > > > groups. However, with the operator-based approach, the user will > have to > > > > specify resources for each operator in one of the two groups, and > tune > > > the > > > > default slot resource via configurations to fit the other group. > > > > > > > > > > > > 2. It increases the chance of breaking operator chains. > > > > > > > > > > > > Setting chainnable operators into different slot sharing groups will > > > > prevent them from being chained. In the current implementation, > > > downstream > > > > operators, if SSG not explicitly specified, will be set to the same > group > > > > as the chainable upstream operators (unless multiple upstream > operators > > > in > > > > different groups), to reduce the chance of breaking chains. > > > > > > > > > > > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding > SSGs > > > > based on whether resource is specified we will easily get groups like > > > (o_1, > > > > o_3) & (o_2, o_4), where none of the operators can be chained. This > is > > > also > > > > possible for the SSG-based approach, but I believe the chance is much > > > > smaller because there's no strong reason for users to specify the > groups > > > > with alternate operators like that. We are more likely to get groups > like > > > > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and > o_3. > > > > > > > > > > > > 3. It complicates the system by having two different mechanisms for > > > sharing > > > > managed memory in a slot. > > > > > > > > > > > > - In FLIP-141, we introduced the intra-slot managed memory sharing > > > > mechanism, where managed memory is first distributed according to the > > > > consumer type, then further distributed across operators of that > consumer > > > > type. > > > > > > > > - With the operator-based approach, managed memory size specified > for an > > > > operator should account for all the consumer types of that operator. > That > > > > means the managed memory is first distributed across operators, then > > > > distributed to different consumer types of each operator. > > > > > > > > > > > > Unfortunately, the different order of the two calculation steps can > lead > > > to > > > > different results. To be specific, the semantic of the configuration > > > option > > > > `consumer-weights` changed (within a slot vs. within an operator). > > > > > > > > > > > > > > > > To sum up things: > > > > > > > > While (3) might be a bit more implementation related, I think (1) > and (2) > > > > somehow suggest that, the price for the proposed approach to avoid > > > > specifying resource for every operator is that it's not as > independent > > > from > > > > operator chaining and slot sharing as the operator-based approach > > > discussed > > > > in the FLIP. > > > > > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> > wrote: > > > > > > > > > Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > > > > > I want to say, first of all, that this is super well written. And > the > > > > > points that the FLIP makes about how to expose the configuration to > > > users > > > > > is exactly the right thing to figure out first. > > > > > So good job here! > > > > > > > > > > About how to let users specify the resource profiles. If I can sum > the > > > > FLIP > > > > > and previous discussion up in my own words, the problem is the > > > following: > > > > > > > > > > Operator-level specification is the simplest and cleanest approach, > > > > because > > > > > > it avoids mixing operator configuration (resource) and > scheduling. No > > > > > > matter what other parameters change (chaining, slot sharing, > > > switching > > > > > > pipelined and blocking shuffles), the resource profiles stay the > > > same. > > > > > > But it would require that a user specifies resources on all > > > operators, > > > > > > which makes it hard to use. That's why the FLIP suggests going > with > > > > > > specifying resources on a Sharing-Group. > > > > > > > > > > > > > > > I think both thoughts are important, so can we find a solution > where > > > the > > > > > Resource Profiles are specified on an Operator, but we still avoid > that > > > > we > > > > > need to specify a resource profile on every operator? > > > > > > > > > > What do you think about something like the following: > > > > > - Resource Profiles are specified on an operator level. > > > > > - Not all operators need profiles > > > > > - All Operators without a Resource Profile ended up in the > default > > > slot > > > > > sharing group with a default profile (will get a default slot). > > > > > - All Operators with a Resource Profile will go into another slot > > > > sharing > > > > > group (the resource-specified-group). > > > > > - Users can define different slot sharing groups for operators > like > > > > they > > > > > do now, with the exception that you cannot mix operators that have > a > > > > > resource profile and operators that have no resource profile. > > > > > - The default case where no operator has a resource profile is > just a > > > > > special case of this model > > > > > - The chaining logic sums up the profiles per operator, like it > does > > > > now, > > > > > and the scheduler sums up the profiles of the tasks that it > schedules > > > > > together. > > > > > > > > > > > > > > > There is another question about reactive scaling raised in the > FLIP. I > > > > need > > > > > to think a bit about that. That is indeed a bit more tricky once we > > > have > > > > > slots of different sizes. > > > > > It is not clear then which of the different slot requests the > > > > > ResourceManager should fulfill when new resources (TMs) show up, > or how > > > > the > > > > > JobManager redistributes the slots resources when resources (TMs) > > > > disappear > > > > > This question is pretty orthogonal, though, to the "how to specify > the > > > > > resources". > > > > > > > > > > > > > > > Best, > > > > > Stephan > > > > > > > > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email] > > > > > > wrote: > > > > > > > > > > > Thanks for drafting the FLIP and driving the discussion, Yangze. > > > > > > And Thanks for the feedback, Till and Chesnay. > > > > > > > > > > > > @Till, > > > > > > > > > > > > I agree that specifying requirements for SSGs means that SSGs > need to > > > > be > > > > > > supported in fine-grained resource management, otherwise each > > > operator > > > > > > might use as many resources as the whole group. However, I cannot > > > think > > > > > of > > > > > > a strong reason for not supporting SSGs in fine-grained resource > > > > > > management. > > > > > > > > > > > > > > > > > > > Interestingly, if all operators have their resources properly > > > > > specified, > > > > > > > then slot sharing is no longer needed because Flink could > slice off > > > > the > > > > > > > appropriately sized slots for every Task individually. > > > > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator op_1 > and > > > > op_2 > > > > > > > where each op needs 100 MB of memory, we would then say that > the > > > slot > > > > > > > sharing group needs 200 MB of memory to run. If we have a > cluster > > > > with > > > > > 2 > > > > > > > TMs with one slot of 100 MB each, then the system cannot run > this > > > > job. > > > > > If > > > > > > > the resources were specified on an operator level, then the > system > > > > > could > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to > TM_2. > > > > > > > > > > > > > > > > > > Couldn't agree more that if all operators' requirements are > properly > > > > > > specified, slot sharing should be no longer needed. I think this > > > > exactly > > > > > > disproves the example. If we already know op_1 and op_2 each > needs > > > 100 > > > > MB > > > > > > of memory, why would we put them in the same group? If they are > in > > > > > separate > > > > > > groups, with the proposed approach the system can freely deploy > them > > > to > > > > > > either a 200 MB TM or two 100 MB TMs. > > > > > > > > > > > > Moreover, the precondition for not needing slot sharing is having > > > > > resource > > > > > > requirements properly specified for all operators. This is not > always > > > > > > possible, and usually requires tremendous efforts. One of the > > > benefits > > > > > for > > > > > > SSG-based requirements is that it allows the user to freely > decide > > > the > > > > > > granularity, thus efforts they want to pay. I would consider SSG > in > > > > > > fine-grained resource management as a group of operators that the > > > user > > > > > > would like to specify the total resource for. There can be only > one > > > > group > > > > > > in the job, 2~3 groups dividing the job into a few major parts, > or as > > > > > many > > > > > > groups as the number of tasks/operators, depending on how > > > fine-grained > > > > > the > > > > > > user is able to specify the resources. > > > > > > > > > > > > Having to support SSGs might be a constraint. But given that all > the > > > > > > current scheduler implementations already support SSGs, I tend to > > > think > > > > > > that as an acceptable price for the above discussed usability and > > > > > > flexibility. > > > > > > > > > > > > @Chesnay > > > > > > > > > > > > Will declaring them on slot sharing groups not also waste > resources > > > if > > > > > the > > > > > > > parallelism of operators within that group are different? > > > > > > > > > > > > > Yes. It's a trade-off between usability and resource > utilization. To > > > > > avoid > > > > > > such wasting, the user can define more groups, so that each group > > > > > contains > > > > > > less operators and the chance of having operators with different > > > > > > parallelism will be reduced. The price is to have more resource > > > > > > requirements to specify. > > > > > > > > > > > > It also seems like quite a hassle for users having to > recalculate the > > > > > > > resource requirements if they change the slot sharing. > > > > > > > I'd think that it's not really workable for users that create > a set > > > > of > > > > > > > re-usable operators which are mixed and matched in their > > > > applications; > > > > > > > managing the resources requirements in such a setting would be > a > > > > > > > nightmare, and in the end would require operator-level > requirements > > > > any > > > > > > > way. > > > > > > > In that sense, I'm not even sure whether it really increases > > > > usability. > > > > > > > > > > > > > > > > > > > - As mentioned in my reply to Till's comment, there's no > reason to > > > > put > > > > > > multiple operators whose individual resource requirements are > > > > already > > > > > > known > > > > > > into the same group in fine-grained resource management. > > > > > > - Even an operator implementation is reused for multiple > > > > applications, > > > > > > it does not guarantee the same resource requirements. During > our > > > > years > > > > > > of > > > > > > practices in Alibaba, with per-operator requirements > specified for > > > > > > Blink's > > > > > > fine-grained resource management, very few users (including > our > > > > > > specialists > > > > > > who are dedicated to supporting Blink users) are as > experienced as > > > > to > > > > > > accurately predict/estimate the operator resource > requirements. > > > Most > > > > > > people > > > > > > rely on the execution-time metrics (throughput, delay, cpu > load, > > > > > memory > > > > > > usage, GC pressure, etc.) to improve the specification. > > > > > > > > > > > > To sum up: > > > > > > If the user is capable of providing proper resource requirements > for > > > > > every > > > > > > operator, that's definitely a good thing and we would not need to > > > rely > > > > on > > > > > > the SSGs. However, that shouldn't be a *must* for the > fine-grained > > > > > resource > > > > > > management to work. For those users who are capable and do not > like > > > > > having > > > > > > to set each operator to a separate SSG, I would be ok to have > both > > > > > > SSG-based and operator-based runtime interfaces and to only > fallback > > > to > > > > > the > > > > > > SSG requirements when the operator requirements are not > specified. > > > > > However, > > > > > > as the first step, I think we should prioritise the use cases > where > > > > users > > > > > > are not that experienced. > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Will declaring them on slot sharing groups not also waste > resources > > > > if > > > > > > > the parallelism of operators within that group are different? > > > > > > > > > > > > > > It also seems like quite a hassle for users having to > recalculate > > > the > > > > > > > resource requirements if they change the slot sharing. > > > > > > > I'd think that it's not really workable for users that create > a set > > > > of > > > > > > > re-usable operators which are mixed and matched in their > > > > applications; > > > > > > > managing the resources requirements in such a setting would be > a > > > > > > > nightmare, and in the end would require operator-level > requirements > > > > any > > > > > > > way. > > > > > > > In that sense, I'm not even sure whether it really increases > > > > usability. > > > > > > > > > > > > > > My main worry is that it if we wire the runtime to work on SSGs > > > it's > > > > > > > gonna be difficult to implement more fine-grained approaches, > which > > > > > > > would not be the case if, for the runtime, they are always > defined > > > on > > > > > an > > > > > > > operator-level. > > > > > > > > > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > > > > Thanks for drafting this FLIP and starting this discussion > > > Yangze. > > > > > > > > > > > > > > > > I like that defining resource requirements on a slot sharing > > > group > > > > > > makes > > > > > > > > the overall setup easier and improves usability of resource > > > > > > requirements. > > > > > > > > > > > > > > > > What I do not like about it is that it changes slot sharing > > > groups > > > > > from > > > > > > > > being a scheduling hint to something which needs to be > supported > > > in > > > > > > order > > > > > > > > to support fine grained resource requirements. So far, the > idea > > > of > > > > > slot > > > > > > > > sharing groups was that it tells the system that a set of > > > operators > > > > > can > > > > > > > be > > > > > > > > deployed in the same slot. But the system still had the > freedom > > > to > > > > > say > > > > > > > that > > > > > > > > it would rather place these tasks in different slots if it > > > wanted. > > > > If > > > > > > we > > > > > > > > now specify resource requirements on a per slot sharing > group, > > > then > > > > > the > > > > > > > > only option for a scheduler which does not support slot > sharing > > > > > groups > > > > > > is > > > > > > > > to say that every operator in this slot sharing group needs a > > > slot > > > > > with > > > > > > > the > > > > > > > > same resources as the whole group. > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator > op_1 > > > > and > > > > > > op_2 > > > > > > > > where each op needs 100 MB of memory, we would then say that > the > > > > slot > > > > > > > > sharing group needs 200 MB of memory to run. If we have a > cluster > > > > > with > > > > > > 2 > > > > > > > > TMs with one slot of 100 MB each, then the system cannot run > this > > > > > job. > > > > > > If > > > > > > > > the resources were specified on an operator level, then the > > > system > > > > > > could > > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to > TM_2. > > > > > > > > > > > > > > > > Originally, one of the primary goals of slot sharing groups > was > > > to > > > > > make > > > > > > > it > > > > > > > > easier for the user to reason about how many slots a job > needs > > > > > > > independent > > > > > > > > of the actual number of operators in the job. Interestingly, > if > > > all > > > > > > > > operators have their resources properly specified, then slot > > > > sharing > > > > > is > > > > > > > no > > > > > > > > longer needed because Flink could slice off the appropriately > > > sized > > > > > > slots > > > > > > > > for every Task individually. What matters is whether the > whole > > > > > cluster > > > > > > > has > > > > > > > > enough resources to run all tasks or not. > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Till > > > > > > > > > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > > > >> Hi, there, > > > > > > > >> > > > > > > > >> We would like to start a discussion thread on "FLIP-156: > Runtime > > > > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], > where we > > > > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces > for > > > > > > > >> specifying fine-grained resource requirements. > > > > > > > >> > > > > > > > >> In this FLIP: > > > > > > > >> - Expound the user story of fine-grained resource > management. > > > > > > > >> - Propose runtime interfaces for specifying SSG-based > resource > > > > > > > >> requirements. > > > > > > > >> - Discuss the pros and cons of the three potential > granularities > > > > for > > > > > > > >> specifying the resource requirements (op, task and slot > sharing > > > > > group) > > > > > > > >> and explain why we choose the slot sharing group. > > > > > > > >> > > > > > > > >> Please find more details in the FLIP wiki document [1]. > Looking > > > > > > > >> forward to your feedback. > > > > > > > >> > > > > > > > >> [1] > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> Yangze Guo > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
I think this makes sense.
The semantic of a SSG is that operators in the group *can* be scheduled together in a slot, which is not a *must*. Specifying resources for SSGs should not change that semantic. In cases that needs for scheduling the operators into different slots arise, it makes sense for the runtime to derive the finer grained resource requirements, if not provided. We may not need to implement this at the moment since currently SSGs are always respected, but we should make that semantic explicit in JavaDocs for the interfaces and user documentations when the user APIs are exposed. Thank you~ Xintong Song On Thu, Jan 21, 2021 at 1:55 AM Till Rohrmann <[hidden email]> wrote: > Maybe a different minor idea: Would it be possible to treat the SSG > resource requirements as a hint for the runtime similar to how slot sharing > groups are designed at the moment? Meaning that we don't give the guarantee > that Flink will always deploy this set of tasks together no matter what > comes. If, for example, the runtime can derive by some means the resource > requirements for each task based on the requirements for the SSG, this > could be possible. One easy strategy would be to give every task the same > resources as the whole slot sharing group. Another one could be > distributing the resources equally among the tasks. This does not even have > to be implemented but we would give ourselves the freedom to change > scheduling if need should arise. > > Cheers, > Till > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email]> wrote: > > > Thanks for the responses, Till and Xintong. > > > > I second Xintong's comment that SSG-based runtime interface will give > > us the flexibility to achieve op/task-based approach. That's one of > > the most important reasons for our design choice. > > > > Some cents regarding the default operator resource: > > - It might be good for the scenario of DataStream jobs. > > ** For light-weight operators, the accumulative configuration error > > will not be significant. Then, the resource of a task used is > > proportional to the number of operators it contains. > > ** For heavy operators like join and window or operators using the > > external resources, user will turn to the fine-grained resource > > configuration. > > - It can increase the stability for the standalone cluster where task > > executors registered are heterogeneous(with different default slot > > resources). > > - It might not be good for SQL users. The operators that SQL will be > > transferred to is a black box to the user. We also do not guarantee > > the cross-version of consistency of the transformation so far. > > > > I think it can be treated as a follow-up work when the fine-grained > > resource management is end-to-end ready. > > > > Best, > > Yangze Guo > > > > > > On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <[hidden email]> > > wrote: > > > > > > Thanks for the feedback, Till. > > > > > > ## I feel that what you proposed (operator-based + default value) might > > be > > > subsumed by the SSG-based approach. > > > Thinking of op_1 -> op_2, there are the following 4 cases, categorized > by > > > whether the resource requirements are known to the users. > > > > > > 1. *Both known.* As previously mentioned, there's no reason to put > > > multiple operators whose individual resource requirements are > already > > known > > > into the same group in fine-grained resource management. And if op_1 > > and > > > op_2 are in different groups, there should be no problem switching > > data > > > exchange mode from pipelined to blocking. This is equivalent to > > specifying > > > operator resource requirements in your proposal. > > > 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is > in a > > > SSG whose resource is not specified thus would have the default slot > > > resource. This is equivalent to having default operator resources in > > your > > > proposal. > > > 3. *Both unknown*. The user can either set op_1 and op_2 to the same > > SSG > > > or separate SSGs. > > > - If op_1 and op_2 are in the same SSG, it will be equivalent to > > the > > > coarse-grained resource management, where op_1 and op_2 share a > > default > > > size slot no matter which data exchange mode is used. > > > - If op_1 and op_2 are in different SSGs, then each of them will > > use > > > a default size slot. This is equivalent to setting them with > > default > > > operator resources in your proposal. > > > 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* > > > - It is possible that the user learns the total / max resource > > > requirement from executing and monitoring the job, while not > > > being aware of > > > individual operator requirements. > > > - I believe this is the case your proposal does not cover. And > TBH, > > > this is probably how most users learn the resource requirements, > > > according > > > to my experiences. > > > - In this case, the user might need to specify different > resources > > if > > > he wants to switch the execution mode, which should not be worse > > than not > > > being able to use fine-grained resource management. > > > > > > > > > ## An additional idea inspired by your proposal. > > > We may provide multiple options for deciding resources for SSGs whose > > > requirement is not specified, if needed. > > > > > > - Default slot resource (current design) > > > - Default operator resource times number of operators (equivalent to > > > your proposal) > > > > > > > > > ## Exposing internal runtime strategies > > > Theoretically, yes. Tying to the SSGs, the resource requirements might > be > > > affected if how SSGs are internally handled changes in future. > > Practically, > > > I do not concretely see at the moment what kind of changes we may want > in > > > future that might conflict with this FLIP proposal, as the question of > > > switching data exchange mode answered above. I'd suggest to not give up > > the > > > user friendliness we may gain now for the future problems that may or > may > > > not exist. > > > > > > Moreover, the SSG-based approach has the flexibility to achieve the > > > equivalent behavior as the operator-based approach, if we set each > > operator > > > (or task) to a separate SSG. We can even provide a shortcut option to > > > automatically do that for users, if needed. > > > > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > > > Thanks for the responses Xintong and Stephan, > > > > > > > > I agree that being able to define the resource requirements for a > > group of > > > > operators is more user friendly. However, my concern is that we are > > > > exposing thereby internal runtime strategies which might limit our > > > > flexibility to execute a given job. Moreover, the semantics of > > configuring > > > > resource requirements for SSGs could break if switching from > streaming > > to > > > > batch execution. If one defines the resource requirements for op_1 -> > > op_2 > > > > which run in pipelined mode when using the streaming execution, then > > how do > > > > we interpret these requirements when op_1 -> op_2 are executed with a > > > > blocking data exchange in batch execution mode? Consequently, I am > > still > > > > leaning towards Stephan's proposal to set the resource requirements > per > > > > operator. > > > > > > > > Maybe the following proposal makes the configuration easier: If the > > user > > > > wants to use fine-grained resource requirements, then she needs to > > specify > > > > the default size which is used for operators which have no explicit > > > > resource annotation. If this holds true, then every operator would > > have a > > > > resource requirement and the system can try to execute the operators > > in the > > > > best possible manner w/o being constrained by how the user set the > SSG > > > > requirements. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> > > > > wrote: > > > > > > > > > Thanks for the feedback, Stephan. > > > > > > > > > > Actually, your proposal has also come to my mind at some point. > And I > > > > have > > > > > some concerns about it. > > > > > > > > > > > > > > > 1. It does not give users the same control as the SSG-based > approach. > > > > > > > > > > > > > > > While both approaches do not require specifying for each operator, > > > > > SSG-based approach supports the semantic that "some operators > > together > > > > use > > > > > this much resource" while the operator-based approach doesn't. > > > > > > > > > > > > > > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and > > at > > > > some > > > > > point there's an agg o_n (1 < n < m) which significantly reduces > the > > data > > > > > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, > ..., > > o_n) > > > > > and SSG_2 (o_n+1, ... o_m), so that configuring much higher > > parallelisms > > > > > for operators in SSG_1 than for operators in SSG_2 won't lead to > too > > much > > > > > wasting of resources. If the two SSGs end up needing different > > resources, > > > > > with the SSG-based approach one can directly specify resources for > > the > > > > two > > > > > groups. However, with the operator-based approach, the user will > > have to > > > > > specify resources for each operator in one of the two groups, and > > tune > > > > the > > > > > default slot resource via configurations to fit the other group. > > > > > > > > > > > > > > > 2. It increases the chance of breaking operator chains. > > > > > > > > > > > > > > > Setting chainnable operators into different slot sharing groups > will > > > > > prevent them from being chained. In the current implementation, > > > > downstream > > > > > operators, if SSG not explicitly specified, will be set to the same > > group > > > > > as the chainable upstream operators (unless multiple upstream > > operators > > > > in > > > > > different groups), to reduce the chance of breaking chains. > > > > > > > > > > > > > > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding > > SSGs > > > > > based on whether resource is specified we will easily get groups > like > > > > (o_1, > > > > > o_3) & (o_2, o_4), where none of the operators can be chained. This > > is > > > > also > > > > > possible for the SSG-based approach, but I believe the chance is > much > > > > > smaller because there's no strong reason for users to specify the > > groups > > > > > with alternate operators like that. We are more likely to get > groups > > like > > > > > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 > and > > o_3. > > > > > > > > > > > > > > > 3. It complicates the system by having two different mechanisms for > > > > sharing > > > > > managed memory in a slot. > > > > > > > > > > > > > > > - In FLIP-141, we introduced the intra-slot managed memory sharing > > > > > mechanism, where managed memory is first distributed according to > the > > > > > consumer type, then further distributed across operators of that > > consumer > > > > > type. > > > > > > > > > > - With the operator-based approach, managed memory size specified > > for an > > > > > operator should account for all the consumer types of that > operator. > > That > > > > > means the managed memory is first distributed across operators, > then > > > > > distributed to different consumer types of each operator. > > > > > > > > > > > > > > > Unfortunately, the different order of the two calculation steps can > > lead > > > > to > > > > > different results. To be specific, the semantic of the > configuration > > > > option > > > > > `consumer-weights` changed (within a slot vs. within an operator). > > > > > > > > > > > > > > > > > > > > To sum up things: > > > > > > > > > > While (3) might be a bit more implementation related, I think (1) > > and (2) > > > > > somehow suggest that, the price for the proposed approach to avoid > > > > > specifying resource for every operator is that it's not as > > independent > > > > from > > > > > operator chaining and slot sharing as the operator-based approach > > > > discussed > > > > > in the FLIP. > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> > > wrote: > > > > > > > > > > > Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > > > > > > > I want to say, first of all, that this is super well written. And > > the > > > > > > points that the FLIP makes about how to expose the configuration > to > > > > users > > > > > > is exactly the right thing to figure out first. > > > > > > So good job here! > > > > > > > > > > > > About how to let users specify the resource profiles. If I can > sum > > the > > > > > FLIP > > > > > > and previous discussion up in my own words, the problem is the > > > > following: > > > > > > > > > > > > Operator-level specification is the simplest and cleanest > approach, > > > > > because > > > > > > > it avoids mixing operator configuration (resource) and > > scheduling. No > > > > > > > matter what other parameters change (chaining, slot sharing, > > > > switching > > > > > > > pipelined and blocking shuffles), the resource profiles stay > the > > > > same. > > > > > > > But it would require that a user specifies resources on all > > > > operators, > > > > > > > which makes it hard to use. That's why the FLIP suggests going > > with > > > > > > > specifying resources on a Sharing-Group. > > > > > > > > > > > > > > > > > > I think both thoughts are important, so can we find a solution > > where > > > > the > > > > > > Resource Profiles are specified on an Operator, but we still > avoid > > that > > > > > we > > > > > > need to specify a resource profile on every operator? > > > > > > > > > > > > What do you think about something like the following: > > > > > > - Resource Profiles are specified on an operator level. > > > > > > - Not all operators need profiles > > > > > > - All Operators without a Resource Profile ended up in the > > default > > > > slot > > > > > > sharing group with a default profile (will get a default slot). > > > > > > - All Operators with a Resource Profile will go into another > slot > > > > > sharing > > > > > > group (the resource-specified-group). > > > > > > - Users can define different slot sharing groups for operators > > like > > > > > they > > > > > > do now, with the exception that you cannot mix operators that > have > > a > > > > > > resource profile and operators that have no resource profile. > > > > > > - The default case where no operator has a resource profile is > > just a > > > > > > special case of this model > > > > > > - The chaining logic sums up the profiles per operator, like it > > does > > > > > now, > > > > > > and the scheduler sums up the profiles of the tasks that it > > schedules > > > > > > together. > > > > > > > > > > > > > > > > > > There is another question about reactive scaling raised in the > > FLIP. I > > > > > need > > > > > > to think a bit about that. That is indeed a bit more tricky once > we > > > > have > > > > > > slots of different sizes. > > > > > > It is not clear then which of the different slot requests the > > > > > > ResourceManager should fulfill when new resources (TMs) show up, > > or how > > > > > the > > > > > > JobManager redistributes the slots resources when resources (TMs) > > > > > disappear > > > > > > This question is pretty orthogonal, though, to the "how to > specify > > the > > > > > > resources". > > > > > > > > > > > > > > > > > > Best, > > > > > > Stephan > > > > > > > > > > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song < > [hidden email] > > > > > > > > wrote: > > > > > > > > > > > > > Thanks for drafting the FLIP and driving the discussion, > Yangze. > > > > > > > And Thanks for the feedback, Till and Chesnay. > > > > > > > > > > > > > > @Till, > > > > > > > > > > > > > > I agree that specifying requirements for SSGs means that SSGs > > need to > > > > > be > > > > > > > supported in fine-grained resource management, otherwise each > > > > operator > > > > > > > might use as many resources as the whole group. However, I > cannot > > > > think > > > > > > of > > > > > > > a strong reason for not supporting SSGs in fine-grained > resource > > > > > > > management. > > > > > > > > > > > > > > > > > > > > > > Interestingly, if all operators have their resources properly > > > > > > specified, > > > > > > > > then slot sharing is no longer needed because Flink could > > slice off > > > > > the > > > > > > > > appropriately sized slots for every Task individually. > > > > > > > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator > op_1 > > and > > > > > op_2 > > > > > > > > where each op needs 100 MB of memory, we would then say that > > the > > > > slot > > > > > > > > sharing group needs 200 MB of memory to run. If we have a > > cluster > > > > > with > > > > > > 2 > > > > > > > > TMs with one slot of 100 MB each, then the system cannot run > > this > > > > > job. > > > > > > If > > > > > > > > the resources were specified on an operator level, then the > > system > > > > > > could > > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to > > TM_2. > > > > > > > > > > > > > > > > > > > > > Couldn't agree more that if all operators' requirements are > > properly > > > > > > > specified, slot sharing should be no longer needed. I think > this > > > > > exactly > > > > > > > disproves the example. If we already know op_1 and op_2 each > > needs > > > > 100 > > > > > MB > > > > > > > of memory, why would we put them in the same group? If they are > > in > > > > > > separate > > > > > > > groups, with the proposed approach the system can freely deploy > > them > > > > to > > > > > > > either a 200 MB TM or two 100 MB TMs. > > > > > > > > > > > > > > Moreover, the precondition for not needing slot sharing is > having > > > > > > resource > > > > > > > requirements properly specified for all operators. This is not > > always > > > > > > > possible, and usually requires tremendous efforts. One of the > > > > benefits > > > > > > for > > > > > > > SSG-based requirements is that it allows the user to freely > > decide > > > > the > > > > > > > granularity, thus efforts they want to pay. I would consider > SSG > > in > > > > > > > fine-grained resource management as a group of operators that > the > > > > user > > > > > > > would like to specify the total resource for. There can be only > > one > > > > > group > > > > > > > in the job, 2~3 groups dividing the job into a few major parts, > > or as > > > > > > many > > > > > > > groups as the number of tasks/operators, depending on how > > > > fine-grained > > > > > > the > > > > > > > user is able to specify the resources. > > > > > > > > > > > > > > Having to support SSGs might be a constraint. But given that > all > > the > > > > > > > current scheduler implementations already support SSGs, I tend > to > > > > think > > > > > > > that as an acceptable price for the above discussed usability > and > > > > > > > flexibility. > > > > > > > > > > > > > > @Chesnay > > > > > > > > > > > > > > Will declaring them on slot sharing groups not also waste > > resources > > > > if > > > > > > the > > > > > > > > parallelism of operators within that group are different? > > > > > > > > > > > > > > > Yes. It's a trade-off between usability and resource > > utilization. To > > > > > > avoid > > > > > > > such wasting, the user can define more groups, so that each > group > > > > > > contains > > > > > > > less operators and the chance of having operators with > different > > > > > > > parallelism will be reduced. The price is to have more resource > > > > > > > requirements to specify. > > > > > > > > > > > > > > It also seems like quite a hassle for users having to > > recalculate the > > > > > > > > resource requirements if they change the slot sharing. > > > > > > > > I'd think that it's not really workable for users that create > > a set > > > > > of > > > > > > > > re-usable operators which are mixed and matched in their > > > > > applications; > > > > > > > > managing the resources requirements in such a setting would > be > > a > > > > > > > > nightmare, and in the end would require operator-level > > requirements > > > > > any > > > > > > > > way. > > > > > > > > In that sense, I'm not even sure whether it really increases > > > > > usability. > > > > > > > > > > > > > > > > > > > > > > - As mentioned in my reply to Till's comment, there's no > > reason to > > > > > put > > > > > > > multiple operators whose individual resource requirements > are > > > > > already > > > > > > > known > > > > > > > into the same group in fine-grained resource management. > > > > > > > - Even an operator implementation is reused for multiple > > > > > applications, > > > > > > > it does not guarantee the same resource requirements. During > > our > > > > > years > > > > > > > of > > > > > > > practices in Alibaba, with per-operator requirements > > specified for > > > > > > > Blink's > > > > > > > fine-grained resource management, very few users (including > > our > > > > > > > specialists > > > > > > > who are dedicated to supporting Blink users) are as > > experienced as > > > > > to > > > > > > > accurately predict/estimate the operator resource > > requirements. > > > > Most > > > > > > > people > > > > > > > rely on the execution-time metrics (throughput, delay, cpu > > load, > > > > > > memory > > > > > > > usage, GC pressure, etc.) to improve the specification. > > > > > > > > > > > > > > To sum up: > > > > > > > If the user is capable of providing proper resource > requirements > > for > > > > > > every > > > > > > > operator, that's definitely a good thing and we would not need > to > > > > rely > > > > > on > > > > > > > the SSGs. However, that shouldn't be a *must* for the > > fine-grained > > > > > > resource > > > > > > > management to work. For those users who are capable and do not > > like > > > > > > having > > > > > > > to set each operator to a separate SSG, I would be ok to have > > both > > > > > > > SSG-based and operator-based runtime interfaces and to only > > fallback > > > > to > > > > > > the > > > > > > > SSG requirements when the operator requirements are not > > specified. > > > > > > However, > > > > > > > as the first step, I think we should prioritise the use cases > > where > > > > > users > > > > > > > are not that experienced. > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > Will declaring them on slot sharing groups not also waste > > resources > > > > > if > > > > > > > > the parallelism of operators within that group are different? > > > > > > > > > > > > > > > > It also seems like quite a hassle for users having to > > recalculate > > > > the > > > > > > > > resource requirements if they change the slot sharing. > > > > > > > > I'd think that it's not really workable for users that create > > a set > > > > > of > > > > > > > > re-usable operators which are mixed and matched in their > > > > > applications; > > > > > > > > managing the resources requirements in such a setting would > be > > a > > > > > > > > nightmare, and in the end would require operator-level > > requirements > > > > > any > > > > > > > > way. > > > > > > > > In that sense, I'm not even sure whether it really increases > > > > > usability. > > > > > > > > > > > > > > > > My main worry is that it if we wire the runtime to work on > SSGs > > > > it's > > > > > > > > gonna be difficult to implement more fine-grained approaches, > > which > > > > > > > > would not be the case if, for the runtime, they are always > > defined > > > > on > > > > > > an > > > > > > > > operator-level. > > > > > > > > > > > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > > > > > Thanks for drafting this FLIP and starting this discussion > > > > Yangze. > > > > > > > > > > > > > > > > > > I like that defining resource requirements on a slot > sharing > > > > group > > > > > > > makes > > > > > > > > > the overall setup easier and improves usability of resource > > > > > > > requirements. > > > > > > > > > > > > > > > > > > What I do not like about it is that it changes slot sharing > > > > groups > > > > > > from > > > > > > > > > being a scheduling hint to something which needs to be > > supported > > > > in > > > > > > > order > > > > > > > > > to support fine grained resource requirements. So far, the > > idea > > > > of > > > > > > slot > > > > > > > > > sharing groups was that it tells the system that a set of > > > > operators > > > > > > can > > > > > > > > be > > > > > > > > > deployed in the same slot. But the system still had the > > freedom > > > > to > > > > > > say > > > > > > > > that > > > > > > > > > it would rather place these tasks in different slots if it > > > > wanted. > > > > > If > > > > > > > we > > > > > > > > > now specify resource requirements on a per slot sharing > > group, > > > > then > > > > > > the > > > > > > > > > only option for a scheduler which does not support slot > > sharing > > > > > > groups > > > > > > > is > > > > > > > > > to say that every operator in this slot sharing group > needs a > > > > slot > > > > > > with > > > > > > > > the > > > > > > > > > same resources as the whole group. > > > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator > > op_1 > > > > > and > > > > > > > op_2 > > > > > > > > > where each op needs 100 MB of memory, we would then say > that > > the > > > > > slot > > > > > > > > > sharing group needs 200 MB of memory to run. If we have a > > cluster > > > > > > with > > > > > > > 2 > > > > > > > > > TMs with one slot of 100 MB each, then the system cannot > run > > this > > > > > > job. > > > > > > > If > > > > > > > > > the resources were specified on an operator level, then the > > > > system > > > > > > > could > > > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to > > TM_2. > > > > > > > > > > > > > > > > > > Originally, one of the primary goals of slot sharing groups > > was > > > > to > > > > > > make > > > > > > > > it > > > > > > > > > easier for the user to reason about how many slots a job > > needs > > > > > > > > independent > > > > > > > > > of the actual number of operators in the job. > Interestingly, > > if > > > > all > > > > > > > > > operators have their resources properly specified, then > slot > > > > > sharing > > > > > > is > > > > > > > > no > > > > > > > > > longer needed because Flink could slice off the > appropriately > > > > sized > > > > > > > slots > > > > > > > > > for every Task individually. What matters is whether the > > whole > > > > > > cluster > > > > > > > > has > > > > > > > > > enough resources to run all tasks or not. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > Till > > > > > > > > > > > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Hi, there, > > > > > > > > >> > > > > > > > > >> We would like to start a discussion thread on "FLIP-156: > > Runtime > > > > > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], > > where we > > > > > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces > > for > > > > > > > > >> specifying fine-grained resource requirements. > > > > > > > > >> > > > > > > > > >> In this FLIP: > > > > > > > > >> - Expound the user story of fine-grained resource > > management. > > > > > > > > >> - Propose runtime interfaces for specifying SSG-based > > resource > > > > > > > > >> requirements. > > > > > > > > >> - Discuss the pros and cons of the three potential > > granularities > > > > > for > > > > > > > > >> specifying the resource requirements (op, task and slot > > sharing > > > > > > group) > > > > > > > > >> and explain why we choose the slot sharing group. > > > > > > > > >> > > > > > > > > >> Please find more details in the FLIP wiki document [1]. > > Looking > > > > > > > > >> forward to your feedback. > > > > > > > > >> > > > > > > > > >> [1] > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > >> > > > > > > > > >> Best, > > > > > > > > >> Yangze Guo > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
@Till
Also +1 to treat the SSG resource requirements as a hint instead of a restrict. We can treat it as a follow-up effort and make it clear in JavaDocs at the first step. Best, Yangze Guo On Thu, Jan 21, 2021 at 10:00 AM Xintong Song <[hidden email]> wrote: > > I think this makes sense. > > The semantic of a SSG is that operators in the group *can* be scheduled > together in a slot, which is not a *must*. Specifying resources for SSGs > should not change that semantic. In cases that needs for scheduling the > operators into different slots arise, it makes sense for the runtime to > derive the finer grained resource requirements, if not provided. > > We may not need to implement this at the moment since currently SSGs are > always respected, but we should make that semantic explicit in JavaDocs for > the interfaces and user documentations when the user APIs are exposed. > > Thank you~ > > Xintong Song > > > > On Thu, Jan 21, 2021 at 1:55 AM Till Rohrmann <[hidden email]> wrote: > > > Maybe a different minor idea: Would it be possible to treat the SSG > > resource requirements as a hint for the runtime similar to how slot sharing > > groups are designed at the moment? Meaning that we don't give the guarantee > > that Flink will always deploy this set of tasks together no matter what > > comes. If, for example, the runtime can derive by some means the resource > > requirements for each task based on the requirements for the SSG, this > > could be possible. One easy strategy would be to give every task the same > > resources as the whole slot sharing group. Another one could be > > distributing the resources equally among the tasks. This does not even have > > to be implemented but we would give ourselves the freedom to change > > scheduling if need should arise. > > > > Cheers, > > Till > > > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email]> wrote: > > > > > Thanks for the responses, Till and Xintong. > > > > > > I second Xintong's comment that SSG-based runtime interface will give > > > us the flexibility to achieve op/task-based approach. That's one of > > > the most important reasons for our design choice. > > > > > > Some cents regarding the default operator resource: > > > - It might be good for the scenario of DataStream jobs. > > > ** For light-weight operators, the accumulative configuration error > > > will not be significant. Then, the resource of a task used is > > > proportional to the number of operators it contains. > > > ** For heavy operators like join and window or operators using the > > > external resources, user will turn to the fine-grained resource > > > configuration. > > > - It can increase the stability for the standalone cluster where task > > > executors registered are heterogeneous(with different default slot > > > resources). > > > - It might not be good for SQL users. The operators that SQL will be > > > transferred to is a black box to the user. We also do not guarantee > > > the cross-version of consistency of the transformation so far. > > > > > > I think it can be treated as a follow-up work when the fine-grained > > > resource management is end-to-end ready. > > > > > > Best, > > > Yangze Guo > > > > > > > > > On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > > Thanks for the feedback, Till. > > > > > > > > ## I feel that what you proposed (operator-based + default value) might > > > be > > > > subsumed by the SSG-based approach. > > > > Thinking of op_1 -> op_2, there are the following 4 cases, categorized > > by > > > > whether the resource requirements are known to the users. > > > > > > > > 1. *Both known.* As previously mentioned, there's no reason to put > > > > multiple operators whose individual resource requirements are > > already > > > known > > > > into the same group in fine-grained resource management. And if op_1 > > > and > > > > op_2 are in different groups, there should be no problem switching > > > data > > > > exchange mode from pipelined to blocking. This is equivalent to > > > specifying > > > > operator resource requirements in your proposal. > > > > 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is > > in a > > > > SSG whose resource is not specified thus would have the default slot > > > > resource. This is equivalent to having default operator resources in > > > your > > > > proposal. > > > > 3. *Both unknown*. The user can either set op_1 and op_2 to the same > > > SSG > > > > or separate SSGs. > > > > - If op_1 and op_2 are in the same SSG, it will be equivalent to > > > the > > > > coarse-grained resource management, where op_1 and op_2 share a > > > default > > > > size slot no matter which data exchange mode is used. > > > > - If op_1 and op_2 are in different SSGs, then each of them will > > > use > > > > a default size slot. This is equivalent to setting them with > > > default > > > > operator resources in your proposal. > > > > 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* > > > > - It is possible that the user learns the total / max resource > > > > requirement from executing and monitoring the job, while not > > > > being aware of > > > > individual operator requirements. > > > > - I believe this is the case your proposal does not cover. And > > TBH, > > > > this is probably how most users learn the resource requirements, > > > > according > > > > to my experiences. > > > > - In this case, the user might need to specify different > > resources > > > if > > > > he wants to switch the execution mode, which should not be worse > > > than not > > > > being able to use fine-grained resource management. > > > > > > > > > > > > ## An additional idea inspired by your proposal. > > > > We may provide multiple options for deciding resources for SSGs whose > > > > requirement is not specified, if needed. > > > > > > > > - Default slot resource (current design) > > > > - Default operator resource times number of operators (equivalent to > > > > your proposal) > > > > > > > > > > > > ## Exposing internal runtime strategies > > > > Theoretically, yes. Tying to the SSGs, the resource requirements might > > be > > > > affected if how SSGs are internally handled changes in future. > > > Practically, > > > > I do not concretely see at the moment what kind of changes we may want > > in > > > > future that might conflict with this FLIP proposal, as the question of > > > > switching data exchange mode answered above. I'd suggest to not give up > > > the > > > > user friendliness we may gain now for the future problems that may or > > may > > > > not exist. > > > > > > > > Moreover, the SSG-based approach has the flexibility to achieve the > > > > equivalent behavior as the operator-based approach, if we set each > > > operator > > > > (or task) to a separate SSG. We can even provide a shortcut option to > > > > automatically do that for users, if needed. > > > > > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email]> > > > wrote: > > > > > > > > > Thanks for the responses Xintong and Stephan, > > > > > > > > > > I agree that being able to define the resource requirements for a > > > group of > > > > > operators is more user friendly. However, my concern is that we are > > > > > exposing thereby internal runtime strategies which might limit our > > > > > flexibility to execute a given job. Moreover, the semantics of > > > configuring > > > > > resource requirements for SSGs could break if switching from > > streaming > > > to > > > > > batch execution. If one defines the resource requirements for op_1 -> > > > op_2 > > > > > which run in pipelined mode when using the streaming execution, then > > > how do > > > > > we interpret these requirements when op_1 -> op_2 are executed with a > > > > > blocking data exchange in batch execution mode? Consequently, I am > > > still > > > > > leaning towards Stephan's proposal to set the resource requirements > > per > > > > > operator. > > > > > > > > > > Maybe the following proposal makes the configuration easier: If the > > > user > > > > > wants to use fine-grained resource requirements, then she needs to > > > specify > > > > > the default size which is used for operators which have no explicit > > > > > resource annotation. If this holds true, then every operator would > > > have a > > > > > resource requirement and the system can try to execute the operators > > > in the > > > > > best possible manner w/o being constrained by how the user set the > > SSG > > > > > requirements. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> > > > > > wrote: > > > > > > > > > > > Thanks for the feedback, Stephan. > > > > > > > > > > > > Actually, your proposal has also come to my mind at some point. > > And I > > > > > have > > > > > > some concerns about it. > > > > > > > > > > > > > > > > > > 1. It does not give users the same control as the SSG-based > > approach. > > > > > > > > > > > > > > > > > > While both approaches do not require specifying for each operator, > > > > > > SSG-based approach supports the semantic that "some operators > > > together > > > > > use > > > > > > this much resource" while the operator-based approach doesn't. > > > > > > > > > > > > > > > > > > Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and > > > at > > > > > some > > > > > > point there's an agg o_n (1 < n < m) which significantly reduces > > the > > > data > > > > > > amount. One can separate the pipeline into 2 groups SSG_1 (o_1, > > ..., > > > o_n) > > > > > > and SSG_2 (o_n+1, ... o_m), so that configuring much higher > > > parallelisms > > > > > > for operators in SSG_1 than for operators in SSG_2 won't lead to > > too > > > much > > > > > > wasting of resources. If the two SSGs end up needing different > > > resources, > > > > > > with the SSG-based approach one can directly specify resources for > > > the > > > > > two > > > > > > groups. However, with the operator-based approach, the user will > > > have to > > > > > > specify resources for each operator in one of the two groups, and > > > tune > > > > > the > > > > > > default slot resource via configurations to fit the other group. > > > > > > > > > > > > > > > > > > 2. It increases the chance of breaking operator chains. > > > > > > > > > > > > > > > > > > Setting chainnable operators into different slot sharing groups > > will > > > > > > prevent them from being chained. In the current implementation, > > > > > downstream > > > > > > operators, if SSG not explicitly specified, will be set to the same > > > group > > > > > > as the chainable upstream operators (unless multiple upstream > > > operators > > > > > in > > > > > > different groups), to reduce the chance of breaking chains. > > > > > > > > > > > > > > > > > > Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding > > > SSGs > > > > > > based on whether resource is specified we will easily get groups > > like > > > > > (o_1, > > > > > > o_3) & (o_2, o_4), where none of the operators can be chained. This > > > is > > > > > also > > > > > > possible for the SSG-based approach, but I believe the chance is > > much > > > > > > smaller because there's no strong reason for users to specify the > > > groups > > > > > > with alternate operators like that. We are more likely to get > > groups > > > like > > > > > > (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 > > and > > > o_3. > > > > > > > > > > > > > > > > > > 3. It complicates the system by having two different mechanisms for > > > > > sharing > > > > > > managed memory in a slot. > > > > > > > > > > > > > > > > > > - In FLIP-141, we introduced the intra-slot managed memory sharing > > > > > > mechanism, where managed memory is first distributed according to > > the > > > > > > consumer type, then further distributed across operators of that > > > consumer > > > > > > type. > > > > > > > > > > > > - With the operator-based approach, managed memory size specified > > > for an > > > > > > operator should account for all the consumer types of that > > operator. > > > That > > > > > > means the managed memory is first distributed across operators, > > then > > > > > > distributed to different consumer types of each operator. > > > > > > > > > > > > > > > > > > Unfortunately, the different order of the two calculation steps can > > > lead > > > > > to > > > > > > different results. To be specific, the semantic of the > > configuration > > > > > option > > > > > > `consumer-weights` changed (within a slot vs. within an operator). > > > > > > > > > > > > > > > > > > > > > > > > To sum up things: > > > > > > > > > > > > While (3) might be a bit more implementation related, I think (1) > > > and (2) > > > > > > somehow suggest that, the price for the proposed approach to avoid > > > > > > specifying resource for every operator is that it's not as > > > independent > > > > > from > > > > > > operator chaining and slot sharing as the operator-based approach > > > > > discussed > > > > > > in the FLIP. > > > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> > > > wrote: > > > > > > > > > > > > > Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > > > > > > > > > I want to say, first of all, that this is super well written. And > > > the > > > > > > > points that the FLIP makes about how to expose the configuration > > to > > > > > users > > > > > > > is exactly the right thing to figure out first. > > > > > > > So good job here! > > > > > > > > > > > > > > About how to let users specify the resource profiles. If I can > > sum > > > the > > > > > > FLIP > > > > > > > and previous discussion up in my own words, the problem is the > > > > > following: > > > > > > > > > > > > > > Operator-level specification is the simplest and cleanest > > approach, > > > > > > because > > > > > > > > it avoids mixing operator configuration (resource) and > > > scheduling. No > > > > > > > > matter what other parameters change (chaining, slot sharing, > > > > > switching > > > > > > > > pipelined and blocking shuffles), the resource profiles stay > > the > > > > > same. > > > > > > > > But it would require that a user specifies resources on all > > > > > operators, > > > > > > > > which makes it hard to use. That's why the FLIP suggests going > > > with > > > > > > > > specifying resources on a Sharing-Group. > > > > > > > > > > > > > > > > > > > > > I think both thoughts are important, so can we find a solution > > > where > > > > > the > > > > > > > Resource Profiles are specified on an Operator, but we still > > avoid > > > that > > > > > > we > > > > > > > need to specify a resource profile on every operator? > > > > > > > > > > > > > > What do you think about something like the following: > > > > > > > - Resource Profiles are specified on an operator level. > > > > > > > - Not all operators need profiles > > > > > > > - All Operators without a Resource Profile ended up in the > > > default > > > > > slot > > > > > > > sharing group with a default profile (will get a default slot). > > > > > > > - All Operators with a Resource Profile will go into another > > slot > > > > > > sharing > > > > > > > group (the resource-specified-group). > > > > > > > - Users can define different slot sharing groups for operators > > > like > > > > > > they > > > > > > > do now, with the exception that you cannot mix operators that > > have > > > a > > > > > > > resource profile and operators that have no resource profile. > > > > > > > - The default case where no operator has a resource profile is > > > just a > > > > > > > special case of this model > > > > > > > - The chaining logic sums up the profiles per operator, like it > > > does > > > > > > now, > > > > > > > and the scheduler sums up the profiles of the tasks that it > > > schedules > > > > > > > together. > > > > > > > > > > > > > > > > > > > > > There is another question about reactive scaling raised in the > > > FLIP. I > > > > > > need > > > > > > > to think a bit about that. That is indeed a bit more tricky once > > we > > > > > have > > > > > > > slots of different sizes. > > > > > > > It is not clear then which of the different slot requests the > > > > > > > ResourceManager should fulfill when new resources (TMs) show up, > > > or how > > > > > > the > > > > > > > JobManager redistributes the slots resources when resources (TMs) > > > > > > disappear > > > > > > > This question is pretty orthogonal, though, to the "how to > > specify > > > the > > > > > > > resources". > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > Stephan > > > > > > > > > > > > > > On Fri, Jan 8, 2021 at 5:14 AM Xintong Song < > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Thanks for drafting the FLIP and driving the discussion, > > Yangze. > > > > > > > > And Thanks for the feedback, Till and Chesnay. > > > > > > > > > > > > > > > > @Till, > > > > > > > > > > > > > > > > I agree that specifying requirements for SSGs means that SSGs > > > need to > > > > > > be > > > > > > > > supported in fine-grained resource management, otherwise each > > > > > operator > > > > > > > > might use as many resources as the whole group. However, I > > cannot > > > > > think > > > > > > > of > > > > > > > > a strong reason for not supporting SSGs in fine-grained > > resource > > > > > > > > management. > > > > > > > > > > > > > > > > > > > > > > > > > Interestingly, if all operators have their resources properly > > > > > > > specified, > > > > > > > > > then slot sharing is no longer needed because Flink could > > > slice off > > > > > > the > > > > > > > > > appropriately sized slots for every Task individually. > > > > > > > > > > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator > > op_1 > > > and > > > > > > op_2 > > > > > > > > > where each op needs 100 MB of memory, we would then say that > > > the > > > > > slot > > > > > > > > > sharing group needs 200 MB of memory to run. If we have a > > > cluster > > > > > > with > > > > > > > 2 > > > > > > > > > TMs with one slot of 100 MB each, then the system cannot run > > > this > > > > > > job. > > > > > > > If > > > > > > > > > the resources were specified on an operator level, then the > > > system > > > > > > > could > > > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to > > > TM_2. > > > > > > > > > > > > > > > > > > > > > > > > Couldn't agree more that if all operators' requirements are > > > properly > > > > > > > > specified, slot sharing should be no longer needed. I think > > this > > > > > > exactly > > > > > > > > disproves the example. If we already know op_1 and op_2 each > > > needs > > > > > 100 > > > > > > MB > > > > > > > > of memory, why would we put them in the same group? If they are > > > in > > > > > > > separate > > > > > > > > groups, with the proposed approach the system can freely deploy > > > them > > > > > to > > > > > > > > either a 200 MB TM or two 100 MB TMs. > > > > > > > > > > > > > > > > Moreover, the precondition for not needing slot sharing is > > having > > > > > > > resource > > > > > > > > requirements properly specified for all operators. This is not > > > always > > > > > > > > possible, and usually requires tremendous efforts. One of the > > > > > benefits > > > > > > > for > > > > > > > > SSG-based requirements is that it allows the user to freely > > > decide > > > > > the > > > > > > > > granularity, thus efforts they want to pay. I would consider > > SSG > > > in > > > > > > > > fine-grained resource management as a group of operators that > > the > > > > > user > > > > > > > > would like to specify the total resource for. There can be only > > > one > > > > > > group > > > > > > > > in the job, 2~3 groups dividing the job into a few major parts, > > > or as > > > > > > > many > > > > > > > > groups as the number of tasks/operators, depending on how > > > > > fine-grained > > > > > > > the > > > > > > > > user is able to specify the resources. > > > > > > > > > > > > > > > > Having to support SSGs might be a constraint. But given that > > all > > > the > > > > > > > > current scheduler implementations already support SSGs, I tend > > to > > > > > think > > > > > > > > that as an acceptable price for the above discussed usability > > and > > > > > > > > flexibility. > > > > > > > > > > > > > > > > @Chesnay > > > > > > > > > > > > > > > > Will declaring them on slot sharing groups not also waste > > > resources > > > > > if > > > > > > > the > > > > > > > > > parallelism of operators within that group are different? > > > > > > > > > > > > > > > > > Yes. It's a trade-off between usability and resource > > > utilization. To > > > > > > > avoid > > > > > > > > such wasting, the user can define more groups, so that each > > group > > > > > > > contains > > > > > > > > less operators and the chance of having operators with > > different > > > > > > > > parallelism will be reduced. The price is to have more resource > > > > > > > > requirements to specify. > > > > > > > > > > > > > > > > It also seems like quite a hassle for users having to > > > recalculate the > > > > > > > > > resource requirements if they change the slot sharing. > > > > > > > > > I'd think that it's not really workable for users that create > > > a set > > > > > > of > > > > > > > > > re-usable operators which are mixed and matched in their > > > > > > applications; > > > > > > > > > managing the resources requirements in such a setting would > > be > > > a > > > > > > > > > nightmare, and in the end would require operator-level > > > requirements > > > > > > any > > > > > > > > > way. > > > > > > > > > In that sense, I'm not even sure whether it really increases > > > > > > usability. > > > > > > > > > > > > > > > > > > > > > > > > > - As mentioned in my reply to Till's comment, there's no > > > reason to > > > > > > put > > > > > > > > multiple operators whose individual resource requirements > > are > > > > > > already > > > > > > > > known > > > > > > > > into the same group in fine-grained resource management. > > > > > > > > - Even an operator implementation is reused for multiple > > > > > > applications, > > > > > > > > it does not guarantee the same resource requirements. During > > > our > > > > > > years > > > > > > > > of > > > > > > > > practices in Alibaba, with per-operator requirements > > > specified for > > > > > > > > Blink's > > > > > > > > fine-grained resource management, very few users (including > > > our > > > > > > > > specialists > > > > > > > > who are dedicated to supporting Blink users) are as > > > experienced as > > > > > > to > > > > > > > > accurately predict/estimate the operator resource > > > requirements. > > > > > Most > > > > > > > > people > > > > > > > > rely on the execution-time metrics (throughput, delay, cpu > > > load, > > > > > > > memory > > > > > > > > usage, GC pressure, etc.) to improve the specification. > > > > > > > > > > > > > > > > To sum up: > > > > > > > > If the user is capable of providing proper resource > > requirements > > > for > > > > > > > every > > > > > > > > operator, that's definitely a good thing and we would not need > > to > > > > > rely > > > > > > on > > > > > > > > the SSGs. However, that shouldn't be a *must* for the > > > fine-grained > > > > > > > resource > > > > > > > > management to work. For those users who are capable and do not > > > like > > > > > > > having > > > > > > > > to set each operator to a separate SSG, I would be ok to have > > > both > > > > > > > > SSG-based and operator-based runtime interfaces and to only > > > fallback > > > > > to > > > > > > > the > > > > > > > > SSG requirements when the operator requirements are not > > > specified. > > > > > > > However, > > > > > > > > as the first step, I think we should prioritise the use cases > > > where > > > > > > users > > > > > > > > are not that experienced. > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Will declaring them on slot sharing groups not also waste > > > resources > > > > > > if > > > > > > > > > the parallelism of operators within that group are different? > > > > > > > > > > > > > > > > > > It also seems like quite a hassle for users having to > > > recalculate > > > > > the > > > > > > > > > resource requirements if they change the slot sharing. > > > > > > > > > I'd think that it's not really workable for users that create > > > a set > > > > > > of > > > > > > > > > re-usable operators which are mixed and matched in their > > > > > > applications; > > > > > > > > > managing the resources requirements in such a setting would > > be > > > a > > > > > > > > > nightmare, and in the end would require operator-level > > > requirements > > > > > > any > > > > > > > > > way. > > > > > > > > > In that sense, I'm not even sure whether it really increases > > > > > > usability. > > > > > > > > > > > > > > > > > > My main worry is that it if we wire the runtime to work on > > SSGs > > > > > it's > > > > > > > > > gonna be difficult to implement more fine-grained approaches, > > > which > > > > > > > > > would not be the case if, for the runtime, they are always > > > defined > > > > > on > > > > > > > an > > > > > > > > > operator-level. > > > > > > > > > > > > > > > > > > On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > > > > > > Thanks for drafting this FLIP and starting this discussion > > > > > Yangze. > > > > > > > > > > > > > > > > > > > > I like that defining resource requirements on a slot > > sharing > > > > > group > > > > > > > > makes > > > > > > > > > > the overall setup easier and improves usability of resource > > > > > > > > requirements. > > > > > > > > > > > > > > > > > > > > What I do not like about it is that it changes slot sharing > > > > > groups > > > > > > > from > > > > > > > > > > being a scheduling hint to something which needs to be > > > supported > > > > > in > > > > > > > > order > > > > > > > > > > to support fine grained resource requirements. So far, the > > > idea > > > > > of > > > > > > > slot > > > > > > > > > > sharing groups was that it tells the system that a set of > > > > > operators > > > > > > > can > > > > > > > > > be > > > > > > > > > > deployed in the same slot. But the system still had the > > > freedom > > > > > to > > > > > > > say > > > > > > > > > that > > > > > > > > > > it would rather place these tasks in different slots if it > > > > > wanted. > > > > > > If > > > > > > > > we > > > > > > > > > > now specify resource requirements on a per slot sharing > > > group, > > > > > then > > > > > > > the > > > > > > > > > > only option for a scheduler which does not support slot > > > sharing > > > > > > > groups > > > > > > > > is > > > > > > > > > > to say that every operator in this slot sharing group > > needs a > > > > > slot > > > > > > > with > > > > > > > > > the > > > > > > > > > > same resources as the whole group. > > > > > > > > > > > > > > > > > > > > So for example, if we have a job consisting of two operator > > > op_1 > > > > > > and > > > > > > > > op_2 > > > > > > > > > > where each op needs 100 MB of memory, we would then say > > that > > > the > > > > > > slot > > > > > > > > > > sharing group needs 200 MB of memory to run. If we have a > > > cluster > > > > > > > with > > > > > > > > 2 > > > > > > > > > > TMs with one slot of 100 MB each, then the system cannot > > run > > > this > > > > > > > job. > > > > > > > > If > > > > > > > > > > the resources were specified on an operator level, then the > > > > > system > > > > > > > > could > > > > > > > > > > still make the decision to deploy op_1 to TM_1 and op_2 to > > > TM_2. > > > > > > > > > > > > > > > > > > > > Originally, one of the primary goals of slot sharing groups > > > was > > > > > to > > > > > > > make > > > > > > > > > it > > > > > > > > > > easier for the user to reason about how many slots a job > > > needs > > > > > > > > > independent > > > > > > > > > > of the actual number of operators in the job. > > Interestingly, > > > if > > > > > all > > > > > > > > > > operators have their resources properly specified, then > > slot > > > > > > sharing > > > > > > > is > > > > > > > > > no > > > > > > > > > > longer needed because Flink could slice off the > > appropriately > > > > > sized > > > > > > > > slots > > > > > > > > > > for every Task individually. What matters is whether the > > > whole > > > > > > > cluster > > > > > > > > > has > > > > > > > > > > enough resources to run all tasks or not. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> Hi, there, > > > > > > > > > >> > > > > > > > > > >> We would like to start a discussion thread on "FLIP-156: > > > Runtime > > > > > > > > > >> Interfaces for Fine-Grained Resource Requirements"[1], > > > where we > > > > > > > > > >> propose Slot Sharing Group (SSG) based runtime interfaces > > > for > > > > > > > > > >> specifying fine-grained resource requirements. > > > > > > > > > >> > > > > > > > > > >> In this FLIP: > > > > > > > > > >> - Expound the user story of fine-grained resource > > > management. > > > > > > > > > >> - Propose runtime interfaces for specifying SSG-based > > > resource > > > > > > > > > >> requirements. > > > > > > > > > >> - Discuss the pros and cons of the three potential > > > granularities > > > > > > for > > > > > > > > > >> specifying the resource requirements (op, task and slot > > > sharing > > > > > > > group) > > > > > > > > > >> and explain why we choose the slot sharing group. > > > > > > > > > >> > > > > > > > > > >> Please find more details in the FLIP wiki document [1]. > > > Looking > > > > > > > > > >> forward to your feedback. > > > > > > > > > >> > > > > > > > > > >> [1] > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > > >> > > > > > > > > > >> Best, > > > > > > > > > >> Yangze Guo > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
In reply to this post by Till Rohrmann
Is there even a functional difference between specifying the
requirements for an SSG vs specifying the same requirements on a single operator within that group (ideally a colocation group to avoid this whole hint business)? Wouldn't we get the best of both worlds in the latter case? Users can take shortcuts to define shared requirements, but refine them further as needed on a per-operator basis, without changing semantics of slotsharing groups nor the runtime being locked into SSG-based requirements. (And before anyone argues what happens if slotsharing groups change or whatnot, that's a plain API issue that we could surely solve. (A plain iteration over slotsharing groups and therein contained operators would suffice)). On 1/20/2021 6:48 PM, Till Rohrmann wrote: > Maybe a different minor idea: Would it be possible to treat the SSG > resource requirements as a hint for the runtime similar to how slot sharing > groups are designed at the moment? Meaning that we don't give the guarantee > that Flink will always deploy this set of tasks together no matter what > comes. If, for example, the runtime can derive by some means the resource > requirements for each task based on the requirements for the SSG, this > could be possible. One easy strategy would be to give every task the same > resources as the whole slot sharing group. Another one could be > distributing the resources equally among the tasks. This does not even have > to be implemented but we would give ourselves the freedom to change > scheduling if need should arise. > > Cheers, > Till > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email]> wrote: > >> Thanks for the responses, Till and Xintong. >> >> I second Xintong's comment that SSG-based runtime interface will give >> us the flexibility to achieve op/task-based approach. That's one of >> the most important reasons for our design choice. >> >> Some cents regarding the default operator resource: >> - It might be good for the scenario of DataStream jobs. >> ** For light-weight operators, the accumulative configuration error >> will not be significant. Then, the resource of a task used is >> proportional to the number of operators it contains. >> ** For heavy operators like join and window or operators using the >> external resources, user will turn to the fine-grained resource >> configuration. >> - It can increase the stability for the standalone cluster where task >> executors registered are heterogeneous(with different default slot >> resources). >> - It might not be good for SQL users. The operators that SQL will be >> transferred to is a black box to the user. We also do not guarantee >> the cross-version of consistency of the transformation so far. >> >> I think it can be treated as a follow-up work when the fine-grained >> resource management is end-to-end ready. >> >> Best, >> Yangze Guo >> >> >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <[hidden email]> >> wrote: >>> Thanks for the feedback, Till. >>> >>> ## I feel that what you proposed (operator-based + default value) might >> be >>> subsumed by the SSG-based approach. >>> Thinking of op_1 -> op_2, there are the following 4 cases, categorized by >>> whether the resource requirements are known to the users. >>> >>> 1. *Both known.* As previously mentioned, there's no reason to put >>> multiple operators whose individual resource requirements are already >> known >>> into the same group in fine-grained resource management. And if op_1 >> and >>> op_2 are in different groups, there should be no problem switching >> data >>> exchange mode from pipelined to blocking. This is equivalent to >> specifying >>> operator resource requirements in your proposal. >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is in a >>> SSG whose resource is not specified thus would have the default slot >>> resource. This is equivalent to having default operator resources in >> your >>> proposal. >>> 3. *Both unknown*. The user can either set op_1 and op_2 to the same >> SSG >>> or separate SSGs. >>> - If op_1 and op_2 are in the same SSG, it will be equivalent to >> the >>> coarse-grained resource management, where op_1 and op_2 share a >> default >>> size slot no matter which data exchange mode is used. >>> - If op_1 and op_2 are in different SSGs, then each of them will >> use >>> a default size slot. This is equivalent to setting them with >> default >>> operator resources in your proposal. >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* >>> - It is possible that the user learns the total / max resource >>> requirement from executing and monitoring the job, while not >>> being aware of >>> individual operator requirements. >>> - I believe this is the case your proposal does not cover. And TBH, >>> this is probably how most users learn the resource requirements, >>> according >>> to my experiences. >>> - In this case, the user might need to specify different resources >> if >>> he wants to switch the execution mode, which should not be worse >> than not >>> being able to use fine-grained resource management. >>> >>> >>> ## An additional idea inspired by your proposal. >>> We may provide multiple options for deciding resources for SSGs whose >>> requirement is not specified, if needed. >>> >>> - Default slot resource (current design) >>> - Default operator resource times number of operators (equivalent to >>> your proposal) >>> >>> >>> ## Exposing internal runtime strategies >>> Theoretically, yes. Tying to the SSGs, the resource requirements might be >>> affected if how SSGs are internally handled changes in future. >> Practically, >>> I do not concretely see at the moment what kind of changes we may want in >>> future that might conflict with this FLIP proposal, as the question of >>> switching data exchange mode answered above. I'd suggest to not give up >> the >>> user friendliness we may gain now for the future problems that may or may >>> not exist. >>> >>> Moreover, the SSG-based approach has the flexibility to achieve the >>> equivalent behavior as the operator-based approach, if we set each >> operator >>> (or task) to a separate SSG. We can even provide a shortcut option to >>> automatically do that for users, if needed. >>> >>> >>> Thank you~ >>> >>> Xintong Song >>> >>> >>> >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email]> >> wrote: >>>> Thanks for the responses Xintong and Stephan, >>>> >>>> I agree that being able to define the resource requirements for a >> group of >>>> operators is more user friendly. However, my concern is that we are >>>> exposing thereby internal runtime strategies which might limit our >>>> flexibility to execute a given job. Moreover, the semantics of >> configuring >>>> resource requirements for SSGs could break if switching from streaming >> to >>>> batch execution. If one defines the resource requirements for op_1 -> >> op_2 >>>> which run in pipelined mode when using the streaming execution, then >> how do >>>> we interpret these requirements when op_1 -> op_2 are executed with a >>>> blocking data exchange in batch execution mode? Consequently, I am >> still >>>> leaning towards Stephan's proposal to set the resource requirements per >>>> operator. >>>> >>>> Maybe the following proposal makes the configuration easier: If the >> user >>>> wants to use fine-grained resource requirements, then she needs to >> specify >>>> the default size which is used for operators which have no explicit >>>> resource annotation. If this holds true, then every operator would >> have a >>>> resource requirement and the system can try to execute the operators >> in the >>>> best possible manner w/o being constrained by how the user set the SSG >>>> requirements. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> >>>> wrote: >>>> >>>>> Thanks for the feedback, Stephan. >>>>> >>>>> Actually, your proposal has also come to my mind at some point. And I >>>> have >>>>> some concerns about it. >>>>> >>>>> >>>>> 1. It does not give users the same control as the SSG-based approach. >>>>> >>>>> >>>>> While both approaches do not require specifying for each operator, >>>>> SSG-based approach supports the semantic that "some operators >> together >>>> use >>>>> this much resource" while the operator-based approach doesn't. >>>>> >>>>> >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and >> at >>>> some >>>>> point there's an agg o_n (1 < n < m) which significantly reduces the >> data >>>>> amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., >> o_n) >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher >> parallelisms >>>>> for operators in SSG_1 than for operators in SSG_2 won't lead to too >> much >>>>> wasting of resources. If the two SSGs end up needing different >> resources, >>>>> with the SSG-based approach one can directly specify resources for >> the >>>> two >>>>> groups. However, with the operator-based approach, the user will >> have to >>>>> specify resources for each operator in one of the two groups, and >> tune >>>> the >>>>> default slot resource via configurations to fit the other group. >>>>> >>>>> >>>>> 2. It increases the chance of breaking operator chains. >>>>> >>>>> >>>>> Setting chainnable operators into different slot sharing groups will >>>>> prevent them from being chained. In the current implementation, >>>> downstream >>>>> operators, if SSG not explicitly specified, will be set to the same >> group >>>>> as the chainable upstream operators (unless multiple upstream >> operators >>>> in >>>>> different groups), to reduce the chance of breaking chains. >>>>> >>>>> >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding >> SSGs >>>>> based on whether resource is specified we will easily get groups like >>>> (o_1, >>>>> o_3) & (o_2, o_4), where none of the operators can be chained. This >> is >>>> also >>>>> possible for the SSG-based approach, but I believe the chance is much >>>>> smaller because there's no strong reason for users to specify the >> groups >>>>> with alternate operators like that. We are more likely to get groups >> like >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and >> o_3. >>>>> >>>>> 3. It complicates the system by having two different mechanisms for >>>> sharing >>>>> managed memory in a slot. >>>>> >>>>> >>>>> - In FLIP-141, we introduced the intra-slot managed memory sharing >>>>> mechanism, where managed memory is first distributed according to the >>>>> consumer type, then further distributed across operators of that >> consumer >>>>> type. >>>>> >>>>> - With the operator-based approach, managed memory size specified >> for an >>>>> operator should account for all the consumer types of that operator. >> That >>>>> means the managed memory is first distributed across operators, then >>>>> distributed to different consumer types of each operator. >>>>> >>>>> >>>>> Unfortunately, the different order of the two calculation steps can >> lead >>>> to >>>>> different results. To be specific, the semantic of the configuration >>>> option >>>>> `consumer-weights` changed (within a slot vs. within an operator). >>>>> >>>>> >>>>> >>>>> To sum up things: >>>>> >>>>> While (3) might be a bit more implementation related, I think (1) >> and (2) >>>>> somehow suggest that, the price for the proposed approach to avoid >>>>> specifying resource for every operator is that it's not as >> independent >>>> from >>>>> operator chaining and slot sharing as the operator-based approach >>>> discussed >>>>> in the FLIP. >>>>> >>>>> >>>>> Thank you~ >>>>> >>>>> Xintong Song >>>>> >>>>> >>>>> >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> >> wrote: >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. >>>>>> >>>>>> I want to say, first of all, that this is super well written. And >> the >>>>>> points that the FLIP makes about how to expose the configuration to >>>> users >>>>>> is exactly the right thing to figure out first. >>>>>> So good job here! >>>>>> >>>>>> About how to let users specify the resource profiles. If I can sum >> the >>>>> FLIP >>>>>> and previous discussion up in my own words, the problem is the >>>> following: >>>>>> Operator-level specification is the simplest and cleanest approach, >>>>> because >>>>>>> it avoids mixing operator configuration (resource) and >> scheduling. No >>>>>>> matter what other parameters change (chaining, slot sharing, >>>> switching >>>>>>> pipelined and blocking shuffles), the resource profiles stay the >>>> same. >>>>>>> But it would require that a user specifies resources on all >>>> operators, >>>>>>> which makes it hard to use. That's why the FLIP suggests going >> with >>>>>>> specifying resources on a Sharing-Group. >>>>>> >>>>>> I think both thoughts are important, so can we find a solution >> where >>>> the >>>>>> Resource Profiles are specified on an Operator, but we still avoid >> that >>>>> we >>>>>> need to specify a resource profile on every operator? >>>>>> >>>>>> What do you think about something like the following: >>>>>> - Resource Profiles are specified on an operator level. >>>>>> - Not all operators need profiles >>>>>> - All Operators without a Resource Profile ended up in the >> default >>>> slot >>>>>> sharing group with a default profile (will get a default slot). >>>>>> - All Operators with a Resource Profile will go into another slot >>>>> sharing >>>>>> group (the resource-specified-group). >>>>>> - Users can define different slot sharing groups for operators >> like >>>>> they >>>>>> do now, with the exception that you cannot mix operators that have >> a >>>>>> resource profile and operators that have no resource profile. >>>>>> - The default case where no operator has a resource profile is >> just a >>>>>> special case of this model >>>>>> - The chaining logic sums up the profiles per operator, like it >> does >>>>> now, >>>>>> and the scheduler sums up the profiles of the tasks that it >> schedules >>>>>> together. >>>>>> >>>>>> >>>>>> There is another question about reactive scaling raised in the >> FLIP. I >>>>> need >>>>>> to think a bit about that. That is indeed a bit more tricky once we >>>> have >>>>>> slots of different sizes. >>>>>> It is not clear then which of the different slot requests the >>>>>> ResourceManager should fulfill when new resources (TMs) show up, >> or how >>>>> the >>>>>> JobManager redistributes the slots resources when resources (TMs) >>>>> disappear >>>>>> This question is pretty orthogonal, though, to the "how to specify >> the >>>>>> resources". >>>>>> >>>>>> >>>>>> Best, >>>>>> Stephan >>>>>> >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email] >>>>> wrote: >>>>>>> Thanks for drafting the FLIP and driving the discussion, Yangze. >>>>>>> And Thanks for the feedback, Till and Chesnay. >>>>>>> >>>>>>> @Till, >>>>>>> >>>>>>> I agree that specifying requirements for SSGs means that SSGs >> need to >>>>> be >>>>>>> supported in fine-grained resource management, otherwise each >>>> operator >>>>>>> might use as many resources as the whole group. However, I cannot >>>> think >>>>>> of >>>>>>> a strong reason for not supporting SSGs in fine-grained resource >>>>>>> management. >>>>>>> >>>>>>> >>>>>>>> Interestingly, if all operators have their resources properly >>>>>> specified, >>>>>>>> then slot sharing is no longer needed because Flink could >> slice off >>>>> the >>>>>>>> appropriately sized slots for every Task individually. >>>>>>>> >>>>>>> So for example, if we have a job consisting of two operator op_1 >> and >>>>> op_2 >>>>>>>> where each op needs 100 MB of memory, we would then say that >> the >>>> slot >>>>>>>> sharing group needs 200 MB of memory to run. If we have a >> cluster >>>>> with >>>>>> 2 >>>>>>>> TMs with one slot of 100 MB each, then the system cannot run >> this >>>>> job. >>>>>> If >>>>>>>> the resources were specified on an operator level, then the >> system >>>>>> could >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to >> TM_2. >>>>>>> >>>>>>> Couldn't agree more that if all operators' requirements are >> properly >>>>>>> specified, slot sharing should be no longer needed. I think this >>>>> exactly >>>>>>> disproves the example. If we already know op_1 and op_2 each >> needs >>>> 100 >>>>> MB >>>>>>> of memory, why would we put them in the same group? If they are >> in >>>>>> separate >>>>>>> groups, with the proposed approach the system can freely deploy >> them >>>> to >>>>>>> either a 200 MB TM or two 100 MB TMs. >>>>>>> >>>>>>> Moreover, the precondition for not needing slot sharing is having >>>>>> resource >>>>>>> requirements properly specified for all operators. This is not >> always >>>>>>> possible, and usually requires tremendous efforts. One of the >>>> benefits >>>>>> for >>>>>>> SSG-based requirements is that it allows the user to freely >> decide >>>> the >>>>>>> granularity, thus efforts they want to pay. I would consider SSG >> in >>>>>>> fine-grained resource management as a group of operators that the >>>> user >>>>>>> would like to specify the total resource for. There can be only >> one >>>>> group >>>>>>> in the job, 2~3 groups dividing the job into a few major parts, >> or as >>>>>> many >>>>>>> groups as the number of tasks/operators, depending on how >>>> fine-grained >>>>>> the >>>>>>> user is able to specify the resources. >>>>>>> >>>>>>> Having to support SSGs might be a constraint. But given that all >> the >>>>>>> current scheduler implementations already support SSGs, I tend to >>>> think >>>>>>> that as an acceptable price for the above discussed usability and >>>>>>> flexibility. >>>>>>> >>>>>>> @Chesnay >>>>>>> >>>>>>> Will declaring them on slot sharing groups not also waste >> resources >>>> if >>>>>> the >>>>>>>> parallelism of operators within that group are different? >>>>>>>> >>>>>>> Yes. It's a trade-off between usability and resource >> utilization. To >>>>>> avoid >>>>>>> such wasting, the user can define more groups, so that each group >>>>>> contains >>>>>>> less operators and the chance of having operators with different >>>>>>> parallelism will be reduced. The price is to have more resource >>>>>>> requirements to specify. >>>>>>> >>>>>>> It also seems like quite a hassle for users having to >> recalculate the >>>>>>>> resource requirements if they change the slot sharing. >>>>>>>> I'd think that it's not really workable for users that create >> a set >>>>> of >>>>>>>> re-usable operators which are mixed and matched in their >>>>> applications; >>>>>>>> managing the resources requirements in such a setting would be >> a >>>>>>>> nightmare, and in the end would require operator-level >> requirements >>>>> any >>>>>>>> way. >>>>>>>> In that sense, I'm not even sure whether it really increases >>>>> usability. >>>>>>> - As mentioned in my reply to Till's comment, there's no >> reason to >>>>> put >>>>>>> multiple operators whose individual resource requirements are >>>>> already >>>>>>> known >>>>>>> into the same group in fine-grained resource management. >>>>>>> - Even an operator implementation is reused for multiple >>>>> applications, >>>>>>> it does not guarantee the same resource requirements. During >> our >>>>> years >>>>>>> of >>>>>>> practices in Alibaba, with per-operator requirements >> specified for >>>>>>> Blink's >>>>>>> fine-grained resource management, very few users (including >> our >>>>>>> specialists >>>>>>> who are dedicated to supporting Blink users) are as >> experienced as >>>>> to >>>>>>> accurately predict/estimate the operator resource >> requirements. >>>> Most >>>>>>> people >>>>>>> rely on the execution-time metrics (throughput, delay, cpu >> load, >>>>>> memory >>>>>>> usage, GC pressure, etc.) to improve the specification. >>>>>>> >>>>>>> To sum up: >>>>>>> If the user is capable of providing proper resource requirements >> for >>>>>> every >>>>>>> operator, that's definitely a good thing and we would not need to >>>> rely >>>>> on >>>>>>> the SSGs. However, that shouldn't be a *must* for the >> fine-grained >>>>>> resource >>>>>>> management to work. For those users who are capable and do not >> like >>>>>> having >>>>>>> to set each operator to a separate SSG, I would be ok to have >> both >>>>>>> SSG-based and operator-based runtime interfaces and to only >> fallback >>>> to >>>>>> the >>>>>>> SSG requirements when the operator requirements are not >> specified. >>>>>> However, >>>>>>> as the first step, I think we should prioritise the use cases >> where >>>>> users >>>>>>> are not that experienced. >>>>>>> >>>>>>> Thank you~ >>>>>>> >>>>>>> Xintong Song >>>>>>> >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < >> [hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> Will declaring them on slot sharing groups not also waste >> resources >>>>> if >>>>>>>> the parallelism of operators within that group are different? >>>>>>>> >>>>>>>> It also seems like quite a hassle for users having to >> recalculate >>>> the >>>>>>>> resource requirements if they change the slot sharing. >>>>>>>> I'd think that it's not really workable for users that create >> a set >>>>> of >>>>>>>> re-usable operators which are mixed and matched in their >>>>> applications; >>>>>>>> managing the resources requirements in such a setting would be >> a >>>>>>>> nightmare, and in the end would require operator-level >> requirements >>>>> any >>>>>>>> way. >>>>>>>> In that sense, I'm not even sure whether it really increases >>>>> usability. >>>>>>>> My main worry is that it if we wire the runtime to work on SSGs >>>> it's >>>>>>>> gonna be difficult to implement more fine-grained approaches, >> which >>>>>>>> would not be the case if, for the runtime, they are always >> defined >>>> on >>>>>> an >>>>>>>> operator-level. >>>>>>>> >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: >>>>>>>>> Thanks for drafting this FLIP and starting this discussion >>>> Yangze. >>>>>>>>> I like that defining resource requirements on a slot sharing >>>> group >>>>>>> makes >>>>>>>>> the overall setup easier and improves usability of resource >>>>>>> requirements. >>>>>>>>> What I do not like about it is that it changes slot sharing >>>> groups >>>>>> from >>>>>>>>> being a scheduling hint to something which needs to be >> supported >>>> in >>>>>>> order >>>>>>>>> to support fine grained resource requirements. So far, the >> idea >>>> of >>>>>> slot >>>>>>>>> sharing groups was that it tells the system that a set of >>>> operators >>>>>> can >>>>>>>> be >>>>>>>>> deployed in the same slot. But the system still had the >> freedom >>>> to >>>>>> say >>>>>>>> that >>>>>>>>> it would rather place these tasks in different slots if it >>>> wanted. >>>>> If >>>>>>> we >>>>>>>>> now specify resource requirements on a per slot sharing >> group, >>>> then >>>>>> the >>>>>>>>> only option for a scheduler which does not support slot >> sharing >>>>>> groups >>>>>>> is >>>>>>>>> to say that every operator in this slot sharing group needs a >>>> slot >>>>>> with >>>>>>>> the >>>>>>>>> same resources as the whole group. >>>>>>>>> >>>>>>>>> So for example, if we have a job consisting of two operator >> op_1 >>>>> and >>>>>>> op_2 >>>>>>>>> where each op needs 100 MB of memory, we would then say that >> the >>>>> slot >>>>>>>>> sharing group needs 200 MB of memory to run. If we have a >> cluster >>>>>> with >>>>>>> 2 >>>>>>>>> TMs with one slot of 100 MB each, then the system cannot run >> this >>>>>> job. >>>>>>> If >>>>>>>>> the resources were specified on an operator level, then the >>>> system >>>>>>> could >>>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to >> TM_2. >>>>>>>>> Originally, one of the primary goals of slot sharing groups >> was >>>> to >>>>>> make >>>>>>>> it >>>>>>>>> easier for the user to reason about how many slots a job >> needs >>>>>>>> independent >>>>>>>>> of the actual number of operators in the job. Interestingly, >> if >>>> all >>>>>>>>> operators have their resources properly specified, then slot >>>>> sharing >>>>>> is >>>>>>>> no >>>>>>>>> longer needed because Flink could slice off the appropriately >>>> sized >>>>>>> slots >>>>>>>>> for every Task individually. What matters is whether the >> whole >>>>>> cluster >>>>>>>> has >>>>>>>>> enough resources to run all tasks or not. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>>> >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < >> [hidden email]> >>>>>> wrote: >>>>>>>>>> Hi, there, >>>>>>>>>> >>>>>>>>>> We would like to start a discussion thread on "FLIP-156: >> Runtime >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], >> where we >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime interfaces >> for >>>>>>>>>> specifying fine-grained resource requirements. >>>>>>>>>> >>>>>>>>>> In this FLIP: >>>>>>>>>> - Expound the user story of fine-grained resource >> management. >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based >> resource >>>>>>>>>> requirements. >>>>>>>>>> - Discuss the pros and cons of the three potential >> granularities >>>>> for >>>>>>>>>> specifying the resource requirements (op, task and slot >> sharing >>>>>> group) >>>>>>>>>> and explain why we choose the slot sharing group. >>>>>>>>>> >>>>>>>>>> Please find more details in the FLIP wiki document [1]. >> Looking >>>>>>>>>> forward to your feedback. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements >>>>>>>>>> Best, >>>>>>>>>> Yangze Guo >>>>>>>>>> >>>>>>>> |
If I understand you correctly Chesnay, then you want to decouple the
resource requirement specification from the slot sharing group assignment. Hence, per default all operators would be in the same slot sharing group. If there is no operator with a resource specification, then the system would allocate a default slot for it. If there is at least one operator, then the system would sum up all the specified resources and allocate a slot of this size. This effectively means that all unspecified operators will implicitly have a zero resource requirement. Did I understand your idea correctly? I am wondering whether this wouldn't lead to a surprising behaviour for the user. If the user specifies the resource requirements for a single operator, then he probably will assume that the other operators will get the default share of resources and not nothing. Cheers, Till On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <[hidden email]> wrote: > Is there even a functional difference between specifying the > requirements for an SSG vs specifying the same requirements on a single > operator within that group (ideally a colocation group to avoid this > whole hint business)? > > Wouldn't we get the best of both worlds in the latter case? > > Users can take shortcuts to define shared requirements, > but refine them further as needed on a per-operator basis, > without changing semantics of slotsharing groups > nor the runtime being locked into SSG-based requirements. > > (And before anyone argues what happens if slotsharing groups change or > whatnot, that's a plain API issue that we could surely solve. (A plain > iteration over slotsharing groups and therein contained operators would > suffice)). > > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > Maybe a different minor idea: Would it be possible to treat the SSG > > resource requirements as a hint for the runtime similar to how slot > sharing > > groups are designed at the moment? Meaning that we don't give the > guarantee > > that Flink will always deploy this set of tasks together no matter what > > comes. If, for example, the runtime can derive by some means the resource > > requirements for each task based on the requirements for the SSG, this > > could be possible. One easy strategy would be to give every task the same > > resources as the whole slot sharing group. Another one could be > > distributing the resources equally among the tasks. This does not even > have > > to be implemented but we would give ourselves the freedom to change > > scheduling if need should arise. > > > > Cheers, > > Till > > > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email]> wrote: > > > >> Thanks for the responses, Till and Xintong. > >> > >> I second Xintong's comment that SSG-based runtime interface will give > >> us the flexibility to achieve op/task-based approach. That's one of > >> the most important reasons for our design choice. > >> > >> Some cents regarding the default operator resource: > >> - It might be good for the scenario of DataStream jobs. > >> ** For light-weight operators, the accumulative configuration error > >> will not be significant. Then, the resource of a task used is > >> proportional to the number of operators it contains. > >> ** For heavy operators like join and window or operators using the > >> external resources, user will turn to the fine-grained resource > >> configuration. > >> - It can increase the stability for the standalone cluster where task > >> executors registered are heterogeneous(with different default slot > >> resources). > >> - It might not be good for SQL users. The operators that SQL will be > >> transferred to is a black box to the user. We also do not guarantee > >> the cross-version of consistency of the transformation so far. > >> > >> I think it can be treated as a follow-up work when the fine-grained > >> resource management is end-to-end ready. > >> > >> Best, > >> Yangze Guo > >> > >> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <[hidden email]> > >> wrote: > >>> Thanks for the feedback, Till. > >>> > >>> ## I feel that what you proposed (operator-based + default value) might > >> be > >>> subsumed by the SSG-based approach. > >>> Thinking of op_1 -> op_2, there are the following 4 cases, categorized > by > >>> whether the resource requirements are known to the users. > >>> > >>> 1. *Both known.* As previously mentioned, there's no reason to put > >>> multiple operators whose individual resource requirements are > already > >> known > >>> into the same group in fine-grained resource management. And if > op_1 > >> and > >>> op_2 are in different groups, there should be no problem switching > >> data > >>> exchange mode from pipelined to blocking. This is equivalent to > >> specifying > >>> operator resource requirements in your proposal. > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is > in a > >>> SSG whose resource is not specified thus would have the default > slot > >>> resource. This is equivalent to having default operator resources > in > >> your > >>> proposal. > >>> 3. *Both unknown*. The user can either set op_1 and op_2 to the > same > >> SSG > >>> or separate SSGs. > >>> - If op_1 and op_2 are in the same SSG, it will be equivalent to > >> the > >>> coarse-grained resource management, where op_1 and op_2 share a > >> default > >>> size slot no matter which data exchange mode is used. > >>> - If op_1 and op_2 are in different SSGs, then each of them will > >> use > >>> a default size slot. This is equivalent to setting them with > >> default > >>> operator resources in your proposal. > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is known.* > >>> - It is possible that the user learns the total / max resource > >>> requirement from executing and monitoring the job, while not > >>> being aware of > >>> individual operator requirements. > >>> - I believe this is the case your proposal does not cover. And > TBH, > >>> this is probably how most users learn the resource requirements, > >>> according > >>> to my experiences. > >>> - In this case, the user might need to specify different > resources > >> if > >>> he wants to switch the execution mode, which should not be worse > >> than not > >>> being able to use fine-grained resource management. > >>> > >>> > >>> ## An additional idea inspired by your proposal. > >>> We may provide multiple options for deciding resources for SSGs whose > >>> requirement is not specified, if needed. > >>> > >>> - Default slot resource (current design) > >>> - Default operator resource times number of operators (equivalent > to > >>> your proposal) > >>> > >>> > >>> ## Exposing internal runtime strategies > >>> Theoretically, yes. Tying to the SSGs, the resource requirements might > be > >>> affected if how SSGs are internally handled changes in future. > >> Practically, > >>> I do not concretely see at the moment what kind of changes we may want > in > >>> future that might conflict with this FLIP proposal, as the question of > >>> switching data exchange mode answered above. I'd suggest to not give up > >> the > >>> user friendliness we may gain now for the future problems that may or > may > >>> not exist. > >>> > >>> Moreover, the SSG-based approach has the flexibility to achieve the > >>> equivalent behavior as the operator-based approach, if we set each > >> operator > >>> (or task) to a separate SSG. We can even provide a shortcut option to > >>> automatically do that for users, if needed. > >>> > >>> > >>> Thank you~ > >>> > >>> Xintong Song > >>> > >>> > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email]> > >> wrote: > >>>> Thanks for the responses Xintong and Stephan, > >>>> > >>>> I agree that being able to define the resource requirements for a > >> group of > >>>> operators is more user friendly. However, my concern is that we are > >>>> exposing thereby internal runtime strategies which might limit our > >>>> flexibility to execute a given job. Moreover, the semantics of > >> configuring > >>>> resource requirements for SSGs could break if switching from streaming > >> to > >>>> batch execution. If one defines the resource requirements for op_1 -> > >> op_2 > >>>> which run in pipelined mode when using the streaming execution, then > >> how do > >>>> we interpret these requirements when op_1 -> op_2 are executed with a > >>>> blocking data exchange in batch execution mode? Consequently, I am > >> still > >>>> leaning towards Stephan's proposal to set the resource requirements > per > >>>> operator. > >>>> > >>>> Maybe the following proposal makes the configuration easier: If the > >> user > >>>> wants to use fine-grained resource requirements, then she needs to > >> specify > >>>> the default size which is used for operators which have no explicit > >>>> resource annotation. If this holds true, then every operator would > >> have a > >>>> resource requirement and the system can try to execute the operators > >> in the > >>>> best possible manner w/o being constrained by how the user set the SSG > >>>> requirements. > >>>> > >>>> Cheers, > >>>> Till > >>>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email]> > >>>> wrote: > >>>> > >>>>> Thanks for the feedback, Stephan. > >>>>> > >>>>> Actually, your proposal has also come to my mind at some point. And I > >>>> have > >>>>> some concerns about it. > >>>>> > >>>>> > >>>>> 1. It does not give users the same control as the SSG-based approach. > >>>>> > >>>>> > >>>>> While both approaches do not require specifying for each operator, > >>>>> SSG-based approach supports the semantic that "some operators > >> together > >>>> use > >>>>> this much resource" while the operator-based approach doesn't. > >>>>> > >>>>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and > >> at > >>>> some > >>>>> point there's an agg o_n (1 < n < m) which significantly reduces the > >> data > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 (o_1, ..., > >> o_n) > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher > >> parallelisms > >>>>> for operators in SSG_1 than for operators in SSG_2 won't lead to too > >> much > >>>>> wasting of resources. If the two SSGs end up needing different > >> resources, > >>>>> with the SSG-based approach one can directly specify resources for > >> the > >>>> two > >>>>> groups. However, with the operator-based approach, the user will > >> have to > >>>>> specify resources for each operator in one of the two groups, and > >> tune > >>>> the > >>>>> default slot resource via configurations to fit the other group. > >>>>> > >>>>> > >>>>> 2. It increases the chance of breaking operator chains. > >>>>> > >>>>> > >>>>> Setting chainnable operators into different slot sharing groups will > >>>>> prevent them from being chained. In the current implementation, > >>>> downstream > >>>>> operators, if SSG not explicitly specified, will be set to the same > >> group > >>>>> as the chainable upstream operators (unless multiple upstream > >> operators > >>>> in > >>>>> different groups), to reduce the chance of breaking chains. > >>>>> > >>>>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding > >> SSGs > >>>>> based on whether resource is specified we will easily get groups like > >>>> (o_1, > >>>>> o_3) & (o_2, o_4), where none of the operators can be chained. This > >> is > >>>> also > >>>>> possible for the SSG-based approach, but I believe the chance is much > >>>>> smaller because there's no strong reason for users to specify the > >> groups > >>>>> with alternate operators like that. We are more likely to get groups > >> like > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 and > >> o_3. > >>>>> > >>>>> 3. It complicates the system by having two different mechanisms for > >>>> sharing > >>>>> managed memory in a slot. > >>>>> > >>>>> > >>>>> - In FLIP-141, we introduced the intra-slot managed memory sharing > >>>>> mechanism, where managed memory is first distributed according to the > >>>>> consumer type, then further distributed across operators of that > >> consumer > >>>>> type. > >>>>> > >>>>> - With the operator-based approach, managed memory size specified > >> for an > >>>>> operator should account for all the consumer types of that operator. > >> That > >>>>> means the managed memory is first distributed across operators, then > >>>>> distributed to different consumer types of each operator. > >>>>> > >>>>> > >>>>> Unfortunately, the different order of the two calculation steps can > >> lead > >>>> to > >>>>> different results. To be specific, the semantic of the configuration > >>>> option > >>>>> `consumer-weights` changed (within a slot vs. within an operator). > >>>>> > >>>>> > >>>>> > >>>>> To sum up things: > >>>>> > >>>>> While (3) might be a bit more implementation related, I think (1) > >> and (2) > >>>>> somehow suggest that, the price for the proposed approach to avoid > >>>>> specifying resource for every operator is that it's not as > >> independent > >>>> from > >>>>> operator chaining and slot sharing as the operator-based approach > >>>> discussed > >>>>> in the FLIP. > >>>>> > >>>>> > >>>>> Thank you~ > >>>>> > >>>>> Xintong Song > >>>>> > >>>>> > >>>>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> > >> wrote: > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > >>>>>> > >>>>>> I want to say, first of all, that this is super well written. And > >> the > >>>>>> points that the FLIP makes about how to expose the configuration to > >>>> users > >>>>>> is exactly the right thing to figure out first. > >>>>>> So good job here! > >>>>>> > >>>>>> About how to let users specify the resource profiles. If I can sum > >> the > >>>>> FLIP > >>>>>> and previous discussion up in my own words, the problem is the > >>>> following: > >>>>>> Operator-level specification is the simplest and cleanest approach, > >>>>> because > >>>>>>> it avoids mixing operator configuration (resource) and > >> scheduling. No > >>>>>>> matter what other parameters change (chaining, slot sharing, > >>>> switching > >>>>>>> pipelined and blocking shuffles), the resource profiles stay the > >>>> same. > >>>>>>> But it would require that a user specifies resources on all > >>>> operators, > >>>>>>> which makes it hard to use. That's why the FLIP suggests going > >> with > >>>>>>> specifying resources on a Sharing-Group. > >>>>>> > >>>>>> I think both thoughts are important, so can we find a solution > >> where > >>>> the > >>>>>> Resource Profiles are specified on an Operator, but we still avoid > >> that > >>>>> we > >>>>>> need to specify a resource profile on every operator? > >>>>>> > >>>>>> What do you think about something like the following: > >>>>>> - Resource Profiles are specified on an operator level. > >>>>>> - Not all operators need profiles > >>>>>> - All Operators without a Resource Profile ended up in the > >> default > >>>> slot > >>>>>> sharing group with a default profile (will get a default slot). > >>>>>> - All Operators with a Resource Profile will go into another slot > >>>>> sharing > >>>>>> group (the resource-specified-group). > >>>>>> - Users can define different slot sharing groups for operators > >> like > >>>>> they > >>>>>> do now, with the exception that you cannot mix operators that have > >> a > >>>>>> resource profile and operators that have no resource profile. > >>>>>> - The default case where no operator has a resource profile is > >> just a > >>>>>> special case of this model > >>>>>> - The chaining logic sums up the profiles per operator, like it > >> does > >>>>> now, > >>>>>> and the scheduler sums up the profiles of the tasks that it > >> schedules > >>>>>> together. > >>>>>> > >>>>>> > >>>>>> There is another question about reactive scaling raised in the > >> FLIP. I > >>>>> need > >>>>>> to think a bit about that. That is indeed a bit more tricky once we > >>>> have > >>>>>> slots of different sizes. > >>>>>> It is not clear then which of the different slot requests the > >>>>>> ResourceManager should fulfill when new resources (TMs) show up, > >> or how > >>>>> the > >>>>>> JobManager redistributes the slots resources when resources (TMs) > >>>>> disappear > >>>>>> This question is pretty orthogonal, though, to the "how to specify > >> the > >>>>>> resources". > >>>>>> > >>>>>> > >>>>>> Best, > >>>>>> Stephan > >>>>>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song <[hidden email] > >>>>> wrote: > >>>>>>> Thanks for drafting the FLIP and driving the discussion, Yangze. > >>>>>>> And Thanks for the feedback, Till and Chesnay. > >>>>>>> > >>>>>>> @Till, > >>>>>>> > >>>>>>> I agree that specifying requirements for SSGs means that SSGs > >> need to > >>>>> be > >>>>>>> supported in fine-grained resource management, otherwise each > >>>> operator > >>>>>>> might use as many resources as the whole group. However, I cannot > >>>> think > >>>>>> of > >>>>>>> a strong reason for not supporting SSGs in fine-grained resource > >>>>>>> management. > >>>>>>> > >>>>>>> > >>>>>>>> Interestingly, if all operators have their resources properly > >>>>>> specified, > >>>>>>>> then slot sharing is no longer needed because Flink could > >> slice off > >>>>> the > >>>>>>>> appropriately sized slots for every Task individually. > >>>>>>>> > >>>>>>> So for example, if we have a job consisting of two operator op_1 > >> and > >>>>> op_2 > >>>>>>>> where each op needs 100 MB of memory, we would then say that > >> the > >>>> slot > >>>>>>>> sharing group needs 200 MB of memory to run. If we have a > >> cluster > >>>>> with > >>>>>> 2 > >>>>>>>> TMs with one slot of 100 MB each, then the system cannot run > >> this > >>>>> job. > >>>>>> If > >>>>>>>> the resources were specified on an operator level, then the > >> system > >>>>>> could > >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to > >> TM_2. > >>>>>>> > >>>>>>> Couldn't agree more that if all operators' requirements are > >> properly > >>>>>>> specified, slot sharing should be no longer needed. I think this > >>>>> exactly > >>>>>>> disproves the example. If we already know op_1 and op_2 each > >> needs > >>>> 100 > >>>>> MB > >>>>>>> of memory, why would we put them in the same group? If they are > >> in > >>>>>> separate > >>>>>>> groups, with the proposed approach the system can freely deploy > >> them > >>>> to > >>>>>>> either a 200 MB TM or two 100 MB TMs. > >>>>>>> > >>>>>>> Moreover, the precondition for not needing slot sharing is having > >>>>>> resource > >>>>>>> requirements properly specified for all operators. This is not > >> always > >>>>>>> possible, and usually requires tremendous efforts. One of the > >>>> benefits > >>>>>> for > >>>>>>> SSG-based requirements is that it allows the user to freely > >> decide > >>>> the > >>>>>>> granularity, thus efforts they want to pay. I would consider SSG > >> in > >>>>>>> fine-grained resource management as a group of operators that the > >>>> user > >>>>>>> would like to specify the total resource for. There can be only > >> one > >>>>> group > >>>>>>> in the job, 2~3 groups dividing the job into a few major parts, > >> or as > >>>>>> many > >>>>>>> groups as the number of tasks/operators, depending on how > >>>> fine-grained > >>>>>> the > >>>>>>> user is able to specify the resources. > >>>>>>> > >>>>>>> Having to support SSGs might be a constraint. But given that all > >> the > >>>>>>> current scheduler implementations already support SSGs, I tend to > >>>> think > >>>>>>> that as an acceptable price for the above discussed usability and > >>>>>>> flexibility. > >>>>>>> > >>>>>>> @Chesnay > >>>>>>> > >>>>>>> Will declaring them on slot sharing groups not also waste > >> resources > >>>> if > >>>>>> the > >>>>>>>> parallelism of operators within that group are different? > >>>>>>>> > >>>>>>> Yes. It's a trade-off between usability and resource > >> utilization. To > >>>>>> avoid > >>>>>>> such wasting, the user can define more groups, so that each group > >>>>>> contains > >>>>>>> less operators and the chance of having operators with different > >>>>>>> parallelism will be reduced. The price is to have more resource > >>>>>>> requirements to specify. > >>>>>>> > >>>>>>> It also seems like quite a hassle for users having to > >> recalculate the > >>>>>>>> resource requirements if they change the slot sharing. > >>>>>>>> I'd think that it's not really workable for users that create > >> a set > >>>>> of > >>>>>>>> re-usable operators which are mixed and matched in their > >>>>> applications; > >>>>>>>> managing the resources requirements in such a setting would be > >> a > >>>>>>>> nightmare, and in the end would require operator-level > >> requirements > >>>>> any > >>>>>>>> way. > >>>>>>>> In that sense, I'm not even sure whether it really increases > >>>>> usability. > >>>>>>> - As mentioned in my reply to Till's comment, there's no > >> reason to > >>>>> put > >>>>>>> multiple operators whose individual resource requirements are > >>>>> already > >>>>>>> known > >>>>>>> into the same group in fine-grained resource management. > >>>>>>> - Even an operator implementation is reused for multiple > >>>>> applications, > >>>>>>> it does not guarantee the same resource requirements. During > >> our > >>>>> years > >>>>>>> of > >>>>>>> practices in Alibaba, with per-operator requirements > >> specified for > >>>>>>> Blink's > >>>>>>> fine-grained resource management, very few users (including > >> our > >>>>>>> specialists > >>>>>>> who are dedicated to supporting Blink users) are as > >> experienced as > >>>>> to > >>>>>>> accurately predict/estimate the operator resource > >> requirements. > >>>> Most > >>>>>>> people > >>>>>>> rely on the execution-time metrics (throughput, delay, cpu > >> load, > >>>>>> memory > >>>>>>> usage, GC pressure, etc.) to improve the specification. > >>>>>>> > >>>>>>> To sum up: > >>>>>>> If the user is capable of providing proper resource requirements > >> for > >>>>>> every > >>>>>>> operator, that's definitely a good thing and we would not need to > >>>> rely > >>>>> on > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > >> fine-grained > >>>>>> resource > >>>>>>> management to work. For those users who are capable and do not > >> like > >>>>>> having > >>>>>>> to set each operator to a separate SSG, I would be ok to have > >> both > >>>>>>> SSG-based and operator-based runtime interfaces and to only > >> fallback > >>>> to > >>>>>> the > >>>>>>> SSG requirements when the operator requirements are not > >> specified. > >>>>>> However, > >>>>>>> as the first step, I think we should prioritise the use cases > >> where > >>>>> users > >>>>>>> are not that experienced. > >>>>>>> > >>>>>>> Thank you~ > >>>>>>> > >>>>>>> Xintong Song > >>>>>>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > >> [hidden email]> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Will declaring them on slot sharing groups not also waste > >> resources > >>>>> if > >>>>>>>> the parallelism of operators within that group are different? > >>>>>>>> > >>>>>>>> It also seems like quite a hassle for users having to > >> recalculate > >>>> the > >>>>>>>> resource requirements if they change the slot sharing. > >>>>>>>> I'd think that it's not really workable for users that create > >> a set > >>>>> of > >>>>>>>> re-usable operators which are mixed and matched in their > >>>>> applications; > >>>>>>>> managing the resources requirements in such a setting would be > >> a > >>>>>>>> nightmare, and in the end would require operator-level > >> requirements > >>>>> any > >>>>>>>> way. > >>>>>>>> In that sense, I'm not even sure whether it really increases > >>>>> usability. > >>>>>>>> My main worry is that it if we wire the runtime to work on SSGs > >>>> it's > >>>>>>>> gonna be difficult to implement more fine-grained approaches, > >> which > >>>>>>>> would not be the case if, for the runtime, they are always > >> defined > >>>> on > >>>>>> an > >>>>>>>> operator-level. > >>>>>>>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > >>>>>>>>> Thanks for drafting this FLIP and starting this discussion > >>>> Yangze. > >>>>>>>>> I like that defining resource requirements on a slot sharing > >>>> group > >>>>>>> makes > >>>>>>>>> the overall setup easier and improves usability of resource > >>>>>>> requirements. > >>>>>>>>> What I do not like about it is that it changes slot sharing > >>>> groups > >>>>>> from > >>>>>>>>> being a scheduling hint to something which needs to be > >> supported > >>>> in > >>>>>>> order > >>>>>>>>> to support fine grained resource requirements. So far, the > >> idea > >>>> of > >>>>>> slot > >>>>>>>>> sharing groups was that it tells the system that a set of > >>>> operators > >>>>>> can > >>>>>>>> be > >>>>>>>>> deployed in the same slot. But the system still had the > >> freedom > >>>> to > >>>>>> say > >>>>>>>> that > >>>>>>>>> it would rather place these tasks in different slots if it > >>>> wanted. > >>>>> If > >>>>>>> we > >>>>>>>>> now specify resource requirements on a per slot sharing > >> group, > >>>> then > >>>>>> the > >>>>>>>>> only option for a scheduler which does not support slot > >> sharing > >>>>>> groups > >>>>>>> is > >>>>>>>>> to say that every operator in this slot sharing group needs a > >>>> slot > >>>>>> with > >>>>>>>> the > >>>>>>>>> same resources as the whole group. > >>>>>>>>> > >>>>>>>>> So for example, if we have a job consisting of two operator > >> op_1 > >>>>> and > >>>>>>> op_2 > >>>>>>>>> where each op needs 100 MB of memory, we would then say that > >> the > >>>>> slot > >>>>>>>>> sharing group needs 200 MB of memory to run. If we have a > >> cluster > >>>>>> with > >>>>>>> 2 > >>>>>>>>> TMs with one slot of 100 MB each, then the system cannot run > >> this > >>>>>> job. > >>>>>>> If > >>>>>>>>> the resources were specified on an operator level, then the > >>>> system > >>>>>>> could > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to > >> TM_2. > >>>>>>>>> Originally, one of the primary goals of slot sharing groups > >> was > >>>> to > >>>>>> make > >>>>>>>> it > >>>>>>>>> easier for the user to reason about how many slots a job > >> needs > >>>>>>>> independent > >>>>>>>>> of the actual number of operators in the job. Interestingly, > >> if > >>>> all > >>>>>>>>> operators have their resources properly specified, then slot > >>>>> sharing > >>>>>> is > >>>>>>>> no > >>>>>>>>> longer needed because Flink could slice off the appropriately > >>>> sized > >>>>>>> slots > >>>>>>>>> for every Task individually. What matters is whether the > >> whole > >>>>>> cluster > >>>>>>>> has > >>>>>>>>> enough resources to run all tasks or not. > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> Till > >>>>>>>>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > >> [hidden email]> > >>>>>> wrote: > >>>>>>>>>> Hi, there, > >>>>>>>>>> > >>>>>>>>>> We would like to start a discussion thread on "FLIP-156: > >> Runtime > >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], > >> where we > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime interfaces > >> for > >>>>>>>>>> specifying fine-grained resource requirements. > >>>>>>>>>> > >>>>>>>>>> In this FLIP: > >>>>>>>>>> - Expound the user story of fine-grained resource > >> management. > >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based > >> resource > >>>>>>>>>> requirements. > >>>>>>>>>> - Discuss the pros and cons of the three potential > >> granularities > >>>>> for > >>>>>>>>>> specifying the resource requirements (op, task and slot > >> sharing > >>>>>> group) > >>>>>>>>>> and explain why we choose the slot sharing group. > >>>>>>>>>> > >>>>>>>>>> Please find more details in the FLIP wiki document [1]. > >> Looking > >>>>>>>>>> forward to your feedback. > >>>>>>>>>> > >>>>>>>>>> [1] > >>>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > >>>>>>>>>> Best, > >>>>>>>>>> Yangze Guo > >>>>>>>>>> > >>>>>>>> > > |
I second Till's concern about implicitly interpreting zero resource
requirements for unspecified operators. I'm not against allowing both specifying SSG requirements as shortcuts and further refining operator requirements as needed. Combining Till's idea, we can do the following. - Prefer using operator requirements if they are available for all operators in a SSG, otherwise fallback to SSG requirements or default slot resource if not specified. - If cases that SSGs are not strictly respected and finer-grained requirements are needed, derive them automatically if not provided. I'm leaning towards introducing the SSG interfaces as the first step, and introduce operator interfaces and the deriving logics as future improvements. Thank you~ Xintong Song On Thu, Jan 21, 2021 at 4:45 PM Till Rohrmann <[hidden email]> wrote: > If I understand you correctly Chesnay, then you want to decouple the > resource requirement specification from the slot sharing group assignment. > Hence, per default all operators would be in the same slot sharing group. > If there is no operator with a resource specification, then the system > would allocate a default slot for it. If there is at least one operator, > then the system would sum up all the specified resources and allocate a > slot of this size. This effectively means that all unspecified operators > will implicitly have a zero resource requirement. Did I understand your > idea correctly? > > I am wondering whether this wouldn't lead to a surprising behaviour for the > user. If the user specifies the resource requirements for a single > operator, then he probably will assume that the other operators will get > the default share of resources and not nothing. > > Cheers, > Till > > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <[hidden email]> > wrote: > > > Is there even a functional difference between specifying the > > requirements for an SSG vs specifying the same requirements on a single > > operator within that group (ideally a colocation group to avoid this > > whole hint business)? > > > > Wouldn't we get the best of both worlds in the latter case? > > > > Users can take shortcuts to define shared requirements, > > but refine them further as needed on a per-operator basis, > > without changing semantics of slotsharing groups > > nor the runtime being locked into SSG-based requirements. > > > > (And before anyone argues what happens if slotsharing groups change or > > whatnot, that's a plain API issue that we could surely solve. (A plain > > iteration over slotsharing groups and therein contained operators would > > suffice)). > > > > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > Maybe a different minor idea: Would it be possible to treat the SSG > > > resource requirements as a hint for the runtime similar to how slot > > sharing > > > groups are designed at the moment? Meaning that we don't give the > > guarantee > > > that Flink will always deploy this set of tasks together no matter what > > > comes. If, for example, the runtime can derive by some means the > resource > > > requirements for each task based on the requirements for the SSG, this > > > could be possible. One easy strategy would be to give every task the > same > > > resources as the whole slot sharing group. Another one could be > > > distributing the resources equally among the tasks. This does not even > > have > > > to be implemented but we would give ourselves the freedom to change > > > scheduling if need should arise. > > > > > > Cheers, > > > Till > > > > > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email]> wrote: > > > > > >> Thanks for the responses, Till and Xintong. > > >> > > >> I second Xintong's comment that SSG-based runtime interface will give > > >> us the flexibility to achieve op/task-based approach. That's one of > > >> the most important reasons for our design choice. > > >> > > >> Some cents regarding the default operator resource: > > >> - It might be good for the scenario of DataStream jobs. > > >> ** For light-weight operators, the accumulative configuration > error > > >> will not be significant. Then, the resource of a task used is > > >> proportional to the number of operators it contains. > > >> ** For heavy operators like join and window or operators using the > > >> external resources, user will turn to the fine-grained resource > > >> configuration. > > >> - It can increase the stability for the standalone cluster where task > > >> executors registered are heterogeneous(with different default slot > > >> resources). > > >> - It might not be good for SQL users. The operators that SQL will be > > >> transferred to is a black box to the user. We also do not guarantee > > >> the cross-version of consistency of the transformation so far. > > >> > > >> I think it can be treated as a follow-up work when the fine-grained > > >> resource management is end-to-end ready. > > >> > > >> Best, > > >> Yangze Guo > > >> > > >> > > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song <[hidden email]> > > >> wrote: > > >>> Thanks for the feedback, Till. > > >>> > > >>> ## I feel that what you proposed (operator-based + default value) > might > > >> be > > >>> subsumed by the SSG-based approach. > > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > categorized > > by > > >>> whether the resource requirements are known to the users. > > >>> > > >>> 1. *Both known.* As previously mentioned, there's no reason to > put > > >>> multiple operators whose individual resource requirements are > > already > > >> known > > >>> into the same group in fine-grained resource management. And if > > op_1 > > >> and > > >>> op_2 are in different groups, there should be no problem > switching > > >> data > > >>> exchange mode from pipelined to blocking. This is equivalent to > > >> specifying > > >>> operator resource requirements in your proposal. > > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except that op_2 is > > in a > > >>> SSG whose resource is not specified thus would have the default > > slot > > >>> resource. This is equivalent to having default operator resources > > in > > >> your > > >>> proposal. > > >>> 3. *Both unknown*. The user can either set op_1 and op_2 to the > > same > > >> SSG > > >>> or separate SSGs. > > >>> - If op_1 and op_2 are in the same SSG, it will be equivalent > to > > >> the > > >>> coarse-grained resource management, where op_1 and op_2 share > a > > >> default > > >>> size slot no matter which data exchange mode is used. > > >>> - If op_1 and op_2 are in different SSGs, then each of them > will > > >> use > > >>> a default size slot. This is equivalent to setting them with > > >> default > > >>> operator resources in your proposal. > > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is > known.* > > >>> - It is possible that the user learns the total / max resource > > >>> requirement from executing and monitoring the job, while not > > >>> being aware of > > >>> individual operator requirements. > > >>> - I believe this is the case your proposal does not cover. And > > TBH, > > >>> this is probably how most users learn the resource > requirements, > > >>> according > > >>> to my experiences. > > >>> - In this case, the user might need to specify different > > resources > > >> if > > >>> he wants to switch the execution mode, which should not be > worse > > >> than not > > >>> being able to use fine-grained resource management. > > >>> > > >>> > > >>> ## An additional idea inspired by your proposal. > > >>> We may provide multiple options for deciding resources for SSGs whose > > >>> requirement is not specified, if needed. > > >>> > > >>> - Default slot resource (current design) > > >>> - Default operator resource times number of operators (equivalent > > to > > >>> your proposal) > > >>> > > >>> > > >>> ## Exposing internal runtime strategies > > >>> Theoretically, yes. Tying to the SSGs, the resource requirements > might > > be > > >>> affected if how SSGs are internally handled changes in future. > > >> Practically, > > >>> I do not concretely see at the moment what kind of changes we may > want > > in > > >>> future that might conflict with this FLIP proposal, as the question > of > > >>> switching data exchange mode answered above. I'd suggest to not give > up > > >> the > > >>> user friendliness we may gain now for the future problems that may or > > may > > >>> not exist. > > >>> > > >>> Moreover, the SSG-based approach has the flexibility to achieve the > > >>> equivalent behavior as the operator-based approach, if we set each > > >> operator > > >>> (or task) to a separate SSG. We can even provide a shortcut option to > > >>> automatically do that for users, if needed. > > >>> > > >>> > > >>> Thank you~ > > >>> > > >>> Xintong Song > > >>> > > >>> > > >>> > > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann <[hidden email] > > > > >> wrote: > > >>>> Thanks for the responses Xintong and Stephan, > > >>>> > > >>>> I agree that being able to define the resource requirements for a > > >> group of > > >>>> operators is more user friendly. However, my concern is that we are > > >>>> exposing thereby internal runtime strategies which might limit our > > >>>> flexibility to execute a given job. Moreover, the semantics of > > >> configuring > > >>>> resource requirements for SSGs could break if switching from > streaming > > >> to > > >>>> batch execution. If one defines the resource requirements for op_1 > -> > > >> op_2 > > >>>> which run in pipelined mode when using the streaming execution, then > > >> how do > > >>>> we interpret these requirements when op_1 -> op_2 are executed with > a > > >>>> blocking data exchange in batch execution mode? Consequently, I am > > >> still > > >>>> leaning towards Stephan's proposal to set the resource requirements > > per > > >>>> operator. > > >>>> > > >>>> Maybe the following proposal makes the configuration easier: If the > > >> user > > >>>> wants to use fine-grained resource requirements, then she needs to > > >> specify > > >>>> the default size which is used for operators which have no explicit > > >>>> resource annotation. If this holds true, then every operator would > > >> have a > > >>>> resource requirement and the system can try to execute the operators > > >> in the > > >>>> best possible manner w/o being constrained by how the user set the > SSG > > >>>> requirements. > > >>>> > > >>>> Cheers, > > >>>> Till > > >>>> > > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song <[hidden email] > > > > >>>> wrote: > > >>>> > > >>>>> Thanks for the feedback, Stephan. > > >>>>> > > >>>>> Actually, your proposal has also come to my mind at some point. > And I > > >>>> have > > >>>>> some concerns about it. > > >>>>> > > >>>>> > > >>>>> 1. It does not give users the same control as the SSG-based > approach. > > >>>>> > > >>>>> > > >>>>> While both approaches do not require specifying for each operator, > > >>>>> SSG-based approach supports the semantic that "some operators > > >> together > > >>>> use > > >>>>> this much resource" while the operator-based approach doesn't. > > >>>>> > > >>>>> > > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., o_m), and > > >> at > > >>>> some > > >>>>> point there's an agg o_n (1 < n < m) which significantly reduces > the > > >> data > > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 (o_1, > ..., > > >> o_n) > > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher > > >> parallelisms > > >>>>> for operators in SSG_1 than for operators in SSG_2 won't lead to > too > > >> much > > >>>>> wasting of resources. If the two SSGs end up needing different > > >> resources, > > >>>>> with the SSG-based approach one can directly specify resources for > > >> the > > >>>> two > > >>>>> groups. However, with the operator-based approach, the user will > > >> have to > > >>>>> specify resources for each operator in one of the two groups, and > > >> tune > > >>>> the > > >>>>> default slot resource via configurations to fit the other group. > > >>>>> > > >>>>> > > >>>>> 2. It increases the chance of breaking operator chains. > > >>>>> > > >>>>> > > >>>>> Setting chainnable operators into different slot sharing groups > will > > >>>>> prevent them from being chained. In the current implementation, > > >>>> downstream > > >>>>> operators, if SSG not explicitly specified, will be set to the same > > >> group > > >>>>> as the chainable upstream operators (unless multiple upstream > > >> operators > > >>>> in > > >>>>> different groups), to reduce the chance of breaking chains. > > >>>>> > > >>>>> > > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, deciding > > >> SSGs > > >>>>> based on whether resource is specified we will easily get groups > like > > >>>> (o_1, > > >>>>> o_3) & (o_2, o_4), where none of the operators can be chained. This > > >> is > > >>>> also > > >>>>> possible for the SSG-based approach, but I believe the chance is > much > > >>>>> smaller because there's no strong reason for users to specify the > > >> groups > > >>>>> with alternate operators like that. We are more likely to get > groups > > >> like > > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only between o_2 > and > > >> o_3. > > >>>>> > > >>>>> 3. It complicates the system by having two different mechanisms for > > >>>> sharing > > >>>>> managed memory in a slot. > > >>>>> > > >>>>> > > >>>>> - In FLIP-141, we introduced the intra-slot managed memory sharing > > >>>>> mechanism, where managed memory is first distributed according to > the > > >>>>> consumer type, then further distributed across operators of that > > >> consumer > > >>>>> type. > > >>>>> > > >>>>> - With the operator-based approach, managed memory size specified > > >> for an > > >>>>> operator should account for all the consumer types of that > operator. > > >> That > > >>>>> means the managed memory is first distributed across operators, > then > > >>>>> distributed to different consumer types of each operator. > > >>>>> > > >>>>> > > >>>>> Unfortunately, the different order of the two calculation steps can > > >> lead > > >>>> to > > >>>>> different results. To be specific, the semantic of the > configuration > > >>>> option > > >>>>> `consumer-weights` changed (within a slot vs. within an operator). > > >>>>> > > >>>>> > > >>>>> > > >>>>> To sum up things: > > >>>>> > > >>>>> While (3) might be a bit more implementation related, I think (1) > > >> and (2) > > >>>>> somehow suggest that, the price for the proposed approach to avoid > > >>>>> specifying resource for every operator is that it's not as > > >> independent > > >>>> from > > >>>>> operator chaining and slot sharing as the operator-based approach > > >>>> discussed > > >>>>> in the FLIP. > > >>>>> > > >>>>> > > >>>>> Thank you~ > > >>>>> > > >>>>> Xintong Song > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen <[hidden email]> > > >> wrote: > > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > >>>>>> > > >>>>>> I want to say, first of all, that this is super well written. And > > >> the > > >>>>>> points that the FLIP makes about how to expose the configuration > to > > >>>> users > > >>>>>> is exactly the right thing to figure out first. > > >>>>>> So good job here! > > >>>>>> > > >>>>>> About how to let users specify the resource profiles. If I can sum > > >> the > > >>>>> FLIP > > >>>>>> and previous discussion up in my own words, the problem is the > > >>>> following: > > >>>>>> Operator-level specification is the simplest and cleanest > approach, > > >>>>> because > > >>>>>>> it avoids mixing operator configuration (resource) and > > >> scheduling. No > > >>>>>>> matter what other parameters change (chaining, slot sharing, > > >>>> switching > > >>>>>>> pipelined and blocking shuffles), the resource profiles stay the > > >>>> same. > > >>>>>>> But it would require that a user specifies resources on all > > >>>> operators, > > >>>>>>> which makes it hard to use. That's why the FLIP suggests going > > >> with > > >>>>>>> specifying resources on a Sharing-Group. > > >>>>>> > > >>>>>> I think both thoughts are important, so can we find a solution > > >> where > > >>>> the > > >>>>>> Resource Profiles are specified on an Operator, but we still avoid > > >> that > > >>>>> we > > >>>>>> need to specify a resource profile on every operator? > > >>>>>> > > >>>>>> What do you think about something like the following: > > >>>>>> - Resource Profiles are specified on an operator level. > > >>>>>> - Not all operators need profiles > > >>>>>> - All Operators without a Resource Profile ended up in the > > >> default > > >>>> slot > > >>>>>> sharing group with a default profile (will get a default slot). > > >>>>>> - All Operators with a Resource Profile will go into another > slot > > >>>>> sharing > > >>>>>> group (the resource-specified-group). > > >>>>>> - Users can define different slot sharing groups for operators > > >> like > > >>>>> they > > >>>>>> do now, with the exception that you cannot mix operators that have > > >> a > > >>>>>> resource profile and operators that have no resource profile. > > >>>>>> - The default case where no operator has a resource profile is > > >> just a > > >>>>>> special case of this model > > >>>>>> - The chaining logic sums up the profiles per operator, like it > > >> does > > >>>>> now, > > >>>>>> and the scheduler sums up the profiles of the tasks that it > > >> schedules > > >>>>>> together. > > >>>>>> > > >>>>>> > > >>>>>> There is another question about reactive scaling raised in the > > >> FLIP. I > > >>>>> need > > >>>>>> to think a bit about that. That is indeed a bit more tricky once > we > > >>>> have > > >>>>>> slots of different sizes. > > >>>>>> It is not clear then which of the different slot requests the > > >>>>>> ResourceManager should fulfill when new resources (TMs) show up, > > >> or how > > >>>>> the > > >>>>>> JobManager redistributes the slots resources when resources (TMs) > > >>>>> disappear > > >>>>>> This question is pretty orthogonal, though, to the "how to specify > > >> the > > >>>>>> resources". > > >>>>>> > > >>>>>> > > >>>>>> Best, > > >>>>>> Stephan > > >>>>>> > > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song < > [hidden email] > > >>>>> wrote: > > >>>>>>> Thanks for drafting the FLIP and driving the discussion, Yangze. > > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > >>>>>>> > > >>>>>>> @Till, > > >>>>>>> > > >>>>>>> I agree that specifying requirements for SSGs means that SSGs > > >> need to > > >>>>> be > > >>>>>>> supported in fine-grained resource management, otherwise each > > >>>> operator > > >>>>>>> might use as many resources as the whole group. However, I cannot > > >>>> think > > >>>>>> of > > >>>>>>> a strong reason for not supporting SSGs in fine-grained resource > > >>>>>>> management. > > >>>>>>> > > >>>>>>> > > >>>>>>>> Interestingly, if all operators have their resources properly > > >>>>>> specified, > > >>>>>>>> then slot sharing is no longer needed because Flink could > > >> slice off > > >>>>> the > > >>>>>>>> appropriately sized slots for every Task individually. > > >>>>>>>> > > >>>>>>> So for example, if we have a job consisting of two operator op_1 > > >> and > > >>>>> op_2 > > >>>>>>>> where each op needs 100 MB of memory, we would then say that > > >> the > > >>>> slot > > >>>>>>>> sharing group needs 200 MB of memory to run. If we have a > > >> cluster > > >>>>> with > > >>>>>> 2 > > >>>>>>>> TMs with one slot of 100 MB each, then the system cannot run > > >> this > > >>>>> job. > > >>>>>> If > > >>>>>>>> the resources were specified on an operator level, then the > > >> system > > >>>>>> could > > >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to > > >> TM_2. > > >>>>>>> > > >>>>>>> Couldn't agree more that if all operators' requirements are > > >> properly > > >>>>>>> specified, slot sharing should be no longer needed. I think this > > >>>>> exactly > > >>>>>>> disproves the example. If we already know op_1 and op_2 each > > >> needs > > >>>> 100 > > >>>>> MB > > >>>>>>> of memory, why would we put them in the same group? If they are > > >> in > > >>>>>> separate > > >>>>>>> groups, with the proposed approach the system can freely deploy > > >> them > > >>>> to > > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > >>>>>>> > > >>>>>>> Moreover, the precondition for not needing slot sharing is having > > >>>>>> resource > > >>>>>>> requirements properly specified for all operators. This is not > > >> always > > >>>>>>> possible, and usually requires tremendous efforts. One of the > > >>>> benefits > > >>>>>> for > > >>>>>>> SSG-based requirements is that it allows the user to freely > > >> decide > > >>>> the > > >>>>>>> granularity, thus efforts they want to pay. I would consider SSG > > >> in > > >>>>>>> fine-grained resource management as a group of operators that the > > >>>> user > > >>>>>>> would like to specify the total resource for. There can be only > > >> one > > >>>>> group > > >>>>>>> in the job, 2~3 groups dividing the job into a few major parts, > > >> or as > > >>>>>> many > > >>>>>>> groups as the number of tasks/operators, depending on how > > >>>> fine-grained > > >>>>>> the > > >>>>>>> user is able to specify the resources. > > >>>>>>> > > >>>>>>> Having to support SSGs might be a constraint. But given that all > > >> the > > >>>>>>> current scheduler implementations already support SSGs, I tend to > > >>>> think > > >>>>>>> that as an acceptable price for the above discussed usability and > > >>>>>>> flexibility. > > >>>>>>> > > >>>>>>> @Chesnay > > >>>>>>> > > >>>>>>> Will declaring them on slot sharing groups not also waste > > >> resources > > >>>> if > > >>>>>> the > > >>>>>>>> parallelism of operators within that group are different? > > >>>>>>>> > > >>>>>>> Yes. It's a trade-off between usability and resource > > >> utilization. To > > >>>>>> avoid > > >>>>>>> such wasting, the user can define more groups, so that each group > > >>>>>> contains > > >>>>>>> less operators and the chance of having operators with different > > >>>>>>> parallelism will be reduced. The price is to have more resource > > >>>>>>> requirements to specify. > > >>>>>>> > > >>>>>>> It also seems like quite a hassle for users having to > > >> recalculate the > > >>>>>>>> resource requirements if they change the slot sharing. > > >>>>>>>> I'd think that it's not really workable for users that create > > >> a set > > >>>>> of > > >>>>>>>> re-usable operators which are mixed and matched in their > > >>>>> applications; > > >>>>>>>> managing the resources requirements in such a setting would be > > >> a > > >>>>>>>> nightmare, and in the end would require operator-level > > >> requirements > > >>>>> any > > >>>>>>>> way. > > >>>>>>>> In that sense, I'm not even sure whether it really increases > > >>>>> usability. > > >>>>>>> - As mentioned in my reply to Till's comment, there's no > > >> reason to > > >>>>> put > > >>>>>>> multiple operators whose individual resource requirements are > > >>>>> already > > >>>>>>> known > > >>>>>>> into the same group in fine-grained resource management. > > >>>>>>> - Even an operator implementation is reused for multiple > > >>>>> applications, > > >>>>>>> it does not guarantee the same resource requirements. During > > >> our > > >>>>> years > > >>>>>>> of > > >>>>>>> practices in Alibaba, with per-operator requirements > > >> specified for > > >>>>>>> Blink's > > >>>>>>> fine-grained resource management, very few users (including > > >> our > > >>>>>>> specialists > > >>>>>>> who are dedicated to supporting Blink users) are as > > >> experienced as > > >>>>> to > > >>>>>>> accurately predict/estimate the operator resource > > >> requirements. > > >>>> Most > > >>>>>>> people > > >>>>>>> rely on the execution-time metrics (throughput, delay, cpu > > >> load, > > >>>>>> memory > > >>>>>>> usage, GC pressure, etc.) to improve the specification. > > >>>>>>> > > >>>>>>> To sum up: > > >>>>>>> If the user is capable of providing proper resource requirements > > >> for > > >>>>>> every > > >>>>>>> operator, that's definitely a good thing and we would not need to > > >>>> rely > > >>>>> on > > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > >> fine-grained > > >>>>>> resource > > >>>>>>> management to work. For those users who are capable and do not > > >> like > > >>>>>> having > > >>>>>>> to set each operator to a separate SSG, I would be ok to have > > >> both > > >>>>>>> SSG-based and operator-based runtime interfaces and to only > > >> fallback > > >>>> to > > >>>>>> the > > >>>>>>> SSG requirements when the operator requirements are not > > >> specified. > > >>>>>> However, > > >>>>>>> as the first step, I think we should prioritise the use cases > > >> where > > >>>>> users > > >>>>>>> are not that experienced. > > >>>>>>> > > >>>>>>> Thank you~ > > >>>>>>> > > >>>>>>> Xintong Song > > >>>>>>> > > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > >> [hidden email]> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>>> Will declaring them on slot sharing groups not also waste > > >> resources > > >>>>> if > > >>>>>>>> the parallelism of operators within that group are different? > > >>>>>>>> > > >>>>>>>> It also seems like quite a hassle for users having to > > >> recalculate > > >>>> the > > >>>>>>>> resource requirements if they change the slot sharing. > > >>>>>>>> I'd think that it's not really workable for users that create > > >> a set > > >>>>> of > > >>>>>>>> re-usable operators which are mixed and matched in their > > >>>>> applications; > > >>>>>>>> managing the resources requirements in such a setting would be > > >> a > > >>>>>>>> nightmare, and in the end would require operator-level > > >> requirements > > >>>>> any > > >>>>>>>> way. > > >>>>>>>> In that sense, I'm not even sure whether it really increases > > >>>>> usability. > > >>>>>>>> My main worry is that it if we wire the runtime to work on SSGs > > >>>> it's > > >>>>>>>> gonna be difficult to implement more fine-grained approaches, > > >> which > > >>>>>>>> would not be the case if, for the runtime, they are always > > >> defined > > >>>> on > > >>>>>> an > > >>>>>>>> operator-level. > > >>>>>>>> > > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > >>>>>>>>> Thanks for drafting this FLIP and starting this discussion > > >>>> Yangze. > > >>>>>>>>> I like that defining resource requirements on a slot sharing > > >>>> group > > >>>>>>> makes > > >>>>>>>>> the overall setup easier and improves usability of resource > > >>>>>>> requirements. > > >>>>>>>>> What I do not like about it is that it changes slot sharing > > >>>> groups > > >>>>>> from > > >>>>>>>>> being a scheduling hint to something which needs to be > > >> supported > > >>>> in > > >>>>>>> order > > >>>>>>>>> to support fine grained resource requirements. So far, the > > >> idea > > >>>> of > > >>>>>> slot > > >>>>>>>>> sharing groups was that it tells the system that a set of > > >>>> operators > > >>>>>> can > > >>>>>>>> be > > >>>>>>>>> deployed in the same slot. But the system still had the > > >> freedom > > >>>> to > > >>>>>> say > > >>>>>>>> that > > >>>>>>>>> it would rather place these tasks in different slots if it > > >>>> wanted. > > >>>>> If > > >>>>>>> we > > >>>>>>>>> now specify resource requirements on a per slot sharing > > >> group, > > >>>> then > > >>>>>> the > > >>>>>>>>> only option for a scheduler which does not support slot > > >> sharing > > >>>>>> groups > > >>>>>>> is > > >>>>>>>>> to say that every operator in this slot sharing group needs a > > >>>> slot > > >>>>>> with > > >>>>>>>> the > > >>>>>>>>> same resources as the whole group. > > >>>>>>>>> > > >>>>>>>>> So for example, if we have a job consisting of two operator > > >> op_1 > > >>>>> and > > >>>>>>> op_2 > > >>>>>>>>> where each op needs 100 MB of memory, we would then say that > > >> the > > >>>>> slot > > >>>>>>>>> sharing group needs 200 MB of memory to run. If we have a > > >> cluster > > >>>>>> with > > >>>>>>> 2 > > >>>>>>>>> TMs with one slot of 100 MB each, then the system cannot run > > >> this > > >>>>>> job. > > >>>>>>> If > > >>>>>>>>> the resources were specified on an operator level, then the > > >>>> system > > >>>>>>> could > > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to > > >> TM_2. > > >>>>>>>>> Originally, one of the primary goals of slot sharing groups > > >> was > > >>>> to > > >>>>>> make > > >>>>>>>> it > > >>>>>>>>> easier for the user to reason about how many slots a job > > >> needs > > >>>>>>>> independent > > >>>>>>>>> of the actual number of operators in the job. Interestingly, > > >> if > > >>>> all > > >>>>>>>>> operators have their resources properly specified, then slot > > >>>>> sharing > > >>>>>> is > > >>>>>>>> no > > >>>>>>>>> longer needed because Flink could slice off the appropriately > > >>>> sized > > >>>>>>> slots > > >>>>>>>>> for every Task individually. What matters is whether the > > >> whole > > >>>>>> cluster > > >>>>>>>> has > > >>>>>>>>> enough resources to run all tasks or not. > > >>>>>>>>> > > >>>>>>>>> Cheers, > > >>>>>>>>> Till > > >>>>>>>>> > > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > >> [hidden email]> > > >>>>>> wrote: > > >>>>>>>>>> Hi, there, > > >>>>>>>>>> > > >>>>>>>>>> We would like to start a discussion thread on "FLIP-156: > > >> Runtime > > >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], > > >> where we > > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime interfaces > > >> for > > >>>>>>>>>> specifying fine-grained resource requirements. > > >>>>>>>>>> > > >>>>>>>>>> In this FLIP: > > >>>>>>>>>> - Expound the user story of fine-grained resource > > >> management. > > >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based > > >> resource > > >>>>>>>>>> requirements. > > >>>>>>>>>> - Discuss the pros and cons of the three potential > > >> granularities > > >>>>> for > > >>>>>>>>>> specifying the resource requirements (op, task and slot > > >> sharing > > >>>>>> group) > > >>>>>>>>>> and explain why we choose the slot sharing group. > > >>>>>>>>>> > > >>>>>>>>>> Please find more details in the FLIP wiki document [1]. > > >> Looking > > >>>>>>>>>> forward to your feedback. > > >>>>>>>>>> > > >>>>>>>>>> [1] > > >>>>>>>>>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > >>>>>>>>>> Best, > > >>>>>>>>>> Yangze Guo > > >>>>>>>>>> > > >>>>>>>> > > > > > |
In reply to this post by Till Rohrmann
You're raising a good point, but I think I can rectify that with a minor
adjustment. Default requirements are whatever the default requirements are, setting the requirements for one operator has no effect on other operators. With these rules, and some API enhancements, the following mockup would replicate the SSG-based behavior: Map<SlotSharingGroupId, Requirements> requirements = ... for slotSharingGroup in env.getSlotSharingGroups() { Ā Ā Ā vertices = slotSharingGroup.getVertices() vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) vertices.remainint().setRequirements(ZERO) } We could even allow setting requirements on slotsharing-groups colocation-groups and internally translate them accordingly. I can't help but feel this is a plain API issue. On 1/21/2021 9:44 AM, Till Rohrmann wrote: > If I understand you correctly Chesnay, then you want to decouple the > resource requirement specification from the slot sharing group > assignment. Hence, per default all operators would be in the same slot > sharing group. If there is no operator with a resource specification, > then the system would allocate a default slot for it. If there is at > least one operator, then the system would sum up all the specified > resources and allocate a slot of this size. This effectively means > that all unspecified operators will implicitly have a zero resource > requirement. Did I understandĀ your idea correctly? > > I am wondering whether this wouldn't lead to a surprising behaviour > for the user. If the user specifies the resource requirements for a > single operator, then he probably will assume that the other operators > will get the default share of resources and not nothing. > > Cheers, > Till > > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <[hidden email] > <mailto:[hidden email]>> wrote: > > Is there even a functional difference between specifying the > requirements for an SSG vs specifying the same requirements on a > single > operator within that group (ideally a colocation group to avoid this > whole hint business)? > > Wouldn't we get the best of both worlds in the latter case? > > Users can take shortcuts to define shared requirements, > but refine them further as needed on a per-operator basis, > without changing semantics of slotsharing groups > nor the runtime being locked into SSG-based requirements. > > (And before anyone argues what happens if slotsharing groups > change or > whatnot, that's a plain API issue that we could surely solve. (A > plain > iteration over slotsharing groups and therein contained operators > would > suffice)). > > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > Maybe a different minor idea: Would it be possible to treat the SSG > > resource requirements as a hint for the runtime similar to how > slot sharing > > groups are designed at the moment? Meaning that we don't give > the guarantee > > that Flink will always deploy this set of tasks together no > matter what > > comes. If, for example, the runtime can derive by some means the > resource > > requirements for each task based on the requirements for the > SSG, this > > could be possible. One easy strategy would be to give every task > the same > > resources as the whole slot sharing group. Another one could be > > distributing the resources equally among the tasks. This does > not even have > > to be implemented but we would give ourselves the freedom to change > > scheduling if need should arise. > > > > Cheers, > > Till > > > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email] > <mailto:[hidden email]>> wrote: > > > >> Thanks for the responses, Till and Xintong. > >> > >> I second Xintong's comment that SSG-based runtime interface > will give > >> us the flexibility to achieve op/task-based approach. That's one of > >> the most important reasons for our design choice. > >> > >> Some cents regarding the default operator resource: > >> - It might be good for the scenario of DataStream jobs. > >>Ā Ā Ā ** For light-weight operators, the accumulative > configuration error > >> will not be significant. Then, the resource of a task used is > >> proportional to the number of operators it contains. > >>Ā Ā Ā ** For heavy operators like join and window or operators > using the > >> external resources, user will turn to the fine-grained resource > >> configuration. > >> - It can increase the stability for the standalone cluster > where task > >> executors registered are heterogeneous(with different default slot > >> resources). > >> - It might not be good for SQL users. The operators that SQL > will be > >> transferred to is a black box to the user. We also do not guarantee > >> the cross-version of consistency of the transformation so far. > >> > >> I think it can be treated as a follow-up work when the fine-grained > >> resource management is end-to-end ready. > >> > >> Best, > >> Yangze Guo > >> > >> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > <[hidden email] <mailto:[hidden email]>> > >> wrote: > >>> Thanks for the feedback, Till. > >>> > >>> ## I feel that what you proposed (operator-based + default > value) might > >> be > >>> subsumed by the SSG-based approach. > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > categorized by > >>> whether the resource requirements are known to the users. > >>> > >>>Ā Ā Ā 1. *Both known.* As previously mentioned, there's no > reason to put > >>>Ā Ā Ā multiple operators whose individual resource requirements > are already > >> known > >>>Ā Ā Ā into the same group in fine-grained resource management. > And if op_1 > >> and > >>>Ā Ā Ā op_2 are in different groups, there should be no problem > switching > >> data > >>>Ā Ā Ā exchange mode from pipelined to blocking. This is > equivalent to > >> specifying > >>>Ā Ā Ā operator resource requirements in your proposal. > >>>Ā Ā Ā 2. *op_1 known, op_2 unknown.* Similar to 1), except that > op_2 is in a > >>>Ā Ā Ā SSG whose resource is not specified thus would have the > default slot > >>>Ā Ā Ā resource. This is equivalent to having default operator > resources in > >> your > >>>Ā Ā Ā proposal. > >>>Ā Ā Ā 3. *Both unknown*. The user can either set op_1 and op_2 > to the same > >> SSG > >>>Ā Ā Ā or separate SSGs. > >>>Ā Ā Ā Ā - If op_1 and op_2 are in the same SSG, it will be > equivalent to > >> the > >>>Ā Ā Ā Ā coarse-grained resource management, where op_1 and op_2 > share a > >> default > >>>Ā Ā Ā Ā size slot no matter which data exchange mode is used. > >>>Ā Ā Ā Ā - If op_1 and op_2 are in different SSGs, then each of > them will > >> use > >>>Ā Ā Ā Ā a default size slot. This is equivalent to setting them > with > >> default > >>>Ā Ā Ā Ā operator resources in your proposal. > >>>Ā Ā Ā 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is > known.* > >>>Ā Ā Ā Ā - It is possible that the user learns the total / max > resource > >>>Ā Ā Ā Ā requirement from executing and monitoring the job, > while not > >>> being aware of > >>>Ā Ā Ā Ā individual operator requirements. > >>>Ā Ā Ā Ā - I believe this is the case your proposal does not > cover. And TBH, > >>>Ā Ā Ā Ā this is probably how most users learn the resource > requirements, > >>> according > >>>Ā Ā Ā Ā to my experiences. > >>>Ā Ā Ā Ā - In this case, the user might need to specify > different resources > >> if > >>>Ā Ā Ā Ā he wants to switch the execution mode, which should not > be worse > >> than not > >>>Ā Ā Ā Ā being able to use fine-grained resource management. > >>> > >>> > >>> ## An additional idea inspired by your proposal. > >>> We may provide multiple options for deciding resources for > SSGs whose > >>> requirement is not specified, if needed. > >>> > >>>Ā Ā Ā - Default slot resource (current design) > >>>Ā Ā Ā - Default operator resource times number of operators > (equivalent to > >>>Ā Ā Ā your proposal) > >>> > >>> > >>> ## Exposing internal runtime strategies > >>> Theoretically, yes. Tying to the SSGs, the resource > requirements might be > >>> affected if how SSGs are internally handled changes in future. > >> Practically, > >>> I do not concretely see at the moment what kind of changes we > may want in > >>> future that might conflict with this FLIP proposal, as the > question of > >>> switching data exchange mode answered above. I'd suggest to > not give up > >> the > >>> user friendliness we may gain now for the future problems that > may or may > >>> not exist. > >>> > >>> Moreover, the SSG-based approach has the flexibility to > achieve the > >>> equivalent behavior as the operator-based approach, if we set each > >> operator > >>> (or task) to a separate SSG. We can even provide a shortcut > option to > >>> automatically do that for users, if needed. > >>> > >>> > >>> Thank you~ > >>> > >>> Xintong Song > >>> > >>> > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > <[hidden email] <mailto:[hidden email]>> > >> wrote: > >>>> Thanks for the responses Xintong and Stephan, > >>>> > >>>> I agree that being able to define the resource requirements for a > >> group of > >>>> operators is more user friendly. However, my concern is that > we are > >>>> exposing thereby internal runtime strategies which might > limit our > >>>> flexibility to execute a given job. Moreover, the semantics of > >> configuring > >>>> resource requirements for SSGs could break if switching from > streaming > >> to > >>>> batch execution. If one defines the resource requirements for > op_1 -> > >> op_2 > >>>> which run in pipelined mode when using the streaming > execution, then > >> how do > >>>> we interpret these requirements when op_1 -> op_2 are > executed with a > >>>> blocking data exchange in batch execution mode? Consequently, > I am > >> still > >>>> leaning towards Stephan's proposal to set the resource > requirements per > >>>> operator. > >>>> > >>>> Maybe the following proposal makes the configuration easier: > If the > >> user > >>>> wants to use fine-grained resource requirements, then she > needs to > >> specify > >>>> the default size which is used for operators which have no > explicit > >>>> resource annotation. If this holds true, then every operator > would > >> have a > >>>> resource requirement and the system can try to execute the > operators > >> in the > >>>> best possible manner w/o being constrained by how the user > set the SSG > >>>> requirements. > >>>> > >>>> Cheers, > >>>> Till > >>>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > <[hidden email] <mailto:[hidden email]>> > >>>> wrote: > >>>> > >>>>> Thanks for the feedback, Stephan. > >>>>> > >>>>> Actually, your proposal has also come to my mind at some > point. And I > >>>> have > >>>>> some concerns about it. > >>>>> > >>>>> > >>>>> 1. It does not give users the same control as the SSG-based > approach. > >>>>> > >>>>> > >>>>> While both approaches do not require specifying for each > operator, > >>>>> SSG-based approach supports the semantic that "some operators > >> together > >>>> use > >>>>> this much resource" while the operator-based approach doesn't. > >>>>> > >>>>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > o_m), and > >> at > >>>> some > >>>>> point there's an agg o_n (1 < n < m) which significantly > reduces the > >> data > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > (o_1, ..., > >> o_n) > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher > >> parallelisms > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > lead to too > >> much > >>>>> wasting of resources. If the two SSGs end up needing different > >> resources, > >>>>> with the SSG-based approach one can directly specify > resources for > >> the > >>>> two > >>>>> groups. However, with the operator-based approach, the user will > >> have to > >>>>> specify resources for each operator in one of the two > groups, and > >> tune > >>>> the > >>>>> default slot resource via configurations to fit the other group. > >>>>> > >>>>> > >>>>> 2. It increases the chance of breaking operator chains. > >>>>> > >>>>> > >>>>> Setting chainnable operators into different slot sharing > groups will > >>>>> prevent them from being chained. In the current implementation, > >>>> downstream > >>>>> operators, if SSG not explicitly specified, will be set to > the same > >> group > >>>>> as the chainable upstream operators (unless multiple upstream > >> operators > >>>> in > >>>>> different groups), to reduce the chance of breaking chains. > >>>>> > >>>>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > deciding > >> SSGs > >>>>> based on whether resource is specified we will easily get > groups like > >>>> (o_1, > >>>>> o_3) & (o_2, o_4), where none of the operators can be > chained. This > >> is > >>>> also > >>>>> possible for the SSG-based approach, but I believe the > chance is much > >>>>> smaller because there's no strong reason for users to > specify the > >> groups > >>>>> with alternate operators like that. We are more likely to > get groups > >> like > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only between > o_2 and > >> o_3. > >>>>> > >>>>> 3. It complicates the system by having two different > mechanisms for > >>>> sharing > >>>>> managed memory inĀ a slot. > >>>>> > >>>>> > >>>>> - In FLIP-141, we introduced the intra-slot managed memory > sharing > >>>>> mechanism, where managed memory is first distributed > according to the > >>>>> consumer type, then further distributed across operators of that > >> consumer > >>>>> type. > >>>>> > >>>>> - With the operator-based approach, managed memory size > specified > >> for an > >>>>> operator should account for all the consumer types of that > operator. > >> That > >>>>> means the managed memory is first distributed across > operators, then > >>>>> distributed to different consumer types of each operator. > >>>>> > >>>>> > >>>>> Unfortunately, the different order of the two calculation > steps can > >> lead > >>>> to > >>>>> different results. To be specific, the semantic of the > configuration > >>>> option > >>>>> `consumer-weights` changed (within a slot vs. within an > operator). > >>>>> > >>>>> > >>>>> > >>>>> To sum up things: > >>>>> > >>>>> While (3) might be a bit more implementation related, I > think (1) > >> and (2) > >>>>> somehow suggest that, the price for the proposed approach to > avoid > >>>>> specifying resource for every operator is that it's not as > >> independent > >>>> from > >>>>> operator chaining and slot sharing as the operator-based > approach > >>>> discussed > >>>>> in the FLIP. > >>>>> > >>>>> > >>>>> Thank you~ > >>>>> > >>>>> Xintong Song > >>>>> > >>>>> > >>>>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > <[hidden email] <mailto:[hidden email]>> > >> wrote: > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > >>>>>> > >>>>>> I want to say, first of all, that this is super well > written. And > >> the > >>>>>> points that the FLIP makes about how to expose the > configuration to > >>>> users > >>>>>> is exactly the right thing to figure out first. > >>>>>> So good job here! > >>>>>> > >>>>>> About how to let users specify the resource profiles. If I > can sum > >> the > >>>>> FLIP > >>>>>> and previous discussion up in my own words, the problem is the > >>>> following: > >>>>>> Operator-level specification is the simplest and cleanest > approach, > >>>>> because > >>>>>>> it avoids mixing operator configuration (resource) and > >> scheduling. No > >>>>>>> matter what other parameters change (chaining, slot sharing, > >>>> switching > >>>>>>> pipelined and blocking shuffles), the resource profiles > stay the > >>>> same. > >>>>>>> But it would require that a user specifies resources on all > >>>> operators, > >>>>>>> which makes it hard to use. That's why the FLIP suggests going > >> with > >>>>>>> specifying resources on a Sharing-Group. > >>>>>> > >>>>>> I think both thoughts are important, so can we find a solution > >> where > >>>> the > >>>>>> Resource Profiles are specified on an Operator, but we > still avoid > >> that > >>>>> we > >>>>>> need to specify a resource profile on every operator? > >>>>>> > >>>>>> What do you think about something like the following: > >>>>>>Ā Ā - Resource Profiles are specified on an operator level. > >>>>>>Ā Ā - Not all operators need profiles > >>>>>>Ā Ā - All Operators without a Resource Profile ended up in the > >> default > >>>> slot > >>>>>> sharing group with a default profile (will get a default slot). > >>>>>>Ā Ā - All Operators with a Resource Profile will go into > another slot > >>>>> sharing > >>>>>> group (the resource-specified-group). > >>>>>>Ā Ā - Users can define different slot sharing groups for > operators > >> like > >>>>> they > >>>>>> do now, with the exception that you cannot mix operators > that have > >> a > >>>>>> resource profile and operators that have no resource profile. > >>>>>>Ā Ā - The default case where no operator has a resource > profile is > >> just a > >>>>>> special case of this model > >>>>>>Ā Ā - The chaining logic sums up the profiles per operator, > like it > >> does > >>>>> now, > >>>>>> and the scheduler sums up the profiles of the tasks that it > >> schedules > >>>>>> together. > >>>>>> > >>>>>> > >>>>>> There is another question about reactive scaling raised in the > >> FLIP. I > >>>>> need > >>>>>> to think a bit about that. That is indeed a bit more tricky > once we > >>>> have > >>>>>> slots of different sizes. > >>>>>> It is not clear then which of the different slot requests the > >>>>>> ResourceManager should fulfill when new resources (TMs) > show up, > >> or how > >>>>> the > >>>>>> JobManager redistributes the slots resources when resources > (TMs) > >>>>> disappear > >>>>>> This question is pretty orthogonal, though, to the "how to > specify > >> the > >>>>>> resources". > >>>>>> > >>>>>> > >>>>>> Best, > >>>>>> Stephan > >>>>>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > <[hidden email] <mailto:[hidden email]> > >>>>> wrote: > >>>>>>> Thanks for drafting the FLIP and driving the discussion, > Yangze. > >>>>>>> And Thanks for the feedback, Till and Chesnay. > >>>>>>> > >>>>>>> @Till, > >>>>>>> > >>>>>>> I agree that specifying requirements for SSGs means that SSGs > >> need to > >>>>> be > >>>>>>> supported in fine-grained resource management, otherwise each > >>>> operator > >>>>>>> might use as many resources as the whole group. However, I > cannot > >>>> think > >>>>>> of > >>>>>>> a strong reason for not supporting SSGs in fine-grained > resource > >>>>>>> management. > >>>>>>> > >>>>>>> > >>>>>>>> Interestingly, if all operators have their resources properly > >>>>>> specified, > >>>>>>>> then slot sharing is no longer needed because Flink could > >> slice off > >>>>> the > >>>>>>>> appropriately sized slots for every Task individually. > >>>>>>>> > >>>>>>> So for example, if we have a job consisting of two > operator op_1 > >> and > >>>>> op_2 > >>>>>>>> where each op needs 100 MB of memory, we would then say that > >> the > >>>> slot > >>>>>>>> sharing group needs 200 MB of memory to run. If we have a > >> cluster > >>>>> with > >>>>>> 2 > >>>>>>>> TMs with one slot of 100 MB each, then the system cannot run > >> this > >>>>> job. > >>>>>> If > >>>>>>>> the resources were specified on an operator level, then the > >> system > >>>>>> could > >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to > >> TM_2. > >>>>>>> > >>>>>>> Couldn't agree more that if all operators' requirements are > >> properly > >>>>>>> specified, slot sharing should be no longer needed. I > think this > >>>>> exactly > >>>>>>> disproves the example. If we already know op_1 and op_2 each > >> needs > >>>> 100 > >>>>> MB > >>>>>>> of memory, why would we put them in the same group? If > they are > >> in > >>>>>> separate > >>>>>>> groups, with the proposed approach the system can freely > deploy > >> them > >>>> to > >>>>>>> either a 200 MB TM or two 100 MB TMs. > >>>>>>> > >>>>>>> Moreover, the precondition for not needing slot sharing is > having > >>>>>> resource > >>>>>>> requirements properly specified for all operators. This is not > >> always > >>>>>>> possible, and usually requires tremendous efforts. One of the > >>>> benefits > >>>>>> for > >>>>>>> SSG-based requirements is that it allows the user to freely > >> decide > >>>> the > >>>>>>> granularity, thus efforts they want to pay. I would > consider SSG > >> in > >>>>>>> fine-grained resource management as a group of operators > that the > >>>> user > >>>>>>> would like to specify the total resource for. There can be > only > >> one > >>>>> group > >>>>>>> in the job, 2~3 groups dividing the job into a few major > parts, > >> or as > >>>>>> many > >>>>>>> groups as the number of tasks/operators, depending on how > >>>> fine-grained > >>>>>> the > >>>>>>> user is able to specify the resources. > >>>>>>> > >>>>>>> Having to support SSGs might be a constraint. But given > that all > >> the > >>>>>>> current scheduler implementations already support SSGs, I > tend to > >>>> think > >>>>>>> that as an acceptable price for the above discussed > usability and > >>>>>>> flexibility. > >>>>>>> > >>>>>>> @Chesnay > >>>>>>> > >>>>>>> Will declaring them on slot sharing groups not also waste > >> resources > >>>> if > >>>>>> the > >>>>>>>> parallelism of operators within that group are different? > >>>>>>>> > >>>>>>> Yes. It's a trade-off between usability and resource > >> utilization. To > >>>>>> avoid > >>>>>>> such wasting, the user can define more groups, so that > each group > >>>>>> contains > >>>>>>> less operators and the chance of having operators with > different > >>>>>>> parallelism will be reduced. The price is to have more > resource > >>>>>>> requirements to specify. > >>>>>>> > >>>>>>> It also seems like quite a hassle for users having to > >> recalculate the > >>>>>>>> resource requirements if they change the slot sharing. > >>>>>>>> I'd think that it's not really workable for users that create > >> a set > >>>>> of > >>>>>>>> re-usable operators which are mixed and matched in their > >>>>> applications; > >>>>>>>> managing the resources requirements in such a setting > would be > >> a > >>>>>>>> nightmare, and in the end would require operator-level > >> requirements > >>>>> any > >>>>>>>> way. > >>>>>>>> In that sense, I'm not even sure whether it really increases > >>>>> usability. > >>>>>>>Ā Ā Ā - As mentioned in my reply to Till's comment, there's no > >> reason to > >>>>> put > >>>>>>>Ā Ā Ā multiple operators whose individual resource > requirements are > >>>>> already > >>>>>>> known > >>>>>>>Ā Ā Ā into the same group in fine-grained resource management. > >>>>>>>Ā Ā Ā - Even an operator implementation is reused for multiple > >>>>> applications, > >>>>>>>Ā Ā Ā it does not guarantee the same resource requirements. > During > >> our > >>>>> years > >>>>>>> of > >>>>>>>Ā Ā Ā practices in Alibaba, with per-operator requirements > >> specified for > >>>>>>> Blink's > >>>>>>>Ā Ā Ā fine-grained resource management, very few users > (including > >> our > >>>>>>> specialists > >>>>>>>Ā Ā Ā who are dedicated to supporting Blink users) are as > >> experienced as > >>>>> to > >>>>>>>Ā Ā Ā accurately predict/estimate the operator resource > >> requirements. > >>>> Most > >>>>>>> people > >>>>>>>Ā Ā Ā rely on the execution-time metrics (throughput, delay, cpu > >> load, > >>>>>> memory > >>>>>>>Ā Ā Ā usage, GC pressure, etc.) to improve the specification. > >>>>>>> > >>>>>>> To sum up: > >>>>>>> If the user is capable of providing proper resource > requirements > >> for > >>>>>> every > >>>>>>> operator, that's definitely a good thing and we would not > need to > >>>> rely > >>>>> on > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > >> fine-grained > >>>>>> resource > >>>>>>> management to work. For those users who are capable and do not > >> like > >>>>>> having > >>>>>>> to set each operator to a separate SSG, I would be ok to have > >> both > >>>>>>> SSG-based and operator-based runtime interfaces and to only > >> fallback > >>>> to > >>>>>> the > >>>>>>> SSG requirements when the operator requirements are not > >> specified. > >>>>>> However, > >>>>>>> as the first step, I think we should prioritise the use cases > >> where > >>>>> users > >>>>>>> are not that experienced. > >>>>>>> > >>>>>>> Thank you~ > >>>>>>> > >>>>>>> Xintong Song > >>>>>>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > >> [hidden email] <mailto:[hidden email]>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Will declaring them on slot sharing groups not also waste > >> resources > >>>>> if > >>>>>>>> the parallelism of operators within that group are different? > >>>>>>>> > >>>>>>>> It also seems like quite a hassle for users having to > >> recalculate > >>>> the > >>>>>>>> resource requirements if they change the slot sharing. > >>>>>>>> I'd think that it's not really workable for users that create > >> a set > >>>>> of > >>>>>>>> re-usable operators which are mixed and matched in their > >>>>> applications; > >>>>>>>> managing the resources requirements in such a setting > would be > >> a > >>>>>>>> nightmare, and in the end would require operator-level > >> requirements > >>>>> any > >>>>>>>> way. > >>>>>>>> In that sense, I'm not even sure whether it really increases > >>>>> usability. > >>>>>>>> My main worry is that it if we wire the runtime to work > on SSGs > >>>> it's > >>>>>>>> gonna be difficult to implement more fine-grained approaches, > >> which > >>>>>>>> would not be the case if, for the runtime, they are always > >> defined > >>>> on > >>>>>> an > >>>>>>>> operator-level. > >>>>>>>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > >>>>>>>>> Thanks for drafting this FLIP and starting this discussion > >>>> Yangze. > >>>>>>>>> I like that defining resource requirements on a slot sharing > >>>> group > >>>>>>> makes > >>>>>>>>> the overall setup easier and improves usability of resource > >>>>>>> requirements. > >>>>>>>>> What I do not like about it is that it changes slot sharing > >>>> groups > >>>>>> from > >>>>>>>>> being a scheduling hint to something which needs to be > >> supported > >>>> in > >>>>>>> order > >>>>>>>>> to support fine grained resource requirements. So far, the > >> idea > >>>> of > >>>>>> slot > >>>>>>>>> sharing groups was that it tells the system that a set of > >>>> operators > >>>>>> can > >>>>>>>> be > >>>>>>>>> deployed in the same slot. But the system still had the > >> freedom > >>>> to > >>>>>> say > >>>>>>>> that > >>>>>>>>> it would rather place these tasks in different slots if it > >>>> wanted. > >>>>> If > >>>>>>> we > >>>>>>>>> now specify resource requirements on a per slot sharing > >> group, > >>>> then > >>>>>> the > >>>>>>>>> only option for a scheduler which does not support slot > >> sharing > >>>>>> groups > >>>>>>> is > >>>>>>>>> to say that every operator in this slot sharing group > needs a > >>>> slot > >>>>>> with > >>>>>>>> the > >>>>>>>>> same resources as the whole group. > >>>>>>>>> > >>>>>>>>> So for example, if we have a job consisting of two operator > >> op_1 > >>>>> and > >>>>>>> op_2 > >>>>>>>>> where each op needs 100 MB of memory, we would then say that > >> the > >>>>> slot > >>>>>>>>> sharing group needs 200 MB of memory to run. If we have a > >> cluster > >>>>>> with > >>>>>>> 2 > >>>>>>>>> TMs with one slot of 100 MB each, then the system cannot run > >> this > >>>>>> job. > >>>>>>> If > >>>>>>>>> the resources were specified on an operator level, then the > >>>> system > >>>>>>> could > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to > >> TM_2. > >>>>>>>>> Originally, one of the primary goals of slot sharing groups > >> was > >>>> to > >>>>>> make > >>>>>>>> it > >>>>>>>>> easier for the user to reason about how many slots a job > >> needs > >>>>>>>> independent > >>>>>>>>> of the actual number of operators in the job. Interestingly, > >> if > >>>> all > >>>>>>>>> operators have their resources properly specified, then slot > >>>>> sharing > >>>>>> is > >>>>>>>> no > >>>>>>>>> longer needed because Flink could slice off the > appropriately > >>>> sized > >>>>>>> slots > >>>>>>>>> for every Task individually. What matters is whether the > >> whole > >>>>>> cluster > >>>>>>>> has > >>>>>>>>> enough resources to run all tasks or not. > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> Till > >>>>>>>>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > >> [hidden email] <mailto:[hidden email]>> > >>>>>> wrote: > >>>>>>>>>> Hi, there, > >>>>>>>>>> > >>>>>>>>>> We would like to start a discussion thread on "FLIP-156: > >> Runtime > >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], > >> where we > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime interfaces > >> for > >>>>>>>>>> specifying fine-grained resource requirements. > >>>>>>>>>> > >>>>>>>>>> In this FLIP: > >>>>>>>>>> - Expound the user story of fine-grained resource > >> management. > >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based > >> resource > >>>>>>>>>> requirements. > >>>>>>>>>> - Discuss the pros and cons of the three potential > >> granularities > >>>>> for > >>>>>>>>>> specifying the resource requirements (op, task and slot > >> sharing > >>>>>> group) > >>>>>>>>>> and explain why we choose the slot sharing group. > >>>>>>>>>> > >>>>>>>>>> Please find more details in the FLIP wiki document [1]. > >> Looking > >>>>>>>>>> forward to your feedback. > >>>>>>>>>> > >>>>>>>>>> [1] > >>>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements> > >>>>>>>>>> Best, > >>>>>>>>>> Yangze Guo > >>>>>>>>>> > >>>>>>>> > |
I think Chesnay's proposal could actually work. IIUC, the keypoint is to derive operator requirements from SSG requirements on the API side, so that the runtime only deals with operator requirements. It's debatableĀ how the deriving should be done though. E.g., an alternative could be to evenly divide the SSG requirement into requirements of operators in the group. However, I'm not entirely sure which option is more desired. Illustrating my understanding in the following figure, in which on the top is Chesnay'sĀ proposal and on the bottom is the SSG-based proposal in this FLIP. I think the major difference betweenĀ the two approaches is where deriving operator requirements from SSG requirements happens. - Chesnay'sĀ proposal simplifies the runtime logic and the interface to expose, at the price of moving more complexity (i.e. the deriving) to the API side. The question is, where do we prefer to keep the complexity? I'm slightlyĀ leaning towards having a thin API and keep the complexity in runtime if possible. - Notice that the dash line arrows represent optional steps that are needed only for schedulers that do not respect SSGs, which we don't have at the moment. If we only look at the solid line arrows, then the SSG-based approach is much simpler, without needing to derive and aggregate the requirements back and forth. I'm not sure about complicating the current design only for the potential future needs. Thank you~ Xintong Song On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <[hidden email]> wrote: You're raising a good point, but I think I can rectify that with a minor |
On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[hidden email]> wrote:
|
Thanks everyone for the lively discussion. I'd like to try to
summarize the current convergence in the discussion. Please let me know if I got things wrong or missed something crucial here. Change of this FLIP: - Treat the SSG resource requirements as a hint instead of a restriction for the runtime. That's should be explicitly explained in the JavaDocs. Potential follow-up issues if needed: - Provide operator-level resource configuration interface. - Provide multiple options for deciding resources for SSGs whose requirement is not specified: ** Default slot resource. ** Default operator resource times number of operators. If there are no other issues, I'll update the FLIP accordingly and start a vote thread. Thanks all for the valuable feedback again. Best, Yangze Guo Best, Yangze Guo On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email]> wrote: > > > FGRuntimeInterface.png > > Thank you~ > > Xintong Song > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[hidden email]> wrote: >> >> I think Chesnay's proposal could actually work. IIUC, the keypoint is to derive operator requirements from SSG requirements on the API side, so that the runtime only deals with operator requirements. It's debatable how the deriving should be done though. E.g., an alternative could be to evenly divide the SSG requirement into requirements of operators in the group. >> >> >> However, I'm not entirely sure which option is more desired. Illustrating my understanding in the following figure, in which on the top is Chesnay's proposal and on the bottom is the SSG-based proposal in this FLIP. >> >> >> >> I think the major difference between the two approaches is where deriving operator requirements from SSG requirements happens. >> >> - Chesnay's proposal simplifies the runtime logic and the interface to expose, at the price of moving more complexity (i.e. the deriving) to the API side. The question is, where do we prefer to keep the complexity? I'm slightly leaning towards having a thin API and keep the complexity in runtime if possible. >> >> - Notice that the dash line arrows represent optional steps that are needed only for schedulers that do not respect SSGs, which we don't have at the moment. If we only look at the solid line arrows, then the SSG-based approach is much simpler, without needing to derive and aggregate the requirements back and forth. I'm not sure about complicating the current design only for the potential future needs. >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <[hidden email]> wrote: >>> >>> You're raising a good point, but I think I can rectify that with a minor >>> adjustment. >>> >>> Default requirements are whatever the default requirements are, setting >>> the requirements for one operator has no effect on other operators. >>> >>> With these rules, and some API enhancements, the following mockup would >>> replicate the SSG-based behavior: >>> >>> Map<SlotSharingGroupId, Requirements> requirements = ... >>> for slotSharingGroup in env.getSlotSharingGroups() { >>> vertices = slotSharingGroup.getVertices() >>> vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) >>> vertices.remainint().setRequirements(ZERO) >>> } >>> >>> We could even allow setting requirements on slotsharing-groups >>> colocation-groups and internally translate them accordingly. >>> I can't help but feel this is a plain API issue. >>> >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: >>> > If I understand you correctly Chesnay, then you want to decouple the >>> > resource requirement specification from the slot sharing group >>> > assignment. Hence, per default all operators would be in the same slot >>> > sharing group. If there is no operator with a resource specification, >>> > then the system would allocate a default slot for it. If there is at >>> > least one operator, then the system would sum up all the specified >>> > resources and allocate a slot of this size. This effectively means >>> > that all unspecified operators will implicitly have a zero resource >>> > requirement. Did I understand your idea correctly? >>> > >>> > I am wondering whether this wouldn't lead to a surprising behaviour >>> > for the user. If the user specifies the resource requirements for a >>> > single operator, then he probably will assume that the other operators >>> > will get the default share of resources and not nothing. >>> > >>> > Cheers, >>> > Till >>> > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <[hidden email] >>> > <mailto:[hidden email]>> wrote: >>> > >>> > Is there even a functional difference between specifying the >>> > requirements for an SSG vs specifying the same requirements on a >>> > single >>> > operator within that group (ideally a colocation group to avoid this >>> > whole hint business)? >>> > >>> > Wouldn't we get the best of both worlds in the latter case? >>> > >>> > Users can take shortcuts to define shared requirements, >>> > but refine them further as needed on a per-operator basis, >>> > without changing semantics of slotsharing groups >>> > nor the runtime being locked into SSG-based requirements. >>> > >>> > (And before anyone argues what happens if slotsharing groups >>> > change or >>> > whatnot, that's a plain API issue that we could surely solve. (A >>> > plain >>> > iteration over slotsharing groups and therein contained operators >>> > would >>> > suffice)). >>> > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: >>> > > Maybe a different minor idea: Would it be possible to treat the SSG >>> > > resource requirements as a hint for the runtime similar to how >>> > slot sharing >>> > > groups are designed at the moment? Meaning that we don't give >>> > the guarantee >>> > > that Flink will always deploy this set of tasks together no >>> > matter what >>> > > comes. If, for example, the runtime can derive by some means the >>> > resource >>> > > requirements for each task based on the requirements for the >>> > SSG, this >>> > > could be possible. One easy strategy would be to give every task >>> > the same >>> > > resources as the whole slot sharing group. Another one could be >>> > > distributing the resources equally among the tasks. This does >>> > not even have >>> > > to be implemented but we would give ourselves the freedom to change >>> > > scheduling if need should arise. >>> > > >>> > > Cheers, >>> > > Till >>> > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email] >>> > <mailto:[hidden email]>> wrote: >>> > > >>> > >> Thanks for the responses, Till and Xintong. >>> > >> >>> > >> I second Xintong's comment that SSG-based runtime interface >>> > will give >>> > >> us the flexibility to achieve op/task-based approach. That's one of >>> > >> the most important reasons for our design choice. >>> > >> >>> > >> Some cents regarding the default operator resource: >>> > >> - It might be good for the scenario of DataStream jobs. >>> > >> ** For light-weight operators, the accumulative >>> > configuration error >>> > >> will not be significant. Then, the resource of a task used is >>> > >> proportional to the number of operators it contains. >>> > >> ** For heavy operators like join and window or operators >>> > using the >>> > >> external resources, user will turn to the fine-grained resource >>> > >> configuration. >>> > >> - It can increase the stability for the standalone cluster >>> > where task >>> > >> executors registered are heterogeneous(with different default slot >>> > >> resources). >>> > >> - It might not be good for SQL users. The operators that SQL >>> > will be >>> > >> transferred to is a black box to the user. We also do not guarantee >>> > >> the cross-version of consistency of the transformation so far. >>> > >> >>> > >> I think it can be treated as a follow-up work when the fine-grained >>> > >> resource management is end-to-end ready. >>> > >> >>> > >> Best, >>> > >> Yangze Guo >>> > >> >>> > >> >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song >>> > <[hidden email] <mailto:[hidden email]>> >>> > >> wrote: >>> > >>> Thanks for the feedback, Till. >>> > >>> >>> > >>> ## I feel that what you proposed (operator-based + default >>> > value) might >>> > >> be >>> > >>> subsumed by the SSG-based approach. >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, >>> > categorized by >>> > >>> whether the resource requirements are known to the users. >>> > >>> >>> > >>> 1. *Both known.* As previously mentioned, there's no >>> > reason to put >>> > >>> multiple operators whose individual resource requirements >>> > are already >>> > >> known >>> > >>> into the same group in fine-grained resource management. >>> > And if op_1 >>> > >> and >>> > >>> op_2 are in different groups, there should be no problem >>> > switching >>> > >> data >>> > >>> exchange mode from pipelined to blocking. This is >>> > equivalent to >>> > >> specifying >>> > >>> operator resource requirements in your proposal. >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except that >>> > op_2 is in a >>> > >>> SSG whose resource is not specified thus would have the >>> > default slot >>> > >>> resource. This is equivalent to having default operator >>> > resources in >>> > >> your >>> > >>> proposal. >>> > >>> 3. *Both unknown*. The user can either set op_1 and op_2 >>> > to the same >>> > >> SSG >>> > >>> or separate SSGs. >>> > >>> - If op_1 and op_2 are in the same SSG, it will be >>> > equivalent to >>> > >> the >>> > >>> coarse-grained resource management, where op_1 and op_2 >>> > share a >>> > >> default >>> > >>> size slot no matter which data exchange mode is used. >>> > >>> - If op_1 and op_2 are in different SSGs, then each of >>> > them will >>> > >> use >>> > >>> a default size slot. This is equivalent to setting them >>> > with >>> > >> default >>> > >>> operator resources in your proposal. >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 is >>> > known.* >>> > >>> - It is possible that the user learns the total / max >>> > resource >>> > >>> requirement from executing and monitoring the job, >>> > while not >>> > >>> being aware of >>> > >>> individual operator requirements. >>> > >>> - I believe this is the case your proposal does not >>> > cover. And TBH, >>> > >>> this is probably how most users learn the resource >>> > requirements, >>> > >>> according >>> > >>> to my experiences. >>> > >>> - In this case, the user might need to specify >>> > different resources >>> > >> if >>> > >>> he wants to switch the execution mode, which should not >>> > be worse >>> > >> than not >>> > >>> being able to use fine-grained resource management. >>> > >>> >>> > >>> >>> > >>> ## An additional idea inspired by your proposal. >>> > >>> We may provide multiple options for deciding resources for >>> > SSGs whose >>> > >>> requirement is not specified, if needed. >>> > >>> >>> > >>> - Default slot resource (current design) >>> > >>> - Default operator resource times number of operators >>> > (equivalent to >>> > >>> your proposal) >>> > >>> >>> > >>> >>> > >>> ## Exposing internal runtime strategies >>> > >>> Theoretically, yes. Tying to the SSGs, the resource >>> > requirements might be >>> > >>> affected if how SSGs are internally handled changes in future. >>> > >> Practically, >>> > >>> I do not concretely see at the moment what kind of changes we >>> > may want in >>> > >>> future that might conflict with this FLIP proposal, as the >>> > question of >>> > >>> switching data exchange mode answered above. I'd suggest to >>> > not give up >>> > >> the >>> > >>> user friendliness we may gain now for the future problems that >>> > may or may >>> > >>> not exist. >>> > >>> >>> > >>> Moreover, the SSG-based approach has the flexibility to >>> > achieve the >>> > >>> equivalent behavior as the operator-based approach, if we set each >>> > >> operator >>> > >>> (or task) to a separate SSG. We can even provide a shortcut >>> > option to >>> > >>> automatically do that for users, if needed. >>> > >>> >>> > >>> >>> > >>> Thank you~ >>> > >>> >>> > >>> Xintong Song >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann >>> > <[hidden email] <mailto:[hidden email]>> >>> > >> wrote: >>> > >>>> Thanks for the responses Xintong and Stephan, >>> > >>>> >>> > >>>> I agree that being able to define the resource requirements for a >>> > >> group of >>> > >>>> operators is more user friendly. However, my concern is that >>> > we are >>> > >>>> exposing thereby internal runtime strategies which might >>> > limit our >>> > >>>> flexibility to execute a given job. Moreover, the semantics of >>> > >> configuring >>> > >>>> resource requirements for SSGs could break if switching from >>> > streaming >>> > >> to >>> > >>>> batch execution. If one defines the resource requirements for >>> > op_1 -> >>> > >> op_2 >>> > >>>> which run in pipelined mode when using the streaming >>> > execution, then >>> > >> how do >>> > >>>> we interpret these requirements when op_1 -> op_2 are >>> > executed with a >>> > >>>> blocking data exchange in batch execution mode? Consequently, >>> > I am >>> > >> still >>> > >>>> leaning towards Stephan's proposal to set the resource >>> > requirements per >>> > >>>> operator. >>> > >>>> >>> > >>>> Maybe the following proposal makes the configuration easier: >>> > If the >>> > >> user >>> > >>>> wants to use fine-grained resource requirements, then she >>> > needs to >>> > >> specify >>> > >>>> the default size which is used for operators which have no >>> > explicit >>> > >>>> resource annotation. If this holds true, then every operator >>> > would >>> > >> have a >>> > >>>> resource requirement and the system can try to execute the >>> > operators >>> > >> in the >>> > >>>> best possible manner w/o being constrained by how the user >>> > set the SSG >>> > >>>> requirements. >>> > >>>> >>> > >>>> Cheers, >>> > >>>> Till >>> > >>>> >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song >>> > <[hidden email] <mailto:[hidden email]>> >>> > >>>> wrote: >>> > >>>> >>> > >>>>> Thanks for the feedback, Stephan. >>> > >>>>> >>> > >>>>> Actually, your proposal has also come to my mind at some >>> > point. And I >>> > >>>> have >>> > >>>>> some concerns about it. >>> > >>>>> >>> > >>>>> >>> > >>>>> 1. It does not give users the same control as the SSG-based >>> > approach. >>> > >>>>> >>> > >>>>> >>> > >>>>> While both approaches do not require specifying for each >>> > operator, >>> > >>>>> SSG-based approach supports the semantic that "some operators >>> > >> together >>> > >>>> use >>> > >>>>> this much resource" while the operator-based approach doesn't. >>> > >>>>> >>> > >>>>> >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., >>> > o_m), and >>> > >> at >>> > >>>> some >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly >>> > reduces the >>> > >> data >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 >>> > (o_1, ..., >>> > >> o_n) >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher >>> > >> parallelisms >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't >>> > lead to too >>> > >> much >>> > >>>>> wasting of resources. If the two SSGs end up needing different >>> > >> resources, >>> > >>>>> with the SSG-based approach one can directly specify >>> > resources for >>> > >> the >>> > >>>> two >>> > >>>>> groups. However, with the operator-based approach, the user will >>> > >> have to >>> > >>>>> specify resources for each operator in one of the two >>> > groups, and >>> > >> tune >>> > >>>> the >>> > >>>>> default slot resource via configurations to fit the other group. >>> > >>>>> >>> > >>>>> >>> > >>>>> 2. It increases the chance of breaking operator chains. >>> > >>>>> >>> > >>>>> >>> > >>>>> Setting chainnable operators into different slot sharing >>> > groups will >>> > >>>>> prevent them from being chained. In the current implementation, >>> > >>>> downstream >>> > >>>>> operators, if SSG not explicitly specified, will be set to >>> > the same >>> > >> group >>> > >>>>> as the chainable upstream operators (unless multiple upstream >>> > >> operators >>> > >>>> in >>> > >>>>> different groups), to reduce the chance of breaking chains. >>> > >>>>> >>> > >>>>> >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, >>> > deciding >>> > >> SSGs >>> > >>>>> based on whether resource is specified we will easily get >>> > groups like >>> > >>>> (o_1, >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be >>> > chained. This >>> > >> is >>> > >>>> also >>> > >>>>> possible for the SSG-based approach, but I believe the >>> > chance is much >>> > >>>>> smaller because there's no strong reason for users to >>> > specify the >>> > >> groups >>> > >>>>> with alternate operators like that. We are more likely to >>> > get groups >>> > >> like >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only between >>> > o_2 and >>> > >> o_3. >>> > >>>>> >>> > >>>>> 3. It complicates the system by having two different >>> > mechanisms for >>> > >>>> sharing >>> > >>>>> managed memory in a slot. >>> > >>>>> >>> > >>>>> >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed memory >>> > sharing >>> > >>>>> mechanism, where managed memory is first distributed >>> > according to the >>> > >>>>> consumer type, then further distributed across operators of that >>> > >> consumer >>> > >>>>> type. >>> > >>>>> >>> > >>>>> - With the operator-based approach, managed memory size >>> > specified >>> > >> for an >>> > >>>>> operator should account for all the consumer types of that >>> > operator. >>> > >> That >>> > >>>>> means the managed memory is first distributed across >>> > operators, then >>> > >>>>> distributed to different consumer types of each operator. >>> > >>>>> >>> > >>>>> >>> > >>>>> Unfortunately, the different order of the two calculation >>> > steps can >>> > >> lead >>> > >>>> to >>> > >>>>> different results. To be specific, the semantic of the >>> > configuration >>> > >>>> option >>> > >>>>> `consumer-weights` changed (within a slot vs. within an >>> > operator). >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> To sum up things: >>> > >>>>> >>> > >>>>> While (3) might be a bit more implementation related, I >>> > think (1) >>> > >> and (2) >>> > >>>>> somehow suggest that, the price for the proposed approach to >>> > avoid >>> > >>>>> specifying resource for every operator is that it's not as >>> > >> independent >>> > >>>> from >>> > >>>>> operator chaining and slot sharing as the operator-based >>> > approach >>> > >>>> discussed >>> > >>>>> in the FLIP. >>> > >>>>> >>> > >>>>> >>> > >>>>> Thank you~ >>> > >>>>> >>> > >>>>> Xintong Song >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen >>> > <[hidden email] <mailto:[hidden email]>> >>> > >> wrote: >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. >>> > >>>>>> >>> > >>>>>> I want to say, first of all, that this is super well >>> > written. And >>> > >> the >>> > >>>>>> points that the FLIP makes about how to expose the >>> > configuration to >>> > >>>> users >>> > >>>>>> is exactly the right thing to figure out first. >>> > >>>>>> So good job here! >>> > >>>>>> >>> > >>>>>> About how to let users specify the resource profiles. If I >>> > can sum >>> > >> the >>> > >>>>> FLIP >>> > >>>>>> and previous discussion up in my own words, the problem is the >>> > >>>> following: >>> > >>>>>> Operator-level specification is the simplest and cleanest >>> > approach, >>> > >>>>> because >>> > >>>>>>> it avoids mixing operator configuration (resource) and >>> > >> scheduling. No >>> > >>>>>>> matter what other parameters change (chaining, slot sharing, >>> > >>>> switching >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles >>> > stay the >>> > >>>> same. >>> > >>>>>>> But it would require that a user specifies resources on all >>> > >>>> operators, >>> > >>>>>>> which makes it hard to use. That's why the FLIP suggests going >>> > >> with >>> > >>>>>>> specifying resources on a Sharing-Group. >>> > >>>>>> >>> > >>>>>> I think both thoughts are important, so can we find a solution >>> > >> where >>> > >>>> the >>> > >>>>>> Resource Profiles are specified on an Operator, but we >>> > still avoid >>> > >> that >>> > >>>>> we >>> > >>>>>> need to specify a resource profile on every operator? >>> > >>>>>> >>> > >>>>>> What do you think about something like the following: >>> > >>>>>> - Resource Profiles are specified on an operator level. >>> > >>>>>> - Not all operators need profiles >>> > >>>>>> - All Operators without a Resource Profile ended up in the >>> > >> default >>> > >>>> slot >>> > >>>>>> sharing group with a default profile (will get a default slot). >>> > >>>>>> - All Operators with a Resource Profile will go into >>> > another slot >>> > >>>>> sharing >>> > >>>>>> group (the resource-specified-group). >>> > >>>>>> - Users can define different slot sharing groups for >>> > operators >>> > >> like >>> > >>>>> they >>> > >>>>>> do now, with the exception that you cannot mix operators >>> > that have >>> > >> a >>> > >>>>>> resource profile and operators that have no resource profile. >>> > >>>>>> - The default case where no operator has a resource >>> > profile is >>> > >> just a >>> > >>>>>> special case of this model >>> > >>>>>> - The chaining logic sums up the profiles per operator, >>> > like it >>> > >> does >>> > >>>>> now, >>> > >>>>>> and the scheduler sums up the profiles of the tasks that it >>> > >> schedules >>> > >>>>>> together. >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> There is another question about reactive scaling raised in the >>> > >> FLIP. I >>> > >>>>> need >>> > >>>>>> to think a bit about that. That is indeed a bit more tricky >>> > once we >>> > >>>> have >>> > >>>>>> slots of different sizes. >>> > >>>>>> It is not clear then which of the different slot requests the >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) >>> > show up, >>> > >> or how >>> > >>>>> the >>> > >>>>>> JobManager redistributes the slots resources when resources >>> > (TMs) >>> > >>>>> disappear >>> > >>>>>> This question is pretty orthogonal, though, to the "how to >>> > specify >>> > >> the >>> > >>>>>> resources". >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> Best, >>> > >>>>>> Stephan >>> > >>>>>> >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song >>> > <[hidden email] <mailto:[hidden email]> >>> > >>>>> wrote: >>> > >>>>>>> Thanks for drafting the FLIP and driving the discussion, >>> > Yangze. >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. >>> > >>>>>>> >>> > >>>>>>> @Till, >>> > >>>>>>> >>> > >>>>>>> I agree that specifying requirements for SSGs means that SSGs >>> > >> need to >>> > >>>>> be >>> > >>>>>>> supported in fine-grained resource management, otherwise each >>> > >>>> operator >>> > >>>>>>> might use as many resources as the whole group. However, I >>> > cannot >>> > >>>> think >>> > >>>>>> of >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained >>> > resource >>> > >>>>>>> management. >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>>> Interestingly, if all operators have their resources properly >>> > >>>>>> specified, >>> > >>>>>>>> then slot sharing is no longer needed because Flink could >>> > >> slice off >>> > >>>>> the >>> > >>>>>>>> appropriately sized slots for every Task individually. >>> > >>>>>>>> >>> > >>>>>>> So for example, if we have a job consisting of two >>> > operator op_1 >>> > >> and >>> > >>>>> op_2 >>> > >>>>>>>> where each op needs 100 MB of memory, we would then say that >>> > >> the >>> > >>>> slot >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we have a >>> > >> cluster >>> > >>>>> with >>> > >>>>>> 2 >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system cannot run >>> > >> this >>> > >>>>> job. >>> > >>>>>> If >>> > >>>>>>>> the resources were specified on an operator level, then the >>> > >> system >>> > >>>>>> could >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to >>> > >> TM_2. >>> > >>>>>>> >>> > >>>>>>> Couldn't agree more that if all operators' requirements are >>> > >> properly >>> > >>>>>>> specified, slot sharing should be no longer needed. I >>> > think this >>> > >>>>> exactly >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 each >>> > >> needs >>> > >>>> 100 >>> > >>>>> MB >>> > >>>>>>> of memory, why would we put them in the same group? If >>> > they are >>> > >> in >>> > >>>>>> separate >>> > >>>>>>> groups, with the proposed approach the system can freely >>> > deploy >>> > >> them >>> > >>>> to >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. >>> > >>>>>>> >>> > >>>>>>> Moreover, the precondition for not needing slot sharing is >>> > having >>> > >>>>>> resource >>> > >>>>>>> requirements properly specified for all operators. This is not >>> > >> always >>> > >>>>>>> possible, and usually requires tremendous efforts. One of the >>> > >>>> benefits >>> > >>>>>> for >>> > >>>>>>> SSG-based requirements is that it allows the user to freely >>> > >> decide >>> > >>>> the >>> > >>>>>>> granularity, thus efforts they want to pay. I would >>> > consider SSG >>> > >> in >>> > >>>>>>> fine-grained resource management as a group of operators >>> > that the >>> > >>>> user >>> > >>>>>>> would like to specify the total resource for. There can be >>> > only >>> > >> one >>> > >>>>> group >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few major >>> > parts, >>> > >> or as >>> > >>>>>> many >>> > >>>>>>> groups as the number of tasks/operators, depending on how >>> > >>>> fine-grained >>> > >>>>>> the >>> > >>>>>>> user is able to specify the resources. >>> > >>>>>>> >>> > >>>>>>> Having to support SSGs might be a constraint. But given >>> > that all >>> > >> the >>> > >>>>>>> current scheduler implementations already support SSGs, I >>> > tend to >>> > >>>> think >>> > >>>>>>> that as an acceptable price for the above discussed >>> > usability and >>> > >>>>>>> flexibility. >>> > >>>>>>> >>> > >>>>>>> @Chesnay >>> > >>>>>>> >>> > >>>>>>> Will declaring them on slot sharing groups not also waste >>> > >> resources >>> > >>>> if >>> > >>>>>> the >>> > >>>>>>>> parallelism of operators within that group are different? >>> > >>>>>>>> >>> > >>>>>>> Yes. It's a trade-off between usability and resource >>> > >> utilization. To >>> > >>>>>> avoid >>> > >>>>>>> such wasting, the user can define more groups, so that >>> > each group >>> > >>>>>> contains >>> > >>>>>>> less operators and the chance of having operators with >>> > different >>> > >>>>>>> parallelism will be reduced. The price is to have more >>> > resource >>> > >>>>>>> requirements to specify. >>> > >>>>>>> >>> > >>>>>>> It also seems like quite a hassle for users having to >>> > >> recalculate the >>> > >>>>>>>> resource requirements if they change the slot sharing. >>> > >>>>>>>> I'd think that it's not really workable for users that create >>> > >> a set >>> > >>>>> of >>> > >>>>>>>> re-usable operators which are mixed and matched in their >>> > >>>>> applications; >>> > >>>>>>>> managing the resources requirements in such a setting >>> > would be >>> > >> a >>> > >>>>>>>> nightmare, and in the end would require operator-level >>> > >> requirements >>> > >>>>> any >>> > >>>>>>>> way. >>> > >>>>>>>> In that sense, I'm not even sure whether it really increases >>> > >>>>> usability. >>> > >>>>>>> - As mentioned in my reply to Till's comment, there's no >>> > >> reason to >>> > >>>>> put >>> > >>>>>>> multiple operators whose individual resource >>> > requirements are >>> > >>>>> already >>> > >>>>>>> known >>> > >>>>>>> into the same group in fine-grained resource management. >>> > >>>>>>> - Even an operator implementation is reused for multiple >>> > >>>>> applications, >>> > >>>>>>> it does not guarantee the same resource requirements. >>> > During >>> > >> our >>> > >>>>> years >>> > >>>>>>> of >>> > >>>>>>> practices in Alibaba, with per-operator requirements >>> > >> specified for >>> > >>>>>>> Blink's >>> > >>>>>>> fine-grained resource management, very few users >>> > (including >>> > >> our >>> > >>>>>>> specialists >>> > >>>>>>> who are dedicated to supporting Blink users) are as >>> > >> experienced as >>> > >>>>> to >>> > >>>>>>> accurately predict/estimate the operator resource >>> > >> requirements. >>> > >>>> Most >>> > >>>>>>> people >>> > >>>>>>> rely on the execution-time metrics (throughput, delay, cpu >>> > >> load, >>> > >>>>>> memory >>> > >>>>>>> usage, GC pressure, etc.) to improve the specification. >>> > >>>>>>> >>> > >>>>>>> To sum up: >>> > >>>>>>> If the user is capable of providing proper resource >>> > requirements >>> > >> for >>> > >>>>>> every >>> > >>>>>>> operator, that's definitely a good thing and we would not >>> > need to >>> > >>>> rely >>> > >>>>> on >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the >>> > >> fine-grained >>> > >>>>>> resource >>> > >>>>>>> management to work. For those users who are capable and do not >>> > >> like >>> > >>>>>> having >>> > >>>>>>> to set each operator to a separate SSG, I would be ok to have >>> > >> both >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to only >>> > >> fallback >>> > >>>> to >>> > >>>>>> the >>> > >>>>>>> SSG requirements when the operator requirements are not >>> > >> specified. >>> > >>>>>> However, >>> > >>>>>>> as the first step, I think we should prioritise the use cases >>> > >> where >>> > >>>>> users >>> > >>>>>>> are not that experienced. >>> > >>>>>>> >>> > >>>>>>> Thank you~ >>> > >>>>>>> >>> > >>>>>>> Xintong Song >>> > >>>>>>> >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < >>> > >> [hidden email] <mailto:[hidden email]>> >>> > >>>>>>> wrote: >>> > >>>>>>> >>> > >>>>>>>> Will declaring them on slot sharing groups not also waste >>> > >> resources >>> > >>>>> if >>> > >>>>>>>> the parallelism of operators within that group are different? >>> > >>>>>>>> >>> > >>>>>>>> It also seems like quite a hassle for users having to >>> > >> recalculate >>> > >>>> the >>> > >>>>>>>> resource requirements if they change the slot sharing. >>> > >>>>>>>> I'd think that it's not really workable for users that create >>> > >> a set >>> > >>>>> of >>> > >>>>>>>> re-usable operators which are mixed and matched in their >>> > >>>>> applications; >>> > >>>>>>>> managing the resources requirements in such a setting >>> > would be >>> > >> a >>> > >>>>>>>> nightmare, and in the end would require operator-level >>> > >> requirements >>> > >>>>> any >>> > >>>>>>>> way. >>> > >>>>>>>> In that sense, I'm not even sure whether it really increases >>> > >>>>> usability. >>> > >>>>>>>> My main worry is that it if we wire the runtime to work >>> > on SSGs >>> > >>>> it's >>> > >>>>>>>> gonna be difficult to implement more fine-grained approaches, >>> > >> which >>> > >>>>>>>> would not be the case if, for the runtime, they are always >>> > >> defined >>> > >>>> on >>> > >>>>>> an >>> > >>>>>>>> operator-level. >>> > >>>>>>>> >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this discussion >>> > >>>> Yangze. >>> > >>>>>>>>> I like that defining resource requirements on a slot sharing >>> > >>>> group >>> > >>>>>>> makes >>> > >>>>>>>>> the overall setup easier and improves usability of resource >>> > >>>>>>> requirements. >>> > >>>>>>>>> What I do not like about it is that it changes slot sharing >>> > >>>> groups >>> > >>>>>> from >>> > >>>>>>>>> being a scheduling hint to something which needs to be >>> > >> supported >>> > >>>> in >>> > >>>>>>> order >>> > >>>>>>>>> to support fine grained resource requirements. So far, the >>> > >> idea >>> > >>>> of >>> > >>>>>> slot >>> > >>>>>>>>> sharing groups was that it tells the system that a set of >>> > >>>> operators >>> > >>>>>> can >>> > >>>>>>>> be >>> > >>>>>>>>> deployed in the same slot. But the system still had the >>> > >> freedom >>> > >>>> to >>> > >>>>>> say >>> > >>>>>>>> that >>> > >>>>>>>>> it would rather place these tasks in different slots if it >>> > >>>> wanted. >>> > >>>>> If >>> > >>>>>>> we >>> > >>>>>>>>> now specify resource requirements on a per slot sharing >>> > >> group, >>> > >>>> then >>> > >>>>>> the >>> > >>>>>>>>> only option for a scheduler which does not support slot >>> > >> sharing >>> > >>>>>> groups >>> > >>>>>>> is >>> > >>>>>>>>> to say that every operator in this slot sharing group >>> > needs a >>> > >>>> slot >>> > >>>>>> with >>> > >>>>>>>> the >>> > >>>>>>>>> same resources as the whole group. >>> > >>>>>>>>> >>> > >>>>>>>>> So for example, if we have a job consisting of two operator >>> > >> op_1 >>> > >>>>> and >>> > >>>>>>> op_2 >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then say that >>> > >> the >>> > >>>>> slot >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we have a >>> > >> cluster >>> > >>>>>> with >>> > >>>>>>> 2 >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system cannot run >>> > >> this >>> > >>>>>> job. >>> > >>>>>>> If >>> > >>>>>>>>> the resources were specified on an operator level, then the >>> > >>>> system >>> > >>>>>>> could >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 to >>> > >> TM_2. >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing groups >>> > >> was >>> > >>>> to >>> > >>>>>> make >>> > >>>>>>>> it >>> > >>>>>>>>> easier for the user to reason about how many slots a job >>> > >> needs >>> > >>>>>>>> independent >>> > >>>>>>>>> of the actual number of operators in the job. Interestingly, >>> > >> if >>> > >>>> all >>> > >>>>>>>>> operators have their resources properly specified, then slot >>> > >>>>> sharing >>> > >>>>>> is >>> > >>>>>>>> no >>> > >>>>>>>>> longer needed because Flink could slice off the >>> > appropriately >>> > >>>> sized >>> > >>>>>>> slots >>> > >>>>>>>>> for every Task individually. What matters is whether the >>> > >> whole >>> > >>>>>> cluster >>> > >>>>>>>> has >>> > >>>>>>>>> enough resources to run all tasks or not. >>> > >>>>>>>>> >>> > >>>>>>>>> Cheers, >>> > >>>>>>>>> Till >>> > >>>>>>>>> >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < >>> > >> [hidden email] <mailto:[hidden email]>> >>> > >>>>>> wrote: >>> > >>>>>>>>>> Hi, there, >>> > >>>>>>>>>> >>> > >>>>>>>>>> We would like to start a discussion thread on "FLIP-156: >>> > >> Runtime >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], >>> > >> where we >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime interfaces >>> > >> for >>> > >>>>>>>>>> specifying fine-grained resource requirements. >>> > >>>>>>>>>> >>> > >>>>>>>>>> In this FLIP: >>> > >>>>>>>>>> - Expound the user story of fine-grained resource >>> > >> management. >>> > >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based >>> > >> resource >>> > >>>>>>>>>> requirements. >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential >>> > >> granularities >>> > >>>>> for >>> > >>>>>>>>>> specifying the resource requirements (op, task and slot >>> > >> sharing >>> > >>>>>> group) >>> > >>>>>>>>>> and explain why we choose the slot sharing group. >>> > >>>>>>>>>> >>> > >>>>>>>>>> Please find more details in the FLIP wiki document [1]. >>> > >> Looking >>> > >>>>>>>>>> forward to your feedback. >>> > >>>>>>>>>> >>> > >>>>>>>>>> [1] >>> > >>>>>>>>>> >>> > >> >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements >>> > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements> >>> > >>>>>>>>>> Best, >>> > >>>>>>>>>> Yangze Guo >>> > >>>>>>>>>> >>> > >>>>>>>> >>> > >>> |
Free forum by Nabble | Edit this page |