Thanks for the summary, Yangze.
The changes and follow-up issues LGTM. Let's wait for responses from the others before starting a vote. Thank you~ Xintong Song On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> wrote: > Thanks everyone for the lively discussion. I'd like to try to > summarize the current convergence in the discussion. Please let me > know if I got things wrong or missed something crucial here. > > Change of this FLIP: > - Treat the SSG resource requirements as a hint instead of a > restriction for the runtime. That's should be explicitly explained in > the JavaDocs. > > Potential follow-up issues if needed: > - Provide operator-level resource configuration interface. > - Provide multiple options for deciding resources for SSGs whose > requirement is not specified: > ** Default slot resource. > ** Default operator resource times number of operators. > > If there are no other issues, I'll update the FLIP accordingly and > start a vote thread. Thanks all for the valuable feedback again. > > Best, > Yangze Guo > > Best, > Yangze Guo > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email]> > wrote: > > > > > > FGRuntimeInterface.png > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[hidden email]> > wrote: > >> > >> I think Chesnay's proposal could actually work. IIUC, the keypoint is > to derive operator requirements from SSG requirements on the API side, so > that the runtime only deals with operator requirements. It's debatable how > the deriving should be done though. E.g., an alternative could be to evenly > divide the SSG requirement into requirements of operators in the group. > >> > >> > >> However, I'm not entirely sure which option is more desired. > Illustrating my understanding in the following figure, in which on the top > is Chesnay's proposal and on the bottom is the SSG-based proposal in this > FLIP. > >> > >> > >> > >> I think the major difference between the two approaches is where > deriving operator requirements from SSG requirements happens. > >> > >> - Chesnay's proposal simplifies the runtime logic and the interface to > expose, at the price of moving more complexity (i.e. the deriving) to the > API side. The question is, where do we prefer to keep the complexity? I'm > slightly leaning towards having a thin API and keep the complexity in > runtime if possible. > >> > >> - Notice that the dash line arrows represent optional steps that are > needed only for schedulers that do not respect SSGs, which we don't have at > the moment. If we only look at the solid line arrows, then the SSG-based > approach is much simpler, without needing to derive and aggregate the > requirements back and forth. I'm not sure about complicating the current > design only for the potential future needs. > >> > >> > >> Thank you~ > >> > >> Xintong Song > >> > >> > >> > >> > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <[hidden email]> > wrote: > >>> > >>> You're raising a good point, but I think I can rectify that with a > minor > >>> adjustment. > >>> > >>> Default requirements are whatever the default requirements are, setting > >>> the requirements for one operator has no effect on other operators. > >>> > >>> With these rules, and some API enhancements, the following mockup would > >>> replicate the SSG-based behavior: > >>> > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > >>> for slotSharingGroup in env.getSlotSharingGroups() { > >>> vertices = slotSharingGroup.getVertices() > >>> > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > >>> vertices.remainint().setRequirements(ZERO) > >>> } > >>> > >>> We could even allow setting requirements on slotsharing-groups > >>> colocation-groups and internally translate them accordingly. > >>> I can't help but feel this is a plain API issue. > >>> > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > >>> > If I understand you correctly Chesnay, then you want to decouple the > >>> > resource requirement specification from the slot sharing group > >>> > assignment. Hence, per default all operators would be in the same > slot > >>> > sharing group. If there is no operator with a resource specification, > >>> > then the system would allocate a default slot for it. If there is at > >>> > least one operator, then the system would sum up all the specified > >>> > resources and allocate a slot of this size. This effectively means > >>> > that all unspecified operators will implicitly have a zero resource > >>> > requirement. Did I understand your idea correctly? > >>> > > >>> > I am wondering whether this wouldn't lead to a surprising behaviour > >>> > for the user. If the user specifies the resource requirements for a > >>> > single operator, then he probably will assume that the other > operators > >>> > will get the default share of resources and not nothing. > >>> > > >>> > Cheers, > >>> > Till > >>> > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler <[hidden email] > >>> > <mailto:[hidden email]>> wrote: > >>> > > >>> > Is there even a functional difference between specifying the > >>> > requirements for an SSG vs specifying the same requirements on a > >>> > single > >>> > operator within that group (ideally a colocation group to avoid > this > >>> > whole hint business)? > >>> > > >>> > Wouldn't we get the best of both worlds in the latter case? > >>> > > >>> > Users can take shortcuts to define shared requirements, > >>> > but refine them further as needed on a per-operator basis, > >>> > without changing semantics of slotsharing groups > >>> > nor the runtime being locked into SSG-based requirements. > >>> > > >>> > (And before anyone argues what happens if slotsharing groups > >>> > change or > >>> > whatnot, that's a plain API issue that we could surely solve. (A > >>> > plain > >>> > iteration over slotsharing groups and therein contained operators > >>> > would > >>> > suffice)). > >>> > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > >>> > > Maybe a different minor idea: Would it be possible to treat > the SSG > >>> > > resource requirements as a hint for the runtime similar to how > >>> > slot sharing > >>> > > groups are designed at the moment? Meaning that we don't give > >>> > the guarantee > >>> > > that Flink will always deploy this set of tasks together no > >>> > matter what > >>> > > comes. If, for example, the runtime can derive by some means > the > >>> > resource > >>> > > requirements for each task based on the requirements for the > >>> > SSG, this > >>> > > could be possible. One easy strategy would be to give every > task > >>> > the same > >>> > > resources as the whole slot sharing group. Another one could be > >>> > > distributing the resources equally among the tasks. This does > >>> > not even have > >>> > > to be implemented but we would give ourselves the freedom to > change > >>> > > scheduling if need should arise. > >>> > > > >>> > > Cheers, > >>> > > Till > >>> > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo <[hidden email] > >>> > <mailto:[hidden email]>> wrote: > >>> > > > >>> > >> Thanks for the responses, Till and Xintong. > >>> > >> > >>> > >> I second Xintong's comment that SSG-based runtime interface > >>> > will give > >>> > >> us the flexibility to achieve op/task-based approach. That's > one of > >>> > >> the most important reasons for our design choice. > >>> > >> > >>> > >> Some cents regarding the default operator resource: > >>> > >> - It might be good for the scenario of DataStream jobs. > >>> > >> ** For light-weight operators, the accumulative > >>> > configuration error > >>> > >> will not be significant. Then, the resource of a task used is > >>> > >> proportional to the number of operators it contains. > >>> > >> ** For heavy operators like join and window or operators > >>> > using the > >>> > >> external resources, user will turn to the fine-grained > resource > >>> > >> configuration. > >>> > >> - It can increase the stability for the standalone cluster > >>> > where task > >>> > >> executors registered are heterogeneous(with different default > slot > >>> > >> resources). > >>> > >> - It might not be good for SQL users. The operators that SQL > >>> > will be > >>> > >> transferred to is a black box to the user. We also do not > guarantee > >>> > >> the cross-version of consistency of the transformation so far. > >>> > >> > >>> > >> I think it can be treated as a follow-up work when the > fine-grained > >>> > >> resource management is end-to-end ready. > >>> > >> > >>> > >> Best, > >>> > >> Yangze Guo > >>> > >> > >>> > >> > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > >>> > <[hidden email] <mailto:[hidden email]>> > >>> > >> wrote: > >>> > >>> Thanks for the feedback, Till. > >>> > >>> > >>> > >>> ## I feel that what you proposed (operator-based + default > >>> > value) might > >>> > >> be > >>> > >>> subsumed by the SSG-based approach. > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > >>> > categorized by > >>> > >>> whether the resource requirements are known to the users. > >>> > >>> > >>> > >>> 1. *Both known.* As previously mentioned, there's no > >>> > reason to put > >>> > >>> multiple operators whose individual resource requirements > >>> > are already > >>> > >> known > >>> > >>> into the same group in fine-grained resource management. > >>> > And if op_1 > >>> > >> and > >>> > >>> op_2 are in different groups, there should be no problem > >>> > switching > >>> > >> data > >>> > >>> exchange mode from pipelined to blocking. This is > >>> > equivalent to > >>> > >> specifying > >>> > >>> operator resource requirements in your proposal. > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except that > >>> > op_2 is in a > >>> > >>> SSG whose resource is not specified thus would have the > >>> > default slot > >>> > >>> resource. This is equivalent to having default operator > >>> > resources in > >>> > >> your > >>> > >>> proposal. > >>> > >>> 3. *Both unknown*. The user can either set op_1 and op_2 > >>> > to the same > >>> > >> SSG > >>> > >>> or separate SSGs. > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > >>> > equivalent to > >>> > >> the > >>> > >>> coarse-grained resource management, where op_1 and > op_2 > >>> > share a > >>> > >> default > >>> > >>> size slot no matter which data exchange mode is used. > >>> > >>> - If op_1 and op_2 are in different SSGs, then each of > >>> > them will > >>> > >> use > >>> > >>> a default size slot. This is equivalent to setting > them > >>> > with > >>> > >> default > >>> > >>> operator resources in your proposal. > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > is > >>> > known.* > >>> > >>> - It is possible that the user learns the total / max > >>> > resource > >>> > >>> requirement from executing and monitoring the job, > >>> > while not > >>> > >>> being aware of > >>> > >>> individual operator requirements. > >>> > >>> - I believe this is the case your proposal does not > >>> > cover. And TBH, > >>> > >>> this is probably how most users learn the resource > >>> > requirements, > >>> > >>> according > >>> > >>> to my experiences. > >>> > >>> - In this case, the user might need to specify > >>> > different resources > >>> > >> if > >>> > >>> he wants to switch the execution mode, which should > not > >>> > be worse > >>> > >> than not > >>> > >>> being able to use fine-grained resource management. > >>> > >>> > >>> > >>> > >>> > >>> ## An additional idea inspired by your proposal. > >>> > >>> We may provide multiple options for deciding resources for > >>> > SSGs whose > >>> > >>> requirement is not specified, if needed. > >>> > >>> > >>> > >>> - Default slot resource (current design) > >>> > >>> - Default operator resource times number of operators > >>> > (equivalent to > >>> > >>> your proposal) > >>> > >>> > >>> > >>> > >>> > >>> ## Exposing internal runtime strategies > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > >>> > requirements might be > >>> > >>> affected if how SSGs are internally handled changes in > future. > >>> > >> Practically, > >>> > >>> I do not concretely see at the moment what kind of changes we > >>> > may want in > >>> > >>> future that might conflict with this FLIP proposal, as the > >>> > question of > >>> > >>> switching data exchange mode answered above. I'd suggest to > >>> > not give up > >>> > >> the > >>> > >>> user friendliness we may gain now for the future problems > that > >>> > may or may > >>> > >>> not exist. > >>> > >>> > >>> > >>> Moreover, the SSG-based approach has the flexibility to > >>> > achieve the > >>> > >>> equivalent behavior as the operator-based approach, if we > set each > >>> > >> operator > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > >>> > option to > >>> > >>> automatically do that for users, if needed. > >>> > >>> > >>> > >>> > >>> > >>> Thank you~ > >>> > >>> > >>> > >>> Xintong Song > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > >>> > <[hidden email] <mailto:[hidden email]>> > >>> > >> wrote: > >>> > >>>> Thanks for the responses Xintong and Stephan, > >>> > >>>> > >>> > >>>> I agree that being able to define the resource requirements > for a > >>> > >> group of > >>> > >>>> operators is more user friendly. However, my concern is that > >>> > we are > >>> > >>>> exposing thereby internal runtime strategies which might > >>> > limit our > >>> > >>>> flexibility to execute a given job. Moreover, the semantics > of > >>> > >> configuring > >>> > >>>> resource requirements for SSGs could break if switching from > >>> > streaming > >>> > >> to > >>> > >>>> batch execution. If one defines the resource requirements > for > >>> > op_1 -> > >>> > >> op_2 > >>> > >>>> which run in pipelined mode when using the streaming > >>> > execution, then > >>> > >> how do > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > >>> > executed with a > >>> > >>>> blocking data exchange in batch execution mode? > Consequently, > >>> > I am > >>> > >> still > >>> > >>>> leaning towards Stephan's proposal to set the resource > >>> > requirements per > >>> > >>>> operator. > >>> > >>>> > >>> > >>>> Maybe the following proposal makes the configuration easier: > >>> > If the > >>> > >> user > >>> > >>>> wants to use fine-grained resource requirements, then she > >>> > needs to > >>> > >> specify > >>> > >>>> the default size which is used for operators which have no > >>> > explicit > >>> > >>>> resource annotation. If this holds true, then every operator > >>> > would > >>> > >> have a > >>> > >>>> resource requirement and the system can try to execute the > >>> > operators > >>> > >> in the > >>> > >>>> best possible manner w/o being constrained by how the user > >>> > set the SSG > >>> > >>>> requirements. > >>> > >>>> > >>> > >>>> Cheers, > >>> > >>>> Till > >>> > >>>> > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > >>> > <[hidden email] <mailto:[hidden email]>> > >>> > >>>> wrote: > >>> > >>>> > >>> > >>>>> Thanks for the feedback, Stephan. > >>> > >>>>> > >>> > >>>>> Actually, your proposal has also come to my mind at some > >>> > point. And I > >>> > >>>> have > >>> > >>>>> some concerns about it. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> 1. It does not give users the same control as the SSG-based > >>> > approach. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> While both approaches do not require specifying for each > >>> > operator, > >>> > >>>>> SSG-based approach supports the semantic that "some > operators > >>> > >> together > >>> > >>>> use > >>> > >>>>> this much resource" while the operator-based approach > doesn't. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > >>> > o_m), and > >>> > >> at > >>> > >>>> some > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > >>> > reduces the > >>> > >> data > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > >>> > (o_1, ..., > >>> > >> o_n) > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much higher > >>> > >> parallelisms > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > >>> > lead to too > >>> > >> much > >>> > >>>>> wasting of resources. If the two SSGs end up needing > different > >>> > >> resources, > >>> > >>>>> with the SSG-based approach one can directly specify > >>> > resources for > >>> > >> the > >>> > >>>> two > >>> > >>>>> groups. However, with the operator-based approach, the > user will > >>> > >> have to > >>> > >>>>> specify resources for each operator in one of the two > >>> > groups, and > >>> > >> tune > >>> > >>>> the > >>> > >>>>> default slot resource via configurations to fit the other > group. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> 2. It increases the chance of breaking operator chains. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Setting chainnable operators into different slot sharing > >>> > groups will > >>> > >>>>> prevent them from being chained. In the current > implementation, > >>> > >>>> downstream > >>> > >>>>> operators, if SSG not explicitly specified, will be set to > >>> > the same > >>> > >> group > >>> > >>>>> as the chainable upstream operators (unless multiple > upstream > >>> > >> operators > >>> > >>>> in > >>> > >>>>> different groups), to reduce the chance of breaking chains. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > >>> > deciding > >>> > >> SSGs > >>> > >>>>> based on whether resource is specified we will easily get > >>> > groups like > >>> > >>>> (o_1, > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > >>> > chained. This > >>> > >> is > >>> > >>>> also > >>> > >>>>> possible for the SSG-based approach, but I believe the > >>> > chance is much > >>> > >>>>> smaller because there's no strong reason for users to > >>> > specify the > >>> > >> groups > >>> > >>>>> with alternate operators like that. We are more likely to > >>> > get groups > >>> > >> like > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > between > >>> > o_2 and > >>> > >> o_3. > >>> > >>>>> > >>> > >>>>> 3. It complicates the system by having two different > >>> > mechanisms for > >>> > >>>> sharing > >>> > >>>>> managed memory in a slot. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed memory > >>> > sharing > >>> > >>>>> mechanism, where managed memory is first distributed > >>> > according to the > >>> > >>>>> consumer type, then further distributed across operators > of that > >>> > >> consumer > >>> > >>>>> type. > >>> > >>>>> > >>> > >>>>> - With the operator-based approach, managed memory size > >>> > specified > >>> > >> for an > >>> > >>>>> operator should account for all the consumer types of that > >>> > operator. > >>> > >> That > >>> > >>>>> means the managed memory is first distributed across > >>> > operators, then > >>> > >>>>> distributed to different consumer types of each operator. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Unfortunately, the different order of the two calculation > >>> > steps can > >>> > >> lead > >>> > >>>> to > >>> > >>>>> different results. To be specific, the semantic of the > >>> > configuration > >>> > >>>> option > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > >>> > operator). > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> To sum up things: > >>> > >>>>> > >>> > >>>>> While (3) might be a bit more implementation related, I > >>> > think (1) > >>> > >> and (2) > >>> > >>>>> somehow suggest that, the price for the proposed approach > to > >>> > avoid > >>> > >>>>> specifying resource for every operator is that it's not as > >>> > >> independent > >>> > >>>> from > >>> > >>>>> operator chaining and slot sharing as the operator-based > >>> > approach > >>> > >>>> discussed > >>> > >>>>> in the FLIP. > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> Thank you~ > >>> > >>>>> > >>> > >>>>> Xintong Song > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > >>> > <[hidden email] <mailto:[hidden email]>> > >>> > >> wrote: > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > >>> > >>>>>> > >>> > >>>>>> I want to say, first of all, that this is super well > >>> > written. And > >>> > >> the > >>> > >>>>>> points that the FLIP makes about how to expose the > >>> > configuration to > >>> > >>>> users > >>> > >>>>>> is exactly the right thing to figure out first. > >>> > >>>>>> So good job here! > >>> > >>>>>> > >>> > >>>>>> About how to let users specify the resource profiles. If I > >>> > can sum > >>> > >> the > >>> > >>>>> FLIP > >>> > >>>>>> and previous discussion up in my own words, the problem > is the > >>> > >>>> following: > >>> > >>>>>> Operator-level specification is the simplest and cleanest > >>> > approach, > >>> > >>>>> because > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > >>> > >> scheduling. No > >>> > >>>>>>> matter what other parameters change (chaining, slot > sharing, > >>> > >>>> switching > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > >>> > stay the > >>> > >>>> same. > >>> > >>>>>>> But it would require that a user specifies resources on > all > >>> > >>>> operators, > >>> > >>>>>>> which makes it hard to use. That's why the FLIP suggests > going > >>> > >> with > >>> > >>>>>>> specifying resources on a Sharing-Group. > >>> > >>>>>> > >>> > >>>>>> I think both thoughts are important, so can we find a > solution > >>> > >> where > >>> > >>>> the > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > >>> > still avoid > >>> > >> that > >>> > >>>>> we > >>> > >>>>>> need to specify a resource profile on every operator? > >>> > >>>>>> > >>> > >>>>>> What do you think about something like the following: > >>> > >>>>>> - Resource Profiles are specified on an operator level. > >>> > >>>>>> - Not all operators need profiles > >>> > >>>>>> - All Operators without a Resource Profile ended up in > the > >>> > >> default > >>> > >>>> slot > >>> > >>>>>> sharing group with a default profile (will get a default > slot). > >>> > >>>>>> - All Operators with a Resource Profile will go into > >>> > another slot > >>> > >>>>> sharing > >>> > >>>>>> group (the resource-specified-group). > >>> > >>>>>> - Users can define different slot sharing groups for > >>> > operators > >>> > >> like > >>> > >>>>> they > >>> > >>>>>> do now, with the exception that you cannot mix operators > >>> > that have > >>> > >> a > >>> > >>>>>> resource profile and operators that have no resource > profile. > >>> > >>>>>> - The default case where no operator has a resource > >>> > profile is > >>> > >> just a > >>> > >>>>>> special case of this model > >>> > >>>>>> - The chaining logic sums up the profiles per operator, > >>> > like it > >>> > >> does > >>> > >>>>> now, > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > it > >>> > >> schedules > >>> > >>>>>> together. > >>> > >>>>>> > >>> > >>>>>> > >>> > >>>>>> There is another question about reactive scaling raised > in the > >>> > >> FLIP. I > >>> > >>>>> need > >>> > >>>>>> to think a bit about that. That is indeed a bit more > tricky > >>> > once we > >>> > >>>> have > >>> > >>>>>> slots of different sizes. > >>> > >>>>>> It is not clear then which of the different slot requests > the > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > >>> > show up, > >>> > >> or how > >>> > >>>>> the > >>> > >>>>>> JobManager redistributes the slots resources when > resources > >>> > (TMs) > >>> > >>>>> disappear > >>> > >>>>>> This question is pretty orthogonal, though, to the "how to > >>> > specify > >>> > >> the > >>> > >>>>>> resources". > >>> > >>>>>> > >>> > >>>>>> > >>> > >>>>>> Best, > >>> > >>>>>> Stephan > >>> > >>>>>> > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > >>> > <[hidden email] <mailto:[hidden email]> > >>> > >>>>> wrote: > >>> > >>>>>>> Thanks for drafting the FLIP and driving the discussion, > >>> > Yangze. > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > >>> > >>>>>>> > >>> > >>>>>>> @Till, > >>> > >>>>>>> > >>> > >>>>>>> I agree that specifying requirements for SSGs means that > SSGs > >>> > >> need to > >>> > >>>>> be > >>> > >>>>>>> supported in fine-grained resource management, otherwise > each > >>> > >>>> operator > >>> > >>>>>>> might use as many resources as the whole group. However, > I > >>> > cannot > >>> > >>>> think > >>> > >>>>>> of > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > >>> > resource > >>> > >>>>>>> management. > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>>> Interestingly, if all operators have their resources > properly > >>> > >>>>>> specified, > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > could > >>> > >> slice off > >>> > >>>>> the > >>> > >>>>>>>> appropriately sized slots for every Task individually. > >>> > >>>>>>>> > >>> > >>>>>>> So for example, if we have a job consisting of two > >>> > operator op_1 > >>> > >> and > >>> > >>>>> op_2 > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then say > that > >>> > >> the > >>> > >>>> slot > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we have > a > >>> > >> cluster > >>> > >>>>> with > >>> > >>>>>> 2 > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > cannot run > >>> > >> this > >>> > >>>>> job. > >>> > >>>>>> If > >>> > >>>>>>>> the resources were specified on an operator level, then > the > >>> > >> system > >>> > >>>>>> could > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and op_2 > to > >>> > >> TM_2. > >>> > >>>>>>> > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > are > >>> > >> properly > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > >>> > think this > >>> > >>>>> exactly > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > each > >>> > >> needs > >>> > >>>> 100 > >>> > >>>>> MB > >>> > >>>>>>> of memory, why would we put them in the same group? If > >>> > they are > >>> > >> in > >>> > >>>>>> separate > >>> > >>>>>>> groups, with the proposed approach the system can freely > >>> > deploy > >>> > >> them > >>> > >>>> to > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > >>> > >>>>>>> > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > is > >>> > having > >>> > >>>>>> resource > >>> > >>>>>>> requirements properly specified for all operators. This > is not > >>> > >> always > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > of the > >>> > >>>> benefits > >>> > >>>>>> for > >>> > >>>>>>> SSG-based requirements is that it allows the user to > freely > >>> > >> decide > >>> > >>>> the > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > >>> > consider SSG > >>> > >> in > >>> > >>>>>>> fine-grained resource management as a group of operators > >>> > that the > >>> > >>>> user > >>> > >>>>>>> would like to specify the total resource for. There can > be > >>> > only > >>> > >> one > >>> > >>>>> group > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few major > >>> > parts, > >>> > >> or as > >>> > >>>>>> many > >>> > >>>>>>> groups as the number of tasks/operators, depending on how > >>> > >>>> fine-grained > >>> > >>>>>> the > >>> > >>>>>>> user is able to specify the resources. > >>> > >>>>>>> > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > >>> > that all > >>> > >> the > >>> > >>>>>>> current scheduler implementations already support SSGs, I > >>> > tend to > >>> > >>>> think > >>> > >>>>>>> that as an acceptable price for the above discussed > >>> > usability and > >>> > >>>>>>> flexibility. > >>> > >>>>>>> > >>> > >>>>>>> @Chesnay > >>> > >>>>>>> > >>> > >>>>>>> Will declaring them on slot sharing groups not also waste > >>> > >> resources > >>> > >>>> if > >>> > >>>>>> the > >>> > >>>>>>>> parallelism of operators within that group are > different? > >>> > >>>>>>>> > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > >>> > >> utilization. To > >>> > >>>>>> avoid > >>> > >>>>>>> such wasting, the user can define more groups, so that > >>> > each group > >>> > >>>>>> contains > >>> > >>>>>>> less operators and the chance of having operators with > >>> > different > >>> > >>>>>>> parallelism will be reduced. The price is to have more > >>> > resource > >>> > >>>>>>> requirements to specify. > >>> > >>>>>>> > >>> > >>>>>>> It also seems like quite a hassle for users having to > >>> > >> recalculate the > >>> > >>>>>>>> resource requirements if they change the slot sharing. > >>> > >>>>>>>> I'd think that it's not really workable for users that > create > >>> > >> a set > >>> > >>>>> of > >>> > >>>>>>>> re-usable operators which are mixed and matched in their > >>> > >>>>> applications; > >>> > >>>>>>>> managing the resources requirements in such a setting > >>> > would be > >>> > >> a > >>> > >>>>>>>> nightmare, and in the end would require operator-level > >>> > >> requirements > >>> > >>>>> any > >>> > >>>>>>>> way. > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > increases > >>> > >>>>> usability. > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > there's no > >>> > >> reason to > >>> > >>>>> put > >>> > >>>>>>> multiple operators whose individual resource > >>> > requirements are > >>> > >>>>> already > >>> > >>>>>>> known > >>> > >>>>>>> into the same group in fine-grained resource > management. > >>> > >>>>>>> - Even an operator implementation is reused for > multiple > >>> > >>>>> applications, > >>> > >>>>>>> it does not guarantee the same resource requirements. > >>> > During > >>> > >> our > >>> > >>>>> years > >>> > >>>>>>> of > >>> > >>>>>>> practices in Alibaba, with per-operator requirements > >>> > >> specified for > >>> > >>>>>>> Blink's > >>> > >>>>>>> fine-grained resource management, very few users > >>> > (including > >>> > >> our > >>> > >>>>>>> specialists > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > >>> > >> experienced as > >>> > >>>>> to > >>> > >>>>>>> accurately predict/estimate the operator resource > >>> > >> requirements. > >>> > >>>> Most > >>> > >>>>>>> people > >>> > >>>>>>> rely on the execution-time metrics (throughput, > delay, cpu > >>> > >> load, > >>> > >>>>>> memory > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > specification. > >>> > >>>>>>> > >>> > >>>>>>> To sum up: > >>> > >>>>>>> If the user is capable of providing proper resource > >>> > requirements > >>> > >> for > >>> > >>>>>> every > >>> > >>>>>>> operator, that's definitely a good thing and we would not > >>> > need to > >>> > >>>> rely > >>> > >>>>> on > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > >>> > >> fine-grained > >>> > >>>>>> resource > >>> > >>>>>>> management to work. For those users who are capable and > do not > >>> > >> like > >>> > >>>>>> having > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok to > have > >>> > >> both > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > only > >>> > >> fallback > >>> > >>>> to > >>> > >>>>>> the > >>> > >>>>>>> SSG requirements when the operator requirements are not > >>> > >> specified. > >>> > >>>>>> However, > >>> > >>>>>>> as the first step, I think we should prioritise the use > cases > >>> > >> where > >>> > >>>>> users > >>> > >>>>>>> are not that experienced. > >>> > >>>>>>> > >>> > >>>>>>> Thank you~ > >>> > >>>>>>> > >>> > >>>>>>> Xintong Song > >>> > >>>>>>> > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > >>> > >> [hidden email] <mailto:[hidden email]>> > >>> > >>>>>>> wrote: > >>> > >>>>>>> > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > waste > >>> > >> resources > >>> > >>>>> if > >>> > >>>>>>>> the parallelism of operators within that group are > different? > >>> > >>>>>>>> > >>> > >>>>>>>> It also seems like quite a hassle for users having to > >>> > >> recalculate > >>> > >>>> the > >>> > >>>>>>>> resource requirements if they change the slot sharing. > >>> > >>>>>>>> I'd think that it's not really workable for users that > create > >>> > >> a set > >>> > >>>>> of > >>> > >>>>>>>> re-usable operators which are mixed and matched in their > >>> > >>>>> applications; > >>> > >>>>>>>> managing the resources requirements in such a setting > >>> > would be > >>> > >> a > >>> > >>>>>>>> nightmare, and in the end would require operator-level > >>> > >> requirements > >>> > >>>>> any > >>> > >>>>>>>> way. > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > increases > >>> > >>>>> usability. > >>> > >>>>>>>> My main worry is that it if we wire the runtime to work > >>> > on SSGs > >>> > >>>> it's > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > approaches, > >>> > >> which > >>> > >>>>>>>> would not be the case if, for the runtime, they are > always > >>> > >> defined > >>> > >>>> on > >>> > >>>>>> an > >>> > >>>>>>>> operator-level. > >>> > >>>>>>>> > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > discussion > >>> > >>>> Yangze. > >>> > >>>>>>>>> I like that defining resource requirements on a slot > sharing > >>> > >>>> group > >>> > >>>>>>> makes > >>> > >>>>>>>>> the overall setup easier and improves usability of > resource > >>> > >>>>>>> requirements. > >>> > >>>>>>>>> What I do not like about it is that it changes slot > sharing > >>> > >>>> groups > >>> > >>>>>> from > >>> > >>>>>>>>> being a scheduling hint to something which needs to be > >>> > >> supported > >>> > >>>> in > >>> > >>>>>>> order > >>> > >>>>>>>>> to support fine grained resource requirements. So far, > the > >>> > >> idea > >>> > >>>> of > >>> > >>>>>> slot > >>> > >>>>>>>>> sharing groups was that it tells the system that a set > of > >>> > >>>> operators > >>> > >>>>>> can > >>> > >>>>>>>> be > >>> > >>>>>>>>> deployed in the same slot. But the system still had the > >>> > >> freedom > >>> > >>>> to > >>> > >>>>>> say > >>> > >>>>>>>> that > >>> > >>>>>>>>> it would rather place these tasks in different slots > if it > >>> > >>>> wanted. > >>> > >>>>> If > >>> > >>>>>>> we > >>> > >>>>>>>>> now specify resource requirements on a per slot sharing > >>> > >> group, > >>> > >>>> then > >>> > >>>>>> the > >>> > >>>>>>>>> only option for a scheduler which does not support slot > >>> > >> sharing > >>> > >>>>>> groups > >>> > >>>>>>> is > >>> > >>>>>>>>> to say that every operator in this slot sharing group > >>> > needs a > >>> > >>>> slot > >>> > >>>>>> with > >>> > >>>>>>>> the > >>> > >>>>>>>>> same resources as the whole group. > >>> > >>>>>>>>> > >>> > >>>>>>>>> So for example, if we have a job consisting of two > operator > >>> > >> op_1 > >>> > >>>>> and > >>> > >>>>>>> op_2 > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > say that > >>> > >> the > >>> > >>>>> slot > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > have a > >>> > >> cluster > >>> > >>>>>> with > >>> > >>>>>>> 2 > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > cannot run > >>> > >> this > >>> > >>>>>> job. > >>> > >>>>>>> If > >>> > >>>>>>>>> the resources were specified on an operator level, > then the > >>> > >>>> system > >>> > >>>>>>> could > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > op_2 to > >>> > >> TM_2. > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > groups > >>> > >> was > >>> > >>>> to > >>> > >>>>>> make > >>> > >>>>>>>> it > >>> > >>>>>>>>> easier for the user to reason about how many slots a > job > >>> > >> needs > >>> > >>>>>>>> independent > >>> > >>>>>>>>> of the actual number of operators in the job. > Interestingly, > >>> > >> if > >>> > >>>> all > >>> > >>>>>>>>> operators have their resources properly specified, > then slot > >>> > >>>>> sharing > >>> > >>>>>> is > >>> > >>>>>>>> no > >>> > >>>>>>>>> longer needed because Flink could slice off the > >>> > appropriately > >>> > >>>> sized > >>> > >>>>>>> slots > >>> > >>>>>>>>> for every Task individually. What matters is whether > the > >>> > >> whole > >>> > >>>>>> cluster > >>> > >>>>>>>> has > >>> > >>>>>>>>> enough resources to run all tasks or not. > >>> > >>>>>>>>> > >>> > >>>>>>>>> Cheers, > >>> > >>>>>>>>> Till > >>> > >>>>>>>>> > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > >>> > >> [hidden email] <mailto:[hidden email]>> > >>> > >>>>>> wrote: > >>> > >>>>>>>>>> Hi, there, > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> We would like to start a discussion thread on > "FLIP-156: > >>> > >> Runtime > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource Requirements"[1], > >>> > >> where we > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > interfaces > >>> > >> for > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> In this FLIP: > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > >>> > >> management. > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying SSG-based > >>> > >> resource > >>> > >>>>>>>>>> requirements. > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > >>> > >> granularities > >>> > >>>>> for > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > slot > >>> > >> sharing > >>> > >>>>>> group) > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > [1]. > >>> > >> Looking > >>> > >>>>>>>>>> forward to your feedback. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> [1] > >>> > >>>>>>>>>> > >>> > >> > >>> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > >>> > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > >>> > >>>>>>>>>> Best, > >>> > >>>>>>>>>> Yangze Guo > >>> > >>>>>>>>>> > >>> > >>>>>>>> > >>> > > >>> > |
Thanks for summarizing the discussion, Yangze. I agree that setting
resource requirements per operator is not very user friendly. Moreover, I couldn't come up with a different proposal which would be as easy to use and wouldn't expose internal scheduling details. In fact, following this argument then we shouldn't have exposed the slot sharing groups in the first place. What is important for the user is that we properly document the limitations and constraints the fine grained resource specification has. For example, we should explain how optimizations like chaining are affected by it and how different execution modes (batch vs. streaming) affect the execution of operators which have specified resources. These things shouldn't become part of the contract of this feature and are more caused by internal implementation details but it will be important to understand these things properly in order to use this feature effectively. Hence, +1 for starting the vote for this FLIP. Cheers, Till On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> wrote: > Thanks for the summary, Yangze. > > The changes and follow-up issues LGTM. Let's wait for responses from the > others before starting a vote. > > Thank you~ > > Xintong Song > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> wrote: > > > Thanks everyone for the lively discussion. I'd like to try to > > summarize the current convergence in the discussion. Please let me > > know if I got things wrong or missed something crucial here. > > > > Change of this FLIP: > > - Treat the SSG resource requirements as a hint instead of a > > restriction for the runtime. That's should be explicitly explained in > > the JavaDocs. > > > > Potential follow-up issues if needed: > > - Provide operator-level resource configuration interface. > > - Provide multiple options for deciding resources for SSGs whose > > requirement is not specified: > > ** Default slot resource. > > ** Default operator resource times number of operators. > > > > If there are no other issues, I'll update the FLIP accordingly and > > start a vote thread. Thanks all for the valuable feedback again. > > > > Best, > > Yangze Guo > > > > Best, > > Yangze Guo > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email]> > > wrote: > > > > > > > > > FGRuntimeInterface.png > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[hidden email]> > > wrote: > > >> > > >> I think Chesnay's proposal could actually work. IIUC, the keypoint is > > to derive operator requirements from SSG requirements on the API side, so > > that the runtime only deals with operator requirements. It's debatable > how > > the deriving should be done though. E.g., an alternative could be to > evenly > > divide the SSG requirement into requirements of operators in the group. > > >> > > >> > > >> However, I'm not entirely sure which option is more desired. > > Illustrating my understanding in the following figure, in which on the > top > > is Chesnay's proposal and on the bottom is the SSG-based proposal in this > > FLIP. > > >> > > >> > > >> > > >> I think the major difference between the two approaches is where > > deriving operator requirements from SSG requirements happens. > > >> > > >> - Chesnay's proposal simplifies the runtime logic and the interface to > > expose, at the price of moving more complexity (i.e. the deriving) to the > > API side. The question is, where do we prefer to keep the complexity? I'm > > slightly leaning towards having a thin API and keep the complexity in > > runtime if possible. > > >> > > >> - Notice that the dash line arrows represent optional steps that are > > needed only for schedulers that do not respect SSGs, which we don't have > at > > the moment. If we only look at the solid line arrows, then the SSG-based > > approach is much simpler, without needing to derive and aggregate the > > requirements back and forth. I'm not sure about complicating the current > > design only for the potential future needs. > > >> > > >> > > >> Thank you~ > > >> > > >> Xintong Song > > >> > > >> > > >> > > >> > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <[hidden email]> > > wrote: > > >>> > > >>> You're raising a good point, but I think I can rectify that with a > > minor > > >>> adjustment. > > >>> > > >>> Default requirements are whatever the default requirements are, > setting > > >>> the requirements for one operator has no effect on other operators. > > >>> > > >>> With these rules, and some API enhancements, the following mockup > would > > >>> replicate the SSG-based behavior: > > >>> > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > >>> vertices = slotSharingGroup.getVertices() > > >>> > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > >>> vertices.remainint().setRequirements(ZERO) > > >>> } > > >>> > > >>> We could even allow setting requirements on slotsharing-groups > > >>> colocation-groups and internally translate them accordingly. > > >>> I can't help but feel this is a plain API issue. > > >>> > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > >>> > If I understand you correctly Chesnay, then you want to decouple > the > > >>> > resource requirement specification from the slot sharing group > > >>> > assignment. Hence, per default all operators would be in the same > > slot > > >>> > sharing group. If there is no operator with a resource > specification, > > >>> > then the system would allocate a default slot for it. If there is > at > > >>> > least one operator, then the system would sum up all the specified > > >>> > resources and allocate a slot of this size. This effectively means > > >>> > that all unspecified operators will implicitly have a zero resource > > >>> > requirement. Did I understand your idea correctly? > > >>> > > > >>> > I am wondering whether this wouldn't lead to a surprising behaviour > > >>> > for the user. If the user specifies the resource requirements for a > > >>> > single operator, then he probably will assume that the other > > operators > > >>> > will get the default share of resources and not nothing. > > >>> > > > >>> > Cheers, > > >>> > Till > > >>> > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > [hidden email] > > >>> > <mailto:[hidden email]>> wrote: > > >>> > > > >>> > Is there even a functional difference between specifying the > > >>> > requirements for an SSG vs specifying the same requirements on > a > > >>> > single > > >>> > operator within that group (ideally a colocation group to avoid > > this > > >>> > whole hint business)? > > >>> > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > >>> > > > >>> > Users can take shortcuts to define shared requirements, > > >>> > but refine them further as needed on a per-operator basis, > > >>> > without changing semantics of slotsharing groups > > >>> > nor the runtime being locked into SSG-based requirements. > > >>> > > > >>> > (And before anyone argues what happens if slotsharing groups > > >>> > change or > > >>> > whatnot, that's a plain API issue that we could surely solve. > (A > > >>> > plain > > >>> > iteration over slotsharing groups and therein contained > operators > > >>> > would > > >>> > suffice)). > > >>> > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > >>> > > Maybe a different minor idea: Would it be possible to treat > > the SSG > > >>> > > resource requirements as a hint for the runtime similar to > how > > >>> > slot sharing > > >>> > > groups are designed at the moment? Meaning that we don't give > > >>> > the guarantee > > >>> > > that Flink will always deploy this set of tasks together no > > >>> > matter what > > >>> > > comes. If, for example, the runtime can derive by some means > > the > > >>> > resource > > >>> > > requirements for each task based on the requirements for the > > >>> > SSG, this > > >>> > > could be possible. One easy strategy would be to give every > > task > > >>> > the same > > >>> > > resources as the whole slot sharing group. Another one could > be > > >>> > > distributing the resources equally among the tasks. This does > > >>> > not even have > > >>> > > to be implemented but we would give ourselves the freedom to > > change > > >>> > > scheduling if need should arise. > > >>> > > > > >>> > > Cheers, > > >>> > > Till > > >>> > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > [hidden email] > > >>> > <mailto:[hidden email]>> wrote: > > >>> > > > > >>> > >> Thanks for the responses, Till and Xintong. > > >>> > >> > > >>> > >> I second Xintong's comment that SSG-based runtime interface > > >>> > will give > > >>> > >> us the flexibility to achieve op/task-based approach. That's > > one of > > >>> > >> the most important reasons for our design choice. > > >>> > >> > > >>> > >> Some cents regarding the default operator resource: > > >>> > >> - It might be good for the scenario of DataStream jobs. > > >>> > >> ** For light-weight operators, the accumulative > > >>> > configuration error > > >>> > >> will not be significant. Then, the resource of a task used > is > > >>> > >> proportional to the number of operators it contains. > > >>> > >> ** For heavy operators like join and window or operators > > >>> > using the > > >>> > >> external resources, user will turn to the fine-grained > > resource > > >>> > >> configuration. > > >>> > >> - It can increase the stability for the standalone cluster > > >>> > where task > > >>> > >> executors registered are heterogeneous(with different > default > > slot > > >>> > >> resources). > > >>> > >> - It might not be good for SQL users. The operators that SQL > > >>> > will be > > >>> > >> transferred to is a black box to the user. We also do not > > guarantee > > >>> > >> the cross-version of consistency of the transformation so > far. > > >>> > >> > > >>> > >> I think it can be treated as a follow-up work when the > > fine-grained > > >>> > >> resource management is end-to-end ready. > > >>> > >> > > >>> > >> Best, > > >>> > >> Yangze Guo > > >>> > >> > > >>> > >> > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > >>> > <[hidden email] <mailto:[hidden email]>> > > >>> > >> wrote: > > >>> > >>> Thanks for the feedback, Till. > > >>> > >>> > > >>> > >>> ## I feel that what you proposed (operator-based + default > > >>> > value) might > > >>> > >> be > > >>> > >>> subsumed by the SSG-based approach. > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > > >>> > categorized by > > >>> > >>> whether the resource requirements are known to the users. > > >>> > >>> > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > >>> > reason to put > > >>> > >>> multiple operators whose individual resource > requirements > > >>> > are already > > >>> > >> known > > >>> > >>> into the same group in fine-grained resource > management. > > >>> > And if op_1 > > >>> > >> and > > >>> > >>> op_2 are in different groups, there should be no > problem > > >>> > switching > > >>> > >> data > > >>> > >>> exchange mode from pipelined to blocking. This is > > >>> > equivalent to > > >>> > >> specifying > > >>> > >>> operator resource requirements in your proposal. > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > that > > >>> > op_2 is in a > > >>> > >>> SSG whose resource is not specified thus would have the > > >>> > default slot > > >>> > >>> resource. This is equivalent to having default operator > > >>> > resources in > > >>> > >> your > > >>> > >>> proposal. > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > op_2 > > >>> > to the same > > >>> > >> SSG > > >>> > >>> or separate SSGs. > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > >>> > equivalent to > > >>> > >> the > > >>> > >>> coarse-grained resource management, where op_1 and > > op_2 > > >>> > share a > > >>> > >> default > > >>> > >>> size slot no matter which data exchange mode is > used. > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > of > > >>> > them will > > >>> > >> use > > >>> > >>> a default size slot. This is equivalent to setting > > them > > >>> > with > > >>> > >> default > > >>> > >>> operator resources in your proposal. > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > is > > >>> > known.* > > >>> > >>> - It is possible that the user learns the total / > max > > >>> > resource > > >>> > >>> requirement from executing and monitoring the job, > > >>> > while not > > >>> > >>> being aware of > > >>> > >>> individual operator requirements. > > >>> > >>> - I believe this is the case your proposal does not > > >>> > cover. And TBH, > > >>> > >>> this is probably how most users learn the resource > > >>> > requirements, > > >>> > >>> according > > >>> > >>> to my experiences. > > >>> > >>> - In this case, the user might need to specify > > >>> > different resources > > >>> > >> if > > >>> > >>> he wants to switch the execution mode, which should > > not > > >>> > be worse > > >>> > >> than not > > >>> > >>> being able to use fine-grained resource management. > > >>> > >>> > > >>> > >>> > > >>> > >>> ## An additional idea inspired by your proposal. > > >>> > >>> We may provide multiple options for deciding resources for > > >>> > SSGs whose > > >>> > >>> requirement is not specified, if needed. > > >>> > >>> > > >>> > >>> - Default slot resource (current design) > > >>> > >>> - Default operator resource times number of operators > > >>> > (equivalent to > > >>> > >>> your proposal) > > >>> > >>> > > >>> > >>> > > >>> > >>> ## Exposing internal runtime strategies > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > >>> > requirements might be > > >>> > >>> affected if how SSGs are internally handled changes in > > future. > > >>> > >> Practically, > > >>> > >>> I do not concretely see at the moment what kind of changes > we > > >>> > may want in > > >>> > >>> future that might conflict with this FLIP proposal, as the > > >>> > question of > > >>> > >>> switching data exchange mode answered above. I'd suggest to > > >>> > not give up > > >>> > >> the > > >>> > >>> user friendliness we may gain now for the future problems > > that > > >>> > may or may > > >>> > >>> not exist. > > >>> > >>> > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > >>> > achieve the > > >>> > >>> equivalent behavior as the operator-based approach, if we > > set each > > >>> > >> operator > > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > > >>> > option to > > >>> > >>> automatically do that for users, if needed. > > >>> > >>> > > >>> > >>> > > >>> > >>> Thank you~ > > >>> > >>> > > >>> > >>> Xintong Song > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > >>> > <[hidden email] <mailto:[hidden email]>> > > >>> > >> wrote: > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > >>> > >>>> > > >>> > >>>> I agree that being able to define the resource > requirements > > for a > > >>> > >> group of > > >>> > >>>> operators is more user friendly. However, my concern is > that > > >>> > we are > > >>> > >>>> exposing thereby internal runtime strategies which might > > >>> > limit our > > >>> > >>>> flexibility to execute a given job. Moreover, the > semantics > > of > > >>> > >> configuring > > >>> > >>>> resource requirements for SSGs could break if switching > from > > >>> > streaming > > >>> > >> to > > >>> > >>>> batch execution. If one defines the resource requirements > > for > > >>> > op_1 -> > > >>> > >> op_2 > > >>> > >>>> which run in pipelined mode when using the streaming > > >>> > execution, then > > >>> > >> how do > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > >>> > executed with a > > >>> > >>>> blocking data exchange in batch execution mode? > > Consequently, > > >>> > I am > > >>> > >> still > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > >>> > requirements per > > >>> > >>>> operator. > > >>> > >>>> > > >>> > >>>> Maybe the following proposal makes the configuration > easier: > > >>> > If the > > >>> > >> user > > >>> > >>>> wants to use fine-grained resource requirements, then she > > >>> > needs to > > >>> > >> specify > > >>> > >>>> the default size which is used for operators which have no > > >>> > explicit > > >>> > >>>> resource annotation. If this holds true, then every > operator > > >>> > would > > >>> > >> have a > > >>> > >>>> resource requirement and the system can try to execute the > > >>> > operators > > >>> > >> in the > > >>> > >>>> best possible manner w/o being constrained by how the user > > >>> > set the SSG > > >>> > >>>> requirements. > > >>> > >>>> > > >>> > >>>> Cheers, > > >>> > >>>> Till > > >>> > >>>> > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > >>> > <[hidden email] <mailto:[hidden email]>> > > >>> > >>>> wrote: > > >>> > >>>> > > >>> > >>>>> Thanks for the feedback, Stephan. > > >>> > >>>>> > > >>> > >>>>> Actually, your proposal has also come to my mind at some > > >>> > point. And I > > >>> > >>>> have > > >>> > >>>>> some concerns about it. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> 1. It does not give users the same control as the > SSG-based > > >>> > approach. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> While both approaches do not require specifying for each > > >>> > operator, > > >>> > >>>>> SSG-based approach supports the semantic that "some > > operators > > >>> > >> together > > >>> > >>>> use > > >>> > >>>>> this much resource" while the operator-based approach > > doesn't. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > > >>> > o_m), and > > >>> > >> at > > >>> > >>>> some > > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > > >>> > reduces the > > >>> > >> data > > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > > >>> > (o_1, ..., > > >>> > >> o_n) > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > higher > > >>> > >> parallelisms > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > > >>> > lead to too > > >>> > >> much > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > different > > >>> > >> resources, > > >>> > >>>>> with the SSG-based approach one can directly specify > > >>> > resources for > > >>> > >> the > > >>> > >>>> two > > >>> > >>>>> groups. However, with the operator-based approach, the > > user will > > >>> > >> have to > > >>> > >>>>> specify resources for each operator in one of the two > > >>> > groups, and > > >>> > >> tune > > >>> > >>>> the > > >>> > >>>>> default slot resource via configurations to fit the other > > group. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> Setting chainnable operators into different slot sharing > > >>> > groups will > > >>> > >>>>> prevent them from being chained. In the current > > implementation, > > >>> > >>>> downstream > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > to > > >>> > the same > > >>> > >> group > > >>> > >>>>> as the chainable upstream operators (unless multiple > > upstream > > >>> > >> operators > > >>> > >>>> in > > >>> > >>>>> different groups), to reduce the chance of breaking > chains. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > > >>> > deciding > > >>> > >> SSGs > > >>> > >>>>> based on whether resource is specified we will easily get > > >>> > groups like > > >>> > >>>> (o_1, > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > >>> > chained. This > > >>> > >> is > > >>> > >>>> also > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > >>> > chance is much > > >>> > >>>>> smaller because there's no strong reason for users to > > >>> > specify the > > >>> > >> groups > > >>> > >>>>> with alternate operators like that. We are more likely to > > >>> > get groups > > >>> > >> like > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > between > > >>> > o_2 and > > >>> > >> o_3. > > >>> > >>>>> > > >>> > >>>>> 3. It complicates the system by having two different > > >>> > mechanisms for > > >>> > >>>> sharing > > >>> > >>>>> managed memory in a slot. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > memory > > >>> > sharing > > >>> > >>>>> mechanism, where managed memory is first distributed > > >>> > according to the > > >>> > >>>>> consumer type, then further distributed across operators > > of that > > >>> > >> consumer > > >>> > >>>>> type. > > >>> > >>>>> > > >>> > >>>>> - With the operator-based approach, managed memory size > > >>> > specified > > >>> > >> for an > > >>> > >>>>> operator should account for all the consumer types of > that > > >>> > operator. > > >>> > >> That > > >>> > >>>>> means the managed memory is first distributed across > > >>> > operators, then > > >>> > >>>>> distributed to different consumer types of each operator. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> Unfortunately, the different order of the two calculation > > >>> > steps can > > >>> > >> lead > > >>> > >>>> to > > >>> > >>>>> different results. To be specific, the semantic of the > > >>> > configuration > > >>> > >>>> option > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > >>> > operator). > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> To sum up things: > > >>> > >>>>> > > >>> > >>>>> While (3) might be a bit more implementation related, I > > >>> > think (1) > > >>> > >> and (2) > > >>> > >>>>> somehow suggest that, the price for the proposed approach > > to > > >>> > avoid > > >>> > >>>>> specifying resource for every operator is that it's not > as > > >>> > >> independent > > >>> > >>>> from > > >>> > >>>>> operator chaining and slot sharing as the operator-based > > >>> > approach > > >>> > >>>> discussed > > >>> > >>>>> in the FLIP. > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> Thank you~ > > >>> > >>>>> > > >>> > >>>>> Xintong Song > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > >>> > <[hidden email] <mailto:[hidden email]>> > > >>> > >> wrote: > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > >>> > >>>>>> > > >>> > >>>>>> I want to say, first of all, that this is super well > > >>> > written. And > > >>> > >> the > > >>> > >>>>>> points that the FLIP makes about how to expose the > > >>> > configuration to > > >>> > >>>> users > > >>> > >>>>>> is exactly the right thing to figure out first. > > >>> > >>>>>> So good job here! > > >>> > >>>>>> > > >>> > >>>>>> About how to let users specify the resource profiles. > If I > > >>> > can sum > > >>> > >> the > > >>> > >>>>> FLIP > > >>> > >>>>>> and previous discussion up in my own words, the problem > > is the > > >>> > >>>> following: > > >>> > >>>>>> Operator-level specification is the simplest and > cleanest > > >>> > approach, > > >>> > >>>>> because > > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > > >>> > >> scheduling. No > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > sharing, > > >>> > >>>> switching > > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > > >>> > stay the > > >>> > >>>> same. > > >>> > >>>>>>> But it would require that a user specifies resources on > > all > > >>> > >>>> operators, > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > suggests > > going > > >>> > >> with > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > >>> > >>>>>> > > >>> > >>>>>> I think both thoughts are important, so can we find a > > solution > > >>> > >> where > > >>> > >>>> the > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > >>> > still avoid > > >>> > >> that > > >>> > >>>>> we > > >>> > >>>>>> need to specify a resource profile on every operator? > > >>> > >>>>>> > > >>> > >>>>>> What do you think about something like the following: > > >>> > >>>>>> - Resource Profiles are specified on an operator > level. > > >>> > >>>>>> - Not all operators need profiles > > >>> > >>>>>> - All Operators without a Resource Profile ended up > in > > the > > >>> > >> default > > >>> > >>>> slot > > >>> > >>>>>> sharing group with a default profile (will get a default > > slot). > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > >>> > another slot > > >>> > >>>>> sharing > > >>> > >>>>>> group (the resource-specified-group). > > >>> > >>>>>> - Users can define different slot sharing groups for > > >>> > operators > > >>> > >> like > > >>> > >>>>> they > > >>> > >>>>>> do now, with the exception that you cannot mix operators > > >>> > that have > > >>> > >> a > > >>> > >>>>>> resource profile and operators that have no resource > > profile. > > >>> > >>>>>> - The default case where no operator has a resource > > >>> > profile is > > >>> > >> just a > > >>> > >>>>>> special case of this model > > >>> > >>>>>> - The chaining logic sums up the profiles per > operator, > > >>> > like it > > >>> > >> does > > >>> > >>>>> now, > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > > it > > >>> > >> schedules > > >>> > >>>>>> together. > > >>> > >>>>>> > > >>> > >>>>>> > > >>> > >>>>>> There is another question about reactive scaling raised > > in the > > >>> > >> FLIP. I > > >>> > >>>>> need > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > tricky > > >>> > once we > > >>> > >>>> have > > >>> > >>>>>> slots of different sizes. > > >>> > >>>>>> It is not clear then which of the different slot > requests > > the > > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > > >>> > show up, > > >>> > >> or how > > >>> > >>>>> the > > >>> > >>>>>> JobManager redistributes the slots resources when > > resources > > >>> > (TMs) > > >>> > >>>>> disappear > > >>> > >>>>>> This question is pretty orthogonal, though, to the "how > to > > >>> > specify > > >>> > >> the > > >>> > >>>>>> resources". > > >>> > >>>>>> > > >>> > >>>>>> > > >>> > >>>>>> Best, > > >>> > >>>>>> Stephan > > >>> > >>>>>> > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > >>> > <[hidden email] <mailto:[hidden email]> > > >>> > >>>>> wrote: > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > discussion, > > >>> > Yangze. > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > >>> > >>>>>>> > > >>> > >>>>>>> @Till, > > >>> > >>>>>>> > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > that > > SSGs > > >>> > >> need to > > >>> > >>>>> be > > >>> > >>>>>>> supported in fine-grained resource management, > otherwise > > each > > >>> > >>>> operator > > >>> > >>>>>>> might use as many resources as the whole group. > However, > > I > > >>> > cannot > > >>> > >>>> think > > >>> > >>>>>> of > > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > > >>> > resource > > >>> > >>>>>>> management. > > >>> > >>>>>>> > > >>> > >>>>>>> > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > properly > > >>> > >>>>>> specified, > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > could > > >>> > >> slice off > > >>> > >>>>> the > > >>> > >>>>>>>> appropriately sized slots for every Task individually. > > >>> > >>>>>>>> > > >>> > >>>>>>> So for example, if we have a job consisting of two > > >>> > operator op_1 > > >>> > >> and > > >>> > >>>>> op_2 > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > say > > that > > >>> > >> the > > >>> > >>>> slot > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > have > > a > > >>> > >> cluster > > >>> > >>>>> with > > >>> > >>>>>> 2 > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > cannot run > > >>> > >> this > > >>> > >>>>> job. > > >>> > >>>>>> If > > >>> > >>>>>>>> the resources were specified on an operator level, > then > > the > > >>> > >> system > > >>> > >>>>>> could > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > op_2 > > to > > >>> > >> TM_2. > > >>> > >>>>>>> > > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > > are > > >>> > >> properly > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > >>> > think this > > >>> > >>>>> exactly > > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > > each > > >>> > >> needs > > >>> > >>>> 100 > > >>> > >>>>> MB > > >>> > >>>>>>> of memory, why would we put them in the same group? If > > >>> > they are > > >>> > >> in > > >>> > >>>>>> separate > > >>> > >>>>>>> groups, with the proposed approach the system can > freely > > >>> > deploy > > >>> > >> them > > >>> > >>>> to > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > >>> > >>>>>>> > > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > > is > > >>> > having > > >>> > >>>>>> resource > > >>> > >>>>>>> requirements properly specified for all operators. This > > is not > > >>> > >> always > > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > > of the > > >>> > >>>> benefits > > >>> > >>>>>> for > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > freely > > >>> > >> decide > > >>> > >>>> the > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > >>> > consider SSG > > >>> > >> in > > >>> > >>>>>>> fine-grained resource management as a group of > operators > > >>> > that the > > >>> > >>>> user > > >>> > >>>>>>> would like to specify the total resource for. There can > > be > > >>> > only > > >>> > >> one > > >>> > >>>>> group > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > major > > >>> > parts, > > >>> > >> or as > > >>> > >>>>>> many > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > how > > >>> > >>>> fine-grained > > >>> > >>>>>> the > > >>> > >>>>>>> user is able to specify the resources. > > >>> > >>>>>>> > > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > > >>> > that all > > >>> > >> the > > >>> > >>>>>>> current scheduler implementations already support > SSGs, I > > >>> > tend to > > >>> > >>>> think > > >>> > >>>>>>> that as an acceptable price for the above discussed > > >>> > usability and > > >>> > >>>>>>> flexibility. > > >>> > >>>>>>> > > >>> > >>>>>>> @Chesnay > > >>> > >>>>>>> > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > waste > > >>> > >> resources > > >>> > >>>> if > > >>> > >>>>>> the > > >>> > >>>>>>>> parallelism of operators within that group are > > different? > > >>> > >>>>>>>> > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > >>> > >> utilization. To > > >>> > >>>>>> avoid > > >>> > >>>>>>> such wasting, the user can define more groups, so that > > >>> > each group > > >>> > >>>>>> contains > > >>> > >>>>>>> less operators and the chance of having operators with > > >>> > different > > >>> > >>>>>>> parallelism will be reduced. The price is to have more > > >>> > resource > > >>> > >>>>>>> requirements to specify. > > >>> > >>>>>>> > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > >>> > >> recalculate the > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > create > > >>> > >> a set > > >>> > >>>>> of > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > their > > >>> > >>>>> applications; > > >>> > >>>>>>>> managing the resources requirements in such a setting > > >>> > would be > > >>> > >> a > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > >>> > >> requirements > > >>> > >>>>> any > > >>> > >>>>>>>> way. > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > increases > > >>> > >>>>> usability. > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > there's no > > >>> > >> reason to > > >>> > >>>>> put > > >>> > >>>>>>> multiple operators whose individual resource > > >>> > requirements are > > >>> > >>>>> already > > >>> > >>>>>>> known > > >>> > >>>>>>> into the same group in fine-grained resource > > management. > > >>> > >>>>>>> - Even an operator implementation is reused for > > multiple > > >>> > >>>>> applications, > > >>> > >>>>>>> it does not guarantee the same resource > requirements. > > >>> > During > > >>> > >> our > > >>> > >>>>> years > > >>> > >>>>>>> of > > >>> > >>>>>>> practices in Alibaba, with per-operator > requirements > > >>> > >> specified for > > >>> > >>>>>>> Blink's > > >>> > >>>>>>> fine-grained resource management, very few users > > >>> > (including > > >>> > >> our > > >>> > >>>>>>> specialists > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > >>> > >> experienced as > > >>> > >>>>> to > > >>> > >>>>>>> accurately predict/estimate the operator resource > > >>> > >> requirements. > > >>> > >>>> Most > > >>> > >>>>>>> people > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > delay, cpu > > >>> > >> load, > > >>> > >>>>>> memory > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > specification. > > >>> > >>>>>>> > > >>> > >>>>>>> To sum up: > > >>> > >>>>>>> If the user is capable of providing proper resource > > >>> > requirements > > >>> > >> for > > >>> > >>>>>> every > > >>> > >>>>>>> operator, that's definitely a good thing and we would > not > > >>> > need to > > >>> > >>>> rely > > >>> > >>>>> on > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > >>> > >> fine-grained > > >>> > >>>>>> resource > > >>> > >>>>>>> management to work. For those users who are capable and > > do not > > >>> > >> like > > >>> > >>>>>> having > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > to > > have > > >>> > >> both > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > > only > > >>> > >> fallback > > >>> > >>>> to > > >>> > >>>>>> the > > >>> > >>>>>>> SSG requirements when the operator requirements are not > > >>> > >> specified. > > >>> > >>>>>> However, > > >>> > >>>>>>> as the first step, I think we should prioritise the use > > cases > > >>> > >> where > > >>> > >>>>> users > > >>> > >>>>>>> are not that experienced. > > >>> > >>>>>>> > > >>> > >>>>>>> Thank you~ > > >>> > >>>>>>> > > >>> > >>>>>>> Xintong Song > > >>> > >>>>>>> > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > >>> > >> [hidden email] <mailto:[hidden email]>> > > >>> > >>>>>>> wrote: > > >>> > >>>>>>> > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > waste > > >>> > >> resources > > >>> > >>>>> if > > >>> > >>>>>>>> the parallelism of operators within that group are > > different? > > >>> > >>>>>>>> > > >>> > >>>>>>>> It also seems like quite a hassle for users having to > > >>> > >> recalculate > > >>> > >>>> the > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > create > > >>> > >> a set > > >>> > >>>>> of > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > their > > >>> > >>>>> applications; > > >>> > >>>>>>>> managing the resources requirements in such a setting > > >>> > would be > > >>> > >> a > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > >>> > >> requirements > > >>> > >>>>> any > > >>> > >>>>>>>> way. > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > increases > > >>> > >>>>> usability. > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > work > > >>> > on SSGs > > >>> > >>>> it's > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > approaches, > > >>> > >> which > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > always > > >>> > >> defined > > >>> > >>>> on > > >>> > >>>>>> an > > >>> > >>>>>>>> operator-level. > > >>> > >>>>>>>> > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > discussion > > >>> > >>>> Yangze. > > >>> > >>>>>>>>> I like that defining resource requirements on a slot > > sharing > > >>> > >>>> group > > >>> > >>>>>>> makes > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > resource > > >>> > >>>>>>> requirements. > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > sharing > > >>> > >>>> groups > > >>> > >>>>>> from > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > be > > >>> > >> supported > > >>> > >>>> in > > >>> > >>>>>>> order > > >>> > >>>>>>>>> to support fine grained resource requirements. So > far, > > the > > >>> > >> idea > > >>> > >>>> of > > >>> > >>>>>> slot > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > set > > of > > >>> > >>>> operators > > >>> > >>>>>> can > > >>> > >>>>>>>> be > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > the > > >>> > >> freedom > > >>> > >>>> to > > >>> > >>>>>> say > > >>> > >>>>>>>> that > > >>> > >>>>>>>>> it would rather place these tasks in different slots > > if it > > >>> > >>>> wanted. > > >>> > >>>>> If > > >>> > >>>>>>> we > > >>> > >>>>>>>>> now specify resource requirements on a per slot > sharing > > >>> > >> group, > > >>> > >>>> then > > >>> > >>>>>> the > > >>> > >>>>>>>>> only option for a scheduler which does not support > slot > > >>> > >> sharing > > >>> > >>>>>> groups > > >>> > >>>>>>> is > > >>> > >>>>>>>>> to say that every operator in this slot sharing group > > >>> > needs a > > >>> > >>>> slot > > >>> > >>>>>> with > > >>> > >>>>>>>> the > > >>> > >>>>>>>>> same resources as the whole group. > > >>> > >>>>>>>>> > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > operator > > >>> > >> op_1 > > >>> > >>>>> and > > >>> > >>>>>>> op_2 > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > say that > > >>> > >> the > > >>> > >>>>> slot > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > have a > > >>> > >> cluster > > >>> > >>>>>> with > > >>> > >>>>>>> 2 > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > cannot run > > >>> > >> this > > >>> > >>>>>> job. > > >>> > >>>>>>> If > > >>> > >>>>>>>>> the resources were specified on an operator level, > > then the > > >>> > >>>> system > > >>> > >>>>>>> could > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > op_2 to > > >>> > >> TM_2. > > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > > groups > > >>> > >> was > > >>> > >>>> to > > >>> > >>>>>> make > > >>> > >>>>>>>> it > > >>> > >>>>>>>>> easier for the user to reason about how many slots a > > job > > >>> > >> needs > > >>> > >>>>>>>> independent > > >>> > >>>>>>>>> of the actual number of operators in the job. > > Interestingly, > > >>> > >> if > > >>> > >>>> all > > >>> > >>>>>>>>> operators have their resources properly specified, > > then slot > > >>> > >>>>> sharing > > >>> > >>>>>> is > > >>> > >>>>>>>> no > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > >>> > appropriately > > >>> > >>>> sized > > >>> > >>>>>>> slots > > >>> > >>>>>>>>> for every Task individually. What matters is whether > > the > > >>> > >> whole > > >>> > >>>>>> cluster > > >>> > >>>>>>>> has > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > >>> > >>>>>>>>> > > >>> > >>>>>>>>> Cheers, > > >>> > >>>>>>>>> Till > > >>> > >>>>>>>>> > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > >>> > >> [hidden email] <mailto:[hidden email]>> > > >>> > >>>>>> wrote: > > >>> > >>>>>>>>>> Hi, there, > > >>> > >>>>>>>>>> > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > "FLIP-156: > > >>> > >> Runtime > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > Requirements"[1], > > >>> > >> where we > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > interfaces > > >>> > >> for > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > >>> > >>>>>>>>>> > > >>> > >>>>>>>>>> In this FLIP: > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > >>> > >> management. > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > SSG-based > > >>> > >> resource > > >>> > >>>>>>>>>> requirements. > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > >>> > >> granularities > > >>> > >>>>> for > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > slot > > >>> > >> sharing > > >>> > >>>>>> group) > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > >>> > >>>>>>>>>> > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > [1]. > > >>> > >> Looking > > >>> > >>>>>>>>>> forward to your feedback. > > >>> > >>>>>>>>>> > > >>> > >>>>>>>>>> [1] > > >>> > >>>>>>>>>> > > >>> > >> > > >>> > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > >>> > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > >>> > >>>>>>>>>> Best, > > >>> > >>>>>>>>>> Yangze Guo > > >>> > >>>>>>>>>> > > >>> > >>>>>>>> > > >>> > > > >>> > > > |
Thanks for reply, Till and Xintong!
I update the FLIP, including: - Edit the JavaDoc of the proposed StreamGraphGenerator#setSlotSharingGroupResource. - Add "Future Plan" section, which contains the potential follow-up issues and the limitations to be documented when fine-grained resource management is exposed to users. I'll start a vote in another thread. Best, Yangze Guo On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> wrote: > > Thanks for summarizing the discussion, Yangze. I agree that setting > resource requirements per operator is not very user friendly. Moreover, I > couldn't come up with a different proposal which would be as easy to use > and wouldn't expose internal scheduling details. In fact, following this > argument then we shouldn't have exposed the slot sharing groups in the > first place. > > What is important for the user is that we properly document the limitations > and constraints the fine grained resource specification has. For example, > we should explain how optimizations like chaining are affected by it and > how different execution modes (batch vs. streaming) affect the execution of > operators which have specified resources. These things shouldn't become > part of the contract of this feature and are more caused by internal > implementation details but it will be important to understand these things > properly in order to use this feature effectively. > > Hence, +1 for starting the vote for this FLIP. > > Cheers, > Till > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> wrote: > > > Thanks for the summary, Yangze. > > > > The changes and follow-up issues LGTM. Let's wait for responses from the > > others before starting a vote. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> wrote: > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > summarize the current convergence in the discussion. Please let me > > > know if I got things wrong or missed something crucial here. > > > > > > Change of this FLIP: > > > - Treat the SSG resource requirements as a hint instead of a > > > restriction for the runtime. That's should be explicitly explained in > > > the JavaDocs. > > > > > > Potential follow-up issues if needed: > > > - Provide operator-level resource configuration interface. > > > - Provide multiple options for deciding resources for SSGs whose > > > requirement is not specified: > > > ** Default slot resource. > > > ** Default operator resource times number of operators. > > > > > > If there are no other issues, I'll update the FLIP accordingly and > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > Best, > > > Yangze Guo > > > > > > Best, > > > Yangze Guo > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[hidden email]> > > > wrote: > > > >> > > > >> I think Chesnay's proposal could actually work. IIUC, the keypoint is > > > to derive operator requirements from SSG requirements on the API side, so > > > that the runtime only deals with operator requirements. It's debatable > > how > > > the deriving should be done though. E.g., an alternative could be to > > evenly > > > divide the SSG requirement into requirements of operators in the group. > > > >> > > > >> > > > >> However, I'm not entirely sure which option is more desired. > > > Illustrating my understanding in the following figure, in which on the > > top > > > is Chesnay's proposal and on the bottom is the SSG-based proposal in this > > > FLIP. > > > >> > > > >> > > > >> > > > >> I think the major difference between the two approaches is where > > > deriving operator requirements from SSG requirements happens. > > > >> > > > >> - Chesnay's proposal simplifies the runtime logic and the interface to > > > expose, at the price of moving more complexity (i.e. the deriving) to the > > > API side. The question is, where do we prefer to keep the complexity? I'm > > > slightly leaning towards having a thin API and keep the complexity in > > > runtime if possible. > > > >> > > > >> - Notice that the dash line arrows represent optional steps that are > > > needed only for schedulers that do not respect SSGs, which we don't have > > at > > > the moment. If we only look at the solid line arrows, then the SSG-based > > > approach is much simpler, without needing to derive and aggregate the > > > requirements back and forth. I'm not sure about complicating the current > > > design only for the potential future needs. > > > >> > > > >> > > > >> Thank you~ > > > >> > > > >> Xintong Song > > > >> > > > >> > > > >> > > > >> > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler <[hidden email]> > > > wrote: > > > >>> > > > >>> You're raising a good point, but I think I can rectify that with a > > > minor > > > >>> adjustment. > > > >>> > > > >>> Default requirements are whatever the default requirements are, > > setting > > > >>> the requirements for one operator has no effect on other operators. > > > >>> > > > >>> With these rules, and some API enhancements, the following mockup > > would > > > >>> replicate the SSG-based behavior: > > > >>> > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > >>> vertices = slotSharingGroup.getVertices() > > > >>> > > > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > > >>> vertices.remainint().setRequirements(ZERO) > > > >>> } > > > >>> > > > >>> We could even allow setting requirements on slotsharing-groups > > > >>> colocation-groups and internally translate them accordingly. > > > >>> I can't help but feel this is a plain API issue. > > > >>> > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > >>> > If I understand you correctly Chesnay, then you want to decouple > > the > > > >>> > resource requirement specification from the slot sharing group > > > >>> > assignment. Hence, per default all operators would be in the same > > > slot > > > >>> > sharing group. If there is no operator with a resource > > specification, > > > >>> > then the system would allocate a default slot for it. If there is > > at > > > >>> > least one operator, then the system would sum up all the specified > > > >>> > resources and allocate a slot of this size. This effectively means > > > >>> > that all unspecified operators will implicitly have a zero resource > > > >>> > requirement. Did I understand your idea correctly? > > > >>> > > > > >>> > I am wondering whether this wouldn't lead to a surprising behaviour > > > >>> > for the user. If the user specifies the resource requirements for a > > > >>> > single operator, then he probably will assume that the other > > > operators > > > >>> > will get the default share of resources and not nothing. > > > >>> > > > > >>> > Cheers, > > > >>> > Till > > > >>> > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > [hidden email] > > > >>> > <mailto:[hidden email]>> wrote: > > > >>> > > > > >>> > Is there even a functional difference between specifying the > > > >>> > requirements for an SSG vs specifying the same requirements on > > a > > > >>> > single > > > >>> > operator within that group (ideally a colocation group to avoid > > > this > > > >>> > whole hint business)? > > > >>> > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > >>> > > > > >>> > Users can take shortcuts to define shared requirements, > > > >>> > but refine them further as needed on a per-operator basis, > > > >>> > without changing semantics of slotsharing groups > > > >>> > nor the runtime being locked into SSG-based requirements. > > > >>> > > > > >>> > (And before anyone argues what happens if slotsharing groups > > > >>> > change or > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > (A > > > >>> > plain > > > >>> > iteration over slotsharing groups and therein contained > > operators > > > >>> > would > > > >>> > suffice)). > > > >>> > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > the SSG > > > >>> > > resource requirements as a hint for the runtime similar to > > how > > > >>> > slot sharing > > > >>> > > groups are designed at the moment? Meaning that we don't give > > > >>> > the guarantee > > > >>> > > that Flink will always deploy this set of tasks together no > > > >>> > matter what > > > >>> > > comes. If, for example, the runtime can derive by some means > > > the > > > >>> > resource > > > >>> > > requirements for each task based on the requirements for the > > > >>> > SSG, this > > > >>> > > could be possible. One easy strategy would be to give every > > > task > > > >>> > the same > > > >>> > > resources as the whole slot sharing group. Another one could > > be > > > >>> > > distributing the resources equally among the tasks. This does > > > >>> > not even have > > > >>> > > to be implemented but we would give ourselves the freedom to > > > change > > > >>> > > scheduling if need should arise. > > > >>> > > > > > >>> > > Cheers, > > > >>> > > Till > > > >>> > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > [hidden email] > > > >>> > <mailto:[hidden email]>> wrote: > > > >>> > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > >>> > >> > > > >>> > >> I second Xintong's comment that SSG-based runtime interface > > > >>> > will give > > > >>> > >> us the flexibility to achieve op/task-based approach. That's > > > one of > > > >>> > >> the most important reasons for our design choice. > > > >>> > >> > > > >>> > >> Some cents regarding the default operator resource: > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > >>> > >> ** For light-weight operators, the accumulative > > > >>> > configuration error > > > >>> > >> will not be significant. Then, the resource of a task used > > is > > > >>> > >> proportional to the number of operators it contains. > > > >>> > >> ** For heavy operators like join and window or operators > > > >>> > using the > > > >>> > >> external resources, user will turn to the fine-grained > > > resource > > > >>> > >> configuration. > > > >>> > >> - It can increase the stability for the standalone cluster > > > >>> > where task > > > >>> > >> executors registered are heterogeneous(with different > > default > > > slot > > > >>> > >> resources). > > > >>> > >> - It might not be good for SQL users. The operators that SQL > > > >>> > will be > > > >>> > >> transferred to is a black box to the user. We also do not > > > guarantee > > > >>> > >> the cross-version of consistency of the transformation so > > far. > > > >>> > >> > > > >>> > >> I think it can be treated as a follow-up work when the > > > fine-grained > > > >>> > >> resource management is end-to-end ready. > > > >>> > >> > > > >>> > >> Best, > > > >>> > >> Yangze Guo > > > >>> > >> > > > >>> > >> > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >> wrote: > > > >>> > >>> Thanks for the feedback, Till. > > > >>> > >>> > > > >>> > >>> ## I feel that what you proposed (operator-based + default > > > >>> > value) might > > > >>> > >> be > > > >>> > >>> subsumed by the SSG-based approach. > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > > > >>> > categorized by > > > >>> > >>> whether the resource requirements are known to the users. > > > >>> > >>> > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > >>> > reason to put > > > >>> > >>> multiple operators whose individual resource > > requirements > > > >>> > are already > > > >>> > >> known > > > >>> > >>> into the same group in fine-grained resource > > management. > > > >>> > And if op_1 > > > >>> > >> and > > > >>> > >>> op_2 are in different groups, there should be no > > problem > > > >>> > switching > > > >>> > >> data > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > >>> > equivalent to > > > >>> > >> specifying > > > >>> > >>> operator resource requirements in your proposal. > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > that > > > >>> > op_2 is in a > > > >>> > >>> SSG whose resource is not specified thus would have the > > > >>> > default slot > > > >>> > >>> resource. This is equivalent to having default operator > > > >>> > resources in > > > >>> > >> your > > > >>> > >>> proposal. > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > op_2 > > > >>> > to the same > > > >>> > >> SSG > > > >>> > >>> or separate SSGs. > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > >>> > equivalent to > > > >>> > >> the > > > >>> > >>> coarse-grained resource management, where op_1 and > > > op_2 > > > >>> > share a > > > >>> > >> default > > > >>> > >>> size slot no matter which data exchange mode is > > used. > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > of > > > >>> > them will > > > >>> > >> use > > > >>> > >>> a default size slot. This is equivalent to setting > > > them > > > >>> > with > > > >>> > >> default > > > >>> > >>> operator resources in your proposal. > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > is > > > >>> > known.* > > > >>> > >>> - It is possible that the user learns the total / > > max > > > >>> > resource > > > >>> > >>> requirement from executing and monitoring the job, > > > >>> > while not > > > >>> > >>> being aware of > > > >>> > >>> individual operator requirements. > > > >>> > >>> - I believe this is the case your proposal does not > > > >>> > cover. And TBH, > > > >>> > >>> this is probably how most users learn the resource > > > >>> > requirements, > > > >>> > >>> according > > > >>> > >>> to my experiences. > > > >>> > >>> - In this case, the user might need to specify > > > >>> > different resources > > > >>> > >> if > > > >>> > >>> he wants to switch the execution mode, which should > > > not > > > >>> > be worse > > > >>> > >> than not > > > >>> > >>> being able to use fine-grained resource management. > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> ## An additional idea inspired by your proposal. > > > >>> > >>> We may provide multiple options for deciding resources for > > > >>> > SSGs whose > > > >>> > >>> requirement is not specified, if needed. > > > >>> > >>> > > > >>> > >>> - Default slot resource (current design) > > > >>> > >>> - Default operator resource times number of operators > > > >>> > (equivalent to > > > >>> > >>> your proposal) > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> ## Exposing internal runtime strategies > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > >>> > requirements might be > > > >>> > >>> affected if how SSGs are internally handled changes in > > > future. > > > >>> > >> Practically, > > > >>> > >>> I do not concretely see at the moment what kind of changes > > we > > > >>> > may want in > > > >>> > >>> future that might conflict with this FLIP proposal, as the > > > >>> > question of > > > >>> > >>> switching data exchange mode answered above. I'd suggest to > > > >>> > not give up > > > >>> > >> the > > > >>> > >>> user friendliness we may gain now for the future problems > > > that > > > >>> > may or may > > > >>> > >>> not exist. > > > >>> > >>> > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > >>> > achieve the > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > set each > > > >>> > >> operator > > > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > > > >>> > option to > > > >>> > >>> automatically do that for users, if needed. > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> Thank you~ > > > >>> > >>> > > > >>> > >>> Xintong Song > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >> wrote: > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > >>> > >>>> > > > >>> > >>>> I agree that being able to define the resource > > requirements > > > for a > > > >>> > >> group of > > > >>> > >>>> operators is more user friendly. However, my concern is > > that > > > >>> > we are > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > >>> > limit our > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > semantics > > > of > > > >>> > >> configuring > > > >>> > >>>> resource requirements for SSGs could break if switching > > from > > > >>> > streaming > > > >>> > >> to > > > >>> > >>>> batch execution. If one defines the resource requirements > > > for > > > >>> > op_1 -> > > > >>> > >> op_2 > > > >>> > >>>> which run in pipelined mode when using the streaming > > > >>> > execution, then > > > >>> > >> how do > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > >>> > executed with a > > > >>> > >>>> blocking data exchange in batch execution mode? > > > Consequently, > > > >>> > I am > > > >>> > >> still > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > >>> > requirements per > > > >>> > >>>> operator. > > > >>> > >>>> > > > >>> > >>>> Maybe the following proposal makes the configuration > > easier: > > > >>> > If the > > > >>> > >> user > > > >>> > >>>> wants to use fine-grained resource requirements, then she > > > >>> > needs to > > > >>> > >> specify > > > >>> > >>>> the default size which is used for operators which have no > > > >>> > explicit > > > >>> > >>>> resource annotation. If this holds true, then every > > operator > > > >>> > would > > > >>> > >> have a > > > >>> > >>>> resource requirement and the system can try to execute the > > > >>> > operators > > > >>> > >> in the > > > >>> > >>>> best possible manner w/o being constrained by how the user > > > >>> > set the SSG > > > >>> > >>>> requirements. > > > >>> > >>>> > > > >>> > >>>> Cheers, > > > >>> > >>>> Till > > > >>> > >>>> > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >>>> wrote: > > > >>> > >>>> > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > >>> > >>>>> > > > >>> > >>>>> Actually, your proposal has also come to my mind at some > > > >>> > point. And I > > > >>> > >>>> have > > > >>> > >>>>> some concerns about it. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> 1. It does not give users the same control as the > > SSG-based > > > >>> > approach. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> While both approaches do not require specifying for each > > > >>> > operator, > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > operators > > > >>> > >> together > > > >>> > >>>> use > > > >>> > >>>>> this much resource" while the operator-based approach > > > doesn't. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > > > >>> > o_m), and > > > >>> > >> at > > > >>> > >>>> some > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > > > >>> > reduces the > > > >>> > >> data > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > > > >>> > (o_1, ..., > > > >>> > >> o_n) > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > higher > > > >>> > >> parallelisms > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > > > >>> > lead to too > > > >>> > >> much > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > different > > > >>> > >> resources, > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > >>> > resources for > > > >>> > >> the > > > >>> > >>>> two > > > >>> > >>>>> groups. However, with the operator-based approach, the > > > user will > > > >>> > >> have to > > > >>> > >>>>> specify resources for each operator in one of the two > > > >>> > groups, and > > > >>> > >> tune > > > >>> > >>>> the > > > >>> > >>>>> default slot resource via configurations to fit the other > > > group. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Setting chainnable operators into different slot sharing > > > >>> > groups will > > > >>> > >>>>> prevent them from being chained. In the current > > > implementation, > > > >>> > >>>> downstream > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > to > > > >>> > the same > > > >>> > >> group > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > upstream > > > >>> > >> operators > > > >>> > >>>> in > > > >>> > >>>>> different groups), to reduce the chance of breaking > > chains. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > > > >>> > deciding > > > >>> > >> SSGs > > > >>> > >>>>> based on whether resource is specified we will easily get > > > >>> > groups like > > > >>> > >>>> (o_1, > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > >>> > chained. This > > > >>> > >> is > > > >>> > >>>> also > > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > > >>> > chance is much > > > >>> > >>>>> smaller because there's no strong reason for users to > > > >>> > specify the > > > >>> > >> groups > > > >>> > >>>>> with alternate operators like that. We are more likely to > > > >>> > get groups > > > >>> > >> like > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > between > > > >>> > o_2 and > > > >>> > >> o_3. > > > >>> > >>>>> > > > >>> > >>>>> 3. It complicates the system by having two different > > > >>> > mechanisms for > > > >>> > >>>> sharing > > > >>> > >>>>> managed memory in a slot. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > memory > > > >>> > sharing > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > >>> > according to the > > > >>> > >>>>> consumer type, then further distributed across operators > > > of that > > > >>> > >> consumer > > > >>> > >>>>> type. > > > >>> > >>>>> > > > >>> > >>>>> - With the operator-based approach, managed memory size > > > >>> > specified > > > >>> > >> for an > > > >>> > >>>>> operator should account for all the consumer types of > > that > > > >>> > operator. > > > >>> > >> That > > > >>> > >>>>> means the managed memory is first distributed across > > > >>> > operators, then > > > >>> > >>>>> distributed to different consumer types of each operator. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Unfortunately, the different order of the two calculation > > > >>> > steps can > > > >>> > >> lead > > > >>> > >>>> to > > > >>> > >>>>> different results. To be specific, the semantic of the > > > >>> > configuration > > > >>> > >>>> option > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > >>> > operator). > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> To sum up things: > > > >>> > >>>>> > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > >>> > think (1) > > > >>> > >> and (2) > > > >>> > >>>>> somehow suggest that, the price for the proposed approach > > > to > > > >>> > avoid > > > >>> > >>>>> specifying resource for every operator is that it's not > > as > > > >>> > >> independent > > > >>> > >>>> from > > > >>> > >>>>> operator chaining and slot sharing as the operator-based > > > >>> > approach > > > >>> > >>>> discussed > > > >>> > >>>>> in the FLIP. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Thank you~ > > > >>> > >>>>> > > > >>> > >>>>> Xintong Song > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >> wrote: > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > >>> > >>>>>> > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > >>> > written. And > > > >>> > >> the > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > >>> > configuration to > > > >>> > >>>> users > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > >>> > >>>>>> So good job here! > > > >>> > >>>>>> > > > >>> > >>>>>> About how to let users specify the resource profiles. > > If I > > > >>> > can sum > > > >>> > >> the > > > >>> > >>>>> FLIP > > > >>> > >>>>>> and previous discussion up in my own words, the problem > > > is the > > > >>> > >>>> following: > > > >>> > >>>>>> Operator-level specification is the simplest and > > cleanest > > > >>> > approach, > > > >>> > >>>>> because > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > > > >>> > >> scheduling. No > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > sharing, > > > >>> > >>>> switching > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > > > >>> > stay the > > > >>> > >>>> same. > > > >>> > >>>>>>> But it would require that a user specifies resources on > > > all > > > >>> > >>>> operators, > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > suggests > > > going > > > >>> > >> with > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > >>> > >>>>>> > > > >>> > >>>>>> I think both thoughts are important, so can we find a > > > solution > > > >>> > >> where > > > >>> > >>>> the > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > >>> > still avoid > > > >>> > >> that > > > >>> > >>>>> we > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > >>> > >>>>>> > > > >>> > >>>>>> What do you think about something like the following: > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > level. > > > >>> > >>>>>> - Not all operators need profiles > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > in > > > the > > > >>> > >> default > > > >>> > >>>> slot > > > >>> > >>>>>> sharing group with a default profile (will get a default > > > slot). > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > >>> > another slot > > > >>> > >>>>> sharing > > > >>> > >>>>>> group (the resource-specified-group). > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > >>> > operators > > > >>> > >> like > > > >>> > >>>>> they > > > >>> > >>>>>> do now, with the exception that you cannot mix operators > > > >>> > that have > > > >>> > >> a > > > >>> > >>>>>> resource profile and operators that have no resource > > > profile. > > > >>> > >>>>>> - The default case where no operator has a resource > > > >>> > profile is > > > >>> > >> just a > > > >>> > >>>>>> special case of this model > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > operator, > > > >>> > like it > > > >>> > >> does > > > >>> > >>>>> now, > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > > > it > > > >>> > >> schedules > > > >>> > >>>>>> together. > > > >>> > >>>>>> > > > >>> > >>>>>> > > > >>> > >>>>>> There is another question about reactive scaling raised > > > in the > > > >>> > >> FLIP. I > > > >>> > >>>>> need > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > tricky > > > >>> > once we > > > >>> > >>>> have > > > >>> > >>>>>> slots of different sizes. > > > >>> > >>>>>> It is not clear then which of the different slot > > requests > > > the > > > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > > > >>> > show up, > > > >>> > >> or how > > > >>> > >>>>> the > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > resources > > > >>> > (TMs) > > > >>> > >>>>> disappear > > > >>> > >>>>>> This question is pretty orthogonal, though, to the "how > > to > > > >>> > specify > > > >>> > >> the > > > >>> > >>>>>> resources". > > > >>> > >>>>>> > > > >>> > >>>>>> > > > >>> > >>>>>> Best, > > > >>> > >>>>>> Stephan > > > >>> > >>>>>> > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > >>> > <[hidden email] <mailto:[hidden email]> > > > >>> > >>>>> wrote: > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > discussion, > > > >>> > Yangze. > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > >>> > >>>>>>> > > > >>> > >>>>>>> @Till, > > > >>> > >>>>>>> > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > that > > > SSGs > > > >>> > >> need to > > > >>> > >>>>> be > > > >>> > >>>>>>> supported in fine-grained resource management, > > otherwise > > > each > > > >>> > >>>> operator > > > >>> > >>>>>>> might use as many resources as the whole group. > > However, > > > I > > > >>> > cannot > > > >>> > >>>> think > > > >>> > >>>>>> of > > > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > > > >>> > resource > > > >>> > >>>>>>> management. > > > >>> > >>>>>>> > > > >>> > >>>>>>> > > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > > properly > > > >>> > >>>>>> specified, > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > could > > > >>> > >> slice off > > > >>> > >>>>> the > > > >>> > >>>>>>>> appropriately sized slots for every Task individually. > > > >>> > >>>>>>>> > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > >>> > operator op_1 > > > >>> > >> and > > > >>> > >>>>> op_2 > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > > say > > > that > > > >>> > >> the > > > >>> > >>>> slot > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > have > > > a > > > >>> > >> cluster > > > >>> > >>>>> with > > > >>> > >>>>>> 2 > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > cannot run > > > >>> > >> this > > > >>> > >>>>> job. > > > >>> > >>>>>> If > > > >>> > >>>>>>>> the resources were specified on an operator level, > > then > > > the > > > >>> > >> system > > > >>> > >>>>>> could > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > op_2 > > > to > > > >>> > >> TM_2. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > > > are > > > >>> > >> properly > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > > >>> > think this > > > >>> > >>>>> exactly > > > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > > > each > > > >>> > >> needs > > > >>> > >>>> 100 > > > >>> > >>>>> MB > > > >>> > >>>>>>> of memory, why would we put them in the same group? If > > > >>> > they are > > > >>> > >> in > > > >>> > >>>>>> separate > > > >>> > >>>>>>> groups, with the proposed approach the system can > > freely > > > >>> > deploy > > > >>> > >> them > > > >>> > >>>> to > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > > > is > > > >>> > having > > > >>> > >>>>>> resource > > > >>> > >>>>>>> requirements properly specified for all operators. This > > > is not > > > >>> > >> always > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > > > of the > > > >>> > >>>> benefits > > > >>> > >>>>>> for > > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > > freely > > > >>> > >> decide > > > >>> > >>>> the > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > >>> > consider SSG > > > >>> > >> in > > > >>> > >>>>>>> fine-grained resource management as a group of > > operators > > > >>> > that the > > > >>> > >>>> user > > > >>> > >>>>>>> would like to specify the total resource for. There can > > > be > > > >>> > only > > > >>> > >> one > > > >>> > >>>>> group > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > major > > > >>> > parts, > > > >>> > >> or as > > > >>> > >>>>>> many > > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > > how > > > >>> > >>>> fine-grained > > > >>> > >>>>>> the > > > >>> > >>>>>>> user is able to specify the resources. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > > > >>> > that all > > > >>> > >> the > > > >>> > >>>>>>> current scheduler implementations already support > > SSGs, I > > > >>> > tend to > > > >>> > >>>> think > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > >>> > usability and > > > >>> > >>>>>>> flexibility. > > > >>> > >>>>>>> > > > >>> > >>>>>>> @Chesnay > > > >>> > >>>>>>> > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > waste > > > >>> > >> resources > > > >>> > >>>> if > > > >>> > >>>>>> the > > > >>> > >>>>>>>> parallelism of operators within that group are > > > different? > > > >>> > >>>>>>>> > > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > > >>> > >> utilization. To > > > >>> > >>>>>> avoid > > > >>> > >>>>>>> such wasting, the user can define more groups, so that > > > >>> > each group > > > >>> > >>>>>> contains > > > >>> > >>>>>>> less operators and the chance of having operators with > > > >>> > different > > > >>> > >>>>>>> parallelism will be reduced. The price is to have more > > > >>> > resource > > > >>> > >>>>>>> requirements to specify. > > > >>> > >>>>>>> > > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > > >>> > >> recalculate the > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > create > > > >>> > >> a set > > > >>> > >>>>> of > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > their > > > >>> > >>>>> applications; > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > >>> > would be > > > >>> > >> a > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > >>> > >> requirements > > > >>> > >>>>> any > > > >>> > >>>>>>>> way. > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > increases > > > >>> > >>>>> usability. > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > there's no > > > >>> > >> reason to > > > >>> > >>>>> put > > > >>> > >>>>>>> multiple operators whose individual resource > > > >>> > requirements are > > > >>> > >>>>> already > > > >>> > >>>>>>> known > > > >>> > >>>>>>> into the same group in fine-grained resource > > > management. > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > multiple > > > >>> > >>>>> applications, > > > >>> > >>>>>>> it does not guarantee the same resource > > requirements. > > > >>> > During > > > >>> > >> our > > > >>> > >>>>> years > > > >>> > >>>>>>> of > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > requirements > > > >>> > >> specified for > > > >>> > >>>>>>> Blink's > > > >>> > >>>>>>> fine-grained resource management, very few users > > > >>> > (including > > > >>> > >> our > > > >>> > >>>>>>> specialists > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > >>> > >> experienced as > > > >>> > >>>>> to > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > >>> > >> requirements. > > > >>> > >>>> Most > > > >>> > >>>>>>> people > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > delay, cpu > > > >>> > >> load, > > > >>> > >>>>>> memory > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > specification. > > > >>> > >>>>>>> > > > >>> > >>>>>>> To sum up: > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > >>> > requirements > > > >>> > >> for > > > >>> > >>>>>> every > > > >>> > >>>>>>> operator, that's definitely a good thing and we would > > not > > > >>> > need to > > > >>> > >>>> rely > > > >>> > >>>>> on > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > >>> > >> fine-grained > > > >>> > >>>>>> resource > > > >>> > >>>>>>> management to work. For those users who are capable and > > > do not > > > >>> > >> like > > > >>> > >>>>>> having > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > to > > > have > > > >>> > >> both > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > > > only > > > >>> > >> fallback > > > >>> > >>>> to > > > >>> > >>>>>> the > > > >>> > >>>>>>> SSG requirements when the operator requirements are not > > > >>> > >> specified. > > > >>> > >>>>>> However, > > > >>> > >>>>>>> as the first step, I think we should prioritise the use > > > cases > > > >>> > >> where > > > >>> > >>>>> users > > > >>> > >>>>>>> are not that experienced. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Thank you~ > > > >>> > >>>>>>> > > > >>> > >>>>>>> Xintong Song > > > >>> > >>>>>>> > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > >>> > >>>>>>> wrote: > > > >>> > >>>>>>> > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > > waste > > > >>> > >> resources > > > >>> > >>>>> if > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > different? > > > >>> > >>>>>>>> > > > >>> > >>>>>>>> It also seems like quite a hassle for users having to > > > >>> > >> recalculate > > > >>> > >>>> the > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > create > > > >>> > >> a set > > > >>> > >>>>> of > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > their > > > >>> > >>>>> applications; > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > >>> > would be > > > >>> > >> a > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > >>> > >> requirements > > > >>> > >>>>> any > > > >>> > >>>>>>>> way. > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > increases > > > >>> > >>>>> usability. > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > work > > > >>> > on SSGs > > > >>> > >>>> it's > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > approaches, > > > >>> > >> which > > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > > always > > > >>> > >> defined > > > >>> > >>>> on > > > >>> > >>>>>> an > > > >>> > >>>>>>>> operator-level. > > > >>> > >>>>>>>> > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > discussion > > > >>> > >>>> Yangze. > > > >>> > >>>>>>>>> I like that defining resource requirements on a slot > > > sharing > > > >>> > >>>> group > > > >>> > >>>>>>> makes > > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > > resource > > > >>> > >>>>>>> requirements. > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > sharing > > > >>> > >>>> groups > > > >>> > >>>>>> from > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > be > > > >>> > >> supported > > > >>> > >>>> in > > > >>> > >>>>>>> order > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > far, > > > the > > > >>> > >> idea > > > >>> > >>>> of > > > >>> > >>>>>> slot > > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > > set > > > of > > > >>> > >>>> operators > > > >>> > >>>>>> can > > > >>> > >>>>>>>> be > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > the > > > >>> > >> freedom > > > >>> > >>>> to > > > >>> > >>>>>> say > > > >>> > >>>>>>>> that > > > >>> > >>>>>>>>> it would rather place these tasks in different slots > > > if it > > > >>> > >>>> wanted. > > > >>> > >>>>> If > > > >>> > >>>>>>> we > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > sharing > > > >>> > >> group, > > > >>> > >>>> then > > > >>> > >>>>>> the > > > >>> > >>>>>>>>> only option for a scheduler which does not support > > slot > > > >>> > >> sharing > > > >>> > >>>>>> groups > > > >>> > >>>>>>> is > > > >>> > >>>>>>>>> to say that every operator in this slot sharing group > > > >>> > needs a > > > >>> > >>>> slot > > > >>> > >>>>>> with > > > >>> > >>>>>>>> the > > > >>> > >>>>>>>>> same resources as the whole group. > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > > operator > > > >>> > >> op_1 > > > >>> > >>>>> and > > > >>> > >>>>>>> op_2 > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > say that > > > >>> > >> the > > > >>> > >>>>> slot > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > have a > > > >>> > >> cluster > > > >>> > >>>>>> with > > > >>> > >>>>>>> 2 > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > cannot run > > > >>> > >> this > > > >>> > >>>>>> job. > > > >>> > >>>>>>> If > > > >>> > >>>>>>>>> the resources were specified on an operator level, > > > then the > > > >>> > >>>> system > > > >>> > >>>>>>> could > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > op_2 to > > > >>> > >> TM_2. > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > > > groups > > > >>> > >> was > > > >>> > >>>> to > > > >>> > >>>>>> make > > > >>> > >>>>>>>> it > > > >>> > >>>>>>>>> easier for the user to reason about how many slots a > > > job > > > >>> > >> needs > > > >>> > >>>>>>>> independent > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > Interestingly, > > > >>> > >> if > > > >>> > >>>> all > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > then slot > > > >>> > >>>>> sharing > > > >>> > >>>>>> is > > > >>> > >>>>>>>> no > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > >>> > appropriately > > > >>> > >>>> sized > > > >>> > >>>>>>> slots > > > >>> > >>>>>>>>> for every Task individually. What matters is whether > > > the > > > >>> > >> whole > > > >>> > >>>>>> cluster > > > >>> > >>>>>>>> has > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> Cheers, > > > >>> > >>>>>>>>> Till > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > >>> > >>>>>> wrote: > > > >>> > >>>>>>>>>> Hi, there, > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > "FLIP-156: > > > >>> > >> Runtime > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > Requirements"[1], > > > >>> > >> where we > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > interfaces > > > >>> > >> for > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> In this FLIP: > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > > >>> > >> management. > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > SSG-based > > > >>> > >> resource > > > >>> > >>>>>>>>>> requirements. > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > >>> > >> granularities > > > >>> > >>>>> for > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > slot > > > >>> > >> sharing > > > >>> > >>>>>> group) > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > [1]. > > > >>> > >> Looking > > > >>> > >>>>>>>>>> forward to your feedback. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> [1] > > > >>> > >>>>>>>>>> > > > >>> > >> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > >>> > < > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > >>> > >>>>>>>>>> Best, > > > >>> > >>>>>>>>>> Yangze Guo > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>> > > > >>> > > > > >>> > > > > > |
Hi all, sorry for join discussion even after voting started.
I want to share my thoughts on this after reading above discussions. I think Flink *runtime* already has an ideal granularity for resource management 'task'. If there is a slot shared by multiple tasks, that slot's resource requirement is simple sum of all its logical slots. So basically, this is no resource requirement for SlotSharingGroup in runtime until now, right ? As in discussion, we already agree upon that: "If all operators have their resources properly specified, then slot sharing is no longer needed. " So seems to me, naturally in mind path, what we would discuss is that: how to bridge impractical operator level resource specifying to runtime task level resource requirement ? This is actually a pure api thing as Chesnay has pointed out. But FLIP-156 brings another direction on table: how about using SSG for both api and runtime resource specifying ? From the FLIP and dicusssion, I assume that SSG resource specifying will override operator level resource specifying if both are specified ? So, I wonder whether we could interpret SSG resource specifying as an "add" but not an "set" on resource requirement ? The semantics is that SSG resource specifying adds additional resource to shared slot to express concerns on possible high thoughput and resource requirement for tasks in one physical slot. The result is that if scheduler indeed respect slot sharing, allocated slot will gain extra resource specified for that SSG. I think one of coding barrier from "add" approach is ResourceSpec.UNKNOWN which didn't support 'merge' operation. I tend to use ResourceSpec.ZERO as default, task executor should be aware of this. @Chesnay > My main worry is that it if we wire the runtime to work on SSGs it's > gonna be difficult to implement more fine-grained approaches, which > would not be the case if, for the runtime, they are always defined on an > operator-level. An "add" operation should be less invasive and enforce low barrier for future find-grained approaches. @Stephan > - Users can define different slot sharing groups for operators like they > do now, with the exception that you cannot mix operators that have a > resource profile and operators that have no resource profile. @Till > This effectively means that all unspecified operators > will implicitly have a zero resource requirement. > I am wondering whether this wouldn't lead to a surprising behaviour for the > user. If the user specifies the resource requirements for a single > operator, then he probably will assume that the other operators will get > the default share of resources and not nothing. I think it is inherent due to fact that we could not defining ResourceSpec.ONE, eg. resource requirement for exact one default slot, with concrete numbers ? I tend to squash out unspecified one if there are operators in chaining with explicit resource specifying. Otherwise, the protocol tends to verbose as say "give me this much resource and a default". I think if we have explict resource specifying for partial operators, it is just saying "I don't care other operators that much, just get them places to run". It is most likely be cases there are stateless fliter/map or other less resource consuming operators. If there is indeed a problem, I think clients can specify a global default(or other level default in future). In job graph generating phase, we could take that default into account for unspecified operators. @FLIP-156 > Expose operator chaining. (Cons fo task level resource specifying) Is it inherent for all group level resource specifying ? They will either break chaining or obey it, or event could not work with. To sum up above, my suggestions are: In api side: * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if unspecified). * Operator: ResourceSpec.ZERO(unspecified) as default. * Task: sum of requirements from specified operators + global default(if there are any unspecified operators) * SSG: additional resource to physical slot. In runtime side: * Task: ResourceSpec.Task or ResourceSpec.ZERO * SSG: ResourceSpec.SSG or ResourceSpec.ZERO Physical slot gets sum up resources from logical slots and SSG, if it gets ResourceSpec.ZERO, it is just a default sized slot. In short, turn SSG resource speciying as "add" and drop ResourceSpec.UNKNOWN. Questions/Issues: * Could SSG express negative resource requirement ? * Is there concrete bar for partial resource configured not function ? I saw it will fail job submission in Dispatcher.submitJob. * An option(cluster/job level) to force slot sharing in scheduler ? This could be useful in case of migration from FLIP-156 to future approach. * An option(cluster) to ignore resource specifying(allow resource specified job to run on open box environment) for no production usage ? On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) wrote: Thanks for reply, Till and Xintong! I update the FLIP, including: - Edit the JavaDoc of the proposed StreamGraphGenerator#setSlotSharingGroupResource. - Add "Future Plan" section, which contains the potential follow-up issues and the limitations to be documented when fine-grained resource management is exposed to users. I'll start a vote in another thread. Best, Yangze Guo On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> wrote: > > Thanks for summarizing the discussion, Yangze. I agree that setting > resource requirements per operator is not very user friendly. Moreover, I > couldn't come up with a different proposal which would be as easy to use > and wouldn't expose internal scheduling details. In fact, following this > argument then we shouldn't have exposed the slot sharing groups in the > first place. > > What is important for the user is that we properly document the limitations > and constraints the fine grained resource specification has. For example, > we should explain how optimizations like chaining are affected by it and > how different execution modes (batch vs. streaming) affect the execution of > operators which have specified resources. These things shouldn't become > part of the contract of this feature and are more caused by internal > implementation details but it will be important to understand these things > properly in order to use this feature effectively. > > Hence, +1 for starting the vote for this FLIP. > > Cheers, > Till > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> wrote: > > > Thanks for the summary, Yangze. > > > > The changes and follow-up issues LGTM. Let's wait for responses from the > > others before starting a vote. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> wrote: > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > summarize the current convergence in the discussion. Please let me > > > know if I got things wrong or missed something crucial here. > > > > > > Change of this FLIP: > > > - Treat the SSG resource requirements as a hint instead of a > > > restriction for the runtime. That's should be explicitly explained in > > > the JavaDocs. > > > > > > Potential follow-up issues if needed: > > > - Provide operator-level resource configuration interface. > > > - Provide multiple options for deciding resources for SSGs whose > > > requirement is not specified: > > > ** Default slot resource. > > > ** Default operator resource times number of operators. > > > > > > If there are no other issues, I'll update the FLIP accordingly and > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > Best, > > > Yangze Guo > > > > > > Best, > > > Yangze Guo > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[hidden email]> > > > wrote: > > > >> > > > >> I think Chesnay's proposal could actually work. IIUC, the keypoint is > > > to derive operator requirements from SSG requirements on the API side, so > > > that the runtime only deals with operator requirements. It's debatable > > how > > > the deriving should be done though. E.g., an alternative could be to > > evenly > > > divide the SSG requirement into requirements of operators in the group. > > > >> > > > >> > > > >> However, I'm not entirely sure which option is more desired. > > > Illustrating my understanding in the following figure, in which on the > > top > > > is Chesnay's proposal and on the bottom is the SSG-based proposal in this > > > FLIP. > > > >> > > > >> > > > >> > > > >> I think the major difference between the two approaches is where > > > deriving operator requirements from SSG requirements happens. > > > >> > > > >> - Chesnay's proposal simplifies the runtime logic and the interface to > > > expose, at the price of moving more complexity (i.e. the deriving) to the > > > API side. The question is, where do we prefer to keep the complexity? I'm > > > slightly leaning towards having a thin API and keep the complexity in > > > runtime if possible. > > > >> > > > >> - Notice that the dash line arrows represent optional steps that are > > > needed only for schedulers that do not respect SSGs, which we don't have > > at > > > the moment. If we only look at the solid line arrows, then the SSG-based > > > approach is much simpler, without needing to derive and aggregate the > > > requirements back and forth. I'm not sure about complicating the current > > > design only for the potential future needs. > > > >> > > > >> > > > >> Thank you~ > > > >> > > > >> Xintong Song > > > >> > > > >> > > > >> > > > >> > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > > > wrote: > > > >>> > > > >>> You're raising a good point, but I think I can rectify that with a > > > minor > > > >>> adjustment. > > > >>> > > > >>> Default requirements are whatever the default requirements are, > > setting > > > >>> the requirements for one operator has no effect on other operators. > > > >>> > > > >>> With these rules, and some API enhancements, the following mockup > > would > > > >>> replicate the SSG-based behavior: > > > >>> > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > >>> vertices = slotSharingGroup.getVertices() > > > >>> > > > > > > > > >>> vertices.remainint().setRequirements(ZERO) > > > >>> } > > > >>> > > > >>> We could even allow setting requirements on slotsharing-groups > > > >>> colocation-groups and internally translate them accordingly. > > > >>> I can't help but feel this is a plain API issue. > > > >>> > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > >>> > If I understand you correctly Chesnay, then you want to decouple > > the > > > >>> > resource requirement specification from the slot sharing group > > > >>> > assignment. Hence, per default all operators would be in the same > > > slot > > > >>> > sharing group. If there is no operator with a resource > > specification, > > > >>> > then the system would allocate a default slot for it. If there is > > at > > > >>> > least one operator, then the system would sum up all the specified > > > >>> > resources and allocate a slot of this size. This effectively means > > > >>> > that all unspecified operators will implicitly have a zero resource > > > >>> > requirement. Did I understand your idea correctly? > > > >>> > > > > >>> > I am wondering whether this wouldn't lead to a surprising behaviour > > > >>> > for the user. If the user specifies the resource requirements for a > > > >>> > single operator, then he probably will assume that the other > > > operators > > > >>> > will get the default share of resources and not nothing. > > > >>> > > > > >>> > Cheers, > > > >>> > Till > > > >>> > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > [hidden email] > > > >>> > <mailto:[hidden email]>> wrote: > > > >>> > > > > >>> > Is there even a functional difference between specifying the > > > >>> > requirements for an SSG vs specifying the same requirements on > > a > > > >>> > single > > > >>> > operator within that group (ideally a colocation group to avoid > > > this > > > >>> > whole hint business)? > > > >>> > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > >>> > > > > >>> > Users can take shortcuts to define shared requirements, > > > >>> > but refine them further as needed on a per-operator basis, > > > >>> > without changing semantics of slotsharing groups > > > >>> > nor the runtime being locked into SSG-based requirements. > > > >>> > > > > >>> > (And before anyone argues what happens if slotsharing groups > > > >>> > change or > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > (A > > > >>> > plain > > > >>> > iteration over slotsharing groups and therein contained > > operators > > > >>> > would > > > >>> > suffice)). > > > >>> > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > the SSG > > > >>> > > resource requirements as a hint for the runtime similar to > > how > > > >>> > slot sharing > > > >>> > > groups are designed at the moment? Meaning that we don't give > > > >>> > the guarantee > > > >>> > > that Flink will always deploy this set of tasks together no > > > >>> > matter what > > > >>> > > comes. If, for example, the runtime can derive by some means > > > the > > > >>> > resource > > > >>> > > requirements for each task based on the requirements for the > > > >>> > SSG, this > > > >>> > > could be possible. One easy strategy would be to give every > > > task > > > >>> > the same > > > >>> > > resources as the whole slot sharing group. Another one could > > be > > > >>> > > distributing the resources equally among the tasks. This does > > > >>> > not even have > > > >>> > > to be implemented but we would give ourselves the freedom to > > > change > > > >>> > > scheduling if need should arise. > > > >>> > > > > > >>> > > Cheers, > > > >>> > > Till > > > >>> > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > [hidden email] > > > >>> > <mailto:[hidden email]>> wrote: > > > >>> > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > >>> > >> > > > >>> > >> I second Xintong's comment that SSG-based runtime interface > > > >>> > will give > > > >>> > >> us the flexibility to achieve op/task-based approach. That's > > > one of > > > >>> > >> the most important reasons for our design choice. > > > >>> > >> > > > >>> > >> Some cents regarding the default operator resource: > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > >>> > >> ** For light-weight operators, the accumulative > > > >>> > configuration error > > > >>> > >> will not be significant. Then, the resource of a task used > > is > > > >>> > >> proportional to the number of operators it contains. > > > >>> > >> ** For heavy operators like join and window or operators > > > >>> > using the > > > >>> > >> external resources, user will turn to the fine-grained > > > resource > > > >>> > >> configuration. > > > >>> > >> - It can increase the stability for the standalone cluster > > > >>> > where task > > > >>> > >> executors registered are heterogeneous(with different > > default > > > slot > > > >>> > >> resources). > > > >>> > >> - It might not be good for SQL users. The operators that SQL > > > >>> > will be > > > >>> > >> transferred to is a black box to the user. We also do not > > > guarantee > > > >>> > >> the cross-version of consistency of the transformation so > > far. > > > >>> > >> > > > >>> > >> I think it can be treated as a follow-up work when the > > > fine-grained > > > >>> > >> resource management is end-to-end ready. > > > >>> > >> > > > >>> > >> Best, > > > >>> > >> Yangze Guo > > > >>> > >> > > > >>> > >> > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >> wrote: > > > >>> > >>> Thanks for the feedback, Till. > > > >>> > >>> > > > >>> > >>> ## I feel that what you proposed (operator-based + default > > > >>> > value) might > > > >>> > >> be > > > >>> > >>> subsumed by the SSG-based approach. > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > > > >>> > categorized by > > > >>> > >>> whether the resource requirements are known to the users. > > > >>> > >>> > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > >>> > reason to put > > > >>> > >>> multiple operators whose individual resource > > requirements > > > >>> > are already > > > >>> > >> known > > > >>> > >>> into the same group in fine-grained resource > > management. > > > >>> > And if op_1 > > > >>> > >> and > > > >>> > >>> op_2 are in different groups, there should be no > > problem > > > >>> > switching > > > >>> > >> data > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > >>> > equivalent to > > > >>> > >> specifying > > > >>> > >>> operator resource requirements in your proposal. > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > that > > > >>> > op_2 is in a > > > >>> > >>> SSG whose resource is not specified thus would have the > > > >>> > default slot > > > >>> > >>> resource. This is equivalent to having default operator > > > >>> > resources in > > > >>> > >> your > > > >>> > >>> proposal. > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > op_2 > > > >>> > to the same > > > >>> > >> SSG > > > >>> > >>> or separate SSGs. > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > >>> > equivalent to > > > >>> > >> the > > > >>> > >>> coarse-grained resource management, where op_1 and > > > op_2 > > > >>> > share a > > > >>> > >> default > > > >>> > >>> size slot no matter which data exchange mode is > > used. > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > of > > > >>> > them will > > > >>> > >> use > > > >>> > >>> a default size slot. This is equivalent to setting > > > them > > > >>> > with > > > >>> > >> default > > > >>> > >>> operator resources in your proposal. > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > is > > > >>> > known.* > > > >>> > >>> - It is possible that the user learns the total / > > max > > > >>> > resource > > > >>> > >>> requirement from executing and monitoring the job, > > > >>> > while not > > > >>> > >>> being aware of > > > >>> > >>> individual operator requirements. > > > >>> > >>> - I believe this is the case your proposal does not > > > >>> > cover. And TBH, > > > >>> > >>> this is probably how most users learn the resource > > > >>> > requirements, > > > >>> > >>> according > > > >>> > >>> to my experiences. > > > >>> > >>> - In this case, the user might need to specify > > > >>> > different resources > > > >>> > >> if > > > >>> > >>> he wants to switch the execution mode, which should > > > not > > > >>> > be worse > > > >>> > >> than not > > > >>> > >>> being able to use fine-grained resource management. > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> ## An additional idea inspired by your proposal. > > > >>> > >>> We may provide multiple options for deciding resources for > > > >>> > SSGs whose > > > >>> > >>> requirement is not specified, if needed. > > > >>> > >>> > > > >>> > >>> - Default slot resource (current design) > > > >>> > >>> - Default operator resource times number of operators > > > >>> > (equivalent to > > > >>> > >>> your proposal) > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> ## Exposing internal runtime strategies > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > >>> > requirements might be > > > >>> > >>> affected if how SSGs are internally handled changes in > > > future. > > > >>> > >> Practically, > > > >>> > >>> I do not concretely see at the moment what kind of changes > > we > > > >>> > may want in > > > >>> > >>> future that might conflict with this FLIP proposal, as the > > > >>> > question of > > > >>> > >>> switching data exchange mode answered above. I'd suggest to > > > >>> > not give up > > > >>> > >> the > > > >>> > >>> user friendliness we may gain now for the future problems > > > that > > > >>> > may or may > > > >>> > >>> not exist. > > > >>> > >>> > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > >>> > achieve the > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > set each > > > >>> > >> operator > > > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > > > >>> > option to > > > >>> > >>> automatically do that for users, if needed. > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> Thank you~ > > > >>> > >>> > > > >>> > >>> Xintong Song > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >> wrote: > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > >>> > >>>> > > > >>> > >>>> I agree that being able to define the resource > > requirements > > > for a > > > >>> > >> group of > > > >>> > >>>> operators is more user friendly. However, my concern is > > that > > > >>> > we are > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > >>> > limit our > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > semantics > > > of > > > >>> > >> configuring > > > >>> > >>>> resource requirements for SSGs could break if switching > > from > > > >>> > streaming > > > >>> > >> to > > > >>> > >>>> batch execution. If one defines the resource requirements > > > for > > > >>> > op_1 -> > > > >>> > >> op_2 > > > >>> > >>>> which run in pipelined mode when using the streaming > > > >>> > execution, then > > > >>> > >> how do > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > >>> > executed with a > > > >>> > >>>> blocking data exchange in batch execution mode? > > > Consequently, > > > >>> > I am > > > >>> > >> still > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > >>> > requirements per > > > >>> > >>>> operator. > > > >>> > >>>> > > > >>> > >>>> Maybe the following proposal makes the configuration > > easier: > > > >>> > If the > > > >>> > >> user > > > >>> > >>>> wants to use fine-grained resource requirements, then she > > > >>> > needs to > > > >>> > >> specify > > > >>> > >>>> the default size which is used for operators which have no > > > >>> > explicit > > > >>> > >>>> resource annotation. If this holds true, then every > > operator > > > >>> > would > > > >>> > >> have a > > > >>> > >>>> resource requirement and the system can try to execute the > > > >>> > operators > > > >>> > >> in the > > > >>> > >>>> best possible manner w/o being constrained by how the user > > > >>> > set the SSG > > > >>> > >>>> requirements. > > > >>> > >>>> > > > >>> > >>>> Cheers, > > > >>> > >>>> Till > > > >>> > >>>> > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >>>> wrote: > > > >>> > >>>> > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > >>> > >>>>> > > > >>> > >>>>> Actually, your proposal has also come to my mind at some > > > >>> > point. And I > > > >>> > >>>> have > > > >>> > >>>>> some concerns about it. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> 1. It does not give users the same control as the > > SSG-based > > > >>> > approach. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> While both approaches do not require specifying for each > > > >>> > operator, > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > operators > > > >>> > >> together > > > >>> > >>>> use > > > >>> > >>>>> this much resource" while the operator-based approach > > > doesn't. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > > > >>> > o_m), and > > > >>> > >> at > > > >>> > >>>> some > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > > > >>> > reduces the > > > >>> > >> data > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > > > >>> > (o_1, ..., > > > >>> > >> o_n) > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > higher > > > >>> > >> parallelisms > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > > > >>> > lead to too > > > >>> > >> much > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > different > > > >>> > >> resources, > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > >>> > resources for > > > >>> > >> the > > > >>> > >>>> two > > > >>> > >>>>> groups. However, with the operator-based approach, the > > > user will > > > >>> > >> have to > > > >>> > >>>>> specify resources for each operator in one of the two > > > >>> > groups, and > > > >>> > >> tune > > > >>> > >>>> the > > > >>> > >>>>> default slot resource via configurations to fit the other > > > group. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Setting chainnable operators into different slot sharing > > > >>> > groups will > > > >>> > >>>>> prevent them from being chained. In the current > > > implementation, > > > >>> > >>>> downstream > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > to > > > >>> > the same > > > >>> > >> group > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > upstream > > > >>> > >> operators > > > >>> > >>>> in > > > >>> > >>>>> different groups), to reduce the chance of breaking > > chains. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > > > >>> > deciding > > > >>> > >> SSGs > > > >>> > >>>>> based on whether resource is specified we will easily get > > > >>> > groups like > > > >>> > >>>> (o_1, > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > >>> > chained. This > > > >>> > >> is > > > >>> > >>>> also > > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > > >>> > chance is much > > > >>> > >>>>> smaller because there's no strong reason for users to > > > >>> > specify the > > > >>> > >> groups > > > >>> > >>>>> with alternate operators like that. We are more likely to > > > >>> > get groups > > > >>> > >> like > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > between > > > >>> > o_2 and > > > >>> > >> o_3. > > > >>> > >>>>> > > > >>> > >>>>> 3. It complicates the system by having two different > > > >>> > mechanisms for > > > >>> > >>>> sharing > > > >>> > >>>>> managed memory in a slot. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > memory > > > >>> > sharing > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > >>> > according to the > > > >>> > >>>>> consumer type, then further distributed across operators > > > of that > > > >>> > >> consumer > > > >>> > >>>>> type. > > > >>> > >>>>> > > > >>> > >>>>> - With the operator-based approach, managed memory size > > > >>> > specified > > > >>> > >> for an > > > >>> > >>>>> operator should account for all the consumer types of > > that > > > >>> > operator. > > > >>> > >> That > > > >>> > >>>>> means the managed memory is first distributed across > > > >>> > operators, then > > > >>> > >>>>> distributed to different consumer types of each operator. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Unfortunately, the different order of the two calculation > > > >>> > steps can > > > >>> > >> lead > > > >>> > >>>> to > > > >>> > >>>>> different results. To be specific, the semantic of the > > > >>> > configuration > > > >>> > >>>> option > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > >>> > operator). > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> To sum up things: > > > >>> > >>>>> > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > >>> > think (1) > > > >>> > >> and (2) > > > >>> > >>>>> somehow suggest that, the price for the proposed approach > > > to > > > >>> > avoid > > > >>> > >>>>> specifying resource for every operator is that it's not > > as > > > >>> > >> independent > > > >>> > >>>> from > > > >>> > >>>>> operator chaining and slot sharing as the operator-based > > > >>> > approach > > > >>> > >>>> discussed > > > >>> > >>>>> in the FLIP. > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> Thank you~ > > > >>> > >>>>> > > > >>> > >>>>> Xintong Song > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > >>> > >> wrote: > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > >>> > >>>>>> > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > >>> > written. And > > > >>> > >> the > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > >>> > configuration to > > > >>> > >>>> users > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > >>> > >>>>>> So good job here! > > > >>> > >>>>>> > > > >>> > >>>>>> About how to let users specify the resource profiles. > > If I > > > >>> > can sum > > > >>> > >> the > > > >>> > >>>>> FLIP > > > >>> > >>>>>> and previous discussion up in my own words, the problem > > > is the > > > >>> > >>>> following: > > > >>> > >>>>>> Operator-level specification is the simplest and > > cleanest > > > >>> > approach, > > > >>> > >>>>> because > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > > > >>> > >> scheduling. No > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > sharing, > > > >>> > >>>> switching > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > > > >>> > stay the > > > >>> > >>>> same. > > > >>> > >>>>>>> But it would require that a user specifies resources on > > > all > > > >>> > >>>> operators, > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > suggests > > > going > > > >>> > >> with > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > >>> > >>>>>> > > > >>> > >>>>>> I think both thoughts are important, so can we find a > > > solution > > > >>> > >> where > > > >>> > >>>> the > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > >>> > still avoid > > > >>> > >> that > > > >>> > >>>>> we > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > >>> > >>>>>> > > > >>> > >>>>>> What do you think about something like the following: > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > level. > > > >>> > >>>>>> - Not all operators need profiles > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > in > > > the > > > >>> > >> default > > > >>> > >>>> slot > > > >>> > >>>>>> sharing group with a default profile (will get a default > > > slot). > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > >>> > another slot > > > >>> > >>>>> sharing > > > >>> > >>>>>> group (the resource-specified-group). > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > >>> > operators > > > >>> > >> like > > > >>> > >>>>> they > > > >>> > >>>>>> do now, with the exception that you cannot mix operators > > > >>> > that have > > > >>> > >> a > > > >>> > >>>>>> resource profile and operators that have no resource > > > profile. > > > >>> > >>>>>> - The default case where no operator has a resource > > > >>> > profile is > > > >>> > >> just a > > > >>> > >>>>>> special case of this model > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > operator, > > > >>> > like it > > > >>> > >> does > > > >>> > >>>>> now, > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > > > it > > > >>> > >> schedules > > > >>> > >>>>>> together. > > > >>> > >>>>>> > > > >>> > >>>>>> > > > >>> > >>>>>> There is another question about reactive scaling raised > > > in the > > > >>> > >> FLIP. I > > > >>> > >>>>> need > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > tricky > > > >>> > once we > > > >>> > >>>> have > > > >>> > >>>>>> slots of different sizes. > > > >>> > >>>>>> It is not clear then which of the different slot > > requests > > > the > > > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > > > >>> > show up, > > > >>> > >> or how > > > >>> > >>>>> the > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > resources > > > >>> > (TMs) > > > >>> > >>>>> disappear > > > >>> > >>>>>> This question is pretty orthogonal, though, to the "how > > to > > > >>> > specify > > > >>> > >> the > > > >>> > >>>>>> resources". > > > >>> > >>>>>> > > > >>> > >>>>>> > > > >>> > >>>>>> Best, > > > >>> > >>>>>> Stephan > > > >>> > >>>>>> > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > >>> > <[hidden email] <mailto:[hidden email]> > > > >>> > >>>>> wrote: > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > discussion, > > > >>> > Yangze. > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > >>> > >>>>>>> > > > >>> > >>>>>>> @Till, > > > >>> > >>>>>>> > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > that > > > SSGs > > > >>> > >> need to > > > >>> > >>>>> be > > > >>> > >>>>>>> supported in fine-grained resource management, > > otherwise > > > each > > > >>> > >>>> operator > > > >>> > >>>>>>> might use as many resources as the whole group. > > However, > > > I > > > >>> > cannot > > > >>> > >>>> think > > > >>> > >>>>>> of > > > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > > > >>> > resource > > > >>> > >>>>>>> management. > > > >>> > >>>>>>> > > > >>> > >>>>>>> > > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > > properly > > > >>> > >>>>>> specified, > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > could > > > >>> > >> slice off > > > >>> > >>>>> the > > > >>> > >>>>>>>> appropriately sized slots for every Task individually. > > > >>> > >>>>>>>> > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > >>> > operator op_1 > > > >>> > >> and > > > >>> > >>>>> op_2 > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > > say > > > that > > > >>> > >> the > > > >>> > >>>> slot > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > have > > > a > > > >>> > >> cluster > > > >>> > >>>>> with > > > >>> > >>>>>> 2 > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > cannot run > > > >>> > >> this > > > >>> > >>>>> job. > > > >>> > >>>>>> If > > > >>> > >>>>>>>> the resources were specified on an operator level, > > then > > > the > > > >>> > >> system > > > >>> > >>>>>> could > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > op_2 > > > to > > > >>> > >> TM_2. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > > > are > > > >>> > >> properly > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > > >>> > think this > > > >>> > >>>>> exactly > > > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > > > each > > > >>> > >> needs > > > >>> > >>>> 100 > > > >>> > >>>>> MB > > > >>> > >>>>>>> of memory, why would we put them in the same group? If > > > >>> > they are > > > >>> > >> in > > > >>> > >>>>>> separate > > > >>> > >>>>>>> groups, with the proposed approach the system can > > freely > > > >>> > deploy > > > >>> > >> them > > > >>> > >>>> to > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > > > is > > > >>> > having > > > >>> > >>>>>> resource > > > >>> > >>>>>>> requirements properly specified for all operators. This > > > is not > > > >>> > >> always > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > > > of the > > > >>> > >>>> benefits > > > >>> > >>>>>> for > > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > > freely > > > >>> > >> decide > > > >>> > >>>> the > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > >>> > consider SSG > > > >>> > >> in > > > >>> > >>>>>>> fine-grained resource management as a group of > > operators > > > >>> > that the > > > >>> > >>>> user > > > >>> > >>>>>>> would like to specify the total resource for. There can > > > be > > > >>> > only > > > >>> > >> one > > > >>> > >>>>> group > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > major > > > >>> > parts, > > > >>> > >> or as > > > >>> > >>>>>> many > > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > > how > > > >>> > >>>> fine-grained > > > >>> > >>>>>> the > > > >>> > >>>>>>> user is able to specify the resources. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > > > >>> > that all > > > >>> > >> the > > > >>> > >>>>>>> current scheduler implementations already support > > SSGs, I > > > >>> > tend to > > > >>> > >>>> think > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > >>> > usability and > > > >>> > >>>>>>> flexibility. > > > >>> > >>>>>>> > > > >>> > >>>>>>> @Chesnay > > > >>> > >>>>>>> > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > waste > > > >>> > >> resources > > > >>> > >>>> if > > > >>> > >>>>>> the > > > >>> > >>>>>>>> parallelism of operators within that group are > > > different? > > > >>> > >>>>>>>> > > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > > >>> > >> utilization. To > > > >>> > >>>>>> avoid > > > >>> > >>>>>>> such wasting, the user can define more groups, so that > > > >>> > each group > > > >>> > >>>>>> contains > > > >>> > >>>>>>> less operators and the chance of having operators with > > > >>> > different > > > >>> > >>>>>>> parallelism will be reduced. The price is to have more > > > >>> > resource > > > >>> > >>>>>>> requirements to specify. > > > >>> > >>>>>>> > > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > > >>> > >> recalculate the > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > create > > > >>> > >> a set > > > >>> > >>>>> of > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > their > > > >>> > >>>>> applications; > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > >>> > would be > > > >>> > >> a > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > >>> > >> requirements > > > >>> > >>>>> any > > > >>> > >>>>>>>> way. > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > increases > > > >>> > >>>>> usability. > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > there's no > > > >>> > >> reason to > > > >>> > >>>>> put > > > >>> > >>>>>>> multiple operators whose individual resource > > > >>> > requirements are > > > >>> > >>>>> already > > > >>> > >>>>>>> known > > > >>> > >>>>>>> into the same group in fine-grained resource > > > management. > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > multiple > > > >>> > >>>>> applications, > > > >>> > >>>>>>> it does not guarantee the same resource > > requirements. > > > >>> > During > > > >>> > >> our > > > >>> > >>>>> years > > > >>> > >>>>>>> of > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > requirements > > > >>> > >> specified for > > > >>> > >>>>>>> Blink's > > > >>> > >>>>>>> fine-grained resource management, very few users > > > >>> > (including > > > >>> > >> our > > > >>> > >>>>>>> specialists > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > >>> > >> experienced as > > > >>> > >>>>> to > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > >>> > >> requirements. > > > >>> > >>>> Most > > > >>> > >>>>>>> people > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > delay, cpu > > > >>> > >> load, > > > >>> > >>>>>> memory > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > specification. > > > >>> > >>>>>>> > > > >>> > >>>>>>> To sum up: > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > >>> > requirements > > > >>> > >> for > > > >>> > >>>>>> every > > > >>> > >>>>>>> operator, that's definitely a good thing and we would > > not > > > >>> > need to > > > >>> > >>>> rely > > > >>> > >>>>> on > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > >>> > >> fine-grained > > > >>> > >>>>>> resource > > > >>> > >>>>>>> management to work. For those users who are capable and > > > do not > > > >>> > >> like > > > >>> > >>>>>> having > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > to > > > have > > > >>> > >> both > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > > > only > > > >>> > >> fallback > > > >>> > >>>> to > > > >>> > >>>>>> the > > > >>> > >>>>>>> SSG requirements when the operator requirements are not > > > >>> > >> specified. > > > >>> > >>>>>> However, > > > >>> > >>>>>>> as the first step, I think we should prioritise the use > > > cases > > > >>> > >> where > > > >>> > >>>>> users > > > >>> > >>>>>>> are not that experienced. > > > >>> > >>>>>>> > > > >>> > >>>>>>> Thank you~ > > > >>> > >>>>>>> > > > >>> > >>>>>>> Xintong Song > > > >>> > >>>>>>> > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > >>> > >>>>>>> wrote: > > > >>> > >>>>>>> > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > > waste > > > >>> > >> resources > > > >>> > >>>>> if > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > different? > > > >>> > >>>>>>>> > > > >>> > >>>>>>>> It also seems like quite a hassle for users having to > > > >>> > >> recalculate > > > >>> > >>>> the > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > create > > > >>> > >> a set > > > >>> > >>>>> of > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > their > > > >>> > >>>>> applications; > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > >>> > would be > > > >>> > >> a > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > >>> > >> requirements > > > >>> > >>>>> any > > > >>> > >>>>>>>> way. > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > increases > > > >>> > >>>>> usability. > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > work > > > >>> > on SSGs > > > >>> > >>>> it's > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > approaches, > > > >>> > >> which > > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > > always > > > >>> > >> defined > > > >>> > >>>> on > > > >>> > >>>>>> an > > > >>> > >>>>>>>> operator-level. > > > >>> > >>>>>>>> > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > discussion > > > >>> > >>>> Yangze. > > > >>> > >>>>>>>>> I like that defining resource requirements on a slot > > > sharing > > > >>> > >>>> group > > > >>> > >>>>>>> makes > > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > > resource > > > >>> > >>>>>>> requirements. > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > sharing > > > >>> > >>>> groups > > > >>> > >>>>>> from > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > be > > > >>> > >> supported > > > >>> > >>>> in > > > >>> > >>>>>>> order > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > far, > > > the > > > >>> > >> idea > > > >>> > >>>> of > > > >>> > >>>>>> slot > > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > > set > > > of > > > >>> > >>>> operators > > > >>> > >>>>>> can > > > >>> > >>>>>>>> be > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > the > > > >>> > >> freedom > > > >>> > >>>> to > > > >>> > >>>>>> say > > > >>> > >>>>>>>> that > > > >>> > >>>>>>>>> it would rather place these tasks in different slots > > > if it > > > >>> > >>>> wanted. > > > >>> > >>>>> If > > > >>> > >>>>>>> we > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > sharing > > > >>> > >> group, > > > >>> > >>>> then > > > >>> > >>>>>> the > > > >>> > >>>>>>>>> only option for a scheduler which does not support > > slot > > > >>> > >> sharing > > > >>> > >>>>>> groups > > > >>> > >>>>>>> is > > > >>> > >>>>>>>>> to say that every operator in this slot sharing group > > > >>> > needs a > > > >>> > >>>> slot > > > >>> > >>>>>> with > > > >>> > >>>>>>>> the > > > >>> > >>>>>>>>> same resources as the whole group. > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > > operator > > > >>> > >> op_1 > > > >>> > >>>>> and > > > >>> > >>>>>>> op_2 > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > say that > > > >>> > >> the > > > >>> > >>>>> slot > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > have a > > > >>> > >> cluster > > > >>> > >>>>>> with > > > >>> > >>>>>>> 2 > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > cannot run > > > >>> > >> this > > > >>> > >>>>>> job. > > > >>> > >>>>>>> If > > > >>> > >>>>>>>>> the resources were specified on an operator level, > > > then the > > > >>> > >>>> system > > > >>> > >>>>>>> could > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > op_2 to > > > >>> > >> TM_2. > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > > > groups > > > >>> > >> was > > > >>> > >>>> to > > > >>> > >>>>>> make > > > >>> > >>>>>>>> it > > > >>> > >>>>>>>>> easier for the user to reason about how many slots a > > > job > > > >>> > >> needs > > > >>> > >>>>>>>> independent > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > Interestingly, > > > >>> > >> if > > > >>> > >>>> all > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > then slot > > > >>> > >>>>> sharing > > > >>> > >>>>>> is > > > >>> > >>>>>>>> no > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > >>> > appropriately > > > >>> > >>>> sized > > > >>> > >>>>>>> slots > > > >>> > >>>>>>>>> for every Task individually. What matters is whether > > > the > > > >>> > >> whole > > > >>> > >>>>>> cluster > > > >>> > >>>>>>>> has > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> Cheers, > > > >>> > >>>>>>>>> Till > > > >>> > >>>>>>>>> > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > >>> > >>>>>> wrote: > > > >>> > >>>>>>>>>> Hi, there, > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > "FLIP-156: > > > >>> > >> Runtime > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > Requirements"[1], > > > >>> > >> where we > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > interfaces > > > >>> > >> for > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> In this FLIP: > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > > >>> > >> management. > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > SSG-based > > > >>> > >> resource > > > >>> > >>>>>>>>>> requirements. > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > >>> > >> granularities > > > >>> > >>>>> for > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > slot > > > >>> > >> sharing > > > >>> > >>>>>> group) > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > [1]. > > > >>> > >> Looking > > > >>> > >>>>>>>>>> forward to your feedback. > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>>>> [1] > > > >>> > >>>>>>>>>> > > > >>> > >> > > > >>> > > > > > > > > > >>> > < > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > >>> > >>>>>>>>>> Best, > > > >>> > >>>>>>>>>> Yangze Guo > > > >>> > >>>>>>>>>> > > > >>> > >>>>>>>> > > > >>> > > > > >>> > > > > > |
Hi, Kezhu.
Thanks for your feedback. > Flink *runtime* already has an ideal granularity for resource management 'task'. As mentioned in FLIP, there are some ancient codes in Flink code base, but these codes are never really used and exposed to user. So, there is actually no operator or SSG level resource requirements, but the slot is already the basic unit for resource management in Flink’s runtime. > that SSG resource specifying will override operator level resource specifying if both are specified We now treat the operator level resource specifying as a potential follow up for the fine-grained resource management. We need to collect more feedbacks to decide whether we really need it. Regarding whether and how to allow hybrid (SSG + OP) configuration, I think there might be no point in discussing it at present. UUIC, your proposal based on the assumption that we already have the operator level resource configuration and target to solve how to determine the slot resource spec when both configurations exist. - First, we do not ensure that we need operator-level resource configuration atm. - Second, we do even not sure whether we need to support hybrid configuration. So, as written in the future plan, I tend to first collect feedbacks on the operator-level resource configuration interface when the fine-grained resource management is ready. Then, we consider further optimization, such as your proposal. Best, Yangze Guo On Wed, Feb 3, 2021 at 1:28 PM Kezhu Wang <[hidden email]> wrote: > > Hi all, sorry for join discussion even after voting started. > > I want to share my thoughts on this after reading above discussions. > > I think Flink *runtime* already has an ideal granularity for resource > management 'task'. If there is > a slot shared by multiple tasks, that slot's resource requirement is simple > sum of all its logical > slots. So basically, this is no resource requirement for SlotSharingGroup > in runtime until now, > right ? > > As in discussion, we already agree upon that: "If all operators have their > resources properly > specified, then slot sharing is no longer needed. " > > So seems to me, naturally in mind path, what we would discuss is that: how > to bridge impractical > operator level resource specifying to runtime task level resource > requirement ? This is actually a > pure api thing as Chesnay has pointed out. > > But FLIP-156 brings another direction on table: how about using SSG for > both api and runtime > resource specifying ? > > From the FLIP and dicusssion, I assume that SSG resource specifying will > override operator level > resource specifying if both are specified ? > > So, I wonder whether we could interpret SSG resource specifying as an "add" > but not an "set" on > resource requirement ? > > The semantics is that SSG resource specifying adds additional resource to > shared slot to express > concerns on possible high thoughput and resource requirement for tasks in > one physical slot. > > The result is that if scheduler indeed respect slot sharing, allocated slot > will gain extra resource > specified for that SSG. > > I think one of coding barrier from "add" approach is ResourceSpec.UNKNOWN > which didn't support > 'merge' operation. I tend to use ResourceSpec.ZERO as default, task > executor should be aware of > this. > > @Chesnay > > My main worry is that it if we wire the runtime to work on SSGs it's > > gonna be difficult to implement more fine-grained approaches, which > > would not be the case if, for the runtime, they are always defined on an > > operator-level. > > An "add" operation should be less invasive and enforce low barrier for > future find-grained > approaches. > > @Stephan > > - Users can define different slot sharing groups for operators like > they > > do now, with the exception that you cannot mix operators that have a > > resource profile and operators that have no resource profile. > > @Till > > This effectively means that all unspecified operators > > will implicitly have a zero resource requirement. > > I am wondering whether this wouldn't lead to a surprising behaviour for > the > > user. If the user specifies the resource requirements for a single > > operator, then he probably will assume that the other operators will get > > the default share of resources and not nothing. > > I think it is inherent due to fact that we could not defining > ResourceSpec.ONE, eg. resource > requirement for exact one default slot, with concrete numbers ? I tend to > squash out unspecified one > if there are operators in chaining with explicit resource specifying. > Otherwise, the protocol tends > to verbose as say "give me this much resource and a default". I think if we > have explict resource > specifying for partial operators, it is just saying "I don't care other > operators that much, just > get them places to run". It is most likely be cases there are stateless > fliter/map or other less > resource consuming operators. If there is indeed a problem, I think clients > can specify a global > default(or other level default in future). In job graph generating phase, > we could take that default > into account for unspecified operators. > > @FLIP-156 > > Expose operator chaining. (Cons fo task level resource specifying) > > Is it inherent for all group level resource specifying ? They will either > break chaining or obey it, > or event could not work with. > > To sum up above, my suggestions are: > > In api side: > * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if > unspecified). > * Operator: ResourceSpec.ZERO(unspecified) as default. > * Task: sum of requirements from specified operators + global default(if > there are any unspecified operators) > * SSG: additional resource to physical slot. > > In runtime side: > * Task: ResourceSpec.Task or ResourceSpec.ZERO > * SSG: ResourceSpec.SSG or ResourceSpec.ZERO > > Physical slot gets sum up resources from logical slots and SSG, if it gets > ResourceSpec.ZERO, it is > just a default sized slot. > > In short, turn SSG resource speciying as "add" and drop > ResourceSpec.UNKNOWN. > > > Questions/Issues: > * Could SSG express negative resource requirement ? > * Is there concrete bar for partial resource configured not function ? I > saw it will fail job submission in Dispatcher.submitJob. > * An option(cluster/job level) to force slot sharing in scheduler ? This > could be useful in case of migration from FLIP-156 to future approach. > * An option(cluster) to ignore resource specifying(allow resource specified > job to run on open box environment) for no production usage ? > > > > On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) wrote: > > Thanks for reply, Till and Xintong! > > I update the FLIP, including: > - Edit the JavaDoc of the proposed > StreamGraphGenerator#setSlotSharingGroupResource. > - Add "Future Plan" section, which contains the potential follow-up > issues and the limitations to be documented when fine-grained resource > management is exposed to users. > > I'll start a vote in another thread. > > Best, > Yangze Guo > > On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> > wrote: > > > > Thanks for summarizing the discussion, Yangze. I agree that setting > > resource requirements per operator is not very user friendly. Moreover, I > > couldn't come up with a different proposal which would be as easy to use > > and wouldn't expose internal scheduling details. In fact, following this > > argument then we shouldn't have exposed the slot sharing groups in the > > first place. > > > > What is important for the user is that we properly document the > limitations > > and constraints the fine grained resource specification has. For example, > > we should explain how optimizations like chaining are affected by it and > > how different execution modes (batch vs. streaming) affect the execution > of > > operators which have specified resources. These things shouldn't become > > part of the contract of this feature and are more caused by internal > > implementation details but it will be important to understand these > things > > properly in order to use this feature effectively. > > > > Hence, +1 for starting the vote for this FLIP. > > > > Cheers, > > Till > > > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> > wrote: > > > > > Thanks for the summary, Yangze. > > > > > > The changes and follow-up issues LGTM. Let's wait for responses from > the > > > others before starting a vote. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> wrote: > > > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > > summarize the current convergence in the discussion. Please let me > > > > know if I got things wrong or missed something crucial here. > > > > > > > > Change of this FLIP: > > > > - Treat the SSG resource requirements as a hint instead of a > > > > restriction for the runtime. That's should be explicitly explained in > > > > the JavaDocs. > > > > > > > > Potential follow-up issues if needed: > > > > - Provide operator-level resource configuration interface. > > > > - Provide multiple options for deciding resources for SSGs whose > > > > requirement is not specified: > > > > ** Default slot resource. > > > > ** Default operator resource times number of operators. > > > > > > > > If there are no other issues, I'll update the FLIP accordingly and > > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email]> > > > > wrote: > > > > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song <[hidden email]> > > > > > wrote: > > > > >> > > > > >> I think Chesnay's proposal could actually work. IIUC, the keypoint > is > > > > to derive operator requirements from SSG requirements on the API > side, so > > > > that the runtime only deals with operator requirements. It's > debatable > > > how > > > > the deriving should be done though. E.g., an alternative could be to > > > evenly > > > > divide the SSG requirement into requirements of operators in the > group. > > > > >> > > > > >> > > > > >> However, I'm not entirely sure which option is more desired. > > > > Illustrating my understanding in the following figure, in which on > the > > > top > > > > is Chesnay's proposal and on the bottom is the SSG-based proposal in > this > > > > FLIP. > > > > >> > > > > >> > > > > >> > > > > >> I think the major difference between the two approaches is where > > > > deriving operator requirements from SSG requirements happens. > > > > >> > > > > >> - Chesnay's proposal simplifies the runtime logic and the > interface to > > > > expose, at the price of moving more complexity (i.e. the deriving) to > the > > > > API side. The question is, where do we prefer to keep the complexity? > I'm > > > > slightly leaning towards having a thin API and keep the complexity in > > > > runtime if possible. > > > > >> > > > > >> - Notice that the dash line arrows represent optional steps that > are > > > > needed only for schedulers that do not respect SSGs, which we don't > have > > > at > > > > the moment. If we only look at the solid line arrows, then the > SSG-based > > > > approach is much simpler, without needing to derive and aggregate the > > > > requirements back and forth. I'm not sure about complicating the > current > > > > design only for the potential future needs. > > > > >> > > > > >> > > > > >> Thank you~ > > > > >> > > > > >> Xintong Song > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > [hidden email]> > > > > wrote: > > > > >>> > > > > >>> You're raising a good point, but I think I can rectify that with > a > > > > minor > > > > >>> adjustment. > > > > >>> > > > > >>> Default requirements are whatever the default requirements are, > > > setting > > > > >>> the requirements for one operator has no effect on other > operators. > > > > >>> > > > > >>> With these rules, and some API enhancements, the following mockup > > > would > > > > >>> replicate the SSG-based behavior: > > > > >>> > > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > > >>> vertices = slotSharingGroup.getVertices() > > > > >>> > > > > > > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > > > >>> vertices.remainint().setRequirements(ZERO) > > > > >>> } > > > > >>> > > > > >>> We could even allow setting requirements on slotsharing-groups > > > > >>> colocation-groups and internally translate them accordingly. > > > > >>> I can't help but feel this is a plain API issue. > > > > >>> > > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > > >>> > If I understand you correctly Chesnay, then you want to > decouple > > > the > > > > >>> > resource requirement specification from the slot sharing group > > > > >>> > assignment. Hence, per default all operators would be in the > same > > > > slot > > > > >>> > sharing group. If there is no operator with a resource > > > specification, > > > > >>> > then the system would allocate a default slot for it. If there > is > > > at > > > > >>> > least one operator, then the system would sum up all the > specified > > > > >>> > resources and allocate a slot of this size. This effectively > means > > > > >>> > that all unspecified operators will implicitly have a zero > resource > > > > >>> > requirement. Did I understand your idea correctly? > > > > >>> > > > > > >>> > I am wondering whether this wouldn't lead to a surprising > behaviour > > > > >>> > for the user. If the user specifies the resource requirements > for a > > > > >>> > single operator, then he probably will assume that the other > > > > operators > > > > >>> > will get the default share of resources and not nothing. > > > > >>> > > > > > >>> > Cheers, > > > > >>> > Till > > > > >>> > > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > > [hidden email] > > > > >>> > <mailto:[hidden email]>> wrote: > > > > >>> > > > > > >>> > Is there even a functional difference between specifying the > > > > >>> > requirements for an SSG vs specifying the same requirements on > > > a > > > > >>> > single > > > > >>> > operator within that group (ideally a colocation group to avoid > > > > this > > > > >>> > whole hint business)? > > > > >>> > > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > > >>> > > > > > >>> > Users can take shortcuts to define shared requirements, > > > > >>> > but refine them further as needed on a per-operator basis, > > > > >>> > without changing semantics of slotsharing groups > > > > >>> > nor the runtime being locked into SSG-based requirements. > > > > >>> > > > > > >>> > (And before anyone argues what happens if slotsharing groups > > > > >>> > change or > > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > > (A > > > > >>> > plain > > > > >>> > iteration over slotsharing groups and therein contained > > > operators > > > > >>> > would > > > > >>> > suffice)). > > > > >>> > > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > > the SSG > > > > >>> > > resource requirements as a hint for the runtime similar to > > > how > > > > >>> > slot sharing > > > > >>> > > groups are designed at the moment? Meaning that we don't give > > > > >>> > the guarantee > > > > >>> > > that Flink will always deploy this set of tasks together no > > > > >>> > matter what > > > > >>> > > comes. If, for example, the runtime can derive by some means > > > > the > > > > >>> > resource > > > > >>> > > requirements for each task based on the requirements for the > > > > >>> > SSG, this > > > > >>> > > could be possible. One easy strategy would be to give every > > > > task > > > > >>> > the same > > > > >>> > > resources as the whole slot sharing group. Another one could > > > be > > > > >>> > > distributing the resources equally among the tasks. This does > > > > >>> > not even have > > > > >>> > > to be implemented but we would give ourselves the freedom to > > > > change > > > > >>> > > scheduling if need should arise. > > > > >>> > > > > > > >>> > > Cheers, > > > > >>> > > Till > > > > >>> > > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > > [hidden email] > > > > >>> > <mailto:[hidden email]>> wrote: > > > > >>> > > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > > >>> > >> > > > > >>> > >> I second Xintong's comment that SSG-based runtime interface > > > > >>> > will give > > > > >>> > >> us the flexibility to achieve op/task-based approach. That's > > > > one of > > > > >>> > >> the most important reasons for our design choice. > > > > >>> > >> > > > > >>> > >> Some cents regarding the default operator resource: > > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > > >>> > >> ** For light-weight operators, the accumulative > > > > >>> > configuration error > > > > >>> > >> will not be significant. Then, the resource of a task used > > > is > > > > >>> > >> proportional to the number of operators it contains. > > > > >>> > >> ** For heavy operators like join and window or operators > > > > >>> > using the > > > > >>> > >> external resources, user will turn to the fine-grained > > > > resource > > > > >>> > >> configuration. > > > > >>> > >> - It can increase the stability for the standalone cluster > > > > >>> > where task > > > > >>> > >> executors registered are heterogeneous(with different > > > default > > > > slot > > > > >>> > >> resources). > > > > >>> > >> - It might not be good for SQL users. The operators that SQL > > > > >>> > will be > > > > >>> > >> transferred to is a black box to the user. We also do not > > > > guarantee > > > > >>> > >> the cross-version of consistency of the transformation so > > > far. > > > > >>> > >> > > > > >>> > >> I think it can be treated as a follow-up work when the > > > > fine-grained > > > > >>> > >> resource management is end-to-end ready. > > > > >>> > >> > > > > >>> > >> Best, > > > > >>> > >> Yangze Guo > > > > >>> > >> > > > > >>> > >> > > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>> Thanks for the feedback, Till. > > > > >>> > >>> > > > > >>> > >>> ## I feel that what you proposed (operator-based + default > > > > >>> > value) might > > > > >>> > >> be > > > > >>> > >>> subsumed by the SSG-based approach. > > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > > > > >>> > categorized by > > > > >>> > >>> whether the resource requirements are known to the users. > > > > >>> > >>> > > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > > >>> > reason to put > > > > >>> > >>> multiple operators whose individual resource > > > requirements > > > > >>> > are already > > > > >>> > >> known > > > > >>> > >>> into the same group in fine-grained resource > > > management. > > > > >>> > And if op_1 > > > > >>> > >> and > > > > >>> > >>> op_2 are in different groups, there should be no > > > problem > > > > >>> > switching > > > > >>> > >> data > > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > > >>> > equivalent to > > > > >>> > >> specifying > > > > >>> > >>> operator resource requirements in your proposal. > > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > > that > > > > >>> > op_2 is in a > > > > >>> > >>> SSG whose resource is not specified thus would have the > > > > >>> > default slot > > > > >>> > >>> resource. This is equivalent to having default operator > > > > >>> > resources in > > > > >>> > >> your > > > > >>> > >>> proposal. > > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > > op_2 > > > > >>> > to the same > > > > >>> > >> SSG > > > > >>> > >>> or separate SSGs. > > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > > >>> > equivalent to > > > > >>> > >> the > > > > >>> > >>> coarse-grained resource management, where op_1 and > > > > op_2 > > > > >>> > share a > > > > >>> > >> default > > > > >>> > >>> size slot no matter which data exchange mode is > > > used. > > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > > of > > > > >>> > them will > > > > >>> > >> use > > > > >>> > >>> a default size slot. This is equivalent to setting > > > > them > > > > >>> > with > > > > >>> > >> default > > > > >>> > >>> operator resources in your proposal. > > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > > is > > > > >>> > known.* > > > > >>> > >>> - It is possible that the user learns the total / > > > max > > > > >>> > resource > > > > >>> > >>> requirement from executing and monitoring the job, > > > > >>> > while not > > > > >>> > >>> being aware of > > > > >>> > >>> individual operator requirements. > > > > >>> > >>> - I believe this is the case your proposal does not > > > > >>> > cover. And TBH, > > > > >>> > >>> this is probably how most users learn the resource > > > > >>> > requirements, > > > > >>> > >>> according > > > > >>> > >>> to my experiences. > > > > >>> > >>> - In this case, the user might need to specify > > > > >>> > different resources > > > > >>> > >> if > > > > >>> > >>> he wants to switch the execution mode, which should > > > > not > > > > >>> > be worse > > > > >>> > >> than not > > > > >>> > >>> being able to use fine-grained resource management. > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> ## An additional idea inspired by your proposal. > > > > >>> > >>> We may provide multiple options for deciding resources for > > > > >>> > SSGs whose > > > > >>> > >>> requirement is not specified, if needed. > > > > >>> > >>> > > > > >>> > >>> - Default slot resource (current design) > > > > >>> > >>> - Default operator resource times number of operators > > > > >>> > (equivalent to > > > > >>> > >>> your proposal) > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> ## Exposing internal runtime strategies > > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > > >>> > requirements might be > > > > >>> > >>> affected if how SSGs are internally handled changes in > > > > future. > > > > >>> > >> Practically, > > > > >>> > >>> I do not concretely see at the moment what kind of changes > > > we > > > > >>> > may want in > > > > >>> > >>> future that might conflict with this FLIP proposal, as the > > > > >>> > question of > > > > >>> > >>> switching data exchange mode answered above. I'd suggest to > > > > >>> > not give up > > > > >>> > >> the > > > > >>> > >>> user friendliness we may gain now for the future problems > > > > that > > > > >>> > may or may > > > > >>> > >>> not exist. > > > > >>> > >>> > > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > > >>> > achieve the > > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > > set each > > > > >>> > >> operator > > > > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > > > > >>> > option to > > > > >>> > >>> automatically do that for users, if needed. > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> Thank you~ > > > > >>> > >>> > > > > >>> > >>> Xintong Song > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > > >>> > >>>> > > > > >>> > >>>> I agree that being able to define the resource > > > requirements > > > > for a > > > > >>> > >> group of > > > > >>> > >>>> operators is more user friendly. However, my concern is > > > that > > > > >>> > we are > > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > > >>> > limit our > > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > > semantics > > > > of > > > > >>> > >> configuring > > > > >>> > >>>> resource requirements for SSGs could break if switching > > > from > > > > >>> > streaming > > > > >>> > >> to > > > > >>> > >>>> batch execution. If one defines the resource requirements > > > > for > > > > >>> > op_1 -> > > > > >>> > >> op_2 > > > > >>> > >>>> which run in pipelined mode when using the streaming > > > > >>> > execution, then > > > > >>> > >> how do > > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > > >>> > executed with a > > > > >>> > >>>> blocking data exchange in batch execution mode? > > > > Consequently, > > > > >>> > I am > > > > >>> > >> still > > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > > >>> > requirements per > > > > >>> > >>>> operator. > > > > >>> > >>>> > > > > >>> > >>>> Maybe the following proposal makes the configuration > > > easier: > > > > >>> > If the > > > > >>> > >> user > > > > >>> > >>>> wants to use fine-grained resource requirements, then she > > > > >>> > needs to > > > > >>> > >> specify > > > > >>> > >>>> the default size which is used for operators which have no > > > > >>> > explicit > > > > >>> > >>>> resource annotation. If this holds true, then every > > > operator > > > > >>> > would > > > > >>> > >> have a > > > > >>> > >>>> resource requirement and the system can try to execute the > > > > >>> > operators > > > > >>> > >> in the > > > > >>> > >>>> best possible manner w/o being constrained by how the user > > > > >>> > set the SSG > > > > >>> > >>>> requirements. > > > > >>> > >>>> > > > > >>> > >>>> Cheers, > > > > >>> > >>>> Till > > > > >>> > >>>> > > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >>>> wrote: > > > > >>> > >>>> > > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > > >>> > >>>>> > > > > >>> > >>>>> Actually, your proposal has also come to my mind at some > > > > >>> > point. And I > > > > >>> > >>>> have > > > > >>> > >>>>> some concerns about it. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> 1. It does not give users the same control as the > > > SSG-based > > > > >>> > approach. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> While both approaches do not require specifying for each > > > > >>> > operator, > > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > > operators > > > > >>> > >> together > > > > >>> > >>>> use > > > > >>> > >>>>> this much resource" while the operator-based approach > > > > doesn't. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > > > > >>> > o_m), and > > > > >>> > >> at > > > > >>> > >>>> some > > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > > > > >>> > reduces the > > > > >>> > >> data > > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > > > > >>> > (o_1, ..., > > > > >>> > >> o_n) > > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > > higher > > > > >>> > >> parallelisms > > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > > > > >>> > lead to too > > > > >>> > >> much > > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > > different > > > > >>> > >> resources, > > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > > >>> > resources for > > > > >>> > >> the > > > > >>> > >>>> two > > > > >>> > >>>>> groups. However, with the operator-based approach, the > > > > user will > > > > >>> > >> have to > > > > >>> > >>>>> specify resources for each operator in one of the two > > > > >>> > groups, and > > > > >>> > >> tune > > > > >>> > >>>> the > > > > >>> > >>>>> default slot resource via configurations to fit the other > > > > group. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Setting chainnable operators into different slot sharing > > > > >>> > groups will > > > > >>> > >>>>> prevent them from being chained. In the current > > > > implementation, > > > > >>> > >>>> downstream > > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > > to > > > > >>> > the same > > > > >>> > >> group > > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > > upstream > > > > >>> > >> operators > > > > >>> > >>>> in > > > > >>> > >>>>> different groups), to reduce the chance of breaking > > > chains. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > > > > >>> > deciding > > > > >>> > >> SSGs > > > > >>> > >>>>> based on whether resource is specified we will easily get > > > > >>> > groups like > > > > >>> > >>>> (o_1, > > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > > >>> > chained. This > > > > >>> > >> is > > > > >>> > >>>> also > > > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > > > >>> > chance is much > > > > >>> > >>>>> smaller because there's no strong reason for users to > > > > >>> > specify the > > > > >>> > >> groups > > > > >>> > >>>>> with alternate operators like that. We are more likely to > > > > >>> > get groups > > > > >>> > >> like > > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > > between > > > > >>> > o_2 and > > > > >>> > >> o_3. > > > > >>> > >>>>> > > > > >>> > >>>>> 3. It complicates the system by having two different > > > > >>> > mechanisms for > > > > >>> > >>>> sharing > > > > >>> > >>>>> managed memory in a slot. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > > memory > > > > >>> > sharing > > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > > >>> > according to the > > > > >>> > >>>>> consumer type, then further distributed across operators > > > > of that > > > > >>> > >> consumer > > > > >>> > >>>>> type. > > > > >>> > >>>>> > > > > >>> > >>>>> - With the operator-based approach, managed memory size > > > > >>> > specified > > > > >>> > >> for an > > > > >>> > >>>>> operator should account for all the consumer types of > > > that > > > > >>> > operator. > > > > >>> > >> That > > > > >>> > >>>>> means the managed memory is first distributed across > > > > >>> > operators, then > > > > >>> > >>>>> distributed to different consumer types of each operator. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Unfortunately, the different order of the two calculation > > > > >>> > steps can > > > > >>> > >> lead > > > > >>> > >>>> to > > > > >>> > >>>>> different results. To be specific, the semantic of the > > > > >>> > configuration > > > > >>> > >>>> option > > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > > >>> > operator). > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> To sum up things: > > > > >>> > >>>>> > > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > > >>> > think (1) > > > > >>> > >> and (2) > > > > >>> > >>>>> somehow suggest that, the price for the proposed approach > > > > to > > > > >>> > avoid > > > > >>> > >>>>> specifying resource for every operator is that it's not > > > as > > > > >>> > >> independent > > > > >>> > >>>> from > > > > >>> > >>>>> operator chaining and slot sharing as the operator-based > > > > >>> > approach > > > > >>> > >>>> discussed > > > > >>> > >>>>> in the FLIP. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Thank you~ > > > > >>> > >>>>> > > > > >>> > >>>>> Xintong Song > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > > >>> > >>>>>> > > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > > >>> > written. And > > > > >>> > >> the > > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > > >>> > configuration to > > > > >>> > >>>> users > > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > > >>> > >>>>>> So good job here! > > > > >>> > >>>>>> > > > > >>> > >>>>>> About how to let users specify the resource profiles. > > > If I > > > > >>> > can sum > > > > >>> > >> the > > > > >>> > >>>>> FLIP > > > > >>> > >>>>>> and previous discussion up in my own words, the problem > > > > is the > > > > >>> > >>>> following: > > > > >>> > >>>>>> Operator-level specification is the simplest and > > > cleanest > > > > >>> > approach, > > > > >>> > >>>>> because > > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > > > > >>> > >> scheduling. No > > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > > sharing, > > > > >>> > >>>> switching > > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > > > > >>> > stay the > > > > >>> > >>>> same. > > > > >>> > >>>>>>> But it would require that a user specifies resources on > > > > all > > > > >>> > >>>> operators, > > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > > suggests > > > > going > > > > >>> > >> with > > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > > >>> > >>>>>> > > > > >>> > >>>>>> I think both thoughts are important, so can we find a > > > > solution > > > > >>> > >> where > > > > >>> > >>>> the > > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > > >>> > still avoid > > > > >>> > >> that > > > > >>> > >>>>> we > > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > > >>> > >>>>>> > > > > >>> > >>>>>> What do you think about something like the following: > > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > > level. > > > > >>> > >>>>>> - Not all operators need profiles > > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > > in > > > > the > > > > >>> > >> default > > > > >>> > >>>> slot > > > > >>> > >>>>>> sharing group with a default profile (will get a default > > > > slot). > > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > > >>> > another slot > > > > >>> > >>>>> sharing > > > > >>> > >>>>>> group (the resource-specified-group). > > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > > >>> > operators > > > > >>> > >> like > > > > >>> > >>>>> they > > > > >>> > >>>>>> do now, with the exception that you cannot mix operators > > > > >>> > that have > > > > >>> > >> a > > > > >>> > >>>>>> resource profile and operators that have no resource > > > > profile. > > > > >>> > >>>>>> - The default case where no operator has a resource > > > > >>> > profile is > > > > >>> > >> just a > > > > >>> > >>>>>> special case of this model > > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > > operator, > > > > >>> > like it > > > > >>> > >> does > > > > >>> > >>>>> now, > > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > > > > it > > > > >>> > >> schedules > > > > >>> > >>>>>> together. > > > > >>> > >>>>>> > > > > >>> > >>>>>> > > > > >>> > >>>>>> There is another question about reactive scaling raised > > > > in the > > > > >>> > >> FLIP. I > > > > >>> > >>>>> need > > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > > tricky > > > > >>> > once we > > > > >>> > >>>> have > > > > >>> > >>>>>> slots of different sizes. > > > > >>> > >>>>>> It is not clear then which of the different slot > > > requests > > > > the > > > > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > > > > >>> > show up, > > > > >>> > >> or how > > > > >>> > >>>>> the > > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > > resources > > > > >>> > (TMs) > > > > >>> > >>>>> disappear > > > > >>> > >>>>>> This question is pretty orthogonal, though, to the "how > > > to > > > > >>> > specify > > > > >>> > >> the > > > > >>> > >>>>>> resources". > > > > >>> > >>>>>> > > > > >>> > >>>>>> > > > > >>> > >>>>>> Best, > > > > >>> > >>>>>> Stephan > > > > >>> > >>>>>> > > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]> > > > > >>> > >>>>> wrote: > > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > > discussion, > > > > >>> > Yangze. > > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> @Till, > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > > that > > > > SSGs > > > > >>> > >> need to > > > > >>> > >>>>> be > > > > >>> > >>>>>>> supported in fine-grained resource management, > > > otherwise > > > > each > > > > >>> > >>>> operator > > > > >>> > >>>>>>> might use as many resources as the whole group. > > > However, > > > > I > > > > >>> > cannot > > > > >>> > >>>> think > > > > >>> > >>>>>> of > > > > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > > > > >>> > resource > > > > >>> > >>>>>>> management. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> > > > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > > > properly > > > > >>> > >>>>>> specified, > > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > > could > > > > >>> > >> slice off > > > > >>> > >>>>> the > > > > >>> > >>>>>>>> appropriately sized slots for every Task individually. > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > > >>> > operator op_1 > > > > >>> > >> and > > > > >>> > >>>>> op_2 > > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > > > say > > > > that > > > > >>> > >> the > > > > >>> > >>>> slot > > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > > have > > > > a > > > > >>> > >> cluster > > > > >>> > >>>>> with > > > > >>> > >>>>>> 2 > > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > > cannot run > > > > >>> > >> this > > > > >>> > >>>>> job. > > > > >>> > >>>>>> If > > > > >>> > >>>>>>>> the resources were specified on an operator level, > > > then > > > > the > > > > >>> > >> system > > > > >>> > >>>>>> could > > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > op_2 > > > > to > > > > >>> > >> TM_2. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > > > > are > > > > >>> > >> properly > > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > > > >>> > think this > > > > >>> > >>>>> exactly > > > > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > > > > each > > > > >>> > >> needs > > > > >>> > >>>> 100 > > > > >>> > >>>>> MB > > > > >>> > >>>>>>> of memory, why would we put them in the same group? If > > > > >>> > they are > > > > >>> > >> in > > > > >>> > >>>>>> separate > > > > >>> > >>>>>>> groups, with the proposed approach the system can > > > freely > > > > >>> > deploy > > > > >>> > >> them > > > > >>> > >>>> to > > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > > > > is > > > > >>> > having > > > > >>> > >>>>>> resource > > > > >>> > >>>>>>> requirements properly specified for all operators. This > > > > is not > > > > >>> > >> always > > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > > > > of the > > > > >>> > >>>> benefits > > > > >>> > >>>>>> for > > > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > > > freely > > > > >>> > >> decide > > > > >>> > >>>> the > > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > > >>> > consider SSG > > > > >>> > >> in > > > > >>> > >>>>>>> fine-grained resource management as a group of > > > operators > > > > >>> > that the > > > > >>> > >>>> user > > > > >>> > >>>>>>> would like to specify the total resource for. There can > > > > be > > > > >>> > only > > > > >>> > >> one > > > > >>> > >>>>> group > > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > > major > > > > >>> > parts, > > > > >>> > >> or as > > > > >>> > >>>>>> many > > > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > > > how > > > > >>> > >>>> fine-grained > > > > >>> > >>>>>> the > > > > >>> > >>>>>>> user is able to specify the resources. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > > > > >>> > that all > > > > >>> > >> the > > > > >>> > >>>>>>> current scheduler implementations already support > > > SSGs, I > > > > >>> > tend to > > > > >>> > >>>> think > > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > > >>> > usability and > > > > >>> > >>>>>>> flexibility. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> @Chesnay > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > > waste > > > > >>> > >> resources > > > > >>> > >>>> if > > > > >>> > >>>>>> the > > > > >>> > >>>>>>>> parallelism of operators within that group are > > > > different? > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > > > >>> > >> utilization. To > > > > >>> > >>>>>> avoid > > > > >>> > >>>>>>> such wasting, the user can define more groups, so that > > > > >>> > each group > > > > >>> > >>>>>> contains > > > > >>> > >>>>>>> less operators and the chance of having operators with > > > > >>> > different > > > > >>> > >>>>>>> parallelism will be reduced. The price is to have more > > > > >>> > resource > > > > >>> > >>>>>>> requirements to specify. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > > > >>> > >> recalculate the > > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > > create > > > > >>> > >> a set > > > > >>> > >>>>> of > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > their > > > > >>> > >>>>> applications; > > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > > >>> > would be > > > > >>> > >> a > > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > > >>> > >> requirements > > > > >>> > >>>>> any > > > > >>> > >>>>>>>> way. > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > increases > > > > >>> > >>>>> usability. > > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > > there's no > > > > >>> > >> reason to > > > > >>> > >>>>> put > > > > >>> > >>>>>>> multiple operators whose individual resource > > > > >>> > requirements are > > > > >>> > >>>>> already > > > > >>> > >>>>>>> known > > > > >>> > >>>>>>> into the same group in fine-grained resource > > > > management. > > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > > multiple > > > > >>> > >>>>> applications, > > > > >>> > >>>>>>> it does not guarantee the same resource > > > requirements. > > > > >>> > During > > > > >>> > >> our > > > > >>> > >>>>> years > > > > >>> > >>>>>>> of > > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > > requirements > > > > >>> > >> specified for > > > > >>> > >>>>>>> Blink's > > > > >>> > >>>>>>> fine-grained resource management, very few users > > > > >>> > (including > > > > >>> > >> our > > > > >>> > >>>>>>> specialists > > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > > >>> > >> experienced as > > > > >>> > >>>>> to > > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > > >>> > >> requirements. > > > > >>> > >>>> Most > > > > >>> > >>>>>>> people > > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > > delay, cpu > > > > >>> > >> load, > > > > >>> > >>>>>> memory > > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > > specification. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> To sum up: > > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > > >>> > requirements > > > > >>> > >> for > > > > >>> > >>>>>> every > > > > >>> > >>>>>>> operator, that's definitely a good thing and we would > > > not > > > > >>> > need to > > > > >>> > >>>> rely > > > > >>> > >>>>> on > > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > > >>> > >> fine-grained > > > > >>> > >>>>>> resource > > > > >>> > >>>>>>> management to work. For those users who are capable and > > > > do not > > > > >>> > >> like > > > > >>> > >>>>>> having > > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > > to > > > > have > > > > >>> > >> both > > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > > > > only > > > > >>> > >> fallback > > > > >>> > >>>> to > > > > >>> > >>>>>> the > > > > >>> > >>>>>>> SSG requirements when the operator requirements are not > > > > >>> > >> specified. > > > > >>> > >>>>>> However, > > > > >>> > >>>>>>> as the first step, I think we should prioritise the use > > > > cases > > > > >>> > >> where > > > > >>> > >>>>> users > > > > >>> > >>>>>>> are not that experienced. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Thank you~ > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Xintong Song > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > >>> > >>>>>>> wrote: > > > > >>> > >>>>>>> > > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > > > waste > > > > >>> > >> resources > > > > >>> > >>>>> if > > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > > different? > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>>> It also seems like quite a hassle for users having to > > > > >>> > >> recalculate > > > > >>> > >>>> the > > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > > create > > > > >>> > >> a set > > > > >>> > >>>>> of > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > their > > > > >>> > >>>>> applications; > > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > > >>> > would be > > > > >>> > >> a > > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > > >>> > >> requirements > > > > >>> > >>>>> any > > > > >>> > >>>>>>>> way. > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > increases > > > > >>> > >>>>> usability. > > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > > work > > > > >>> > on SSGs > > > > >>> > >>>> it's > > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > > approaches, > > > > >>> > >> which > > > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > > > always > > > > >>> > >> defined > > > > >>> > >>>> on > > > > >>> > >>>>>> an > > > > >>> > >>>>>>>> operator-level. > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > > discussion > > > > >>> > >>>> Yangze. > > > > >>> > >>>>>>>>> I like that defining resource requirements on a slot > > > > sharing > > > > >>> > >>>> group > > > > >>> > >>>>>>> makes > > > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > > > resource > > > > >>> > >>>>>>> requirements. > > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > > sharing > > > > >>> > >>>> groups > > > > >>> > >>>>>> from > > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > > be > > > > >>> > >> supported > > > > >>> > >>>> in > > > > >>> > >>>>>>> order > > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > > far, > > > > the > > > > >>> > >> idea > > > > >>> > >>>> of > > > > >>> > >>>>>> slot > > > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > > > set > > > > of > > > > >>> > >>>> operators > > > > >>> > >>>>>> can > > > > >>> > >>>>>>>> be > > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > > the > > > > >>> > >> freedom > > > > >>> > >>>> to > > > > >>> > >>>>>> say > > > > >>> > >>>>>>>> that > > > > >>> > >>>>>>>>> it would rather place these tasks in different slots > > > > if it > > > > >>> > >>>> wanted. > > > > >>> > >>>>> If > > > > >>> > >>>>>>> we > > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > > sharing > > > > >>> > >> group, > > > > >>> > >>>> then > > > > >>> > >>>>>> the > > > > >>> > >>>>>>>>> only option for a scheduler which does not support > > > slot > > > > >>> > >> sharing > > > > >>> > >>>>>> groups > > > > >>> > >>>>>>> is > > > > >>> > >>>>>>>>> to say that every operator in this slot sharing group > > > > >>> > needs a > > > > >>> > >>>> slot > > > > >>> > >>>>>> with > > > > >>> > >>>>>>>> the > > > > >>> > >>>>>>>>> same resources as the whole group. > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > > > operator > > > > >>> > >> op_1 > > > > >>> > >>>>> and > > > > >>> > >>>>>>> op_2 > > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > > say that > > > > >>> > >> the > > > > >>> > >>>>> slot > > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > have a > > > > >>> > >> cluster > > > > >>> > >>>>>> with > > > > >>> > >>>>>>> 2 > > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > > cannot run > > > > >>> > >> this > > > > >>> > >>>>>> job. > > > > >>> > >>>>>>> If > > > > >>> > >>>>>>>>> the resources were specified on an operator level, > > > > then the > > > > >>> > >>>> system > > > > >>> > >>>>>>> could > > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > op_2 to > > > > >>> > >> TM_2. > > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > > > > groups > > > > >>> > >> was > > > > >>> > >>>> to > > > > >>> > >>>>>> make > > > > >>> > >>>>>>>> it > > > > >>> > >>>>>>>>> easier for the user to reason about how many slots a > > > > job > > > > >>> > >> needs > > > > >>> > >>>>>>>> independent > > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > > Interestingly, > > > > >>> > >> if > > > > >>> > >>>> all > > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > > then slot > > > > >>> > >>>>> sharing > > > > >>> > >>>>>> is > > > > >>> > >>>>>>>> no > > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > > >>> > appropriately > > > > >>> > >>>> sized > > > > >>> > >>>>>>> slots > > > > >>> > >>>>>>>>> for every Task individually. What matters is whether > > > > the > > > > >>> > >> whole > > > > >>> > >>>>>> cluster > > > > >>> > >>>>>>>> has > > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> Cheers, > > > > >>> > >>>>>>>>> Till > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > >>> > >>>>>> wrote: > > > > >>> > >>>>>>>>>> Hi, there, > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > > "FLIP-156: > > > > >>> > >> Runtime > > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > > Requirements"[1], > > > > >>> > >> where we > > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > > interfaces > > > > >>> > >> for > > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> In this FLIP: > > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > > > >>> > >> management. > > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > > SSG-based > > > > >>> > >> resource > > > > >>> > >>>>>>>>>> requirements. > > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > > >>> > >> granularities > > > > >>> > >>>>> for > > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > > slot > > > > >>> > >> sharing > > > > >>> > >>>>>> group) > > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > > [1]. > > > > >>> > >> Looking > > > > >>> > >>>>>>>>>> forward to your feedback. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> [1] > > > > >>> > >>>>>>>>>> > > > > >>> > >> > > > > >>> > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > >>> > < > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > > >>> > >>>>>>>>>> Best, > > > > >>> > >>>>>>>>>> Yangze Guo > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>> > > > > >>> > > > > > >>> > > > > > > > |
In reply to this post by Kezhu Wang
Thanks for your feedback, Kezhu.
I think Flink *runtime* already has an ideal granularity for resource > management 'task'. If there is > a slot shared by multiple tasks, that slot's resource requirement is simple > sum of all its logical > slots. So basically, this is no resource requirement for SlotSharingGroup > in runtime until now, > right ? That is a halfly-cooked implementation, coming from the previous attempts (years ago) trying to deliver the fine-grained resource management feature, and never really put into use. From the FLIP and dicusssion, I assume that SSG resource specifying will > override operator level > resource specifying if both are specified ? > Actually, I think we should use the finer-grained resources (i.e. operator level) if both are specified. And more importantly, that is based on the assumption that we do need two different levels of interfaces. So, I wonder whether we could interpret SSG resource specifying as an "add" > but not an "set" on > resource requirement ? > IIUC, this is the core idea behind your proposal. I think it provides an interesting idea of how we combine operator level and SSG level resources, *if we allow configuring resources at both levels*. However, I'm not sure whether the configuring resources on the operator level is indeed needed. Therefore, as a first step, this FLIP proposes to only introduce the SSG-level interfaces. As listed in the future plan, we would consider allowing operator level resource configuration later if we do see a need for it. At that time, we definitely should discuss what to do if resources are configured at both levels. * Could SSG express negative resource requirement ? > No. Is there concrete bar for partial resource configured not function ? I > saw it will fail job submission in Dispatcher.submitJob. > With the SSG-based approach, this should no longer be needed. The constraint was introduced because we can neither properly define what is the resource of a task chained from an operator with specified resource and another with unspecified resource, nor for a slot shared by a task with specified resource and another with unspecified resource. With the SSG-based approach, we no longer have those problems. An option(cluster/job level) to force slot sharing in scheduler ? This > could be useful in case of migration from FLIP-156 to future approach. > I think this is exactly what we are trying to avoid, requiring the scheduler to enforce slot sharing. An option(cluster) to ignore resource specifying(allow resource specified > job to run on open box environment) for no production usage ? > That's possible. Actually, we are planning to introduce an option for activating the fine-grained resource management, for development purposes. We might consider to keep that option after the feature is completed, to allow disable the feature without having to touch the job codes. Thank you~ Xintong Song On Wed, Feb 3, 2021 at 1:28 PM Kezhu Wang <[hidden email]> wrote: > Hi all, sorry for join discussion even after voting started. > > I want to share my thoughts on this after reading above discussions. > > I think Flink *runtime* already has an ideal granularity for resource > management 'task'. If there is > a slot shared by multiple tasks, that slot's resource requirement is simple > sum of all its logical > slots. So basically, this is no resource requirement for SlotSharingGroup > in runtime until now, > right ? > > As in discussion, we already agree upon that: "If all operators have their > resources properly > specified, then slot sharing is no longer needed. " > > So seems to me, naturally in mind path, what we would discuss is that: how > to bridge impractical > operator level resource specifying to runtime task level resource > requirement ? This is actually a > pure api thing as Chesnay has pointed out. > > But FLIP-156 brings another direction on table: how about using SSG for > both api and runtime > resource specifying ? > > From the FLIP and dicusssion, I assume that SSG resource specifying will > override operator level > resource specifying if both are specified ? > > So, I wonder whether we could interpret SSG resource specifying as an "add" > but not an "set" on > resource requirement ? > > The semantics is that SSG resource specifying adds additional resource to > shared slot to express > concerns on possible high thoughput and resource requirement for tasks in > one physical slot. > > The result is that if scheduler indeed respect slot sharing, allocated slot > will gain extra resource > specified for that SSG. > > I think one of coding barrier from "add" approach is ResourceSpec.UNKNOWN > which didn't support > 'merge' operation. I tend to use ResourceSpec.ZERO as default, task > executor should be aware of > this. > > @Chesnay > > My main worry is that it if we wire the runtime to work on SSGs it's > > gonna be difficult to implement more fine-grained approaches, which > > would not be the case if, for the runtime, they are always defined on an > > operator-level. > > An "add" operation should be less invasive and enforce low barrier for > future find-grained > approaches. > > @Stephan > > - Users can define different slot sharing groups for operators like > they > > do now, with the exception that you cannot mix operators that have a > > resource profile and operators that have no resource profile. > > @Till > > This effectively means that all unspecified operators > > will implicitly have a zero resource requirement. > > I am wondering whether this wouldn't lead to a surprising behaviour for > the > > user. If the user specifies the resource requirements for a single > > operator, then he probably will assume that the other operators will get > > the default share of resources and not nothing. > > I think it is inherent due to fact that we could not defining > ResourceSpec.ONE, eg. resource > requirement for exact one default slot, with concrete numbers ? I tend to > squash out unspecified one > if there are operators in chaining with explicit resource specifying. > Otherwise, the protocol tends > to verbose as say "give me this much resource and a default". I think if we > have explict resource > specifying for partial operators, it is just saying "I don't care other > operators that much, just > get them places to run". It is most likely be cases there are stateless > fliter/map or other less > resource consuming operators. If there is indeed a problem, I think clients > can specify a global > default(or other level default in future). In job graph generating phase, > we could take that default > into account for unspecified operators. > > @FLIP-156 > > Expose operator chaining. (Cons fo task level resource specifying) > > Is it inherent for all group level resource specifying ? They will either > break chaining or obey it, > or event could not work with. > > To sum up above, my suggestions are: > > In api side: > * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if > unspecified). > * Operator: ResourceSpec.ZERO(unspecified) as default. > * Task: sum of requirements from specified operators + global default(if > there are any unspecified operators) > * SSG: additional resource to physical slot. > > In runtime side: > * Task: ResourceSpec.Task or ResourceSpec.ZERO > * SSG: ResourceSpec.SSG or ResourceSpec.ZERO > > Physical slot gets sum up resources from logical slots and SSG, if it gets > ResourceSpec.ZERO, it is > just a default sized slot. > > In short, turn SSG resource speciying as "add" and drop > ResourceSpec.UNKNOWN. > > > Questions/Issues: > * Could SSG express negative resource requirement ? > * Is there concrete bar for partial resource configured not function ? I > saw it will fail job submission in Dispatcher.submitJob. > * An option(cluster/job level) to force slot sharing in scheduler ? This > could be useful in case of migration from FLIP-156 to future approach. > * An option(cluster) to ignore resource specifying(allow resource specified > job to run on open box environment) for no production usage ? > > > > On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) wrote: > > Thanks for reply, Till and Xintong! > > I update the FLIP, including: > - Edit the JavaDoc of the proposed > StreamGraphGenerator#setSlotSharingGroupResource. > - Add "Future Plan" section, which contains the potential follow-up > issues and the limitations to be documented when fine-grained resource > management is exposed to users. > > I'll start a vote in another thread. > > Best, > Yangze Guo > > On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> > wrote: > > > > Thanks for summarizing the discussion, Yangze. I agree that setting > > resource requirements per operator is not very user friendly. Moreover, I > > couldn't come up with a different proposal which would be as easy to use > > and wouldn't expose internal scheduling details. In fact, following this > > argument then we shouldn't have exposed the slot sharing groups in the > > first place. > > > > What is important for the user is that we properly document the > limitations > > and constraints the fine grained resource specification has. For example, > > we should explain how optimizations like chaining are affected by it and > > how different execution modes (batch vs. streaming) affect the execution > of > > operators which have specified resources. These things shouldn't become > > part of the contract of this feature and are more caused by internal > > implementation details but it will be important to understand these > things > > properly in order to use this feature effectively. > > > > Hence, +1 for starting the vote for this FLIP. > > > > Cheers, > > Till > > > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> > wrote: > > > > > Thanks for the summary, Yangze. > > > > > > The changes and follow-up issues LGTM. Let's wait for responses from > the > > > others before starting a vote. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> > wrote: > > > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > > summarize the current convergence in the discussion. Please let me > > > > know if I got things wrong or missed something crucial here. > > > > > > > > Change of this FLIP: > > > > - Treat the SSG resource requirements as a hint instead of a > > > > restriction for the runtime. That's should be explicitly explained in > > > > the JavaDocs. > > > > > > > > Potential follow-up issues if needed: > > > > - Provide operator-level resource configuration interface. > > > > - Provide multiple options for deciding resources for SSGs whose > > > > requirement is not specified: > > > > ** Default slot resource. > > > > ** Default operator resource times number of operators. > > > > > > > > If there are no other issues, I'll update the FLIP accordingly and > > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email] > > > > > > wrote: > > > > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song < > [hidden email]> > > > > > wrote: > > > > >> > > > > >> I think Chesnay's proposal could actually work. IIUC, the keypoint > is > > > > to derive operator requirements from SSG requirements on the API > side, so > > > > that the runtime only deals with operator requirements. It's > debatable > > > how > > > > the deriving should be done though. E.g., an alternative could be to > > > evenly > > > > divide the SSG requirement into requirements of operators in the > group. > > > > >> > > > > >> > > > > >> However, I'm not entirely sure which option is more desired. > > > > Illustrating my understanding in the following figure, in which on > the > > > top > > > > is Chesnay's proposal and on the bottom is the SSG-based proposal in > this > > > > FLIP. > > > > >> > > > > >> > > > > >> > > > > >> I think the major difference between the two approaches is where > > > > deriving operator requirements from SSG requirements happens. > > > > >> > > > > >> - Chesnay's proposal simplifies the runtime logic and the > interface to > > > > expose, at the price of moving more complexity (i.e. the deriving) to > the > > > > API side. The question is, where do we prefer to keep the complexity? > I'm > > > > slightly leaning towards having a thin API and keep the complexity in > > > > runtime if possible. > > > > >> > > > > >> - Notice that the dash line arrows represent optional steps that > are > > > > needed only for schedulers that do not respect SSGs, which we don't > have > > > at > > > > the moment. If we only look at the solid line arrows, then the > SSG-based > > > > approach is much simpler, without needing to derive and aggregate the > > > > requirements back and forth. I'm not sure about complicating the > current > > > > design only for the potential future needs. > > > > >> > > > > >> > > > > >> Thank you~ > > > > >> > > > > >> Xintong Song > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > [hidden email]> > > > > wrote: > > > > >>> > > > > >>> You're raising a good point, but I think I can rectify that with > a > > > > minor > > > > >>> adjustment. > > > > >>> > > > > >>> Default requirements are whatever the default requirements are, > > > setting > > > > >>> the requirements for one operator has no effect on other > operators. > > > > >>> > > > > >>> With these rules, and some API enhancements, the following mockup > > > would > > > > >>> replicate the SSG-based behavior: > > > > >>> > > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > > >>> vertices = slotSharingGroup.getVertices() > > > > >>> > > > > > > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > > > >>> vertices.remainint().setRequirements(ZERO) > > > > >>> } > > > > >>> > > > > >>> We could even allow setting requirements on slotsharing-groups > > > > >>> colocation-groups and internally translate them accordingly. > > > > >>> I can't help but feel this is a plain API issue. > > > > >>> > > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > > >>> > If I understand you correctly Chesnay, then you want to > decouple > > > the > > > > >>> > resource requirement specification from the slot sharing group > > > > >>> > assignment. Hence, per default all operators would be in the > same > > > > slot > > > > >>> > sharing group. If there is no operator with a resource > > > specification, > > > > >>> > then the system would allocate a default slot for it. If there > is > > > at > > > > >>> > least one operator, then the system would sum up all the > specified > > > > >>> > resources and allocate a slot of this size. This effectively > means > > > > >>> > that all unspecified operators will implicitly have a zero > resource > > > > >>> > requirement. Did I understand your idea correctly? > > > > >>> > > > > > >>> > I am wondering whether this wouldn't lead to a surprising > behaviour > > > > >>> > for the user. If the user specifies the resource requirements > for a > > > > >>> > single operator, then he probably will assume that the other > > > > operators > > > > >>> > will get the default share of resources and not nothing. > > > > >>> > > > > > >>> > Cheers, > > > > >>> > Till > > > > >>> > > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > > [hidden email] > > > > >>> > <mailto:[hidden email]>> wrote: > > > > >>> > > > > > >>> > Is there even a functional difference between specifying the > > > > >>> > requirements for an SSG vs specifying the same requirements on > > > a > > > > >>> > single > > > > >>> > operator within that group (ideally a colocation group to avoid > > > > this > > > > >>> > whole hint business)? > > > > >>> > > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > > >>> > > > > > >>> > Users can take shortcuts to define shared requirements, > > > > >>> > but refine them further as needed on a per-operator basis, > > > > >>> > without changing semantics of slotsharing groups > > > > >>> > nor the runtime being locked into SSG-based requirements. > > > > >>> > > > > > >>> > (And before anyone argues what happens if slotsharing groups > > > > >>> > change or > > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > > (A > > > > >>> > plain > > > > >>> > iteration over slotsharing groups and therein contained > > > operators > > > > >>> > would > > > > >>> > suffice)). > > > > >>> > > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > > the SSG > > > > >>> > > resource requirements as a hint for the runtime similar to > > > how > > > > >>> > slot sharing > > > > >>> > > groups are designed at the moment? Meaning that we don't give > > > > >>> > the guarantee > > > > >>> > > that Flink will always deploy this set of tasks together no > > > > >>> > matter what > > > > >>> > > comes. If, for example, the runtime can derive by some means > > > > the > > > > >>> > resource > > > > >>> > > requirements for each task based on the requirements for the > > > > >>> > SSG, this > > > > >>> > > could be possible. One easy strategy would be to give every > > > > task > > > > >>> > the same > > > > >>> > > resources as the whole slot sharing group. Another one could > > > be > > > > >>> > > distributing the resources equally among the tasks. This does > > > > >>> > not even have > > > > >>> > > to be implemented but we would give ourselves the freedom to > > > > change > > > > >>> > > scheduling if need should arise. > > > > >>> > > > > > > >>> > > Cheers, > > > > >>> > > Till > > > > >>> > > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > > [hidden email] > > > > >>> > <mailto:[hidden email]>> wrote: > > > > >>> > > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > > >>> > >> > > > > >>> > >> I second Xintong's comment that SSG-based runtime interface > > > > >>> > will give > > > > >>> > >> us the flexibility to achieve op/task-based approach. That's > > > > one of > > > > >>> > >> the most important reasons for our design choice. > > > > >>> > >> > > > > >>> > >> Some cents regarding the default operator resource: > > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > > >>> > >> ** For light-weight operators, the accumulative > > > > >>> > configuration error > > > > >>> > >> will not be significant. Then, the resource of a task used > > > is > > > > >>> > >> proportional to the number of operators it contains. > > > > >>> > >> ** For heavy operators like join and window or operators > > > > >>> > using the > > > > >>> > >> external resources, user will turn to the fine-grained > > > > resource > > > > >>> > >> configuration. > > > > >>> > >> - It can increase the stability for the standalone cluster > > > > >>> > where task > > > > >>> > >> executors registered are heterogeneous(with different > > > default > > > > slot > > > > >>> > >> resources). > > > > >>> > >> - It might not be good for SQL users. The operators that SQL > > > > >>> > will be > > > > >>> > >> transferred to is a black box to the user. We also do not > > > > guarantee > > > > >>> > >> the cross-version of consistency of the transformation so > > > far. > > > > >>> > >> > > > > >>> > >> I think it can be treated as a follow-up work when the > > > > fine-grained > > > > >>> > >> resource management is end-to-end ready. > > > > >>> > >> > > > > >>> > >> Best, > > > > >>> > >> Yangze Guo > > > > >>> > >> > > > > >>> > >> > > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>> Thanks for the feedback, Till. > > > > >>> > >>> > > > > >>> > >>> ## I feel that what you proposed (operator-based + default > > > > >>> > value) might > > > > >>> > >> be > > > > >>> > >>> subsumed by the SSG-based approach. > > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > > > > >>> > categorized by > > > > >>> > >>> whether the resource requirements are known to the users. > > > > >>> > >>> > > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > > >>> > reason to put > > > > >>> > >>> multiple operators whose individual resource > > > requirements > > > > >>> > are already > > > > >>> > >> known > > > > >>> > >>> into the same group in fine-grained resource > > > management. > > > > >>> > And if op_1 > > > > >>> > >> and > > > > >>> > >>> op_2 are in different groups, there should be no > > > problem > > > > >>> > switching > > > > >>> > >> data > > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > > >>> > equivalent to > > > > >>> > >> specifying > > > > >>> > >>> operator resource requirements in your proposal. > > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > > that > > > > >>> > op_2 is in a > > > > >>> > >>> SSG whose resource is not specified thus would have the > > > > >>> > default slot > > > > >>> > >>> resource. This is equivalent to having default operator > > > > >>> > resources in > > > > >>> > >> your > > > > >>> > >>> proposal. > > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > > op_2 > > > > >>> > to the same > > > > >>> > >> SSG > > > > >>> > >>> or separate SSGs. > > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > > >>> > equivalent to > > > > >>> > >> the > > > > >>> > >>> coarse-grained resource management, where op_1 and > > > > op_2 > > > > >>> > share a > > > > >>> > >> default > > > > >>> > >>> size slot no matter which data exchange mode is > > > used. > > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > > of > > > > >>> > them will > > > > >>> > >> use > > > > >>> > >>> a default size slot. This is equivalent to setting > > > > them > > > > >>> > with > > > > >>> > >> default > > > > >>> > >>> operator resources in your proposal. > > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > > is > > > > >>> > known.* > > > > >>> > >>> - It is possible that the user learns the total / > > > max > > > > >>> > resource > > > > >>> > >>> requirement from executing and monitoring the job, > > > > >>> > while not > > > > >>> > >>> being aware of > > > > >>> > >>> individual operator requirements. > > > > >>> > >>> - I believe this is the case your proposal does not > > > > >>> > cover. And TBH, > > > > >>> > >>> this is probably how most users learn the resource > > > > >>> > requirements, > > > > >>> > >>> according > > > > >>> > >>> to my experiences. > > > > >>> > >>> - In this case, the user might need to specify > > > > >>> > different resources > > > > >>> > >> if > > > > >>> > >>> he wants to switch the execution mode, which should > > > > not > > > > >>> > be worse > > > > >>> > >> than not > > > > >>> > >>> being able to use fine-grained resource management. > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> ## An additional idea inspired by your proposal. > > > > >>> > >>> We may provide multiple options for deciding resources for > > > > >>> > SSGs whose > > > > >>> > >>> requirement is not specified, if needed. > > > > >>> > >>> > > > > >>> > >>> - Default slot resource (current design) > > > > >>> > >>> - Default operator resource times number of operators > > > > >>> > (equivalent to > > > > >>> > >>> your proposal) > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> ## Exposing internal runtime strategies > > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > > >>> > requirements might be > > > > >>> > >>> affected if how SSGs are internally handled changes in > > > > future. > > > > >>> > >> Practically, > > > > >>> > >>> I do not concretely see at the moment what kind of changes > > > we > > > > >>> > may want in > > > > >>> > >>> future that might conflict with this FLIP proposal, as the > > > > >>> > question of > > > > >>> > >>> switching data exchange mode answered above. I'd suggest to > > > > >>> > not give up > > > > >>> > >> the > > > > >>> > >>> user friendliness we may gain now for the future problems > > > > that > > > > >>> > may or may > > > > >>> > >>> not exist. > > > > >>> > >>> > > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > > >>> > achieve the > > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > > set each > > > > >>> > >> operator > > > > >>> > >>> (or task) to a separate SSG. We can even provide a shortcut > > > > >>> > option to > > > > >>> > >>> automatically do that for users, if needed. > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> Thank you~ > > > > >>> > >>> > > > > >>> > >>> Xintong Song > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > > >>> > >>>> > > > > >>> > >>>> I agree that being able to define the resource > > > requirements > > > > for a > > > > >>> > >> group of > > > > >>> > >>>> operators is more user friendly. However, my concern is > > > that > > > > >>> > we are > > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > > >>> > limit our > > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > > semantics > > > > of > > > > >>> > >> configuring > > > > >>> > >>>> resource requirements for SSGs could break if switching > > > from > > > > >>> > streaming > > > > >>> > >> to > > > > >>> > >>>> batch execution. If one defines the resource requirements > > > > for > > > > >>> > op_1 -> > > > > >>> > >> op_2 > > > > >>> > >>>> which run in pipelined mode when using the streaming > > > > >>> > execution, then > > > > >>> > >> how do > > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > > >>> > executed with a > > > > >>> > >>>> blocking data exchange in batch execution mode? > > > > Consequently, > > > > >>> > I am > > > > >>> > >> still > > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > > >>> > requirements per > > > > >>> > >>>> operator. > > > > >>> > >>>> > > > > >>> > >>>> Maybe the following proposal makes the configuration > > > easier: > > > > >>> > If the > > > > >>> > >> user > > > > >>> > >>>> wants to use fine-grained resource requirements, then she > > > > >>> > needs to > > > > >>> > >> specify > > > > >>> > >>>> the default size which is used for operators which have no > > > > >>> > explicit > > > > >>> > >>>> resource annotation. If this holds true, then every > > > operator > > > > >>> > would > > > > >>> > >> have a > > > > >>> > >>>> resource requirement and the system can try to execute the > > > > >>> > operators > > > > >>> > >> in the > > > > >>> > >>>> best possible manner w/o being constrained by how the user > > > > >>> > set the SSG > > > > >>> > >>>> requirements. > > > > >>> > >>>> > > > > >>> > >>>> Cheers, > > > > >>> > >>>> Till > > > > >>> > >>>> > > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >>>> wrote: > > > > >>> > >>>> > > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > > >>> > >>>>> > > > > >>> > >>>>> Actually, your proposal has also come to my mind at some > > > > >>> > point. And I > > > > >>> > >>>> have > > > > >>> > >>>>> some concerns about it. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> 1. It does not give users the same control as the > > > SSG-based > > > > >>> > approach. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> While both approaches do not require specifying for each > > > > >>> > operator, > > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > > operators > > > > >>> > >> together > > > > >>> > >>>> use > > > > >>> > >>>>> this much resource" while the operator-based approach > > > > doesn't. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, ..., > > > > >>> > o_m), and > > > > >>> > >> at > > > > >>> > >>>> some > > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > > > > >>> > reduces the > > > > >>> > >> data > > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > > > > >>> > (o_1, ..., > > > > >>> > >> o_n) > > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > > higher > > > > >>> > >> parallelisms > > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > > > > >>> > lead to too > > > > >>> > >> much > > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > > different > > > > >>> > >> resources, > > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > > >>> > resources for > > > > >>> > >> the > > > > >>> > >>>> two > > > > >>> > >>>>> groups. However, with the operator-based approach, the > > > > user will > > > > >>> > >> have to > > > > >>> > >>>>> specify resources for each operator in one of the two > > > > >>> > groups, and > > > > >>> > >> tune > > > > >>> > >>>> the > > > > >>> > >>>>> default slot resource via configurations to fit the other > > > > group. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Setting chainnable operators into different slot sharing > > > > >>> > groups will > > > > >>> > >>>>> prevent them from being chained. In the current > > > > implementation, > > > > >>> > >>>> downstream > > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > > to > > > > >>> > the same > > > > >>> > >> group > > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > > upstream > > > > >>> > >> operators > > > > >>> > >>>> in > > > > >>> > >>>>> different groups), to reduce the chance of breaking > > > chains. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> o_3, > > > > >>> > deciding > > > > >>> > >> SSGs > > > > >>> > >>>>> based on whether resource is specified we will easily get > > > > >>> > groups like > > > > >>> > >>>> (o_1, > > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > > >>> > chained. This > > > > >>> > >> is > > > > >>> > >>>> also > > > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > > > >>> > chance is much > > > > >>> > >>>>> smaller because there's no strong reason for users to > > > > >>> > specify the > > > > >>> > >> groups > > > > >>> > >>>>> with alternate operators like that. We are more likely to > > > > >>> > get groups > > > > >>> > >> like > > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > > between > > > > >>> > o_2 and > > > > >>> > >> o_3. > > > > >>> > >>>>> > > > > >>> > >>>>> 3. It complicates the system by having two different > > > > >>> > mechanisms for > > > > >>> > >>>> sharing > > > > >>> > >>>>> managed memory in a slot. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > > memory > > > > >>> > sharing > > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > > >>> > according to the > > > > >>> > >>>>> consumer type, then further distributed across operators > > > > of that > > > > >>> > >> consumer > > > > >>> > >>>>> type. > > > > >>> > >>>>> > > > > >>> > >>>>> - With the operator-based approach, managed memory size > > > > >>> > specified > > > > >>> > >> for an > > > > >>> > >>>>> operator should account for all the consumer types of > > > that > > > > >>> > operator. > > > > >>> > >> That > > > > >>> > >>>>> means the managed memory is first distributed across > > > > >>> > operators, then > > > > >>> > >>>>> distributed to different consumer types of each operator. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Unfortunately, the different order of the two calculation > > > > >>> > steps can > > > > >>> > >> lead > > > > >>> > >>>> to > > > > >>> > >>>>> different results. To be specific, the semantic of the > > > > >>> > configuration > > > > >>> > >>>> option > > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > > >>> > operator). > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> To sum up things: > > > > >>> > >>>>> > > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > > >>> > think (1) > > > > >>> > >> and (2) > > > > >>> > >>>>> somehow suggest that, the price for the proposed approach > > > > to > > > > >>> > avoid > > > > >>> > >>>>> specifying resource for every operator is that it's not > > > as > > > > >>> > >> independent > > > > >>> > >>>> from > > > > >>> > >>>>> operator chaining and slot sharing as the operator-based > > > > >>> > approach > > > > >>> > >>>> discussed > > > > >>> > >>>>> in the FLIP. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Thank you~ > > > > >>> > >>>>> > > > > >>> > >>>>> Xintong Song > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > > >>> > >>>>>> > > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > > >>> > written. And > > > > >>> > >> the > > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > > >>> > configuration to > > > > >>> > >>>> users > > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > > >>> > >>>>>> So good job here! > > > > >>> > >>>>>> > > > > >>> > >>>>>> About how to let users specify the resource profiles. > > > If I > > > > >>> > can sum > > > > >>> > >> the > > > > >>> > >>>>> FLIP > > > > >>> > >>>>>> and previous discussion up in my own words, the problem > > > > is the > > > > >>> > >>>> following: > > > > >>> > >>>>>> Operator-level specification is the simplest and > > > cleanest > > > > >>> > approach, > > > > >>> > >>>>> because > > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > > > > >>> > >> scheduling. No > > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > > sharing, > > > > >>> > >>>> switching > > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > > > > >>> > stay the > > > > >>> > >>>> same. > > > > >>> > >>>>>>> But it would require that a user specifies resources on > > > > all > > > > >>> > >>>> operators, > > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > > suggests > > > > going > > > > >>> > >> with > > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > > >>> > >>>>>> > > > > >>> > >>>>>> I think both thoughts are important, so can we find a > > > > solution > > > > >>> > >> where > > > > >>> > >>>> the > > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > > >>> > still avoid > > > > >>> > >> that > > > > >>> > >>>>> we > > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > > >>> > >>>>>> > > > > >>> > >>>>>> What do you think about something like the following: > > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > > level. > > > > >>> > >>>>>> - Not all operators need profiles > > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > > in > > > > the > > > > >>> > >> default > > > > >>> > >>>> slot > > > > >>> > >>>>>> sharing group with a default profile (will get a default > > > > slot). > > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > > >>> > another slot > > > > >>> > >>>>> sharing > > > > >>> > >>>>>> group (the resource-specified-group). > > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > > >>> > operators > > > > >>> > >> like > > > > >>> > >>>>> they > > > > >>> > >>>>>> do now, with the exception that you cannot mix operators > > > > >>> > that have > > > > >>> > >> a > > > > >>> > >>>>>> resource profile and operators that have no resource > > > > profile. > > > > >>> > >>>>>> - The default case where no operator has a resource > > > > >>> > profile is > > > > >>> > >> just a > > > > >>> > >>>>>> special case of this model > > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > > operator, > > > > >>> > like it > > > > >>> > >> does > > > > >>> > >>>>> now, > > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks that > > > > it > > > > >>> > >> schedules > > > > >>> > >>>>>> together. > > > > >>> > >>>>>> > > > > >>> > >>>>>> > > > > >>> > >>>>>> There is another question about reactive scaling raised > > > > in the > > > > >>> > >> FLIP. I > > > > >>> > >>>>> need > > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > > tricky > > > > >>> > once we > > > > >>> > >>>> have > > > > >>> > >>>>>> slots of different sizes. > > > > >>> > >>>>>> It is not clear then which of the different slot > > > requests > > > > the > > > > >>> > >>>>>> ResourceManager should fulfill when new resources (TMs) > > > > >>> > show up, > > > > >>> > >> or how > > > > >>> > >>>>> the > > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > > resources > > > > >>> > (TMs) > > > > >>> > >>>>> disappear > > > > >>> > >>>>>> This question is pretty orthogonal, though, to the "how > > > to > > > > >>> > specify > > > > >>> > >> the > > > > >>> > >>>>>> resources". > > > > >>> > >>>>>> > > > > >>> > >>>>>> > > > > >>> > >>>>>> Best, > > > > >>> > >>>>>> Stephan > > > > >>> > >>>>>> > > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]> > > > > >>> > >>>>> wrote: > > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > > discussion, > > > > >>> > Yangze. > > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> @Till, > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > > that > > > > SSGs > > > > >>> > >> need to > > > > >>> > >>>>> be > > > > >>> > >>>>>>> supported in fine-grained resource management, > > > otherwise > > > > each > > > > >>> > >>>> operator > > > > >>> > >>>>>>> might use as many resources as the whole group. > > > However, > > > > I > > > > >>> > cannot > > > > >>> > >>>> think > > > > >>> > >>>>>> of > > > > >>> > >>>>>>> a strong reason for not supporting SSGs in fine-grained > > > > >>> > resource > > > > >>> > >>>>>>> management. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> > > > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > > > properly > > > > >>> > >>>>>> specified, > > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > > could > > > > >>> > >> slice off > > > > >>> > >>>>> the > > > > >>> > >>>>>>>> appropriately sized slots for every Task individually. > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > > >>> > operator op_1 > > > > >>> > >> and > > > > >>> > >>>>> op_2 > > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > > > say > > > > that > > > > >>> > >> the > > > > >>> > >>>> slot > > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > > have > > > > a > > > > >>> > >> cluster > > > > >>> > >>>>> with > > > > >>> > >>>>>> 2 > > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > > cannot run > > > > >>> > >> this > > > > >>> > >>>>> job. > > > > >>> > >>>>>> If > > > > >>> > >>>>>>>> the resources were specified on an operator level, > > > then > > > > the > > > > >>> > >> system > > > > >>> > >>>>>> could > > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > op_2 > > > > to > > > > >>> > >> TM_2. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Couldn't agree more that if all operators' requirements > > > > are > > > > >>> > >> properly > > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > > > >>> > think this > > > > >>> > >>>>> exactly > > > > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > > > > each > > > > >>> > >> needs > > > > >>> > >>>> 100 > > > > >>> > >>>>> MB > > > > >>> > >>>>>>> of memory, why would we put them in the same group? If > > > > >>> > they are > > > > >>> > >> in > > > > >>> > >>>>>> separate > > > > >>> > >>>>>>> groups, with the proposed approach the system can > > > freely > > > > >>> > deploy > > > > >>> > >> them > > > > >>> > >>>> to > > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Moreover, the precondition for not needing slot sharing > > > > is > > > > >>> > having > > > > >>> > >>>>>> resource > > > > >>> > >>>>>>> requirements properly specified for all operators. This > > > > is not > > > > >>> > >> always > > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > > > > of the > > > > >>> > >>>> benefits > > > > >>> > >>>>>> for > > > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > > > freely > > > > >>> > >> decide > > > > >>> > >>>> the > > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > > >>> > consider SSG > > > > >>> > >> in > > > > >>> > >>>>>>> fine-grained resource management as a group of > > > operators > > > > >>> > that the > > > > >>> > >>>> user > > > > >>> > >>>>>>> would like to specify the total resource for. There can > > > > be > > > > >>> > only > > > > >>> > >> one > > > > >>> > >>>>> group > > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > > major > > > > >>> > parts, > > > > >>> > >> or as > > > > >>> > >>>>>> many > > > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > > > how > > > > >>> > >>>> fine-grained > > > > >>> > >>>>>> the > > > > >>> > >>>>>>> user is able to specify the resources. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But given > > > > >>> > that all > > > > >>> > >> the > > > > >>> > >>>>>>> current scheduler implementations already support > > > SSGs, I > > > > >>> > tend to > > > > >>> > >>>> think > > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > > >>> > usability and > > > > >>> > >>>>>>> flexibility. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> @Chesnay > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > > waste > > > > >>> > >> resources > > > > >>> > >>>> if > > > > >>> > >>>>>> the > > > > >>> > >>>>>>>> parallelism of operators within that group are > > > > different? > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > > > >>> > >> utilization. To > > > > >>> > >>>>>> avoid > > > > >>> > >>>>>>> such wasting, the user can define more groups, so that > > > > >>> > each group > > > > >>> > >>>>>> contains > > > > >>> > >>>>>>> less operators and the chance of having operators with > > > > >>> > different > > > > >>> > >>>>>>> parallelism will be reduced. The price is to have more > > > > >>> > resource > > > > >>> > >>>>>>> requirements to specify. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > > > >>> > >> recalculate the > > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > > create > > > > >>> > >> a set > > > > >>> > >>>>> of > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > their > > > > >>> > >>>>> applications; > > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > > >>> > would be > > > > >>> > >> a > > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > > >>> > >> requirements > > > > >>> > >>>>> any > > > > >>> > >>>>>>>> way. > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > increases > > > > >>> > >>>>> usability. > > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > > there's no > > > > >>> > >> reason to > > > > >>> > >>>>> put > > > > >>> > >>>>>>> multiple operators whose individual resource > > > > >>> > requirements are > > > > >>> > >>>>> already > > > > >>> > >>>>>>> known > > > > >>> > >>>>>>> into the same group in fine-grained resource > > > > management. > > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > > multiple > > > > >>> > >>>>> applications, > > > > >>> > >>>>>>> it does not guarantee the same resource > > > requirements. > > > > >>> > During > > > > >>> > >> our > > > > >>> > >>>>> years > > > > >>> > >>>>>>> of > > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > > requirements > > > > >>> > >> specified for > > > > >>> > >>>>>>> Blink's > > > > >>> > >>>>>>> fine-grained resource management, very few users > > > > >>> > (including > > > > >>> > >> our > > > > >>> > >>>>>>> specialists > > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > > >>> > >> experienced as > > > > >>> > >>>>> to > > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > > >>> > >> requirements. > > > > >>> > >>>> Most > > > > >>> > >>>>>>> people > > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > > delay, cpu > > > > >>> > >> load, > > > > >>> > >>>>>> memory > > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > > specification. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> To sum up: > > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > > >>> > requirements > > > > >>> > >> for > > > > >>> > >>>>>> every > > > > >>> > >>>>>>> operator, that's definitely a good thing and we would > > > not > > > > >>> > need to > > > > >>> > >>>> rely > > > > >>> > >>>>> on > > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > > >>> > >> fine-grained > > > > >>> > >>>>>> resource > > > > >>> > >>>>>>> management to work. For those users who are capable and > > > > do not > > > > >>> > >> like > > > > >>> > >>>>>> having > > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > > to > > > > have > > > > >>> > >> both > > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > > > > only > > > > >>> > >> fallback > > > > >>> > >>>> to > > > > >>> > >>>>>> the > > > > >>> > >>>>>>> SSG requirements when the operator requirements are not > > > > >>> > >> specified. > > > > >>> > >>>>>> However, > > > > >>> > >>>>>>> as the first step, I think we should prioritise the use > > > > cases > > > > >>> > >> where > > > > >>> > >>>>> users > > > > >>> > >>>>>>> are not that experienced. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Thank you~ > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Xintong Song > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > >>> > >>>>>>> wrote: > > > > >>> > >>>>>>> > > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > > > waste > > > > >>> > >> resources > > > > >>> > >>>>> if > > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > > different? > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>>> It also seems like quite a hassle for users having to > > > > >>> > >> recalculate > > > > >>> > >>>> the > > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > > create > > > > >>> > >> a set > > > > >>> > >>>>> of > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > their > > > > >>> > >>>>> applications; > > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > > >>> > would be > > > > >>> > >> a > > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > > >>> > >> requirements > > > > >>> > >>>>> any > > > > >>> > >>>>>>>> way. > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > increases > > > > >>> > >>>>> usability. > > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > > work > > > > >>> > on SSGs > > > > >>> > >>>> it's > > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > > approaches, > > > > >>> > >> which > > > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > > > always > > > > >>> > >> defined > > > > >>> > >>>> on > > > > >>> > >>>>>> an > > > > >>> > >>>>>>>> operator-level. > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > > discussion > > > > >>> > >>>> Yangze. > > > > >>> > >>>>>>>>> I like that defining resource requirements on a slot > > > > sharing > > > > >>> > >>>> group > > > > >>> > >>>>>>> makes > > > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > > > resource > > > > >>> > >>>>>>> requirements. > > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > > sharing > > > > >>> > >>>> groups > > > > >>> > >>>>>> from > > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > > be > > > > >>> > >> supported > > > > >>> > >>>> in > > > > >>> > >>>>>>> order > > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > > far, > > > > the > > > > >>> > >> idea > > > > >>> > >>>> of > > > > >>> > >>>>>> slot > > > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > > > set > > > > of > > > > >>> > >>>> operators > > > > >>> > >>>>>> can > > > > >>> > >>>>>>>> be > > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > > the > > > > >>> > >> freedom > > > > >>> > >>>> to > > > > >>> > >>>>>> say > > > > >>> > >>>>>>>> that > > > > >>> > >>>>>>>>> it would rather place these tasks in different slots > > > > if it > > > > >>> > >>>> wanted. > > > > >>> > >>>>> If > > > > >>> > >>>>>>> we > > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > > sharing > > > > >>> > >> group, > > > > >>> > >>>> then > > > > >>> > >>>>>> the > > > > >>> > >>>>>>>>> only option for a scheduler which does not support > > > slot > > > > >>> > >> sharing > > > > >>> > >>>>>> groups > > > > >>> > >>>>>>> is > > > > >>> > >>>>>>>>> to say that every operator in this slot sharing group > > > > >>> > needs a > > > > >>> > >>>> slot > > > > >>> > >>>>>> with > > > > >>> > >>>>>>>> the > > > > >>> > >>>>>>>>> same resources as the whole group. > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > > > operator > > > > >>> > >> op_1 > > > > >>> > >>>>> and > > > > >>> > >>>>>>> op_2 > > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > > say that > > > > >>> > >> the > > > > >>> > >>>>> slot > > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > have a > > > > >>> > >> cluster > > > > >>> > >>>>>> with > > > > >>> > >>>>>>> 2 > > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > > cannot run > > > > >>> > >> this > > > > >>> > >>>>>> job. > > > > >>> > >>>>>>> If > > > > >>> > >>>>>>>>> the resources were specified on an operator level, > > > > then the > > > > >>> > >>>> system > > > > >>> > >>>>>>> could > > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > op_2 to > > > > >>> > >> TM_2. > > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot sharing > > > > groups > > > > >>> > >> was > > > > >>> > >>>> to > > > > >>> > >>>>>> make > > > > >>> > >>>>>>>> it > > > > >>> > >>>>>>>>> easier for the user to reason about how many slots a > > > > job > > > > >>> > >> needs > > > > >>> > >>>>>>>> independent > > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > > Interestingly, > > > > >>> > >> if > > > > >>> > >>>> all > > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > > then slot > > > > >>> > >>>>> sharing > > > > >>> > >>>>>> is > > > > >>> > >>>>>>>> no > > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > > >>> > appropriately > > > > >>> > >>>> sized > > > > >>> > >>>>>>> slots > > > > >>> > >>>>>>>>> for every Task individually. What matters is whether > > > > the > > > > >>> > >> whole > > > > >>> > >>>>>> cluster > > > > >>> > >>>>>>>> has > > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> Cheers, > > > > >>> > >>>>>>>>> Till > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > >>> > >>>>>> wrote: > > > > >>> > >>>>>>>>>> Hi, there, > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > > "FLIP-156: > > > > >>> > >> Runtime > > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > > Requirements"[1], > > > > >>> > >> where we > > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > > interfaces > > > > >>> > >> for > > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> In this FLIP: > > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > > > >>> > >> management. > > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > > SSG-based > > > > >>> > >> resource > > > > >>> > >>>>>>>>>> requirements. > > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > > >>> > >> granularities > > > > >>> > >>>>> for > > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > > slot > > > > >>> > >> sharing > > > > >>> > >>>>>> group) > > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > > [1]. > > > > >>> > >> Looking > > > > >>> > >>>>>>>>>> forward to your feedback. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> [1] > > > > >>> > >>>>>>>>>> > > > > >>> > >> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > >>> > < > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > > >>> > >>>>>>>>>> Best, > > > > >>> > >>>>>>>>>> Yangze Guo > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>> > > > > >>> > > > > > >>> > > > > > > > > |
Thanks for sharing your thoughts Kezhu. I like your ideas of how
per-operator and SSG requirements can be combined. I've also thought about defining a default resource profile for all tasks which have no resources configured. That way all operators would have resources assigned if the user chooses to use this feature. As Yangze and Xintong have said, we have decided to first only support specifying resources for SSGs as this seems more user friendly. Based on the feedback for this feature one potential development direction might be to allow the resource specification on per-operator basis. Here we could pick up your ideas. Cheers, Till On Wed, Feb 3, 2021 at 7:31 AM Xintong Song <[hidden email]> wrote: > Thanks for your feedback, Kezhu. > > I think Flink *runtime* already has an ideal granularity for resource > > management 'task'. If there is > > a slot shared by multiple tasks, that slot's resource requirement is > simple > > sum of all its logical > > slots. So basically, this is no resource requirement for SlotSharingGroup > > in runtime until now, > > right ? > > That is a halfly-cooked implementation, coming from the previous attempts > (years ago) trying to deliver the fine-grained resource management feature, > and never really put into use. > > From the FLIP and dicusssion, I assume that SSG resource specifying will > > override operator level > > resource specifying if both are specified ? > > > Actually, I think we should use the finer-grained resources (i.e. operator > level) if both are specified. And more importantly, that is based on the > assumption that we do need two different levels of interfaces. > > So, I wonder whether we could interpret SSG resource specifying as an "add" > > but not an "set" on > > resource requirement ? > > > IIUC, this is the core idea behind your proposal. I think it provides an > interesting idea of how we combine operator level and SSG level resources, > *if > we allow configuring resources at both levels*. However, I'm not sure > whether the configuring resources on the operator level is indeed needed. > Therefore, as a first step, this FLIP proposes to only introduce the > SSG-level interfaces. As listed in the future plan, we would consider > allowing operator level resource configuration later if we do see a need > for it. At that time, we definitely should discuss what to do if resources > are configured at both levels. > > * Could SSG express negative resource requirement ? > > > No. > > Is there concrete bar for partial resource configured not function ? I > > saw it will fail job submission in Dispatcher.submitJob. > > > With the SSG-based approach, this should no longer be needed. The > constraint was introduced because we can neither properly define what is > the resource of a task chained from an operator with specified resource and > another with unspecified resource, nor for a slot shared by a task with > specified resource and another with unspecified resource. With the > SSG-based approach, we no longer have those problems. > > An option(cluster/job level) to force slot sharing in scheduler ? This > > could be useful in case of migration from FLIP-156 to future approach. > > > I think this is exactly what we are trying to avoid, requiring the > scheduler to enforce slot sharing. > > An option(cluster) to ignore resource specifying(allow resource specified > > job to run on open box environment) for no production usage ? > > > That's possible. Actually, we are planning to introduce an option for > activating the fine-grained resource management, for development purposes. > We might consider to keep that option after the feature is completed, to > allow disable the feature without having to touch the job codes. > > Thank you~ > > Xintong Song > > > > On Wed, Feb 3, 2021 at 1:28 PM Kezhu Wang <[hidden email]> wrote: > > > Hi all, sorry for join discussion even after voting started. > > > > I want to share my thoughts on this after reading above discussions. > > > > I think Flink *runtime* already has an ideal granularity for resource > > management 'task'. If there is > > a slot shared by multiple tasks, that slot's resource requirement is > simple > > sum of all its logical > > slots. So basically, this is no resource requirement for SlotSharingGroup > > in runtime until now, > > right ? > > > > As in discussion, we already agree upon that: "If all operators have > their > > resources properly > > specified, then slot sharing is no longer needed. " > > > > So seems to me, naturally in mind path, what we would discuss is that: > how > > to bridge impractical > > operator level resource specifying to runtime task level resource > > requirement ? This is actually a > > pure api thing as Chesnay has pointed out. > > > > But FLIP-156 brings another direction on table: how about using SSG for > > both api and runtime > > resource specifying ? > > > > From the FLIP and dicusssion, I assume that SSG resource specifying will > > override operator level > > resource specifying if both are specified ? > > > > So, I wonder whether we could interpret SSG resource specifying as an > "add" > > but not an "set" on > > resource requirement ? > > > > The semantics is that SSG resource specifying adds additional resource to > > shared slot to express > > concerns on possible high thoughput and resource requirement for tasks in > > one physical slot. > > > > The result is that if scheduler indeed respect slot sharing, allocated > slot > > will gain extra resource > > specified for that SSG. > > > > I think one of coding barrier from "add" approach is ResourceSpec.UNKNOWN > > which didn't support > > 'merge' operation. I tend to use ResourceSpec.ZERO as default, task > > executor should be aware of > > this. > > > > @Chesnay > > > My main worry is that it if we wire the runtime to work on SSGs it's > > > gonna be difficult to implement more fine-grained approaches, which > > > would not be the case if, for the runtime, they are always defined on > an > > > operator-level. > > > > An "add" operation should be less invasive and enforce low barrier for > > future find-grained > > approaches. > > > > @Stephan > > > - Users can define different slot sharing groups for operators like > > they > > > do now, with the exception that you cannot mix operators that have a > > > resource profile and operators that have no resource profile. > > > > @Till > > > This effectively means that all unspecified operators > > > will implicitly have a zero resource requirement. > > > I am wondering whether this wouldn't lead to a surprising behaviour for > > the > > > user. If the user specifies the resource requirements for a single > > > operator, then he probably will assume that the other operators will > get > > > the default share of resources and not nothing. > > > > I think it is inherent due to fact that we could not defining > > ResourceSpec.ONE, eg. resource > > requirement for exact one default slot, with concrete numbers ? I tend to > > squash out unspecified one > > if there are operators in chaining with explicit resource specifying. > > Otherwise, the protocol tends > > to verbose as say "give me this much resource and a default". I think if > we > > have explict resource > > specifying for partial operators, it is just saying "I don't care other > > operators that much, just > > get them places to run". It is most likely be cases there are stateless > > fliter/map or other less > > resource consuming operators. If there is indeed a problem, I think > clients > > can specify a global > > default(or other level default in future). In job graph generating phase, > > we could take that default > > into account for unspecified operators. > > > > @FLIP-156 > > > Expose operator chaining. (Cons fo task level resource specifying) > > > > Is it inherent for all group level resource specifying ? They will either > > break chaining or obey it, > > or event could not work with. > > > > To sum up above, my suggestions are: > > > > In api side: > > * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if > > unspecified). > > * Operator: ResourceSpec.ZERO(unspecified) as default. > > * Task: sum of requirements from specified operators + global default(if > > there are any unspecified operators) > > * SSG: additional resource to physical slot. > > > > In runtime side: > > * Task: ResourceSpec.Task or ResourceSpec.ZERO > > * SSG: ResourceSpec.SSG or ResourceSpec.ZERO > > > > Physical slot gets sum up resources from logical slots and SSG, if it > gets > > ResourceSpec.ZERO, it is > > just a default sized slot. > > > > In short, turn SSG resource speciying as "add" and drop > > ResourceSpec.UNKNOWN. > > > > > > Questions/Issues: > > * Could SSG express negative resource requirement ? > > * Is there concrete bar for partial resource configured not function ? I > > saw it will fail job submission in Dispatcher.submitJob. > > * An option(cluster/job level) to force slot sharing in scheduler ? This > > could be useful in case of migration from FLIP-156 to future approach. > > * An option(cluster) to ignore resource specifying(allow resource > specified > > job to run on open box environment) for no production usage ? > > > > > > > > On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) wrote: > > > > Thanks for reply, Till and Xintong! > > > > I update the FLIP, including: > > - Edit the JavaDoc of the proposed > > StreamGraphGenerator#setSlotSharingGroupResource. > > - Add "Future Plan" section, which contains the potential follow-up > > issues and the limitations to be documented when fine-grained resource > > management is exposed to users. > > > > I'll start a vote in another thread. > > > > Best, > > Yangze Guo > > > > On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > > Thanks for summarizing the discussion, Yangze. I agree that setting > > > resource requirements per operator is not very user friendly. > Moreover, I > > > couldn't come up with a different proposal which would be as easy to > use > > > and wouldn't expose internal scheduling details. In fact, following > this > > > argument then we shouldn't have exposed the slot sharing groups in the > > > first place. > > > > > > What is important for the user is that we properly document the > > limitations > > > and constraints the fine grained resource specification has. For > example, > > > we should explain how optimizations like chaining are affected by it > and > > > how different execution modes (batch vs. streaming) affect the > execution > > of > > > operators which have specified resources. These things shouldn't become > > > part of the contract of this feature and are more caused by internal > > > implementation details but it will be important to understand these > > things > > > properly in order to use this feature effectively. > > > > > > Hence, +1 for starting the vote for this FLIP. > > > > > > Cheers, > > > Till > > > > > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> > > wrote: > > > > > > > Thanks for the summary, Yangze. > > > > > > > > The changes and follow-up issues LGTM. Let's wait for responses from > > the > > > > others before starting a vote. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> > > wrote: > > > > > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > > > summarize the current convergence in the discussion. Please let me > > > > > know if I got things wrong or missed something crucial here. > > > > > > > > > > Change of this FLIP: > > > > > - Treat the SSG resource requirements as a hint instead of a > > > > > restriction for the runtime. That's should be explicitly explained > in > > > > > the JavaDocs. > > > > > > > > > > Potential follow-up issues if needed: > > > > > - Provide operator-level resource configuration interface. > > > > > - Provide multiple options for deciding resources for SSGs whose > > > > > requirement is not specified: > > > > > ** Default slot resource. > > > > > ** Default operator resource times number of operators. > > > > > > > > > > If there are no other issues, I'll update the FLIP accordingly and > > > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song < > [hidden email] > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song < > > [hidden email]> > > > > > > > wrote: > > > > > >> > > > > > >> I think Chesnay's proposal could actually work. IIUC, the > keypoint > > is > > > > > to derive operator requirements from SSG requirements on the API > > side, so > > > > > that the runtime only deals with operator requirements. It's > > debatable > > > > how > > > > > the deriving should be done though. E.g., an alternative could be > to > > > > evenly > > > > > divide the SSG requirement into requirements of operators in the > > group. > > > > > >> > > > > > >> > > > > > >> However, I'm not entirely sure which option is more desired. > > > > > Illustrating my understanding in the following figure, in which on > > the > > > > top > > > > > is Chesnay's proposal and on the bottom is the SSG-based proposal > in > > this > > > > > FLIP. > > > > > >> > > > > > >> > > > > > >> > > > > > >> I think the major difference between the two approaches is where > > > > > deriving operator requirements from SSG requirements happens. > > > > > >> > > > > > >> - Chesnay's proposal simplifies the runtime logic and the > > interface to > > > > > expose, at the price of moving more complexity (i.e. the deriving) > to > > the > > > > > API side. The question is, where do we prefer to keep the > complexity? > > I'm > > > > > slightly leaning towards having a thin API and keep the complexity > in > > > > > runtime if possible. > > > > > >> > > > > > >> - Notice that the dash line arrows represent optional steps that > > are > > > > > needed only for schedulers that do not respect SSGs, which we don't > > have > > > > at > > > > > the moment. If we only look at the solid line arrows, then the > > SSG-based > > > > > approach is much simpler, without needing to derive and aggregate > the > > > > > requirements back and forth. I'm not sure about complicating the > > current > > > > > design only for the potential future needs. > > > > > >> > > > > > >> > > > > > >> Thank you~ > > > > > >> > > > > > >> Xintong Song > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > > [hidden email]> > > > > > wrote: > > > > > >>> > > > > > >>> You're raising a good point, but I think I can rectify that > with > > a > > > > > minor > > > > > >>> adjustment. > > > > > >>> > > > > > >>> Default requirements are whatever the default requirements are, > > > > setting > > > > > >>> the requirements for one operator has no effect on other > > operators. > > > > > >>> > > > > > >>> With these rules, and some API enhancements, the following > mockup > > > > would > > > > > >>> replicate the SSG-based behavior: > > > > > >>> > > > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > > > >>> vertices = slotSharingGroup.getVertices() > > > > > >>> > > > > > > > > > > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > > > > >>> vertices.remainint().setRequirements(ZERO) > > > > > >>> } > > > > > >>> > > > > > >>> We could even allow setting requirements on slotsharing-groups > > > > > >>> colocation-groups and internally translate them accordingly. > > > > > >>> I can't help but feel this is a plain API issue. > > > > > >>> > > > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > > > >>> > If I understand you correctly Chesnay, then you want to > > decouple > > > > the > > > > > >>> > resource requirement specification from the slot sharing > group > > > > > >>> > assignment. Hence, per default all operators would be in the > > same > > > > > slot > > > > > >>> > sharing group. If there is no operator with a resource > > > > specification, > > > > > >>> > then the system would allocate a default slot for it. If > there > > is > > > > at > > > > > >>> > least one operator, then the system would sum up all the > > specified > > > > > >>> > resources and allocate a slot of this size. This effectively > > means > > > > > >>> > that all unspecified operators will implicitly have a zero > > resource > > > > > >>> > requirement. Did I understand your idea correctly? > > > > > >>> > > > > > > >>> > I am wondering whether this wouldn't lead to a surprising > > behaviour > > > > > >>> > for the user. If the user specifies the resource requirements > > for a > > > > > >>> > single operator, then he probably will assume that the other > > > > > operators > > > > > >>> > will get the default share of resources and not nothing. > > > > > >>> > > > > > > >>> > Cheers, > > > > > >>> > Till > > > > > >>> > > > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > > > [hidden email] > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > >>> > > > > > > >>> > Is there even a functional difference between specifying the > > > > > >>> > requirements for an SSG vs specifying the same requirements > on > > > > a > > > > > >>> > single > > > > > >>> > operator within that group (ideally a colocation group to > avoid > > > > > this > > > > > >>> > whole hint business)? > > > > > >>> > > > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > > > >>> > > > > > > >>> > Users can take shortcuts to define shared requirements, > > > > > >>> > but refine them further as needed on a per-operator basis, > > > > > >>> > without changing semantics of slotsharing groups > > > > > >>> > nor the runtime being locked into SSG-based requirements. > > > > > >>> > > > > > > >>> > (And before anyone argues what happens if slotsharing groups > > > > > >>> > change or > > > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > > > (A > > > > > >>> > plain > > > > > >>> > iteration over slotsharing groups and therein contained > > > > operators > > > > > >>> > would > > > > > >>> > suffice)). > > > > > >>> > > > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > > > the SSG > > > > > >>> > > resource requirements as a hint for the runtime similar to > > > > how > > > > > >>> > slot sharing > > > > > >>> > > groups are designed at the moment? Meaning that we don't > give > > > > > >>> > the guarantee > > > > > >>> > > that Flink will always deploy this set of tasks together no > > > > > >>> > matter what > > > > > >>> > > comes. If, for example, the runtime can derive by some > means > > > > > the > > > > > >>> > resource > > > > > >>> > > requirements for each task based on the requirements for > the > > > > > >>> > SSG, this > > > > > >>> > > could be possible. One easy strategy would be to give every > > > > > task > > > > > >>> > the same > > > > > >>> > > resources as the whole slot sharing group. Another one > could > > > > be > > > > > >>> > > distributing the resources equally among the tasks. This > does > > > > > >>> > not even have > > > > > >>> > > to be implemented but we would give ourselves the freedom > to > > > > > change > > > > > >>> > > scheduling if need should arise. > > > > > >>> > > > > > > > >>> > > Cheers, > > > > > >>> > > Till > > > > > >>> > > > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > > > [hidden email] > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > >>> > > > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > > > >>> > >> > > > > > >>> > >> I second Xintong's comment that SSG-based runtime > interface > > > > > >>> > will give > > > > > >>> > >> us the flexibility to achieve op/task-based approach. > That's > > > > > one of > > > > > >>> > >> the most important reasons for our design choice. > > > > > >>> > >> > > > > > >>> > >> Some cents regarding the default operator resource: > > > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > > > >>> > >> ** For light-weight operators, the accumulative > > > > > >>> > configuration error > > > > > >>> > >> will not be significant. Then, the resource of a task used > > > > is > > > > > >>> > >> proportional to the number of operators it contains. > > > > > >>> > >> ** For heavy operators like join and window or operators > > > > > >>> > using the > > > > > >>> > >> external resources, user will turn to the fine-grained > > > > > resource > > > > > >>> > >> configuration. > > > > > >>> > >> - It can increase the stability for the standalone cluster > > > > > >>> > where task > > > > > >>> > >> executors registered are heterogeneous(with different > > > > default > > > > > slot > > > > > >>> > >> resources). > > > > > >>> > >> - It might not be good for SQL users. The operators that > SQL > > > > > >>> > will be > > > > > >>> > >> transferred to is a black box to the user. We also do not > > > > > guarantee > > > > > >>> > >> the cross-version of consistency of the transformation so > > > > far. > > > > > >>> > >> > > > > > >>> > >> I think it can be treated as a follow-up work when the > > > > > fine-grained > > > > > >>> > >> resource management is end-to-end ready. > > > > > >>> > >> > > > > > >>> > >> Best, > > > > > >>> > >> Yangze Guo > > > > > >>> > >> > > > > > >>> > >> > > > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >> wrote: > > > > > >>> > >>> Thanks for the feedback, Till. > > > > > >>> > >>> > > > > > >>> > >>> ## I feel that what you proposed (operator-based + > default > > > > > >>> > value) might > > > > > >>> > >> be > > > > > >>> > >>> subsumed by the SSG-based approach. > > > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 > cases, > > > > > >>> > categorized by > > > > > >>> > >>> whether the resource requirements are known to the users. > > > > > >>> > >>> > > > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > > > >>> > reason to put > > > > > >>> > >>> multiple operators whose individual resource > > > > requirements > > > > > >>> > are already > > > > > >>> > >> known > > > > > >>> > >>> into the same group in fine-grained resource > > > > management. > > > > > >>> > And if op_1 > > > > > >>> > >> and > > > > > >>> > >>> op_2 are in different groups, there should be no > > > > problem > > > > > >>> > switching > > > > > >>> > >> data > > > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > > > >>> > equivalent to > > > > > >>> > >> specifying > > > > > >>> > >>> operator resource requirements in your proposal. > > > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > > > that > > > > > >>> > op_2 is in a > > > > > >>> > >>> SSG whose resource is not specified thus would have the > > > > > >>> > default slot > > > > > >>> > >>> resource. This is equivalent to having default operator > > > > > >>> > resources in > > > > > >>> > >> your > > > > > >>> > >>> proposal. > > > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > > > op_2 > > > > > >>> > to the same > > > > > >>> > >> SSG > > > > > >>> > >>> or separate SSGs. > > > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > > > >>> > equivalent to > > > > > >>> > >> the > > > > > >>> > >>> coarse-grained resource management, where op_1 and > > > > > op_2 > > > > > >>> > share a > > > > > >>> > >> default > > > > > >>> > >>> size slot no matter which data exchange mode is > > > > used. > > > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > > > of > > > > > >>> > them will > > > > > >>> > >> use > > > > > >>> > >>> a default size slot. This is equivalent to setting > > > > > them > > > > > >>> > with > > > > > >>> > >> default > > > > > >>> > >>> operator resources in your proposal. > > > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > > > is > > > > > >>> > known.* > > > > > >>> > >>> - It is possible that the user learns the total / > > > > max > > > > > >>> > resource > > > > > >>> > >>> requirement from executing and monitoring the job, > > > > > >>> > while not > > > > > >>> > >>> being aware of > > > > > >>> > >>> individual operator requirements. > > > > > >>> > >>> - I believe this is the case your proposal does not > > > > > >>> > cover. And TBH, > > > > > >>> > >>> this is probably how most users learn the resource > > > > > >>> > requirements, > > > > > >>> > >>> according > > > > > >>> > >>> to my experiences. > > > > > >>> > >>> - In this case, the user might need to specify > > > > > >>> > different resources > > > > > >>> > >> if > > > > > >>> > >>> he wants to switch the execution mode, which should > > > > > not > > > > > >>> > be worse > > > > > >>> > >> than not > > > > > >>> > >>> being able to use fine-grained resource management. > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> ## An additional idea inspired by your proposal. > > > > > >>> > >>> We may provide multiple options for deciding resources > for > > > > > >>> > SSGs whose > > > > > >>> > >>> requirement is not specified, if needed. > > > > > >>> > >>> > > > > > >>> > >>> - Default slot resource (current design) > > > > > >>> > >>> - Default operator resource times number of operators > > > > > >>> > (equivalent to > > > > > >>> > >>> your proposal) > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> ## Exposing internal runtime strategies > > > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > > > >>> > requirements might be > > > > > >>> > >>> affected if how SSGs are internally handled changes in > > > > > future. > > > > > >>> > >> Practically, > > > > > >>> > >>> I do not concretely see at the moment what kind of > changes > > > > we > > > > > >>> > may want in > > > > > >>> > >>> future that might conflict with this FLIP proposal, as > the > > > > > >>> > question of > > > > > >>> > >>> switching data exchange mode answered above. I'd suggest > to > > > > > >>> > not give up > > > > > >>> > >> the > > > > > >>> > >>> user friendliness we may gain now for the future problems > > > > > that > > > > > >>> > may or may > > > > > >>> > >>> not exist. > > > > > >>> > >>> > > > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > > > >>> > achieve the > > > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > > > set each > > > > > >>> > >> operator > > > > > >>> > >>> (or task) to a separate SSG. We can even provide a > shortcut > > > > > >>> > option to > > > > > >>> > >>> automatically do that for users, if needed. > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> Thank you~ > > > > > >>> > >>> > > > > > >>> > >>> Xintong Song > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >> wrote: > > > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > > > >>> > >>>> > > > > > >>> > >>>> I agree that being able to define the resource > > > > requirements > > > > > for a > > > > > >>> > >> group of > > > > > >>> > >>>> operators is more user friendly. However, my concern is > > > > that > > > > > >>> > we are > > > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > > > >>> > limit our > > > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > > > semantics > > > > > of > > > > > >>> > >> configuring > > > > > >>> > >>>> resource requirements for SSGs could break if switching > > > > from > > > > > >>> > streaming > > > > > >>> > >> to > > > > > >>> > >>>> batch execution. If one defines the resource > requirements > > > > > for > > > > > >>> > op_1 -> > > > > > >>> > >> op_2 > > > > > >>> > >>>> which run in pipelined mode when using the streaming > > > > > >>> > execution, then > > > > > >>> > >> how do > > > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > > > >>> > executed with a > > > > > >>> > >>>> blocking data exchange in batch execution mode? > > > > > Consequently, > > > > > >>> > I am > > > > > >>> > >> still > > > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > > > >>> > requirements per > > > > > >>> > >>>> operator. > > > > > >>> > >>>> > > > > > >>> > >>>> Maybe the following proposal makes the configuration > > > > easier: > > > > > >>> > If the > > > > > >>> > >> user > > > > > >>> > >>>> wants to use fine-grained resource requirements, then > she > > > > > >>> > needs to > > > > > >>> > >> specify > > > > > >>> > >>>> the default size which is used for operators which have > no > > > > > >>> > explicit > > > > > >>> > >>>> resource annotation. If this holds true, then every > > > > operator > > > > > >>> > would > > > > > >>> > >> have a > > > > > >>> > >>>> resource requirement and the system can try to execute > the > > > > > >>> > operators > > > > > >>> > >> in the > > > > > >>> > >>>> best possible manner w/o being constrained by how the > user > > > > > >>> > set the SSG > > > > > >>> > >>>> requirements. > > > > > >>> > >>>> > > > > > >>> > >>>> Cheers, > > > > > >>> > >>>> Till > > > > > >>> > >>>> > > > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >>>> wrote: > > > > > >>> > >>>> > > > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > > > >>> > >>>>> > > > > > >>> > >>>>> Actually, your proposal has also come to my mind at > some > > > > > >>> > point. And I > > > > > >>> > >>>> have > > > > > >>> > >>>>> some concerns about it. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> 1. It does not give users the same control as the > > > > SSG-based > > > > > >>> > approach. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> While both approaches do not require specifying for > each > > > > > >>> > operator, > > > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > > > operators > > > > > >>> > >> together > > > > > >>> > >>>> use > > > > > >>> > >>>>> this much resource" while the operator-based approach > > > > > doesn't. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, > ..., > > > > > >>> > o_m), and > > > > > >>> > >> at > > > > > >>> > >>>> some > > > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which > significantly > > > > > >>> > reduces the > > > > > >>> > >> data > > > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups > SSG_1 > > > > > >>> > (o_1, ..., > > > > > >>> > >> o_n) > > > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > > > higher > > > > > >>> > >> parallelisms > > > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 > won't > > > > > >>> > lead to too > > > > > >>> > >> much > > > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > > > different > > > > > >>> > >> resources, > > > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > > > >>> > resources for > > > > > >>> > >> the > > > > > >>> > >>>> two > > > > > >>> > >>>>> groups. However, with the operator-based approach, the > > > > > user will > > > > > >>> > >> have to > > > > > >>> > >>>>> specify resources for each operator in one of the two > > > > > >>> > groups, and > > > > > >>> > >> tune > > > > > >>> > >>>> the > > > > > >>> > >>>>> default slot resource via configurations to fit the > other > > > > > group. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Setting chainnable operators into different slot > sharing > > > > > >>> > groups will > > > > > >>> > >>>>> prevent them from being chained. In the current > > > > > implementation, > > > > > >>> > >>>> downstream > > > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > > > to > > > > > >>> > the same > > > > > >>> > >> group > > > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > > > upstream > > > > > >>> > >> operators > > > > > >>> > >>>> in > > > > > >>> > >>>>> different groups), to reduce the chance of breaking > > > > chains. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> > o_3, > > > > > >>> > deciding > > > > > >>> > >> SSGs > > > > > >>> > >>>>> based on whether resource is specified we will easily > get > > > > > >>> > groups like > > > > > >>> > >>>> (o_1, > > > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > > > >>> > chained. This > > > > > >>> > >> is > > > > > >>> > >>>> also > > > > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > > > > >>> > chance is much > > > > > >>> > >>>>> smaller because there's no strong reason for users to > > > > > >>> > specify the > > > > > >>> > >> groups > > > > > >>> > >>>>> with alternate operators like that. We are more likely > to > > > > > >>> > get groups > > > > > >>> > >> like > > > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > > > between > > > > > >>> > o_2 and > > > > > >>> > >> o_3. > > > > > >>> > >>>>> > > > > > >>> > >>>>> 3. It complicates the system by having two different > > > > > >>> > mechanisms for > > > > > >>> > >>>> sharing > > > > > >>> > >>>>> managed memory in a slot. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > > > memory > > > > > >>> > sharing > > > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > > > >>> > according to the > > > > > >>> > >>>>> consumer type, then further distributed across > operators > > > > > of that > > > > > >>> > >> consumer > > > > > >>> > >>>>> type. > > > > > >>> > >>>>> > > > > > >>> > >>>>> - With the operator-based approach, managed memory size > > > > > >>> > specified > > > > > >>> > >> for an > > > > > >>> > >>>>> operator should account for all the consumer types of > > > > that > > > > > >>> > operator. > > > > > >>> > >> That > > > > > >>> > >>>>> means the managed memory is first distributed across > > > > > >>> > operators, then > > > > > >>> > >>>>> distributed to different consumer types of each > operator. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Unfortunately, the different order of the two > calculation > > > > > >>> > steps can > > > > > >>> > >> lead > > > > > >>> > >>>> to > > > > > >>> > >>>>> different results. To be specific, the semantic of the > > > > > >>> > configuration > > > > > >>> > >>>> option > > > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > > > >>> > operator). > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> To sum up things: > > > > > >>> > >>>>> > > > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > > > >>> > think (1) > > > > > >>> > >> and (2) > > > > > >>> > >>>>> somehow suggest that, the price for the proposed > approach > > > > > to > > > > > >>> > avoid > > > > > >>> > >>>>> specifying resource for every operator is that it's not > > > > as > > > > > >>> > >> independent > > > > > >>> > >>>> from > > > > > >>> > >>>>> operator chaining and slot sharing as the > operator-based > > > > > >>> > approach > > > > > >>> > >>>> discussed > > > > > >>> > >>>>> in the FLIP. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Thank you~ > > > > > >>> > >>>>> > > > > > >>> > >>>>> Xintong Song > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >> wrote: > > > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > > > >>> > written. And > > > > > >>> > >> the > > > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > > > >>> > configuration to > > > > > >>> > >>>> users > > > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > > > >>> > >>>>>> So good job here! > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> About how to let users specify the resource profiles. > > > > If I > > > > > >>> > can sum > > > > > >>> > >> the > > > > > >>> > >>>>> FLIP > > > > > >>> > >>>>>> and previous discussion up in my own words, the > problem > > > > > is the > > > > > >>> > >>>> following: > > > > > >>> > >>>>>> Operator-level specification is the simplest and > > > > cleanest > > > > > >>> > approach, > > > > > >>> > >>>>> because > > > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) > and > > > > > >>> > >> scheduling. No > > > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > > > sharing, > > > > > >>> > >>>> switching > > > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource > profiles > > > > > >>> > stay the > > > > > >>> > >>>> same. > > > > > >>> > >>>>>>> But it would require that a user specifies resources > on > > > > > all > > > > > >>> > >>>> operators, > > > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > > > suggests > > > > > going > > > > > >>> > >> with > > > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> I think both thoughts are important, so can we find a > > > > > solution > > > > > >>> > >> where > > > > > >>> > >>>> the > > > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > > > >>> > still avoid > > > > > >>> > >> that > > > > > >>> > >>>>> we > > > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> What do you think about something like the following: > > > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > > > level. > > > > > >>> > >>>>>> - Not all operators need profiles > > > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > > > in > > > > > the > > > > > >>> > >> default > > > > > >>> > >>>> slot > > > > > >>> > >>>>>> sharing group with a default profile (will get a > default > > > > > slot). > > > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > > > >>> > another slot > > > > > >>> > >>>>> sharing > > > > > >>> > >>>>>> group (the resource-specified-group). > > > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > > > >>> > operators > > > > > >>> > >> like > > > > > >>> > >>>>> they > > > > > >>> > >>>>>> do now, with the exception that you cannot mix > operators > > > > > >>> > that have > > > > > >>> > >> a > > > > > >>> > >>>>>> resource profile and operators that have no resource > > > > > profile. > > > > > >>> > >>>>>> - The default case where no operator has a resource > > > > > >>> > profile is > > > > > >>> > >> just a > > > > > >>> > >>>>>> special case of this model > > > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > > > operator, > > > > > >>> > like it > > > > > >>> > >> does > > > > > >>> > >>>>> now, > > > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks > that > > > > > it > > > > > >>> > >> schedules > > > > > >>> > >>>>>> together. > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> There is another question about reactive scaling > raised > > > > > in the > > > > > >>> > >> FLIP. I > > > > > >>> > >>>>> need > > > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > > > tricky > > > > > >>> > once we > > > > > >>> > >>>> have > > > > > >>> > >>>>>> slots of different sizes. > > > > > >>> > >>>>>> It is not clear then which of the different slot > > > > requests > > > > > the > > > > > >>> > >>>>>> ResourceManager should fulfill when new resources > (TMs) > > > > > >>> > show up, > > > > > >>> > >> or how > > > > > >>> > >>>>> the > > > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > > > resources > > > > > >>> > (TMs) > > > > > >>> > >>>>> disappear > > > > > >>> > >>>>>> This question is pretty orthogonal, though, to the > "how > > > > to > > > > > >>> > specify > > > > > >>> > >> the > > > > > >>> > >>>>>> resources". > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> Best, > > > > > >>> > >>>>>> Stephan > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > > > >>> > <[hidden email] <mailto:[hidden email]> > > > > > >>> > >>>>> wrote: > > > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > > > discussion, > > > > > >>> > Yangze. > > > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> @Till, > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > > > that > > > > > SSGs > > > > > >>> > >> need to > > > > > >>> > >>>>> be > > > > > >>> > >>>>>>> supported in fine-grained resource management, > > > > otherwise > > > > > each > > > > > >>> > >>>> operator > > > > > >>> > >>>>>>> might use as many resources as the whole group. > > > > However, > > > > > I > > > > > >>> > cannot > > > > > >>> > >>>> think > > > > > >>> > >>>>>> of > > > > > >>> > >>>>>>> a strong reason for not supporting SSGs in > fine-grained > > > > > >>> > resource > > > > > >>> > >>>>>>> management. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > > > > properly > > > > > >>> > >>>>>> specified, > > > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > > > could > > > > > >>> > >> slice off > > > > > >>> > >>>>> the > > > > > >>> > >>>>>>>> appropriately sized slots for every Task > individually. > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > > > >>> > operator op_1 > > > > > >>> > >> and > > > > > >>> > >>>>> op_2 > > > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > > > > say > > > > > that > > > > > >>> > >> the > > > > > >>> > >>>> slot > > > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > have > > > > > a > > > > > >>> > >> cluster > > > > > >>> > >>>>> with > > > > > >>> > >>>>>> 2 > > > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > > > cannot run > > > > > >>> > >> this > > > > > >>> > >>>>> job. > > > > > >>> > >>>>>> If > > > > > >>> > >>>>>>>> the resources were specified on an operator level, > > > > then > > > > > the > > > > > >>> > >> system > > > > > >>> > >>>>>> could > > > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > op_2 > > > > > to > > > > > >>> > >> TM_2. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Couldn't agree more that if all operators' > requirements > > > > > are > > > > > >>> > >> properly > > > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > > > > >>> > think this > > > > > >>> > >>>>> exactly > > > > > >>> > >>>>>>> disproves the example. If we already know op_1 and > op_2 > > > > > each > > > > > >>> > >> needs > > > > > >>> > >>>> 100 > > > > > >>> > >>>>> MB > > > > > >>> > >>>>>>> of memory, why would we put them in the same group? > If > > > > > >>> > they are > > > > > >>> > >> in > > > > > >>> > >>>>>> separate > > > > > >>> > >>>>>>> groups, with the proposed approach the system can > > > > freely > > > > > >>> > deploy > > > > > >>> > >> them > > > > > >>> > >>>> to > > > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Moreover, the precondition for not needing slot > sharing > > > > > is > > > > > >>> > having > > > > > >>> > >>>>>> resource > > > > > >>> > >>>>>>> requirements properly specified for all operators. > This > > > > > is not > > > > > >>> > >> always > > > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. > One > > > > > of the > > > > > >>> > >>>> benefits > > > > > >>> > >>>>>> for > > > > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > > > > freely > > > > > >>> > >> decide > > > > > >>> > >>>> the > > > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > > > >>> > consider SSG > > > > > >>> > >> in > > > > > >>> > >>>>>>> fine-grained resource management as a group of > > > > operators > > > > > >>> > that the > > > > > >>> > >>>> user > > > > > >>> > >>>>>>> would like to specify the total resource for. There > can > > > > > be > > > > > >>> > only > > > > > >>> > >> one > > > > > >>> > >>>>> group > > > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > > > major > > > > > >>> > parts, > > > > > >>> > >> or as > > > > > >>> > >>>>>> many > > > > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > > > > how > > > > > >>> > >>>> fine-grained > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>> user is able to specify the resources. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But > given > > > > > >>> > that all > > > > > >>> > >> the > > > > > >>> > >>>>>>> current scheduler implementations already support > > > > SSGs, I > > > > > >>> > tend to > > > > > >>> > >>>> think > > > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > > > >>> > usability and > > > > > >>> > >>>>>>> flexibility. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> @Chesnay > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > > > waste > > > > > >>> > >> resources > > > > > >>> > >>>> if > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>>> parallelism of operators within that group are > > > > > different? > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > > > > >>> > >> utilization. To > > > > > >>> > >>>>>> avoid > > > > > >>> > >>>>>>> such wasting, the user can define more groups, so > that > > > > > >>> > each group > > > > > >>> > >>>>>> contains > > > > > >>> > >>>>>>> less operators and the chance of having operators > with > > > > > >>> > different > > > > > >>> > >>>>>>> parallelism will be reduced. The price is to have > more > > > > > >>> > resource > > > > > >>> > >>>>>>> requirements to specify. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > > > > >>> > >> recalculate the > > > > > >>> > >>>>>>>> resource requirements if they change the slot > sharing. > > > > > >>> > >>>>>>>> I'd think that it's not really workable for users > that > > > > > create > > > > > >>> > >> a set > > > > > >>> > >>>>> of > > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > > their > > > > > >>> > >>>>> applications; > > > > > >>> > >>>>>>>> managing the resources requirements in such a > setting > > > > > >>> > would be > > > > > >>> > >> a > > > > > >>> > >>>>>>>> nightmare, and in the end would require > operator-level > > > > > >>> > >> requirements > > > > > >>> > >>>>> any > > > > > >>> > >>>>>>>> way. > > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > > increases > > > > > >>> > >>>>> usability. > > > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > > > there's no > > > > > >>> > >> reason to > > > > > >>> > >>>>> put > > > > > >>> > >>>>>>> multiple operators whose individual resource > > > > > >>> > requirements are > > > > > >>> > >>>>> already > > > > > >>> > >>>>>>> known > > > > > >>> > >>>>>>> into the same group in fine-grained resource > > > > > management. > > > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > > > multiple > > > > > >>> > >>>>> applications, > > > > > >>> > >>>>>>> it does not guarantee the same resource > > > > requirements. > > > > > >>> > During > > > > > >>> > >> our > > > > > >>> > >>>>> years > > > > > >>> > >>>>>>> of > > > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > > > requirements > > > > > >>> > >> specified for > > > > > >>> > >>>>>>> Blink's > > > > > >>> > >>>>>>> fine-grained resource management, very few users > > > > > >>> > (including > > > > > >>> > >> our > > > > > >>> > >>>>>>> specialists > > > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > > > >>> > >> experienced as > > > > > >>> > >>>>> to > > > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > > > >>> > >> requirements. > > > > > >>> > >>>> Most > > > > > >>> > >>>>>>> people > > > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > > > delay, cpu > > > > > >>> > >> load, > > > > > >>> > >>>>>> memory > > > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > > > specification. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> To sum up: > > > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > > > >>> > requirements > > > > > >>> > >> for > > > > > >>> > >>>>>> every > > > > > >>> > >>>>>>> operator, that's definitely a good thing and we would > > > > not > > > > > >>> > need to > > > > > >>> > >>>> rely > > > > > >>> > >>>>> on > > > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > > > >>> > >> fine-grained > > > > > >>> > >>>>>> resource > > > > > >>> > >>>>>>> management to work. For those users who are capable > and > > > > > do not > > > > > >>> > >> like > > > > > >>> > >>>>>> having > > > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > > > to > > > > > have > > > > > >>> > >> both > > > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and > to > > > > > only > > > > > >>> > >> fallback > > > > > >>> > >>>> to > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>> SSG requirements when the operator requirements are > not > > > > > >>> > >> specified. > > > > > >>> > >>>>>> However, > > > > > >>> > >>>>>>> as the first step, I think we should prioritise the > use > > > > > cases > > > > > >>> > >> where > > > > > >>> > >>>>> users > > > > > >>> > >>>>>>> are not that experienced. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Thank you~ > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Xintong Song > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > > >>> > >>>>>>> wrote: > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > > > > waste > > > > > >>> > >> resources > > > > > >>> > >>>>> if > > > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > > > different? > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>>> It also seems like quite a hassle for users having > to > > > > > >>> > >> recalculate > > > > > >>> > >>>> the > > > > > >>> > >>>>>>>> resource requirements if they change the slot > sharing. > > > > > >>> > >>>>>>>> I'd think that it's not really workable for users > that > > > > > create > > > > > >>> > >> a set > > > > > >>> > >>>>> of > > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > > their > > > > > >>> > >>>>> applications; > > > > > >>> > >>>>>>>> managing the resources requirements in such a > setting > > > > > >>> > would be > > > > > >>> > >> a > > > > > >>> > >>>>>>>> nightmare, and in the end would require > operator-level > > > > > >>> > >> requirements > > > > > >>> > >>>>> any > > > > > >>> > >>>>>>>> way. > > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > > increases > > > > > >>> > >>>>> usability. > > > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > > > work > > > > > >>> > on SSGs > > > > > >>> > >>>> it's > > > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > > > approaches, > > > > > >>> > >> which > > > > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > > > > always > > > > > >>> > >> defined > > > > > >>> > >>>> on > > > > > >>> > >>>>>> an > > > > > >>> > >>>>>>>> operator-level. > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > > > discussion > > > > > >>> > >>>> Yangze. > > > > > >>> > >>>>>>>>> I like that defining resource requirements on a > slot > > > > > sharing > > > > > >>> > >>>> group > > > > > >>> > >>>>>>> makes > > > > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > > > > resource > > > > > >>> > >>>>>>> requirements. > > > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > > > sharing > > > > > >>> > >>>> groups > > > > > >>> > >>>>>> from > > > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > > > be > > > > > >>> > >> supported > > > > > >>> > >>>> in > > > > > >>> > >>>>>>> order > > > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > > > far, > > > > > the > > > > > >>> > >> idea > > > > > >>> > >>>> of > > > > > >>> > >>>>>> slot > > > > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > > > > set > > > > > of > > > > > >>> > >>>> operators > > > > > >>> > >>>>>> can > > > > > >>> > >>>>>>>> be > > > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > > > the > > > > > >>> > >> freedom > > > > > >>> > >>>> to > > > > > >>> > >>>>>> say > > > > > >>> > >>>>>>>> that > > > > > >>> > >>>>>>>>> it would rather place these tasks in different > slots > > > > > if it > > > > > >>> > >>>> wanted. > > > > > >>> > >>>>> If > > > > > >>> > >>>>>>> we > > > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > > > sharing > > > > > >>> > >> group, > > > > > >>> > >>>> then > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>>>> only option for a scheduler which does not support > > > > slot > > > > > >>> > >> sharing > > > > > >>> > >>>>>> groups > > > > > >>> > >>>>>>> is > > > > > >>> > >>>>>>>>> to say that every operator in this slot sharing > group > > > > > >>> > needs a > > > > > >>> > >>>> slot > > > > > >>> > >>>>>> with > > > > > >>> > >>>>>>>> the > > > > > >>> > >>>>>>>>> same resources as the whole group. > > > > > >>> > >>>>>>>>> > > > > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > > > > operator > > > > > >>> > >> op_1 > > > > > >>> > >>>>> and > > > > > >>> > >>>>>>> op_2 > > > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > > > say that > > > > > >>> > >> the > > > > > >>> > >>>>> slot > > > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > > have a > > > > > >>> > >> cluster > > > > > >>> > >>>>>> with > > > > > >>> > >>>>>>> 2 > > > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > > > cannot run > > > > > >>> > >> this > > > > > >>> > >>>>>> job. > > > > > >>> > >>>>>>> If > > > > > >>> > >>>>>>>>> the resources were specified on an operator level, > > > > > then the > > > > > >>> > >>>> system > > > > > >>> > >>>>>>> could > > > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > > op_2 to > > > > > >>> > >> TM_2. > > > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot > sharing > > > > > groups > > > > > >>> > >> was > > > > > >>> > >>>> to > > > > > >>> > >>>>>> make > > > > > >>> > >>>>>>>> it > > > > > >>> > >>>>>>>>> easier for the user to reason about how many slots > a > > > > > job > > > > > >>> > >> needs > > > > > >>> > >>>>>>>> independent > > > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > > > Interestingly, > > > > > >>> > >> if > > > > > >>> > >>>> all > > > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > > > then slot > > > > > >>> > >>>>> sharing > > > > > >>> > >>>>>> is > > > > > >>> > >>>>>>>> no > > > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > > > >>> > appropriately > > > > > >>> > >>>> sized > > > > > >>> > >>>>>>> slots > > > > > >>> > >>>>>>>>> for every Task individually. What matters is > whether > > > > > the > > > > > >>> > >> whole > > > > > >>> > >>>>>> cluster > > > > > >>> > >>>>>>>> has > > > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > > > >>> > >>>>>>>>> > > > > > >>> > >>>>>>>>> Cheers, > > > > > >>> > >>>>>>>>> Till > > > > > >>> > >>>>>>>>> > > > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > > >>> > >>>>>> wrote: > > > > > >>> > >>>>>>>>>> Hi, there, > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > > > "FLIP-156: > > > > > >>> > >> Runtime > > > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > > > Requirements"[1], > > > > > >>> > >> where we > > > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > > > interfaces > > > > > >>> > >> for > > > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> In this FLIP: > > > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > > > > >>> > >> management. > > > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > > > SSG-based > > > > > >>> > >> resource > > > > > >>> > >>>>>>>>>> requirements. > > > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > > > >>> > >> granularities > > > > > >>> > >>>>> for > > > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > > > slot > > > > > >>> > >> sharing > > > > > >>> > >>>>>> group) > > > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > > > [1]. > > > > > >>> > >> Looking > > > > > >>> > >>>>>>>>>> forward to your feedback. > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> [1] > > > > > >>> > >>>>>>>>>> > > > > > >>> > >> > > > > > >>> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > >>> > < > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > > > > >>> > >>>>>>>>>> Best, > > > > > >>> > >>>>>>>>>> Yangze Guo > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>> > > > > > >>> > > > > > > >>> > > > > > > > > > > > > |
In reply to this post by Xintong Song
Hi, Yangze and Xintong, thank you for replies.
I indeed make assumptions, I list them here in order: 1. There is only task/LogicalSlot level resource specification in runtime. And it comes from api side and is respected in runtime. 2. Current operator level resource specification in client side is respected and used to aggregate task resource specification for runtime usage. 3. It is possible that other find-grained group level resource specfiying, which could obey chaing, emerge in future. My proposal is basing on first, and try to make room for the last two in SSG resource specfiying. @Xintong > I think this is exactly what we are trying to avoid, requiring the scheduler to enforce slot sharing. I saw the dicussion to keep slot sharing as an hint, but in reality, will SSG jobs expect to fail or run slowly if scheduler does not respect it ? A slot with 20GB memory is different from two 1GB default sized slots. So, we actually depends on scheduler version/implementation/de-fact if we claim it is an hint. @Xintong > So, I wonder whether we could interpret SSG resource specifying as an "add" > but not an "set" on resource requirement ? > IIUC, this is the core idea behind your proposal. You are right, all other changes are serving for this. It is also the semantics divergence between the two: my suggestion treat SSG as an hint and extra resource specfiying place while FLIP-156 tends to treat SSG as restriction and authoritative resource specfiying. With this change, I think FLIP-156 is just a special case by forcing only SSG and no other specifications. That is, if there is no other resource specifications, "set" equals to "add" to zero. So if this is the case after FLIP-156, then there is still room for this direction, if indeed required. @Yangze, @Xintong > never really used Do you mean code-path or production environment ? If it is code-path, could you please point out where the story breaks ? From the dicussion and history, could I consider FLIP-156 is an redirection more than inheritance/enhancement of current halfly-cooked/ancient implmentation ? Thank you, Yangze and Xintong. On February 3, 2021 at 14:31:28, Xintong Song ([hidden email]) wrote: Thanks for your feedback, Kezhu. I think Flink *runtime* already has an ideal granularity for resource > management 'task'. If there is > a slot shared by multiple tasks, that slot's resource requirement is simple > sum of all its logical > slots. So basically, this is no resource requirement for SlotSharingGroup > in runtime until now, > right ? That is a halfly-cooked implementation, coming from the previous attempts (years ago) trying to deliver the fine-grained resource management feature, and never really put into use. From the FLIP and dicusssion, I assume that SSG resource specifying will > override operator level > resource specifying if both are specified ? > Actually, I think we should use the finer-grained resources (i.e. operator level) if both are specified. And more importantly, that is based on the assumption that we do need two different levels of interfaces. So, I wonder whether we could interpret SSG resource specifying as an "add" > but not an "set" on > resource requirement ? > IIUC, this is the core idea behind your proposal. I think it provides an interesting idea of how we combine operator level and SSG level resources, *if we allow configuring resources at both levels*. However, I'm not sure whether the configuring resources on the operator level is indeed needed. Therefore, as a first step, this FLIP proposes to only introduce the SSG-level interfaces. As listed in the future plan, we would consider allowing operator level resource configuration later if we do see a need for it. At that time, we definitely should discuss what to do if resources are configured at both levels. * Could SSG express negative resource requirement ? > No. Is there concrete bar for partial resource configured not function ? I > saw it will fail job submission in Dispatcher.submitJob. > With the SSG-based approach, this should no longer be needed. The constraint was introduced because we can neither properly define what is the resource of a task chained from an operator with specified resource and another with unspecified resource, nor for a slot shared by a task with specified resource and another with unspecified resource. With the SSG-based approach, we no longer have those problems. An option(cluster/job level) to force slot sharing in scheduler ? This > could be useful in case of migration from FLIP-156 to future approach. > I think this is exactly what we are trying to avoid, requiring the scheduler to enforce slot sharing. An option(cluster) to ignore resource specifying(allow resource specified > job to run on open box environment) for no production usage ? > That's possible. Actually, we are planning to introduce an option for activating the fine-grained resource management, for development purposes. We might consider to keep that option after the feature is completed, to allow disable the feature without having to touch the job codes. Thank you~ Xintong Song On Wed, Feb 3, 2021 at 1:28 PM Kezhu Wang <[hidden email]> wrote: > Hi all, sorry for join discussion even after voting started. > > I want to share my thoughts on this after reading above discussions. > > I think Flink *runtime* already has an ideal granularity for resource > management 'task'. If there is > a slot shared by multiple tasks, that slot's resource requirement is simple > sum of all its logical > slots. So basically, this is no resource requirement for SlotSharingGroup > in runtime until now, > right ? > > As in discussion, we already agree upon that: "If all operators have their > resources properly > specified, then slot sharing is no longer needed. " > > So seems to me, naturally in mind path, what we would discuss is that: how > to bridge impractical > operator level resource specifying to runtime task level resource > requirement ? This is actually a > pure api thing as Chesnay has pointed out. > > But FLIP-156 brings another direction on table: how about using SSG for > both api and runtime > resource specifying ? > > From the FLIP and dicusssion, I assume that SSG resource specifying will > override operator level > resource specifying if both are specified ? > > So, I wonder whether we could interpret SSG resource specifying as an > but not an "set" on > resource requirement ? > > The semantics is that SSG resource specifying adds additional resource to > shared slot to express > concerns on possible high thoughput and resource requirement for tasks in > one physical slot. > > The result is that if scheduler indeed respect slot sharing, allocated slot > will gain extra resource > specified for that SSG. > > I think one of coding barrier from "add" approach is ResourceSpec.UNKNOWN > which didn't support > 'merge' operation. I tend to use ResourceSpec.ZERO as default, task > executor should be aware of > this. > > @Chesnay > > My main worry is that it if we wire the runtime to work on SSGs it's > > gonna be difficult to implement more fine-grained approaches, which > > would not be the case if, for the runtime, they are always defined on > > operator-level. > > An "add" operation should be less invasive and enforce low barrier for > future find-grained > approaches. > > @Stephan > > - Users can define different slot sharing groups for operators like > they > > do now, with the exception that you cannot mix operators that have a > > resource profile and operators that have no resource profile. > > @Till > > This effectively means that all unspecified operators > > will implicitly have a zero resource requirement. > > I am wondering whether this wouldn't lead to a surprising behaviour for > the > > user. If the user specifies the resource requirements for a single > > operator, then he probably will assume that the other operators will > > the default share of resources and not nothing. > > I think it is inherent due to fact that we could not defining > ResourceSpec.ONE, eg. resource > requirement for exact one default slot, with concrete numbers ? I tend to > squash out unspecified one > if there are operators in chaining with explicit resource specifying. > Otherwise, the protocol tends > to verbose as say "give me this much resource and a default". I think if we > have explict resource > specifying for partial operators, it is just saying "I don't care other > operators that much, just > get them places to run". It is most likely be cases there are stateless > fliter/map or other less > resource consuming operators. If there is indeed a problem, I think clients > can specify a global > default(or other level default in future). In job graph generating phase, > we could take that default > into account for unspecified operators. > > @FLIP-156 > > Expose operator chaining. (Cons fo task level resource specifying) > > Is it inherent for all group level resource specifying ? They will either > break chaining or obey it, > or event could not work with. > > To sum up above, my suggestions are: > > In api side: > * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if > unspecified). > * Operator: ResourceSpec.ZERO(unspecified) as default. > * Task: sum of requirements from specified operators + global default(if > there are any unspecified operators) > * SSG: additional resource to physical slot. > > In runtime side: > * Task: ResourceSpec.Task or ResourceSpec.ZERO > * SSG: ResourceSpec.SSG or ResourceSpec.ZERO > > Physical slot gets sum up resources from logical slots and SSG, if it > ResourceSpec.ZERO, it is > just a default sized slot. > > In short, turn SSG resource speciying as "add" and drop > ResourceSpec.UNKNOWN. > > > Questions/Issues: > * Could SSG express negative resource requirement ? > * Is there concrete bar for partial resource configured not function ? I > saw it will fail job submission in Dispatcher.submitJob. > * An option(cluster/job level) to force slot sharing in scheduler ? This > could be useful in case of migration from FLIP-156 to future approach. > * An option(cluster) to ignore resource specifying(allow resource > job to run on open box environment) for no production usage ? > > > > On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) wrote: > > Thanks for reply, Till and Xintong! > > I update the FLIP, including: > - Edit the JavaDoc of the proposed > StreamGraphGenerator#setSlotSharingGroupResource. > - Add "Future Plan" section, which contains the potential follow-up > issues and the limitations to be documented when fine-grained resource > management is exposed to users. > > I'll start a vote in another thread. > > Best, > Yangze Guo > > On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> > wrote: > > > > Thanks for summarizing the discussion, Yangze. I agree that setting > > resource requirements per operator is not very user friendly. Moreover, > > couldn't come up with a different proposal which would be as easy to use > > and wouldn't expose internal scheduling details. In fact, following this > > argument then we shouldn't have exposed the slot sharing groups in the > > first place. > > > > What is important for the user is that we properly document the > limitations > > and constraints the fine grained resource specification has. For example, > > we should explain how optimizations like chaining are affected by it and > > how different execution modes (batch vs. streaming) affect the execution > of > > operators which have specified resources. These things shouldn't become > > part of the contract of this feature and are more caused by internal > > implementation details but it will be important to understand these > things > > properly in order to use this feature effectively. > > > > Hence, +1 for starting the vote for this FLIP. > > > > Cheers, > > Till > > > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> > wrote: > > > > > Thanks for the summary, Yangze. > > > > > > The changes and follow-up issues LGTM. Let's wait for responses from > the > > > others before starting a vote. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> > wrote: > > > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > > summarize the current convergence in the discussion. Please let me > > > > know if I got things wrong or missed something crucial here. > > > > > > > > Change of this FLIP: > > > > - Treat the SSG resource requirements as a hint instead of a > > > > restriction for the runtime. That's should be explicitly explained > > > > the JavaDocs. > > > > > > > > Potential follow-up issues if needed: > > > > - Provide operator-level resource configuration interface. > > > > - Provide multiple options for deciding resources for SSGs whose > > > > requirement is not specified: > > > > ** Default slot resource. > > > > ** Default operator resource times number of operators. > > > > > > > > If there are no other issues, I'll update the FLIP accordingly and > > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song <[hidden email] > > > > > > wrote: > > > > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song < > [hidden email]> > > > > > wrote: > > > > >> > > > > >> I think Chesnay's proposal could actually work. IIUC, the > is > > > > to derive operator requirements from SSG requirements on the API > side, so > > > > that the runtime only deals with operator requirements. It's > debatable > > > how > > > > the deriving should be done though. E.g., an alternative could be to > > > evenly > > > > divide the SSG requirement into requirements of operators in the > group. > > > > >> > > > > >> > > > > >> However, I'm not entirely sure which option is more desired. > > > > Illustrating my understanding in the following figure, in which on > the > > > top > > > > is Chesnay's proposal and on the bottom is the SSG-based proposal > this > > > > FLIP. > > > > >> > > > > >> > > > > >> > > > > >> I think the major difference between the two approaches is where > > > > deriving operator requirements from SSG requirements happens. > > > > >> > > > > >> - Chesnay's proposal simplifies the runtime logic and the > interface to > > > > expose, at the price of moving more complexity (i.e. the deriving) > the > > > > API side. The question is, where do we prefer to keep the complexity? > I'm > > > > slightly leaning towards having a thin API and keep the complexity in > > > > runtime if possible. > > > > >> > > > > >> - Notice that the dash line arrows represent optional steps that > are > > > > needed only for schedulers that do not respect SSGs, which we don't > have > > > at > > > > the moment. If we only look at the solid line arrows, then the > SSG-based > > > > approach is much simpler, without needing to derive and aggregate > > > > requirements back and forth. I'm not sure about complicating the > current > > > > design only for the potential future needs. > > > > >> > > > > >> > > > > >> Thank you~ > > > > >> > > > > >> Xintong Song > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > [hidden email]> > > > > wrote: > > > > >>> > > > > >>> You're raising a good point, but I think I can rectify that > a > > > > minor > > > > >>> adjustment. > > > > >>> > > > > >>> Default requirements are whatever the default requirements are, > > > setting > > > > >>> the requirements for one operator has no effect on other > operators. > > > > >>> > > > > >>> With these rules, and some API enhancements, the following > > > would > > > > >>> replicate the SSG-based behavior: > > > > >>> > > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > > >>> vertices = slotSharingGroup.getVertices() > > > > >>> > > > > > > > > > > > > >>> vertices.remainint().setRequirements(ZERO) > > > > >>> } > > > > >>> > > > > >>> We could even allow setting requirements on slotsharing-groups > > > > >>> colocation-groups and internally translate them accordingly. > > > > >>> I can't help but feel this is a plain API issue. > > > > >>> > > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > > >>> > If I understand you correctly Chesnay, then you want to > decouple > > > the > > > > >>> > resource requirement specification from the slot sharing > > > > >>> > assignment. Hence, per default all operators would be in the > same > > > > slot > > > > >>> > sharing group. If there is no operator with a resource > > > specification, > > > > >>> > then the system would allocate a default slot for it. If there > is > > > at > > > > >>> > least one operator, then the system would sum up all the > specified > > > > >>> > resources and allocate a slot of this size. This effectively > means > > > > >>> > that all unspecified operators will implicitly have a zero > resource > > > > >>> > requirement. Did I understand your idea correctly? > > > > >>> > > > > > >>> > I am wondering whether this wouldn't lead to a surprising > behaviour > > > > >>> > for the user. If the user specifies the resource requirements > for a > > > > >>> > single operator, then he probably will assume that the other > > > > operators > > > > >>> > will get the default share of resources and not nothing. > > > > >>> > > > > > >>> > Cheers, > > > > >>> > Till > > > > >>> > > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > > [hidden email] > > > > >>> > <mailto:[hidden email]>> wrote: > > > > >>> > > > > > >>> > Is there even a functional difference between specifying the > > > > >>> > requirements for an SSG vs specifying the same requirements > > > a > > > > >>> > single > > > > >>> > operator within that group (ideally a colocation group to avoid > > > > this > > > > >>> > whole hint business)? > > > > >>> > > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > > >>> > > > > > >>> > Users can take shortcuts to define shared requirements, > > > > >>> > but refine them further as needed on a per-operator basis, > > > > >>> > without changing semantics of slotsharing groups > > > > >>> > nor the runtime being locked into SSG-based requirements. > > > > >>> > > > > > >>> > (And before anyone argues what happens if slotsharing groups > > > > >>> > change or > > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > > (A > > > > >>> > plain > > > > >>> > iteration over slotsharing groups and therein contained > > > operators > > > > >>> > would > > > > >>> > suffice)). > > > > >>> > > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > > the SSG > > > > >>> > > resource requirements as a hint for the runtime similar to > > > how > > > > >>> > slot sharing > > > > >>> > > groups are designed at the moment? Meaning that we don't > > > > >>> > the guarantee > > > > >>> > > that Flink will always deploy this set of tasks together no > > > > >>> > matter what > > > > >>> > > comes. If, for example, the runtime can derive by some means > > > > the > > > > >>> > resource > > > > >>> > > requirements for each task based on the requirements for the > > > > >>> > SSG, this > > > > >>> > > could be possible. One easy strategy would be to give every > > > > task > > > > >>> > the same > > > > >>> > > resources as the whole slot sharing group. Another one could > > > be > > > > >>> > > distributing the resources equally among the tasks. This does > > > > >>> > not even have > > > > >>> > > to be implemented but we would give ourselves the freedom to > > > > change > > > > >>> > > scheduling if need should arise. > > > > >>> > > > > > > >>> > > Cheers, > > > > >>> > > Till > > > > >>> > > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > > [hidden email] > > > > >>> > <mailto:[hidden email]>> wrote: > > > > >>> > > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > > >>> > >> > > > > >>> > >> I second Xintong's comment that SSG-based runtime > > > > >>> > will give > > > > >>> > >> us the flexibility to achieve op/task-based approach. That's > > > > one of > > > > >>> > >> the most important reasons for our design choice. > > > > >>> > >> > > > > >>> > >> Some cents regarding the default operator resource: > > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > > >>> > >> ** For light-weight operators, the accumulative > > > > >>> > configuration error > > > > >>> > >> will not be significant. Then, the resource of a task used > > > is > > > > >>> > >> proportional to the number of operators it contains. > > > > >>> > >> ** For heavy operators like join and window or operators > > > > >>> > using the > > > > >>> > >> external resources, user will turn to the fine-grained > > > > resource > > > > >>> > >> configuration. > > > > >>> > >> - It can increase the stability for the standalone cluster > > > > >>> > where task > > > > >>> > >> executors registered are heterogeneous(with different > > > default > > > > slot > > > > >>> > >> resources). > > > > >>> > >> - It might not be good for SQL users. The operators that > > > > >>> > will be > > > > >>> > >> transferred to is a black box to the user. We also do not > > > > guarantee > > > > >>> > >> the cross-version of consistency of the transformation so > > > far. > > > > >>> > >> > > > > >>> > >> I think it can be treated as a follow-up work when the > > > > fine-grained > > > > >>> > >> resource management is end-to-end ready. > > > > >>> > >> > > > > >>> > >> Best, > > > > >>> > >> Yangze Guo > > > > >>> > >> > > > > >>> > >> > > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>> Thanks for the feedback, Till. > > > > >>> > >>> > > > > >>> > >>> ## I feel that what you proposed (operator-based + > > > > >>> > value) might > > > > >>> > >> be > > > > >>> > >>> subsumed by the SSG-based approach. > > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 cases, > > > > >>> > categorized by > > > > >>> > >>> whether the resource requirements are known to the users. > > > > >>> > >>> > > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > > >>> > reason to put > > > > >>> > >>> multiple operators whose individual resource > > > requirements > > > > >>> > are already > > > > >>> > >> known > > > > >>> > >>> into the same group in fine-grained resource > > > management. > > > > >>> > And if op_1 > > > > >>> > >> and > > > > >>> > >>> op_2 are in different groups, there should be no > > > problem > > > > >>> > switching > > > > >>> > >> data > > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > > >>> > equivalent to > > > > >>> > >> specifying > > > > >>> > >>> operator resource requirements in your proposal. > > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > > that > > > > >>> > op_2 is in a > > > > >>> > >>> SSG whose resource is not specified thus would have the > > > > >>> > default slot > > > > >>> > >>> resource. This is equivalent to having default operator > > > > >>> > resources in > > > > >>> > >> your > > > > >>> > >>> proposal. > > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > > op_2 > > > > >>> > to the same > > > > >>> > >> SSG > > > > >>> > >>> or separate SSGs. > > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > > >>> > equivalent to > > > > >>> > >> the > > > > >>> > >>> coarse-grained resource management, where op_1 and > > > > op_2 > > > > >>> > share a > > > > >>> > >> default > > > > >>> > >>> size slot no matter which data exchange mode is > > > used. > > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > > of > > > > >>> > them will > > > > >>> > >> use > > > > >>> > >>> a default size slot. This is equivalent to setting > > > > them > > > > >>> > with > > > > >>> > >> default > > > > >>> > >>> operator resources in your proposal. > > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > > is > > > > >>> > known.* > > > > >>> > >>> - It is possible that the user learns the total / > > > max > > > > >>> > resource > > > > >>> > >>> requirement from executing and monitoring the job, > > > > >>> > while not > > > > >>> > >>> being aware of > > > > >>> > >>> individual operator requirements. > > > > >>> > >>> - I believe this is the case your proposal does not > > > > >>> > cover. And TBH, > > > > >>> > >>> this is probably how most users learn the resource > > > > >>> > requirements, > > > > >>> > >>> according > > > > >>> > >>> to my experiences. > > > > >>> > >>> - In this case, the user might need to specify > > > > >>> > different resources > > > > >>> > >> if > > > > >>> > >>> he wants to switch the execution mode, which should > > > > not > > > > >>> > be worse > > > > >>> > >> than not > > > > >>> > >>> being able to use fine-grained resource management. > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> ## An additional idea inspired by your proposal. > > > > >>> > >>> We may provide multiple options for deciding resources > > > > >>> > SSGs whose > > > > >>> > >>> requirement is not specified, if needed. > > > > >>> > >>> > > > > >>> > >>> - Default slot resource (current design) > > > > >>> > >>> - Default operator resource times number of operators > > > > >>> > (equivalent to > > > > >>> > >>> your proposal) > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> ## Exposing internal runtime strategies > > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > > >>> > requirements might be > > > > >>> > >>> affected if how SSGs are internally handled changes in > > > > future. > > > > >>> > >> Practically, > > > > >>> > >>> I do not concretely see at the moment what kind of > > > we > > > > >>> > may want in > > > > >>> > >>> future that might conflict with this FLIP proposal, as the > > > > >>> > question of > > > > >>> > >>> switching data exchange mode answered above. I'd suggest to > > > > >>> > not give up > > > > >>> > >> the > > > > >>> > >>> user friendliness we may gain now for the future problems > > > > that > > > > >>> > may or may > > > > >>> > >>> not exist. > > > > >>> > >>> > > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > > >>> > achieve the > > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > > set each > > > > >>> > >> operator > > > > >>> > >>> (or task) to a separate SSG. We can even provide a > > > > >>> > option to > > > > >>> > >>> automatically do that for users, if needed. > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> Thank you~ > > > > >>> > >>> > > > > >>> > >>> Xintong Song > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> > > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > > >>> > >>>> > > > > >>> > >>>> I agree that being able to define the resource > > > requirements > > > > for a > > > > >>> > >> group of > > > > >>> > >>>> operators is more user friendly. However, my concern is > > > that > > > > >>> > we are > > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > > >>> > limit our > > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > > semantics > > > > of > > > > >>> > >> configuring > > > > >>> > >>>> resource requirements for SSGs could break if switching > > > from > > > > >>> > streaming > > > > >>> > >> to > > > > >>> > >>>> batch execution. If one defines the resource > > > > for > > > > >>> > op_1 -> > > > > >>> > >> op_2 > > > > >>> > >>>> which run in pipelined mode when using the streaming > > > > >>> > execution, then > > > > >>> > >> how do > > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > > >>> > executed with a > > > > >>> > >>>> blocking data exchange in batch execution mode? > > > > Consequently, > > > > >>> > I am > > > > >>> > >> still > > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > > >>> > requirements per > > > > >>> > >>>> operator. > > > > >>> > >>>> > > > > >>> > >>>> Maybe the following proposal makes the configuration > > > easier: > > > > >>> > If the > > > > >>> > >> user > > > > >>> > >>>> wants to use fine-grained resource requirements, then > > > > >>> > needs to > > > > >>> > >> specify > > > > >>> > >>>> the default size which is used for operators which have no > > > > >>> > explicit > > > > >>> > >>>> resource annotation. If this holds true, then every > > > operator > > > > >>> > would > > > > >>> > >> have a > > > > >>> > >>>> resource requirement and the system can try to execute the > > > > >>> > operators > > > > >>> > >> in the > > > > >>> > >>>> best possible manner w/o being constrained by how the user > > > > >>> > set the SSG > > > > >>> > >>>> requirements. > > > > >>> > >>>> > > > > >>> > >>>> Cheers, > > > > >>> > >>>> Till > > > > >>> > >>>> > > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >>>> wrote: > > > > >>> > >>>> > > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > > >>> > >>>>> > > > > >>> > >>>>> Actually, your proposal has also come to my mind at > > > > >>> > point. And I > > > > >>> > >>>> have > > > > >>> > >>>>> some concerns about it. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> 1. It does not give users the same control as the > > > SSG-based > > > > >>> > approach. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> While both approaches do not require specifying for > > > > >>> > operator, > > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > > operators > > > > >>> > >> together > > > > >>> > >>>> use > > > > >>> > >>>>> this much resource" while the operator-based approach > > > > doesn't. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, > > > > >>> > o_m), and > > > > >>> > >> at > > > > >>> > >>>> some > > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which significantly > > > > >>> > reduces the > > > > >>> > >> data > > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups SSG_1 > > > > >>> > (o_1, ..., > > > > >>> > >> o_n) > > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > > higher > > > > >>> > >> parallelisms > > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 won't > > > > >>> > lead to too > > > > >>> > >> much > > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > > different > > > > >>> > >> resources, > > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > > >>> > resources for > > > > >>> > >> the > > > > >>> > >>>> two > > > > >>> > >>>>> groups. However, with the operator-based approach, the > > > > user will > > > > >>> > >> have to > > > > >>> > >>>>> specify resources for each operator in one of the two > > > > >>> > groups, and > > > > >>> > >> tune > > > > >>> > >>>> the > > > > >>> > >>>>> default slot resource via configurations to fit the > > > > group. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> 2. It increases the chance of breaking operator chains. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Setting chainnable operators into different slot sharing > > > > >>> > groups will > > > > >>> > >>>>> prevent them from being chained. In the current > > > > implementation, > > > > >>> > >>>> downstream > > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > > to > > > > >>> > the same > > > > >>> > >> group > > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > > upstream > > > > >>> > >> operators > > > > >>> > >>>> in > > > > >>> > >>>>> different groups), to reduce the chance of breaking > > > chains. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> > > > > >>> > deciding > > > > >>> > >> SSGs > > > > >>> > >>>>> based on whether resource is specified we will easily get > > > > >>> > groups like > > > > >>> > >>>> (o_1, > > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > > >>> > chained. This > > > > >>> > >> is > > > > >>> > >>>> also > > > > >>> > >>>>> possible for the SSG-based approach, but I believe the > > > > >>> > chance is much > > > > >>> > >>>>> smaller because there's no strong reason for users to > > > > >>> > specify the > > > > >>> > >> groups > > > > >>> > >>>>> with alternate operators like that. We are more likely > > > > >>> > get groups > > > > >>> > >> like > > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > > between > > > > >>> > o_2 and > > > > >>> > >> o_3. > > > > >>> > >>>>> > > > > >>> > >>>>> 3. It complicates the system by having two different > > > > >>> > mechanisms for > > > > >>> > >>>> sharing > > > > >>> > >>>>> managed memory in a slot. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > > memory > > > > >>> > sharing > > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > > >>> > according to the > > > > >>> > >>>>> consumer type, then further distributed across > > > > of that > > > > >>> > >> consumer > > > > >>> > >>>>> type. > > > > >>> > >>>>> > > > > >>> > >>>>> - With the operator-based approach, managed memory size > > > > >>> > specified > > > > >>> > >> for an > > > > >>> > >>>>> operator should account for all the consumer types of > > > that > > > > >>> > operator. > > > > >>> > >> That > > > > >>> > >>>>> means the managed memory is first distributed across > > > > >>> > operators, then > > > > >>> > >>>>> distributed to different consumer types of each > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Unfortunately, the different order of the two calculation > > > > >>> > steps can > > > > >>> > >> lead > > > > >>> > >>>> to > > > > >>> > >>>>> different results. To be specific, the semantic of the > > > > >>> > configuration > > > > >>> > >>>> option > > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > > >>> > operator). > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> To sum up things: > > > > >>> > >>>>> > > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > > >>> > think (1) > > > > >>> > >> and (2) > > > > >>> > >>>>> somehow suggest that, the price for the proposed > > > > to > > > > >>> > avoid > > > > >>> > >>>>> specifying resource for every operator is that it's not > > > as > > > > >>> > >> independent > > > > >>> > >>>> from > > > > >>> > >>>>> operator chaining and slot sharing as the operator-based > > > > >>> > approach > > > > >>> > >>>> discussed > > > > >>> > >>>>> in the FLIP. > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> Thank you~ > > > > >>> > >>>>> > > > > >>> > >>>>> Xintong Song > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> > > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > >>> > >> wrote: > > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > > >>> > >>>>>> > > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > > >>> > written. And > > > > >>> > >> the > > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > > >>> > configuration to > > > > >>> > >>>> users > > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > > >>> > >>>>>> So good job here! > > > > >>> > >>>>>> > > > > >>> > >>>>>> About how to let users specify the resource profiles. > > > If I > > > > >>> > can sum > > > > >>> > >> the > > > > >>> > >>>>> FLIP > > > > >>> > >>>>>> and previous discussion up in my own words, the > > > > is the > > > > >>> > >>>> following: > > > > >>> > >>>>>> Operator-level specification is the simplest and > > > cleanest > > > > >>> > approach, > > > > >>> > >>>>> because > > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) and > > > > >>> > >> scheduling. No > > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > > sharing, > > > > >>> > >>>> switching > > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource profiles > > > > >>> > stay the > > > > >>> > >>>> same. > > > > >>> > >>>>>>> But it would require that a user specifies resources on > > > > all > > > > >>> > >>>> operators, > > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > > suggests > > > > going > > > > >>> > >> with > > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > > >>> > >>>>>> > > > > >>> > >>>>>> I think both thoughts are important, so can we find a > > > > solution > > > > >>> > >> where > > > > >>> > >>>> the > > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > > >>> > still avoid > > > > >>> > >> that > > > > >>> > >>>>> we > > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > > >>> > >>>>>> > > > > >>> > >>>>>> What do you think about something like the following: > > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > > level. > > > > >>> > >>>>>> - Not all operators need profiles > > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > > in > > > > the > > > > >>> > >> default > > > > >>> > >>>> slot > > > > >>> > >>>>>> sharing group with a default profile (will get a > > > > slot). > > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > > >>> > another slot > > > > >>> > >>>>> sharing > > > > >>> > >>>>>> group (the resource-specified-group). > > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > > >>> > operators > > > > >>> > >> like > > > > >>> > >>>>> they > > > > >>> > >>>>>> do now, with the exception that you cannot mix > > > > >>> > that have > > > > >>> > >> a > > > > >>> > >>>>>> resource profile and operators that have no resource > > > > profile. > > > > >>> > >>>>>> - The default case where no operator has a resource > > > > >>> > profile is > > > > >>> > >> just a > > > > >>> > >>>>>> special case of this model > > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > > operator, > > > > >>> > like it > > > > >>> > >> does > > > > >>> > >>>>> now, > > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks > > > > it > > > > >>> > >> schedules > > > > >>> > >>>>>> together. > > > > >>> > >>>>>> > > > > >>> > >>>>>> > > > > >>> > >>>>>> There is another question about reactive scaling raised > > > > in the > > > > >>> > >> FLIP. I > > > > >>> > >>>>> need > > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > > tricky > > > > >>> > once we > > > > >>> > >>>> have > > > > >>> > >>>>>> slots of different sizes. > > > > >>> > >>>>>> It is not clear then which of the different slot > > > requests > > > > the > > > > >>> > >>>>>> ResourceManager should fulfill when new resources > > > > >>> > show up, > > > > >>> > >> or how > > > > >>> > >>>>> the > > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > > resources > > > > >>> > (TMs) > > > > >>> > >>>>> disappear > > > > >>> > >>>>>> This question is pretty orthogonal, though, to the "how > > > to > > > > >>> > specify > > > > >>> > >> the > > > > >>> > >>>>>> resources". > > > > >>> > >>>>>> > > > > >>> > >>>>>> > > > > >>> > >>>>>> Best, > > > > >>> > >>>>>> Stephan > > > > >>> > >>>>>> > > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > > >>> > <[hidden email] <mailto:[hidden email]> > > > > >>> > >>>>> wrote: > > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > > discussion, > > > > >>> > Yangze. > > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> @Till, > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > > that > > > > SSGs > > > > >>> > >> need to > > > > >>> > >>>>> be > > > > >>> > >>>>>>> supported in fine-grained resource management, > > > otherwise > > > > each > > > > >>> > >>>> operator > > > > >>> > >>>>>>> might use as many resources as the whole group. > > > However, > > > > I > > > > >>> > cannot > > > > >>> > >>>> think > > > > >>> > >>>>>> of > > > > >>> > >>>>>>> a strong reason for not supporting SSGs in > > > > >>> > resource > > > > >>> > >>>>>>> management. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> > > > > >>> > >>>>>>>> Interestingly, if all operators have their resources > > > > properly > > > > >>> > >>>>>> specified, > > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > > could > > > > >>> > >> slice off > > > > >>> > >>>>> the > > > > >>> > >>>>>>>> appropriately sized slots for every Task > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > > >>> > operator op_1 > > > > >>> > >> and > > > > >>> > >>>>> op_2 > > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would then > > > say > > > > that > > > > >>> > >> the > > > > >>> > >>>> slot > > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > > have > > > > a > > > > >>> > >> cluster > > > > >>> > >>>>> with > > > > >>> > >>>>>> 2 > > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > > cannot run > > > > >>> > >> this > > > > >>> > >>>>> job. > > > > >>> > >>>>>> If > > > > >>> > >>>>>>>> the resources were specified on an operator level, > > > then > > > > the > > > > >>> > >> system > > > > >>> > >>>>>> could > > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > op_2 > > > > to > > > > >>> > >> TM_2. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Couldn't agree more that if all operators' > > > > are > > > > >>> > >> properly > > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. I > > > > >>> > think this > > > > >>> > >>>>> exactly > > > > >>> > >>>>>>> disproves the example. If we already know op_1 and op_2 > > > > each > > > > >>> > >> needs > > > > >>> > >>>> 100 > > > > >>> > >>>>> MB > > > > >>> > >>>>>>> of memory, why would we put them in the same group? If > > > > >>> > they are > > > > >>> > >> in > > > > >>> > >>>>>> separate > > > > >>> > >>>>>>> groups, with the proposed approach the system can > > > freely > > > > >>> > deploy > > > > >>> > >> them > > > > >>> > >>>> to > > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Moreover, the precondition for not needing slot > > > > is > > > > >>> > having > > > > >>> > >>>>>> resource > > > > >>> > >>>>>>> requirements properly specified for all operators. This > > > > is not > > > > >>> > >> always > > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. One > > > > of the > > > > >>> > >>>> benefits > > > > >>> > >>>>>> for > > > > >>> > >>>>>>> SSG-based requirements is that it allows the user to > > > > freely > > > > >>> > >> decide > > > > >>> > >>>> the > > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > > >>> > consider SSG > > > > >>> > >> in > > > > >>> > >>>>>>> fine-grained resource management as a group of > > > operators > > > > >>> > that the > > > > >>> > >>>> user > > > > >>> > >>>>>>> would like to specify the total resource for. There > > > > be > > > > >>> > only > > > > >>> > >> one > > > > >>> > >>>>> group > > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > > major > > > > >>> > parts, > > > > >>> > >> or as > > > > >>> > >>>>>> many > > > > >>> > >>>>>>> groups as the number of tasks/operators, depending on > > > how > > > > >>> > >>>> fine-grained > > > > >>> > >>>>>> the > > > > >>> > >>>>>>> user is able to specify the resources. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But > > > > >>> > that all > > > > >>> > >> the > > > > >>> > >>>>>>> current scheduler implementations already support > > > SSGs, I > > > > >>> > tend to > > > > >>> > >>>> think > > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > > >>> > usability and > > > > >>> > >>>>>>> flexibility. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> @Chesnay > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > > waste > > > > >>> > >> resources > > > > >>> > >>>> if > > > > >>> > >>>>>> the > > > > >>> > >>>>>>>> parallelism of operators within that group are > > > > different? > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>> Yes. It's a trade-off between usability and resource > > > > >>> > >> utilization. To > > > > >>> > >>>>>> avoid > > > > >>> > >>>>>>> such wasting, the user can define more groups, so > > > > >>> > each group > > > > >>> > >>>>>> contains > > > > >>> > >>>>>>> less operators and the chance of having operators with > > > > >>> > different > > > > >>> > >>>>>>> parallelism will be reduced. The price is to have more > > > > >>> > resource > > > > >>> > >>>>>>> requirements to specify. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> It also seems like quite a hassle for users having to > > > > >>> > >> recalculate the > > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > > create > > > > >>> > >> a set > > > > >>> > >>>>> of > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > their > > > > >>> > >>>>> applications; > > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > > >>> > would be > > > > >>> > >> a > > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > > >>> > >> requirements > > > > >>> > >>>>> any > > > > >>> > >>>>>>>> way. > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > increases > > > > >>> > >>>>> usability. > > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > > there's no > > > > >>> > >> reason to > > > > >>> > >>>>> put > > > > >>> > >>>>>>> multiple operators whose individual resource > > > > >>> > requirements are > > > > >>> > >>>>> already > > > > >>> > >>>>>>> known > > > > >>> > >>>>>>> into the same group in fine-grained resource > > > > management. > > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > > multiple > > > > >>> > >>>>> applications, > > > > >>> > >>>>>>> it does not guarantee the same resource > > > requirements. > > > > >>> > During > > > > >>> > >> our > > > > >>> > >>>>> years > > > > >>> > >>>>>>> of > > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > > requirements > > > > >>> > >> specified for > > > > >>> > >>>>>>> Blink's > > > > >>> > >>>>>>> fine-grained resource management, very few users > > > > >>> > (including > > > > >>> > >> our > > > > >>> > >>>>>>> specialists > > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > > >>> > >> experienced as > > > > >>> > >>>>> to > > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > > >>> > >> requirements. > > > > >>> > >>>> Most > > > > >>> > >>>>>>> people > > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > > delay, cpu > > > > >>> > >> load, > > > > >>> > >>>>>> memory > > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > > specification. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> To sum up: > > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > > >>> > requirements > > > > >>> > >> for > > > > >>> > >>>>>> every > > > > >>> > >>>>>>> operator, that's definitely a good thing and we would > > > not > > > > >>> > need to > > > > >>> > >>>> rely > > > > >>> > >>>>> on > > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > > >>> > >> fine-grained > > > > >>> > >>>>>> resource > > > > >>> > >>>>>>> management to work. For those users who are capable > > > > do not > > > > >>> > >> like > > > > >>> > >>>>>> having > > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > > to > > > > have > > > > >>> > >> both > > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and to > > > > only > > > > >>> > >> fallback > > > > >>> > >>>> to > > > > >>> > >>>>>> the > > > > >>> > >>>>>>> SSG requirements when the operator requirements are not > > > > >>> > >> specified. > > > > >>> > >>>>>> However, > > > > >>> > >>>>>>> as the first step, I think we should prioritise the use > > > > cases > > > > >>> > >> where > > > > >>> > >>>>> users > > > > >>> > >>>>>>> are not that experienced. > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Thank you~ > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> Xintong Song > > > > >>> > >>>>>>> > > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > >>> > >>>>>>> wrote: > > > > >>> > >>>>>>> > > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not also > > > > waste > > > > >>> > >> resources > > > > >>> > >>>>> if > > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > > different? > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>>> It also seems like quite a hassle for users having > > > > >>> > >> recalculate > > > > >>> > >>>> the > > > > >>> > >>>>>>>> resource requirements if they change the slot sharing. > > > > >>> > >>>>>>>> I'd think that it's not really workable for users that > > > > create > > > > >>> > >> a set > > > > >>> > >>>>> of > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > their > > > > >>> > >>>>> applications; > > > > >>> > >>>>>>>> managing the resources requirements in such a setting > > > > >>> > would be > > > > >>> > >> a > > > > >>> > >>>>>>>> nightmare, and in the end would require operator-level > > > > >>> > >> requirements > > > > >>> > >>>>> any > > > > >>> > >>>>>>>> way. > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > increases > > > > >>> > >>>>> usability. > > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > > work > > > > >>> > on SSGs > > > > >>> > >>>> it's > > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > > approaches, > > > > >>> > >> which > > > > >>> > >>>>>>>> would not be the case if, for the runtime, they are > > > > always > > > > >>> > >> defined > > > > >>> > >>>> on > > > > >>> > >>>>>> an > > > > >>> > >>>>>>>> operator-level. > > > > >>> > >>>>>>>> > > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > > discussion > > > > >>> > >>>> Yangze. > > > > >>> > >>>>>>>>> I like that defining resource requirements on a > > > > sharing > > > > >>> > >>>> group > > > > >>> > >>>>>>> makes > > > > >>> > >>>>>>>>> the overall setup easier and improves usability of > > > > resource > > > > >>> > >>>>>>> requirements. > > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > > sharing > > > > >>> > >>>> groups > > > > >>> > >>>>>> from > > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > > be > > > > >>> > >> supported > > > > >>> > >>>> in > > > > >>> > >>>>>>> order > > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > > far, > > > > the > > > > >>> > >> idea > > > > >>> > >>>> of > > > > >>> > >>>>>> slot > > > > >>> > >>>>>>>>> sharing groups was that it tells the system that a > > > set > > > > of > > > > >>> > >>>> operators > > > > >>> > >>>>>> can > > > > >>> > >>>>>>>> be > > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > > the > > > > >>> > >> freedom > > > > >>> > >>>> to > > > > >>> > >>>>>> say > > > > >>> > >>>>>>>> that > > > > >>> > >>>>>>>>> it would rather place these tasks in different > > > > if it > > > > >>> > >>>> wanted. > > > > >>> > >>>>> If > > > > >>> > >>>>>>> we > > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > > sharing > > > > >>> > >> group, > > > > >>> > >>>> then > > > > >>> > >>>>>> the > > > > >>> > >>>>>>>>> only option for a scheduler which does not support > > > slot > > > > >>> > >> sharing > > > > >>> > >>>>>> groups > > > > >>> > >>>>>>> is > > > > >>> > >>>>>>>>> to say that every operator in this slot sharing > > > > >>> > needs a > > > > >>> > >>>> slot > > > > >>> > >>>>>> with > > > > >>> > >>>>>>>> the > > > > >>> > >>>>>>>>> same resources as the whole group. > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> So for example, if we have a job consisting of two > > > > operator > > > > >>> > >> op_1 > > > > >>> > >>>>> and > > > > >>> > >>>>>>> op_2 > > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > > say that > > > > >>> > >> the > > > > >>> > >>>>> slot > > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > have a > > > > >>> > >> cluster > > > > >>> > >>>>>> with > > > > >>> > >>>>>>> 2 > > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > > cannot run > > > > >>> > >> this > > > > >>> > >>>>>> job. > > > > >>> > >>>>>>> If > > > > >>> > >>>>>>>>> the resources were specified on an operator level, > > > > then the > > > > >>> > >>>> system > > > > >>> > >>>>>>> could > > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > op_2 to > > > > >>> > >> TM_2. > > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot > > > > groups > > > > >>> > >> was > > > > >>> > >>>> to > > > > >>> > >>>>>> make > > > > >>> > >>>>>>>> it > > > > >>> > >>>>>>>>> easier for the user to reason about how many slots a > > > > job > > > > >>> > >> needs > > > > >>> > >>>>>>>> independent > > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > > Interestingly, > > > > >>> > >> if > > > > >>> > >>>> all > > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > > then slot > > > > >>> > >>>>> sharing > > > > >>> > >>>>>> is > > > > >>> > >>>>>>>> no > > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > > >>> > appropriately > > > > >>> > >>>> sized > > > > >>> > >>>>>>> slots > > > > >>> > >>>>>>>>> for every Task individually. What matters is > > > > the > > > > >>> > >> whole > > > > >>> > >>>>>> cluster > > > > >>> > >>>>>>>> has > > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> Cheers, > > > > >>> > >>>>>>>>> Till > > > > >>> > >>>>>>>>> > > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > >>> > >>>>>> wrote: > > > > >>> > >>>>>>>>>> Hi, there, > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > > "FLIP-156: > > > > >>> > >> Runtime > > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > > Requirements"[1], > > > > >>> > >> where we > > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > > interfaces > > > > >>> > >> for > > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> In this FLIP: > > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained resource > > > > >>> > >> management. > > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > > SSG-based > > > > >>> > >> resource > > > > >>> > >>>>>>>>>> requirements. > > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > > >>> > >> granularities > > > > >>> > >>>>> for > > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > > slot > > > > >>> > >> sharing > > > > >>> > >>>>>> group) > > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > > [1]. > > > > >>> > >> Looking > > > > >>> > >>>>>>>>>> forward to your feedback. > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>>>> [1] > > > > >>> > >>>>>>>>>> > > > > >>> > >> > > > > >>> > > > > > > > > > > > > > > >>> > < > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > > >>> > >>>>>>>>>> Best, > > > > >>> > >>>>>>>>>> Yangze Guo > > > > >>> > >>>>>>>>>> > > > > >>> > >>>>>>>> > > > > >>> > > > > > >>> > > > > > > > > |
In reply to this post by Till Rohrmann
Hi Till,
Based on what I understood, if not wrong, the door is not closed after SSG resource specifying. So, hope it could be useful in potential future improvement. Best, Kezhu Wang On February 3, 2021 at 18:07:21, Till Rohrmann ([hidden email]) wrote: Thanks for sharing your thoughts Kezhu. I like your ideas of how per-operator and SSG requirements can be combined. I've also thought about defining a default resource profile for all tasks which have no resources configured. That way all operators would have resources assigned if the user chooses to use this feature. As Yangze and Xintong have said, we have decided to first only support specifying resources for SSGs as this seems more user friendly. Based on the feedback for this feature one potential development direction might be to allow the resource specification on per-operator basis. Here we could pick up your ideas. Cheers, Till On Wed, Feb 3, 2021 at 7:31 AM Xintong Song <[hidden email]> wrote: > Thanks for your feedback, Kezhu. > > I think Flink *runtime* already has an ideal granularity for resource > > management 'task'. If there is > > a slot shared by multiple tasks, that slot's resource requirement is > simple > > sum of all its logical > > slots. So basically, this is no resource requirement for SlotSharingGroup > > in runtime until now, > > right ? > > That is a halfly-cooked implementation, coming from the previous attempts > (years ago) trying to deliver the fine-grained resource management feature, > and never really put into use. > > From the FLIP and dicusssion, I assume that SSG resource specifying will > > override operator level > > resource specifying if both are specified ? > > > Actually, I think we should use the finer-grained resources (i.e. operator > level) if both are specified. And more importantly, that is based on the > assumption that we do need two different levels of interfaces. > > So, I wonder whether we could interpret SSG resource specifying as an "add" > > but not an "set" on > > resource requirement ? > > > IIUC, this is the core idea behind your proposal. I think it provides an > interesting idea of how we combine operator level and SSG level resources, > *if > we allow configuring resources at both levels*. However, I'm not sure > whether the configuring resources on the operator level is indeed needed. > Therefore, as a first step, this FLIP proposes to only introduce the > SSG-level interfaces. As listed in the future plan, we would consider > allowing operator level resource configuration later if we do see a need > for it. At that time, we definitely should discuss what to do if resources > are configured at both levels. > > * Could SSG express negative resource requirement ? > > > No. > > Is there concrete bar for partial resource configured not function ? I > > saw it will fail job submission in Dispatcher.submitJob. > > > With the SSG-based approach, this should no longer be needed. The > constraint was introduced because we can neither properly define what is > the resource of a task chained from an operator with specified resource > another with unspecified resource, nor for a slot shared by a task with > specified resource and another with unspecified resource. With the > SSG-based approach, we no longer have those problems. > > An option(cluster/job level) to force slot sharing in scheduler ? This > > could be useful in case of migration from FLIP-156 to future approach. > > > I think this is exactly what we are trying to avoid, requiring the > scheduler to enforce slot sharing. > > An option(cluster) to ignore resource specifying(allow resource specified > > job to run on open box environment) for no production usage ? > > > That's possible. Actually, we are planning to introduce an option for > activating the fine-grained resource management, for development > We might consider to keep that option after the feature is completed, to > allow disable the feature without having to touch the job codes. > > Thank you~ > > Xintong Song > > > > On Wed, Feb 3, 2021 at 1:28 PM Kezhu Wang <[hidden email]> wrote: > > > Hi all, sorry for join discussion even after voting started. > > > > I want to share my thoughts on this after reading above discussions. > > > > I think Flink *runtime* already has an ideal granularity for resource > > management 'task'. If there is > > a slot shared by multiple tasks, that slot's resource requirement is > simple > > sum of all its logical > > slots. So basically, this is no resource requirement for > > in runtime until now, > > right ? > > > > As in discussion, we already agree upon that: "If all operators have > their > > resources properly > > specified, then slot sharing is no longer needed. " > > > > So seems to me, naturally in mind path, what we would discuss is that: > how > > to bridge impractical > > operator level resource specifying to runtime task level resource > > requirement ? This is actually a > > pure api thing as Chesnay has pointed out. > > > > But FLIP-156 brings another direction on table: how about using SSG for > > both api and runtime > > resource specifying ? > > > > From the FLIP and dicusssion, I assume that SSG resource specifying > > override operator level > > resource specifying if both are specified ? > > > > So, I wonder whether we could interpret SSG resource specifying as an > "add" > > but not an "set" on > > resource requirement ? > > > > The semantics is that SSG resource specifying adds additional resource to > > shared slot to express > > concerns on possible high thoughput and resource requirement for tasks in > > one physical slot. > > > > The result is that if scheduler indeed respect slot sharing, allocated > slot > > will gain extra resource > > specified for that SSG. > > > > I think one of coding barrier from "add" approach is ResourceSpec.UNKNOWN > > which didn't support > > 'merge' operation. I tend to use ResourceSpec.ZERO as default, task > > executor should be aware of > > this. > > > > @Chesnay > > > My main worry is that it if we wire the runtime to work on SSGs it's > > > gonna be difficult to implement more fine-grained approaches, which > > > would not be the case if, for the runtime, they are always defined on > an > > > operator-level. > > > > An "add" operation should be less invasive and enforce low barrier for > > future find-grained > > approaches. > > > > @Stephan > > > - Users can define different slot sharing groups for operators like > > they > > > do now, with the exception that you cannot mix operators that have a > > > resource profile and operators that have no resource profile. > > > > @Till > > > This effectively means that all unspecified operators > > > will implicitly have a zero resource requirement. > > > I am wondering whether this wouldn't lead to a surprising behaviour > > the > > > user. If the user specifies the resource requirements for a single > > > operator, then he probably will assume that the other operators will > get > > > the default share of resources and not nothing. > > > > I think it is inherent due to fact that we could not defining > > ResourceSpec.ONE, eg. resource > > requirement for exact one default slot, with concrete numbers ? I tend to > > squash out unspecified one > > if there are operators in chaining with explicit resource specifying. > > Otherwise, the protocol tends > > to verbose as say "give me this much resource and a default". I think if > we > > have explict resource > > specifying for partial operators, it is just saying "I don't care other > > operators that much, just > > get them places to run". It is most likely be cases there are stateless > > fliter/map or other less > > resource consuming operators. If there is indeed a problem, I think > clients > > can specify a global > > default(or other level default in future). In job graph generating > > we could take that default > > into account for unspecified operators. > > > > @FLIP-156 > > > Expose operator chaining. (Cons fo task level resource specifying) > > > > Is it inherent for all group level resource specifying ? They will either > > break chaining or obey it, > > or event could not work with. > > > > To sum up above, my suggestions are: > > > > In api side: > > * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if > > unspecified). > > * Operator: ResourceSpec.ZERO(unspecified) as default. > > * Task: sum of requirements from specified operators + global > > there are any unspecified operators) > > * SSG: additional resource to physical slot. > > > > In runtime side: > > * Task: ResourceSpec.Task or ResourceSpec.ZERO > > * SSG: ResourceSpec.SSG or ResourceSpec.ZERO > > > > Physical slot gets sum up resources from logical slots and SSG, if it > gets > > ResourceSpec.ZERO, it is > > just a default sized slot. > > > > In short, turn SSG resource speciying as "add" and drop > > ResourceSpec.UNKNOWN. > > > > > > Questions/Issues: > > * Could SSG express negative resource requirement ? > > * Is there concrete bar for partial resource configured not function ? > > saw it will fail job submission in Dispatcher.submitJob. > > * An option(cluster/job level) to force slot sharing in scheduler ? This > > could be useful in case of migration from FLIP-156 to future approach. > > * An option(cluster) to ignore resource specifying(allow resource > specified > > job to run on open box environment) for no production usage ? > > > > > > > > On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) wrote: > > > > Thanks for reply, Till and Xintong! > > > > I update the FLIP, including: > > - Edit the JavaDoc of the proposed > > StreamGraphGenerator#setSlotSharingGroupResource. > > - Add "Future Plan" section, which contains the potential follow-up > > issues and the limitations to be documented when fine-grained resource > > management is exposed to users. > > > > I'll start a vote in another thread. > > > > Best, > > Yangze Guo > > > > On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > > Thanks for summarizing the discussion, Yangze. I agree that setting > > > resource requirements per operator is not very user friendly. > Moreover, I > > > couldn't come up with a different proposal which would be as easy to > use > > > and wouldn't expose internal scheduling details. In fact, following > this > > > argument then we shouldn't have exposed the slot sharing groups in > > > first place. > > > > > > What is important for the user is that we properly document the > > limitations > > > and constraints the fine grained resource specification has. For > example, > > > we should explain how optimizations like chaining are affected by it > and > > > how different execution modes (batch vs. streaming) affect the > execution > > of > > > operators which have specified resources. These things shouldn't > > > part of the contract of this feature and are more caused by internal > > > implementation details but it will be important to understand these > > things > > > properly in order to use this feature effectively. > > > > > > Hence, +1 for starting the vote for this FLIP. > > > > > > Cheers, > > > Till > > > > > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> > > wrote: > > > > > > > Thanks for the summary, Yangze. > > > > > > > > The changes and follow-up issues LGTM. Let's wait for responses > > the > > > > others before starting a vote. > > > > > > > > Thank you~ > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> > > wrote: > > > > > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > > > summarize the current convergence in the discussion. Please let > > > > > know if I got things wrong or missed something crucial here. > > > > > > > > > > Change of this FLIP: > > > > > - Treat the SSG resource requirements as a hint instead of a > > > > > restriction for the runtime. That's should be explicitly explained > in > > > > > the JavaDocs. > > > > > > > > > > Potential follow-up issues if needed: > > > > > - Provide operator-level resource configuration interface. > > > > > - Provide multiple options for deciding resources for SSGs whose > > > > > requirement is not specified: > > > > > ** Default slot resource. > > > > > ** Default operator resource times number of operators. > > > > > > > > > > If there are no other issues, I'll update the FLIP accordingly > > > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song < > [hidden email] > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song < > > [hidden email]> > > > > > > > wrote: > > > > > >> > > > > > >> I think Chesnay's proposal could actually work. IIUC, the > keypoint > > is > > > > > to derive operator requirements from SSG requirements on the API > > side, so > > > > > that the runtime only deals with operator requirements. It's > > debatable > > > > how > > > > > the deriving should be done though. E.g., an alternative could be > to > > > > evenly > > > > > divide the SSG requirement into requirements of operators in the > > group. > > > > > >> > > > > > >> > > > > > >> However, I'm not entirely sure which option is more desired. > > > > > Illustrating my understanding in the following figure, in which > > the > > > > top > > > > > is Chesnay's proposal and on the bottom is the SSG-based proposal > in > > this > > > > > FLIP. > > > > > >> > > > > > >> > > > > > >> > > > > > >> I think the major difference between the two approaches is > > > > > deriving operator requirements from SSG requirements happens. > > > > > >> > > > > > >> - Chesnay's proposal simplifies the runtime logic and the > > interface to > > > > > expose, at the price of moving more complexity (i.e. the deriving) > to > > the > > > > > API side. The question is, where do we prefer to keep the > complexity? > > I'm > > > > > slightly leaning towards having a thin API and keep the complexity > in > > > > > runtime if possible. > > > > > >> > > > > > >> - Notice that the dash line arrows represent optional steps that > > are > > > > > needed only for schedulers that do not respect SSGs, which we don't > > have > > > > at > > > > > the moment. If we only look at the solid line arrows, then the > > SSG-based > > > > > approach is much simpler, without needing to derive and aggregate > the > > > > > requirements back and forth. I'm not sure about complicating the > > current > > > > > design only for the potential future needs. > > > > > >> > > > > > >> > > > > > >> Thank you~ > > > > > >> > > > > > >> Xintong Song > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > > [hidden email]> > > > > > wrote: > > > > > >>> > > > > > >>> You're raising a good point, but I think I can rectify that > with > > a > > > > > minor > > > > > >>> adjustment. > > > > > >>> > > > > > >>> Default requirements are whatever the default requirements > > > > setting > > > > > >>> the requirements for one operator has no effect on other > > operators. > > > > > >>> > > > > > >>> With these rules, and some API enhancements, the following > mockup > > > > would > > > > > >>> replicate the SSG-based behavior: > > > > > >>> > > > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > > > >>> vertices = slotSharingGroup.getVertices() > > > > > >>> > > > > > > > > > > > > > > > > > >>> vertices.remainint().setRequirements(ZERO) > > > > > >>> } > > > > > >>> > > > > > >>> We could even allow setting requirements on slotsharing-groups > > > > > >>> colocation-groups and internally translate them accordingly. > > > > > >>> I can't help but feel this is a plain API issue. > > > > > >>> > > > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > > > >>> > If I understand you correctly Chesnay, then you want to > > decouple > > > > the > > > > > >>> > resource requirement specification from the slot sharing > group > > > > > >>> > assignment. Hence, per default all operators would be in > > same > > > > > slot > > > > > >>> > sharing group. If there is no operator with a resource > > > > specification, > > > > > >>> > then the system would allocate a default slot for it. If > there > > is > > > > at > > > > > >>> > least one operator, then the system would sum up all the > > specified > > > > > >>> > resources and allocate a slot of this size. This > > means > > > > > >>> > that all unspecified operators will implicitly have a zero > > resource > > > > > >>> > requirement. Did I understand your idea correctly? > > > > > >>> > > > > > > >>> > I am wondering whether this wouldn't lead to a surprising > > behaviour > > > > > >>> > for the user. If the user specifies the resource requirements > > for a > > > > > >>> > single operator, then he probably will assume that the other > > > > > operators > > > > > >>> > will get the default share of resources and not nothing. > > > > > >>> > > > > > > >>> > Cheers, > > > > > >>> > Till > > > > > >>> > > > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > > > [hidden email] > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > >>> > > > > > > >>> > Is there even a functional difference between specifying > > > > > >>> > requirements for an SSG vs specifying the same requirements > on > > > > a > > > > > >>> > single > > > > > >>> > operator within that group (ideally a colocation group to > avoid > > > > > this > > > > > >>> > whole hint business)? > > > > > >>> > > > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > > > >>> > > > > > > >>> > Users can take shortcuts to define shared requirements, > > > > > >>> > but refine them further as needed on a per-operator basis, > > > > > >>> > without changing semantics of slotsharing groups > > > > > >>> > nor the runtime being locked into SSG-based requirements. > > > > > >>> > > > > > > >>> > (And before anyone argues what happens if slotsharing > > > > > >>> > change or > > > > > >>> > whatnot, that's a plain API issue that we could surely solve. > > > > (A > > > > > >>> > plain > > > > > >>> > iteration over slotsharing groups and therein contained > > > > operators > > > > > >>> > would > > > > > >>> > suffice)). > > > > > >>> > > > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > > > >>> > > Maybe a different minor idea: Would it be possible to treat > > > > > the SSG > > > > > >>> > > resource requirements as a hint for the runtime similar to > > > > how > > > > > >>> > slot sharing > > > > > >>> > > groups are designed at the moment? Meaning that we don't > give > > > > > >>> > the guarantee > > > > > >>> > > that Flink will always deploy this set of tasks together no > > > > > >>> > matter what > > > > > >>> > > comes. If, for example, the runtime can derive by some > means > > > > > the > > > > > >>> > resource > > > > > >>> > > requirements for each task based on the requirements for > the > > > > > >>> > SSG, this > > > > > >>> > > could be possible. One easy strategy would be to give every > > > > > task > > > > > >>> > the same > > > > > >>> > > resources as the whole slot sharing group. Another one > could > > > > be > > > > > >>> > > distributing the resources equally among the tasks. This > does > > > > > >>> > not even have > > > > > >>> > > to be implemented but we would give ourselves the freedom > to > > > > > change > > > > > >>> > > scheduling if need should arise. > > > > > >>> > > > > > > > >>> > > Cheers, > > > > > >>> > > Till > > > > > >>> > > > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > > > [hidden email] > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > >>> > > > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > > > >>> > >> > > > > > >>> > >> I second Xintong's comment that SSG-based runtime > interface > > > > > >>> > will give > > > > > >>> > >> us the flexibility to achieve op/task-based approach. > That's > > > > > one of > > > > > >>> > >> the most important reasons for our design choice. > > > > > >>> > >> > > > > > >>> > >> Some cents regarding the default operator resource: > > > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > > > >>> > >> ** For light-weight operators, the accumulative > > > > > >>> > configuration error > > > > > >>> > >> will not be significant. Then, the resource of a task > > > > is > > > > > >>> > >> proportional to the number of operators it contains. > > > > > >>> > >> ** For heavy operators like join and window or operators > > > > > >>> > using the > > > > > >>> > >> external resources, user will turn to the fine-grained > > > > > resource > > > > > >>> > >> configuration. > > > > > >>> > >> - It can increase the stability for the standalone cluster > > > > > >>> > where task > > > > > >>> > >> executors registered are heterogeneous(with different > > > > default > > > > > slot > > > > > >>> > >> resources). > > > > > >>> > >> - It might not be good for SQL users. The operators that > SQL > > > > > >>> > will be > > > > > >>> > >> transferred to is a black box to the user. We also do not > > > > > guarantee > > > > > >>> > >> the cross-version of consistency of the transformation so > > > > far. > > > > > >>> > >> > > > > > >>> > >> I think it can be treated as a follow-up work when the > > > > > fine-grained > > > > > >>> > >> resource management is end-to-end ready. > > > > > >>> > >> > > > > > >>> > >> Best, > > > > > >>> > >> Yangze Guo > > > > > >>> > >> > > > > > >>> > >> > > > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >> wrote: > > > > > >>> > >>> Thanks for the feedback, Till. > > > > > >>> > >>> > > > > > >>> > >>> ## I feel that what you proposed (operator-based + > default > > > > > >>> > value) might > > > > > >>> > >> be > > > > > >>> > >>> subsumed by the SSG-based approach. > > > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 > cases, > > > > > >>> > categorized by > > > > > >>> > >>> whether the resource requirements are known to the > > > > > >>> > >>> > > > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > > > >>> > reason to put > > > > > >>> > >>> multiple operators whose individual resource > > > > requirements > > > > > >>> > are already > > > > > >>> > >> known > > > > > >>> > >>> into the same group in fine-grained resource > > > > management. > > > > > >>> > And if op_1 > > > > > >>> > >> and > > > > > >>> > >>> op_2 are in different groups, there should be no > > > > problem > > > > > >>> > switching > > > > > >>> > >> data > > > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > > > >>> > equivalent to > > > > > >>> > >> specifying > > > > > >>> > >>> operator resource requirements in your proposal. > > > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > > > that > > > > > >>> > op_2 is in a > > > > > >>> > >>> SSG whose resource is not specified thus would have the > > > > > >>> > default slot > > > > > >>> > >>> resource. This is equivalent to having default operator > > > > > >>> > resources in > > > > > >>> > >> your > > > > > >>> > >>> proposal. > > > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > > > op_2 > > > > > >>> > to the same > > > > > >>> > >> SSG > > > > > >>> > >>> or separate SSGs. > > > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > > > >>> > equivalent to > > > > > >>> > >> the > > > > > >>> > >>> coarse-grained resource management, where op_1 and > > > > > op_2 > > > > > >>> > share a > > > > > >>> > >> default > > > > > >>> > >>> size slot no matter which data exchange mode is > > > > used. > > > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > > > of > > > > > >>> > them will > > > > > >>> > >> use > > > > > >>> > >>> a default size slot. This is equivalent to setting > > > > > them > > > > > >>> > with > > > > > >>> > >> default > > > > > >>> > >>> operator resources in your proposal. > > > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > > > is > > > > > >>> > known.* > > > > > >>> > >>> - It is possible that the user learns the total / > > > > max > > > > > >>> > resource > > > > > >>> > >>> requirement from executing and monitoring the job, > > > > > >>> > while not > > > > > >>> > >>> being aware of > > > > > >>> > >>> individual operator requirements. > > > > > >>> > >>> - I believe this is the case your proposal does not > > > > > >>> > cover. And TBH, > > > > > >>> > >>> this is probably how most users learn the resource > > > > > >>> > requirements, > > > > > >>> > >>> according > > > > > >>> > >>> to my experiences. > > > > > >>> > >>> - In this case, the user might need to specify > > > > > >>> > different resources > > > > > >>> > >> if > > > > > >>> > >>> he wants to switch the execution mode, which should > > > > > not > > > > > >>> > be worse > > > > > >>> > >> than not > > > > > >>> > >>> being able to use fine-grained resource management. > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> ## An additional idea inspired by your proposal. > > > > > >>> > >>> We may provide multiple options for deciding resources > for > > > > > >>> > SSGs whose > > > > > >>> > >>> requirement is not specified, if needed. > > > > > >>> > >>> > > > > > >>> > >>> - Default slot resource (current design) > > > > > >>> > >>> - Default operator resource times number of operators > > > > > >>> > (equivalent to > > > > > >>> > >>> your proposal) > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> ## Exposing internal runtime strategies > > > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > > > >>> > requirements might be > > > > > >>> > >>> affected if how SSGs are internally handled changes in > > > > > future. > > > > > >>> > >> Practically, > > > > > >>> > >>> I do not concretely see at the moment what kind of > changes > > > > we > > > > > >>> > may want in > > > > > >>> > >>> future that might conflict with this FLIP proposal, as > the > > > > > >>> > question of > > > > > >>> > >>> switching data exchange mode answered above. I'd > to > > > > > >>> > not give up > > > > > >>> > >> the > > > > > >>> > >>> user friendliness we may gain now for the future problems > > > > > that > > > > > >>> > may or may > > > > > >>> > >>> not exist. > > > > > >>> > >>> > > > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > > > >>> > achieve the > > > > > >>> > >>> equivalent behavior as the operator-based approach, if we > > > > > set each > > > > > >>> > >> operator > > > > > >>> > >>> (or task) to a separate SSG. We can even provide a > shortcut > > > > > >>> > option to > > > > > >>> > >>> automatically do that for users, if needed. > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> Thank you~ > > > > > >>> > >>> > > > > > >>> > >>> Xintong Song > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >> wrote: > > > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > > > >>> > >>>> > > > > > >>> > >>>> I agree that being able to define the resource > > > > requirements > > > > > for a > > > > > >>> > >> group of > > > > > >>> > >>>> operators is more user friendly. However, my concern > > > > that > > > > > >>> > we are > > > > > >>> > >>>> exposing thereby internal runtime strategies which might > > > > > >>> > limit our > > > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > > > semantics > > > > > of > > > > > >>> > >> configuring > > > > > >>> > >>>> resource requirements for SSGs could break if switching > > > > from > > > > > >>> > streaming > > > > > >>> > >> to > > > > > >>> > >>>> batch execution. If one defines the resource > requirements > > > > > for > > > > > >>> > op_1 -> > > > > > >>> > >> op_2 > > > > > >>> > >>>> which run in pipelined mode when using the streaming > > > > > >>> > execution, then > > > > > >>> > >> how do > > > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > > > >>> > executed with a > > > > > >>> > >>>> blocking data exchange in batch execution mode? > > > > > Consequently, > > > > > >>> > I am > > > > > >>> > >> still > > > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > > > >>> > requirements per > > > > > >>> > >>>> operator. > > > > > >>> > >>>> > > > > > >>> > >>>> Maybe the following proposal makes the configuration > > > > easier: > > > > > >>> > If the > > > > > >>> > >> user > > > > > >>> > >>>> wants to use fine-grained resource requirements, then > she > > > > > >>> > needs to > > > > > >>> > >> specify > > > > > >>> > >>>> the default size which is used for operators which > no > > > > > >>> > explicit > > > > > >>> > >>>> resource annotation. If this holds true, then every > > > > operator > > > > > >>> > would > > > > > >>> > >> have a > > > > > >>> > >>>> resource requirement and the system can try to execute > the > > > > > >>> > operators > > > > > >>> > >> in the > > > > > >>> > >>>> best possible manner w/o being constrained by how the > user > > > > > >>> > set the SSG > > > > > >>> > >>>> requirements. > > > > > >>> > >>>> > > > > > >>> > >>>> Cheers, > > > > > >>> > >>>> Till > > > > > >>> > >>>> > > > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >>>> wrote: > > > > > >>> > >>>> > > > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > > > >>> > >>>>> > > > > > >>> > >>>>> Actually, your proposal has also come to my mind at > some > > > > > >>> > point. And I > > > > > >>> > >>>> have > > > > > >>> > >>>>> some concerns about it. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> 1. It does not give users the same control as the > > > > SSG-based > > > > > >>> > approach. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> While both approaches do not require specifying for > each > > > > > >>> > operator, > > > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > > > operators > > > > > >>> > >> together > > > > > >>> > >>>> use > > > > > >>> > >>>>> this much resource" while the operator-based approach > > > > > doesn't. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, > ..., > > > > > >>> > o_m), and > > > > > >>> > >> at > > > > > >>> > >>>> some > > > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which > significantly > > > > > >>> > reduces the > > > > > >>> > >> data > > > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups > SSG_1 > > > > > >>> > (o_1, ..., > > > > > >>> > >> o_n) > > > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > > > higher > > > > > >>> > >> parallelisms > > > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 > won't > > > > > >>> > lead to too > > > > > >>> > >> much > > > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > > > different > > > > > >>> > >> resources, > > > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > > > >>> > resources for > > > > > >>> > >> the > > > > > >>> > >>>> two > > > > > >>> > >>>>> groups. However, with the operator-based approach, > > > > > user will > > > > > >>> > >> have to > > > > > >>> > >>>>> specify resources for each operator in one of the two > > > > > >>> > groups, and > > > > > >>> > >> tune > > > > > >>> > >>>> the > > > > > >>> > >>>>> default slot resource via configurations to fit the > other > > > > > group. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> 2. It increases the chance of breaking operator > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Setting chainnable operators into different slot > sharing > > > > > >>> > groups will > > > > > >>> > >>>>> prevent them from being chained. In the current > > > > > implementation, > > > > > >>> > >>>> downstream > > > > > >>> > >>>>> operators, if SSG not explicitly specified, will be set > > > > to > > > > > >>> > the same > > > > > >>> > >> group > > > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > > > upstream > > > > > >>> > >> operators > > > > > >>> > >>>> in > > > > > >>> > >>>>> different groups), to reduce the chance of breaking > > > > chains. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> > o_3, > > > > > >>> > deciding > > > > > >>> > >> SSGs > > > > > >>> > >>>>> based on whether resource is specified we will easily > get > > > > > >>> > groups like > > > > > >>> > >>>> (o_1, > > > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > > > >>> > chained. This > > > > > >>> > >> is > > > > > >>> > >>>> also > > > > > >>> > >>>>> possible for the SSG-based approach, but I believe > > > > > >>> > chance is much > > > > > >>> > >>>>> smaller because there's no strong reason for users to > > > > > >>> > specify the > > > > > >>> > >> groups > > > > > >>> > >>>>> with alternate operators like that. We are more likely > to > > > > > >>> > get groups > > > > > >>> > >> like > > > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > > > between > > > > > >>> > o_2 and > > > > > >>> > >> o_3. > > > > > >>> > >>>>> > > > > > >>> > >>>>> 3. It complicates the system by having two different > > > > > >>> > mechanisms for > > > > > >>> > >>>> sharing > > > > > >>> > >>>>> managed memory in a slot. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > > > memory > > > > > >>> > sharing > > > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > > > >>> > according to the > > > > > >>> > >>>>> consumer type, then further distributed across > operators > > > > > of that > > > > > >>> > >> consumer > > > > > >>> > >>>>> type. > > > > > >>> > >>>>> > > > > > >>> > >>>>> - With the operator-based approach, managed memory > > > > > >>> > specified > > > > > >>> > >> for an > > > > > >>> > >>>>> operator should account for all the consumer types of > > > > that > > > > > >>> > operator. > > > > > >>> > >> That > > > > > >>> > >>>>> means the managed memory is first distributed across > > > > > >>> > operators, then > > > > > >>> > >>>>> distributed to different consumer types of each > operator. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Unfortunately, the different order of the two > calculation > > > > > >>> > steps can > > > > > >>> > >> lead > > > > > >>> > >>>> to > > > > > >>> > >>>>> different results. To be specific, the semantic of > > > > > >>> > configuration > > > > > >>> > >>>> option > > > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within an > > > > > >>> > operator). > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> To sum up things: > > > > > >>> > >>>>> > > > > > >>> > >>>>> While (3) might be a bit more implementation related, I > > > > > >>> > think (1) > > > > > >>> > >> and (2) > > > > > >>> > >>>>> somehow suggest that, the price for the proposed > approach > > > > > to > > > > > >>> > avoid > > > > > >>> > >>>>> specifying resource for every operator is that it's not > > > > as > > > > > >>> > >> independent > > > > > >>> > >>>> from > > > > > >>> > >>>>> operator chaining and slot sharing as the > operator-based > > > > > >>> > approach > > > > > >>> > >>>> discussed > > > > > >>> > >>>>> in the FLIP. > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> Thank you~ > > > > > >>> > >>>>> > > > > > >>> > >>>>> Xintong Song > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> > > > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > >>> > >> wrote: > > > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > > > >>> > written. And > > > > > >>> > >> the > > > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > > > >>> > configuration to > > > > > >>> > >>>> users > > > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > > > >>> > >>>>>> So good job here! > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> About how to let users specify the resource > > > > If I > > > > > >>> > can sum > > > > > >>> > >> the > > > > > >>> > >>>>> FLIP > > > > > >>> > >>>>>> and previous discussion up in my own words, the > problem > > > > > is the > > > > > >>> > >>>> following: > > > > > >>> > >>>>>> Operator-level specification is the simplest and > > > > cleanest > > > > > >>> > approach, > > > > > >>> > >>>>> because > > > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) > and > > > > > >>> > >> scheduling. No > > > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > > > sharing, > > > > > >>> > >>>> switching > > > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource > profiles > > > > > >>> > stay the > > > > > >>> > >>>> same. > > > > > >>> > >>>>>>> But it would require that a user specifies > on > > > > > all > > > > > >>> > >>>> operators, > > > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > > > suggests > > > > > going > > > > > >>> > >> with > > > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> I think both thoughts are important, so can we find > > > > > solution > > > > > >>> > >> where > > > > > >>> > >>>> the > > > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but we > > > > > >>> > still avoid > > > > > >>> > >> that > > > > > >>> > >>>>> we > > > > > >>> > >>>>>> need to specify a resource profile on every operator? > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> What do you think about something like the following: > > > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > > > level. > > > > > >>> > >>>>>> - Not all operators need profiles > > > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > > > in > > > > > the > > > > > >>> > >> default > > > > > >>> > >>>> slot > > > > > >>> > >>>>>> sharing group with a default profile (will get a > default > > > > > slot). > > > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > > > >>> > another slot > > > > > >>> > >>>>> sharing > > > > > >>> > >>>>>> group (the resource-specified-group). > > > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > > > >>> > operators > > > > > >>> > >> like > > > > > >>> > >>>>> they > > > > > >>> > >>>>>> do now, with the exception that you cannot mix > operators > > > > > >>> > that have > > > > > >>> > >> a > > > > > >>> > >>>>>> resource profile and operators that have no resource > > > > > profile. > > > > > >>> > >>>>>> - The default case where no operator has a resource > > > > > >>> > profile is > > > > > >>> > >> just a > > > > > >>> > >>>>>> special case of this model > > > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > > > operator, > > > > > >>> > like it > > > > > >>> > >> does > > > > > >>> > >>>>> now, > > > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks > that > > > > > it > > > > > >>> > >> schedules > > > > > >>> > >>>>>> together. > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> There is another question about reactive scaling > raised > > > > > in the > > > > > >>> > >> FLIP. I > > > > > >>> > >>>>> need > > > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > > > tricky > > > > > >>> > once we > > > > > >>> > >>>> have > > > > > >>> > >>>>>> slots of different sizes. > > > > > >>> > >>>>>> It is not clear then which of the different slot > > > > requests > > > > > the > > > > > >>> > >>>>>> ResourceManager should fulfill when new resources > (TMs) > > > > > >>> > show up, > > > > > >>> > >> or how > > > > > >>> > >>>>> the > > > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > > > resources > > > > > >>> > (TMs) > > > > > >>> > >>>>> disappear > > > > > >>> > >>>>>> This question is pretty orthogonal, though, to the > "how > > > > to > > > > > >>> > specify > > > > > >>> > >> the > > > > > >>> > >>>>>> resources". > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> Best, > > > > > >>> > >>>>>> Stephan > > > > > >>> > >>>>>> > > > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > > > >>> > <[hidden email] <mailto:[hidden email]> > > > > > >>> > >>>>> wrote: > > > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > > > discussion, > > > > > >>> > Yangze. > > > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> @Till, > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > > > that > > > > > SSGs > > > > > >>> > >> need to > > > > > >>> > >>>>> be > > > > > >>> > >>>>>>> supported in fine-grained resource management, > > > > otherwise > > > > > each > > > > > >>> > >>>> operator > > > > > >>> > >>>>>>> might use as many resources as the whole group. > > > > However, > > > > > I > > > > > >>> > cannot > > > > > >>> > >>>> think > > > > > >>> > >>>>>> of > > > > > >>> > >>>>>>> a strong reason for not supporting SSGs in > fine-grained > > > > > >>> > resource > > > > > >>> > >>>>>>> management. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>>> Interestingly, if all operators have their > > > > > properly > > > > > >>> > >>>>>> specified, > > > > > >>> > >>>>>>>> then slot sharing is no longer needed because Flink > > > > > could > > > > > >>> > >> slice off > > > > > >>> > >>>>> the > > > > > >>> > >>>>>>>> appropriately sized slots for every Task > individually. > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > > > >>> > operator op_1 > > > > > >>> > >> and > > > > > >>> > >>>>> op_2 > > > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would > > > > say > > > > > that > > > > > >>> > >> the > > > > > >>> > >>>> slot > > > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > have > > > > > a > > > > > >>> > >> cluster > > > > > >>> > >>>>> with > > > > > >>> > >>>>>> 2 > > > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > > > cannot run > > > > > >>> > >> this > > > > > >>> > >>>>> job. > > > > > >>> > >>>>>> If > > > > > >>> > >>>>>>>> the resources were specified on an operator level, > > > > then > > > > > the > > > > > >>> > >> system > > > > > >>> > >>>>>> could > > > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > op_2 > > > > > to > > > > > >>> > >> TM_2. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Couldn't agree more that if all operators' > requirements > > > > > are > > > > > >>> > >> properly > > > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. > > > > > >>> > think this > > > > > >>> > >>>>> exactly > > > > > >>> > >>>>>>> disproves the example. If we already know op_1 and > op_2 > > > > > each > > > > > >>> > >> needs > > > > > >>> > >>>> 100 > > > > > >>> > >>>>> MB > > > > > >>> > >>>>>>> of memory, why would we put them in the same group? > If > > > > > >>> > they are > > > > > >>> > >> in > > > > > >>> > >>>>>> separate > > > > > >>> > >>>>>>> groups, with the proposed approach the system can > > > > freely > > > > > >>> > deploy > > > > > >>> > >> them > > > > > >>> > >>>> to > > > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Moreover, the precondition for not needing slot > sharing > > > > > is > > > > > >>> > having > > > > > >>> > >>>>>> resource > > > > > >>> > >>>>>>> requirements properly specified for all operators. > This > > > > > is not > > > > > >>> > >> always > > > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. > One > > > > > of the > > > > > >>> > >>>> benefits > > > > > >>> > >>>>>> for > > > > > >>> > >>>>>>> SSG-based requirements is that it allows the user > > > > > freely > > > > > >>> > >> decide > > > > > >>> > >>>> the > > > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > > > >>> > consider SSG > > > > > >>> > >> in > > > > > >>> > >>>>>>> fine-grained resource management as a group of > > > > operators > > > > > >>> > that the > > > > > >>> > >>>> user > > > > > >>> > >>>>>>> would like to specify the total resource for. There > can > > > > > be > > > > > >>> > only > > > > > >>> > >> one > > > > > >>> > >>>>> group > > > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > > > major > > > > > >>> > parts, > > > > > >>> > >> or as > > > > > >>> > >>>>>> many > > > > > >>> > >>>>>>> groups as the number of tasks/operators, depending > > > > how > > > > > >>> > >>>> fine-grained > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>> user is able to specify the resources. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But > given > > > > > >>> > that all > > > > > >>> > >> the > > > > > >>> > >>>>>>> current scheduler implementations already support > > > > SSGs, I > > > > > >>> > tend to > > > > > >>> > >>>> think > > > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > > > >>> > usability and > > > > > >>> > >>>>>>> flexibility. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> @Chesnay > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > > > waste > > > > > >>> > >> resources > > > > > >>> > >>>> if > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>>> parallelism of operators within that group are > > > > > different? > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>> Yes. It's a trade-off between usability and > > > > > >>> > >> utilization. To > > > > > >>> > >>>>>> avoid > > > > > >>> > >>>>>>> such wasting, the user can define more groups, so > that > > > > > >>> > each group > > > > > >>> > >>>>>> contains > > > > > >>> > >>>>>>> less operators and the chance of having operators > with > > > > > >>> > different > > > > > >>> > >>>>>>> parallelism will be reduced. The price is to have > more > > > > > >>> > resource > > > > > >>> > >>>>>>> requirements to specify. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> It also seems like quite a hassle for users having > > > > > >>> > >> recalculate the > > > > > >>> > >>>>>>>> resource requirements if they change the slot > sharing. > > > > > >>> > >>>>>>>> I'd think that it's not really workable for users > that > > > > > create > > > > > >>> > >> a set > > > > > >>> > >>>>> of > > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > > their > > > > > >>> > >>>>> applications; > > > > > >>> > >>>>>>>> managing the resources requirements in such a > setting > > > > > >>> > would be > > > > > >>> > >> a > > > > > >>> > >>>>>>>> nightmare, and in the end would require > operator-level > > > > > >>> > >> requirements > > > > > >>> > >>>>> any > > > > > >>> > >>>>>>>> way. > > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > > increases > > > > > >>> > >>>>> usability. > > > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > > > there's no > > > > > >>> > >> reason to > > > > > >>> > >>>>> put > > > > > >>> > >>>>>>> multiple operators whose individual resource > > > > > >>> > requirements are > > > > > >>> > >>>>> already > > > > > >>> > >>>>>>> known > > > > > >>> > >>>>>>> into the same group in fine-grained resource > > > > > management. > > > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > > > multiple > > > > > >>> > >>>>> applications, > > > > > >>> > >>>>>>> it does not guarantee the same resource > > > > requirements. > > > > > >>> > During > > > > > >>> > >> our > > > > > >>> > >>>>> years > > > > > >>> > >>>>>>> of > > > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > > > requirements > > > > > >>> > >> specified for > > > > > >>> > >>>>>>> Blink's > > > > > >>> > >>>>>>> fine-grained resource management, very few users > > > > > >>> > (including > > > > > >>> > >> our > > > > > >>> > >>>>>>> specialists > > > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > > > >>> > >> experienced as > > > > > >>> > >>>>> to > > > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > > > >>> > >> requirements. > > > > > >>> > >>>> Most > > > > > >>> > >>>>>>> people > > > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > > > delay, cpu > > > > > >>> > >> load, > > > > > >>> > >>>>>> memory > > > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > > > specification. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> To sum up: > > > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > > > >>> > requirements > > > > > >>> > >> for > > > > > >>> > >>>>>> every > > > > > >>> > >>>>>>> operator, that's definitely a good thing and we > > > > not > > > > > >>> > need to > > > > > >>> > >>>> rely > > > > > >>> > >>>>> on > > > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for the > > > > > >>> > >> fine-grained > > > > > >>> > >>>>>> resource > > > > > >>> > >>>>>>> management to work. For those users who are capable > and > > > > > do not > > > > > >>> > >> like > > > > > >>> > >>>>>> having > > > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be ok > > > > to > > > > > have > > > > > >>> > >> both > > > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and > to > > > > > only > > > > > >>> > >> fallback > > > > > >>> > >>>> to > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>> SSG requirements when the operator requirements are > not > > > > > >>> > >> specified. > > > > > >>> > >>>>>> However, > > > > > >>> > >>>>>>> as the first step, I think we should prioritise the > use > > > > > cases > > > > > >>> > >> where > > > > > >>> > >>>>> users > > > > > >>> > >>>>>>> are not that experienced. > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Thank you~ > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> Xintong Song > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > > >>> > >>>>>>> wrote: > > > > > >>> > >>>>>>> > > > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not > > > > > waste > > > > > >>> > >> resources > > > > > >>> > >>>>> if > > > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > > > different? > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>>> It also seems like quite a hassle for users having > to > > > > > >>> > >> recalculate > > > > > >>> > >>>> the > > > > > >>> > >>>>>>>> resource requirements if they change the slot > sharing. > > > > > >>> > >>>>>>>> I'd think that it's not really workable for users > that > > > > > create > > > > > >>> > >> a set > > > > > >>> > >>>>> of > > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > > their > > > > > >>> > >>>>> applications; > > > > > >>> > >>>>>>>> managing the resources requirements in such a > setting > > > > > >>> > would be > > > > > >>> > >> a > > > > > >>> > >>>>>>>> nightmare, and in the end would require > operator-level > > > > > >>> > >> requirements > > > > > >>> > >>>>> any > > > > > >>> > >>>>>>>> way. > > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > > increases > > > > > >>> > >>>>> usability. > > > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > > > work > > > > > >>> > on SSGs > > > > > >>> > >>>> it's > > > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > > > approaches, > > > > > >>> > >> which > > > > > >>> > >>>>>>>> would not be the case if, for the runtime, they > > > > > always > > > > > >>> > >> defined > > > > > >>> > >>>> on > > > > > >>> > >>>>>> an > > > > > >>> > >>>>>>>> operator-level. > > > > > >>> > >>>>>>>> > > > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > > > discussion > > > > > >>> > >>>> Yangze. > > > > > >>> > >>>>>>>>> I like that defining resource requirements on a > slot > > > > > sharing > > > > > >>> > >>>> group > > > > > >>> > >>>>>>> makes > > > > > >>> > >>>>>>>>> the overall setup easier and improves usability > > > > > resource > > > > > >>> > >>>>>>> requirements. > > > > > >>> > >>>>>>>>> What I do not like about it is that it changes slot > > > > > sharing > > > > > >>> > >>>> groups > > > > > >>> > >>>>>> from > > > > > >>> > >>>>>>>>> being a scheduling hint to something which needs to > > > > be > > > > > >>> > >> supported > > > > > >>> > >>>> in > > > > > >>> > >>>>>>> order > > > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > > > far, > > > > > the > > > > > >>> > >> idea > > > > > >>> > >>>> of > > > > > >>> > >>>>>> slot > > > > > >>> > >>>>>>>>> sharing groups was that it tells the system that > > > > set > > > > > of > > > > > >>> > >>>> operators > > > > > >>> > >>>>>> can > > > > > >>> > >>>>>>>> be > > > > > >>> > >>>>>>>>> deployed in the same slot. But the system still had > > > > the > > > > > >>> > >> freedom > > > > > >>> > >>>> to > > > > > >>> > >>>>>> say > > > > > >>> > >>>>>>>> that > > > > > >>> > >>>>>>>>> it would rather place these tasks in different > slots > > > > > if it > > > > > >>> > >>>> wanted. > > > > > >>> > >>>>> If > > > > > >>> > >>>>>>> we > > > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > > > sharing > > > > > >>> > >> group, > > > > > >>> > >>>> then > > > > > >>> > >>>>>> the > > > > > >>> > >>>>>>>>> only option for a scheduler which does not > > > > slot > > > > > >>> > >> sharing > > > > > >>> > >>>>>> groups > > > > > >>> > >>>>>>> is > > > > > >>> > >>>>>>>>> to say that every operator in this slot sharing > group > > > > > >>> > needs a > > > > > >>> > >>>> slot > > > > > >>> > >>>>>> with > > > > > >>> > >>>>>>>> the > > > > > >>> > >>>>>>>>> same resources as the whole group. > > > > > >>> > >>>>>>>>> > > > > > >>> > >>>>>>>>> So for example, if we have a job consisting of > > > > > operator > > > > > >>> > >> op_1 > > > > > >>> > >>>>> and > > > > > >>> > >>>>>>> op_2 > > > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would then > > > > > say that > > > > > >>> > >> the > > > > > >>> > >>>>> slot > > > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > > have a > > > > > >>> > >> cluster > > > > > >>> > >>>>>> with > > > > > >>> > >>>>>>> 2 > > > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > > > cannot run > > > > > >>> > >> this > > > > > >>> > >>>>>> job. > > > > > >>> > >>>>>>> If > > > > > >>> > >>>>>>>>> the resources were specified on an operator > > > > > then the > > > > > >>> > >>>> system > > > > > >>> > >>>>>>> could > > > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > > op_2 to > > > > > >>> > >> TM_2. > > > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot > sharing > > > > > groups > > > > > >>> > >> was > > > > > >>> > >>>> to > > > > > >>> > >>>>>> make > > > > > >>> > >>>>>>>> it > > > > > >>> > >>>>>>>>> easier for the user to reason about how many > a > > > > > job > > > > > >>> > >> needs > > > > > >>> > >>>>>>>> independent > > > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > > > Interestingly, > > > > > >>> > >> if > > > > > >>> > >>>> all > > > > > >>> > >>>>>>>>> operators have their resources properly specified, > > > > > then slot > > > > > >>> > >>>>> sharing > > > > > >>> > >>>>>> is > > > > > >>> > >>>>>>>> no > > > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > > > >>> > appropriately > > > > > >>> > >>>> sized > > > > > >>> > >>>>>>> slots > > > > > >>> > >>>>>>>>> for every Task individually. What matters is > whether > > > > > the > > > > > >>> > >> whole > > > > > >>> > >>>>>> cluster > > > > > >>> > >>>>>>>> has > > > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > > > >>> > >>>>>>>>> > > > > > >>> > >>>>>>>>> Cheers, > > > > > >>> > >>>>>>>>> Till > > > > > >>> > >>>>>>>>> > > > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > > >>> > >>>>>> wrote: > > > > > >>> > >>>>>>>>>> Hi, there, > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > > > "FLIP-156: > > > > > >>> > >> Runtime > > > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > > > Requirements"[1], > > > > > >>> > >> where we > > > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > > > interfaces > > > > > >>> > >> for > > > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> In this FLIP: > > > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained > > > > > >>> > >> management. > > > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > > > SSG-based > > > > > >>> > >> resource > > > > > >>> > >>>>>>>>>> requirements. > > > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three potential > > > > > >>> > >> granularities > > > > > >>> > >>>>> for > > > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task and > > > > > slot > > > > > >>> > >> sharing > > > > > >>> > >>>>>> group) > > > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing group. > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki document > > > > > [1]. > > > > > >>> > >> Looking > > > > > >>> > >>>>>>>>>> forward to your feedback. > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>>>> [1] > > > > > >>> > >>>>>>>>>> > > > > > >>> > >> > > > > > >>> > > > > > > > > > > > > > > > > > > > > >>> > < > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > > > > >>> > >>>>>>>>>> Best, > > > > > >>> > >>>>>>>>>> Yangze Guo > > > > > >>> > >>>>>>>>>> > > > > > >>> > >>>>>>>> > > > > > >>> > > > > > > >>> > > > > > > > > > > > > |
Hi Kezhu,
Maybe let me share some backgrounds first. - We at Alibaba have been using fine-grained resource management for many years, with Blink (an internal version of Flink). - We have been trying to contribute this feature to Apache Flink since many years ago. However, we haven't succeeded, due to various reasons. - Back to years ago, I believe there were not many users that used Flink in production at a very large scale, thus less demand for the feature. - The feature on Blink is quite specific to our internal use cases and scenarios. We have not made it general enough to cover the community's common use cases. - Divergences between Flink & Blink code bases. - Blink used operator-level resource interfaces. According to our years of production experiences, we believe that specifying operator-level resources are neither necessary nor easy-to-use. This is why we propose group-level interfaces. Back to your questions. I saw the dicussion to keep slot sharing as an hint, but in reality, will > SSG jobs expect to fail or > run slowly if scheduler does not respect it ? A slot with 20GB memory is > different from two 1GB > default sized slots. So, we actually depends on scheduler > version/implementation/de-fact if we > claim it is an hint. > SSG-based resource requirements are considered hints because the SSG itself is a hint. There's no guarantee that operators of a SSG will always be scheduled together. I think you have a good point that, if SSGs are not respected, is it prefered to fail the job or to interpret the resource of an actual slot. It's possible that we provide a configuration option and leave that decision to the users. However, that is a design choice we need to make when there's indeed a need for not respecting the SSGs. Do you mean code-path or production environment ? If it is code-path, could > you please point out where > the story breaks ? > > From the dicussion and history, could I consider FLIP-156 is an redirection > more than inheritance/enhancement > of current halfly-cooked/ancient implmentation ? > If you try to set the operator resources, you would find that it won't work at the moment. There are several things not ready. - Interfaces for setting operator resources are never really exposed to users. - The resource manager never allocates slots with the requested resources. - Managed memory size specified for operators will not be respected, because managed memory is shared within a slot with a different approach. While the first 2 points are more related to that the feature is not yet ready, the last point is closely related to the specifying operator level resources. To sum up, we do not want to support specifying operator level in the first step, for the following reasons. - It's not likely needed, due to poor usability compared to the SSG-based approach. - It introduces the complexity to deal with the managed memory sharing. - It introduces the complexity to deal with combining resource requirements from two different levels. Thank you~ Xintong Song On Wed, Feb 3, 2021 at 7:50 PM Kezhu Wang <[hidden email]> wrote: > Hi Till, > > Based on what I understood, if not wrong, the door is not closed after SSG > resource specifying. So, hope it could be useful in potential future > improvement. > > Best, > Kezhu Wang > > > On February 3, 2021 at 18:07:21, Till Rohrmann ([hidden email]) > wrote: > > Thanks for sharing your thoughts Kezhu. I like your ideas of how > per-operator and SSG requirements can be combined. I've also thought about > defining a default resource profile for all tasks which have no resources > configured. That way all operators would have resources assigned if the > user chooses to use this feature. > > As Yangze and Xintong have said, we have decided to first only support > specifying resources for SSGs as this seems more user friendly. Based on > the feedback for this feature one potential development direction might be > to allow the resource specification on per-operator basis. Here we could > pick up your ideas. > > Cheers, > Till > > On Wed, Feb 3, 2021 at 7:31 AM Xintong Song <[hidden email]> wrote: > > > Thanks for your feedback, Kezhu. > > > > I think Flink *runtime* already has an ideal granularity for resource > > > management 'task'. If there is > > > a slot shared by multiple tasks, that slot's resource requirement is > > simple > > > sum of all its logical > > > slots. So basically, this is no resource requirement for > SlotSharingGroup > > > in runtime until now, > > > right ? > > > > That is a halfly-cooked implementation, coming from the previous attempts > > (years ago) trying to deliver the fine-grained resource management > feature, > > and never really put into use. > > > > From the FLIP and dicusssion, I assume that SSG resource specifying will > > > override operator level > > > resource specifying if both are specified ? > > > > > Actually, I think we should use the finer-grained resources (i.e. > operator > > level) if both are specified. And more importantly, that is based on the > > assumption that we do need two different levels of interfaces. > > > > So, I wonder whether we could interpret SSG resource specifying as an > "add" > > > but not an "set" on > > > resource requirement ? > > > > > IIUC, this is the core idea behind your proposal. I think it provides an > > interesting idea of how we combine operator level and SSG level > resources, > > *if > > we allow configuring resources at both levels*. However, I'm not sure > > whether the configuring resources on the operator level is indeed needed. > > Therefore, as a first step, this FLIP proposes to only introduce the > > SSG-level interfaces. As listed in the future plan, we would consider > > allowing operator level resource configuration later if we do see a need > > for it. At that time, we definitely should discuss what to do if > resources > > are configured at both levels. > > > > * Could SSG express negative resource requirement ? > > > > > No. > > > > Is there concrete bar for partial resource configured not function ? I > > > saw it will fail job submission in Dispatcher.submitJob. > > > > > With the SSG-based approach, this should no longer be needed. The > > constraint was introduced because we can neither properly define what is > > the resource of a task chained from an operator with specified resource > and > > another with unspecified resource, nor for a slot shared by a task with > > specified resource and another with unspecified resource. With the > > SSG-based approach, we no longer have those problems. > > > > An option(cluster/job level) to force slot sharing in scheduler ? This > > > could be useful in case of migration from FLIP-156 to future approach. > > > > > I think this is exactly what we are trying to avoid, requiring the > > scheduler to enforce slot sharing. > > > > An option(cluster) to ignore resource specifying(allow resource specified > > > job to run on open box environment) for no production usage ? > > > > > That's possible. Actually, we are planning to introduce an option for > > activating the fine-grained resource management, for development > purposes. > > We might consider to keep that option after the feature is completed, to > > allow disable the feature without having to touch the job codes. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Wed, Feb 3, 2021 at 1:28 PM Kezhu Wang <[hidden email]> wrote: > > > > > Hi all, sorry for join discussion even after voting started. > > > > > > I want to share my thoughts on this after reading above discussions. > > > > > > I think Flink *runtime* already has an ideal granularity for resource > > > management 'task'. If there is > > > a slot shared by multiple tasks, that slot's resource requirement is > > simple > > > sum of all its logical > > > slots. So basically, this is no resource requirement for > SlotSharingGroup > > > in runtime until now, > > > right ? > > > > > > As in discussion, we already agree upon that: "If all operators have > > their > > > resources properly > > > specified, then slot sharing is no longer needed. " > > > > > > So seems to me, naturally in mind path, what we would discuss is that: > > how > > > to bridge impractical > > > operator level resource specifying to runtime task level resource > > > requirement ? This is actually a > > > pure api thing as Chesnay has pointed out. > > > > > > But FLIP-156 brings another direction on table: how about using SSG for > > > both api and runtime > > > resource specifying ? > > > > > > From the FLIP and dicusssion, I assume that SSG resource specifying > will > > > override operator level > > > resource specifying if both are specified ? > > > > > > So, I wonder whether we could interpret SSG resource specifying as an > > "add" > > > but not an "set" on > > > resource requirement ? > > > > > > The semantics is that SSG resource specifying adds additional resource > to > > > shared slot to express > > > concerns on possible high thoughput and resource requirement for tasks > in > > > one physical slot. > > > > > > The result is that if scheduler indeed respect slot sharing, allocated > > slot > > > will gain extra resource > > > specified for that SSG. > > > > > > I think one of coding barrier from "add" approach is > ResourceSpec.UNKNOWN > > > which didn't support > > > 'merge' operation. I tend to use ResourceSpec.ZERO as default, task > > > executor should be aware of > > > this. > > > > > > @Chesnay > > > > My main worry is that it if we wire the runtime to work on SSGs it's > > > > gonna be difficult to implement more fine-grained approaches, which > > > > would not be the case if, for the runtime, they are always defined on > > an > > > > operator-level. > > > > > > An "add" operation should be less invasive and enforce low barrier for > > > future find-grained > > > approaches. > > > > > > @Stephan > > > > - Users can define different slot sharing groups for operators like > > > they > > > > do now, with the exception that you cannot mix operators that have a > > > > resource profile and operators that have no resource profile. > > > > > > @Till > > > > This effectively means that all unspecified operators > > > > will implicitly have a zero resource requirement. > > > > I am wondering whether this wouldn't lead to a surprising behaviour > for > > > the > > > > user. If the user specifies the resource requirements for a single > > > > operator, then he probably will assume that the other operators will > > get > > > > the default share of resources and not nothing. > > > > > > I think it is inherent due to fact that we could not defining > > > ResourceSpec.ONE, eg. resource > > > requirement for exact one default slot, with concrete numbers ? I tend > to > > > squash out unspecified one > > > if there are operators in chaining with explicit resource specifying. > > > Otherwise, the protocol tends > > > to verbose as say "give me this much resource and a default". I think > if > > we > > > have explict resource > > > specifying for partial operators, it is just saying "I don't care other > > > operators that much, just > > > get them places to run". It is most likely be cases there are stateless > > > fliter/map or other less > > > resource consuming operators. If there is indeed a problem, I think > > clients > > > can specify a global > > > default(or other level default in future). In job graph generating > phase, > > > we could take that default > > > into account for unspecified operators. > > > > > > @FLIP-156 > > > > Expose operator chaining. (Cons fo task level resource specifying) > > > > > > Is it inherent for all group level resource specifying ? They will > either > > > break chaining or obey it, > > > or event could not work with. > > > > > > To sum up above, my suggestions are: > > > > > > In api side: > > > * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if > > > unspecified). > > > * Operator: ResourceSpec.ZERO(unspecified) as default. > > > * Task: sum of requirements from specified operators + global > default(if > > > there are any unspecified operators) > > > * SSG: additional resource to physical slot. > > > > > > In runtime side: > > > * Task: ResourceSpec.Task or ResourceSpec.ZERO > > > * SSG: ResourceSpec.SSG or ResourceSpec.ZERO > > > > > > Physical slot gets sum up resources from logical slots and SSG, if it > > gets > > > ResourceSpec.ZERO, it is > > > just a default sized slot. > > > > > > In short, turn SSG resource speciying as "add" and drop > > > ResourceSpec.UNKNOWN. > > > > > > > > > Questions/Issues: > > > * Could SSG express negative resource requirement ? > > > * Is there concrete bar for partial resource configured not function ? > I > > > saw it will fail job submission in Dispatcher.submitJob. > > > * An option(cluster/job level) to force slot sharing in scheduler ? > This > > > could be useful in case of migration from FLIP-156 to future approach. > > > * An option(cluster) to ignore resource specifying(allow resource > > specified > > > job to run on open box environment) for no production usage ? > > > > > > > > > > > > On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) > wrote: > > > > > > Thanks for reply, Till and Xintong! > > > > > > I update the FLIP, including: > > > - Edit the JavaDoc of the proposed > > > StreamGraphGenerator#setSlotSharingGroupResource. > > > - Add "Future Plan" section, which contains the potential follow-up > > > issues and the limitations to be documented when fine-grained resource > > > management is exposed to users. > > > > > > I'll start a vote in another thread. > > > > > > Best, > > > Yangze Guo > > > > > > On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email]> > > > wrote: > > > > > > > > Thanks for summarizing the discussion, Yangze. I agree that setting > > > > resource requirements per operator is not very user friendly. > > Moreover, I > > > > couldn't come up with a different proposal which would be as easy to > > use > > > > and wouldn't expose internal scheduling details. In fact, following > > this > > > > argument then we shouldn't have exposed the slot sharing groups in > the > > > > first place. > > > > > > > > What is important for the user is that we properly document the > > > limitations > > > > and constraints the fine grained resource specification has. For > > example, > > > > we should explain how optimizations like chaining are affected by it > > and > > > > how different execution modes (batch vs. streaming) affect the > > execution > > > of > > > > operators which have specified resources. These things shouldn't > become > > > > part of the contract of this feature and are more caused by internal > > > > implementation details but it will be important to understand these > > > things > > > > properly in order to use this feature effectively. > > > > > > > > Hence, +1 for starting the vote for this FLIP. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song <[hidden email]> > > > wrote: > > > > > > > > > Thanks for the summary, Yangze. > > > > > > > > > > The changes and follow-up issues LGTM. Let's wait for responses > from > > > the > > > > > others before starting a vote. > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> > > > wrote: > > > > > > > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > > > > summarize the current convergence in the discussion. Please let > me > > > > > > know if I got things wrong or missed something crucial here. > > > > > > > > > > > > Change of this FLIP: > > > > > > - Treat the SSG resource requirements as a hint instead of a > > > > > > restriction for the runtime. That's should be explicitly > explained > > in > > > > > > the JavaDocs. > > > > > > > > > > > > Potential follow-up issues if needed: > > > > > > - Provide operator-level resource configuration interface. > > > > > > - Provide multiple options for deciding resources for SSGs whose > > > > > > requirement is not specified: > > > > > > ** Default slot resource. > > > > > > ** Default operator resource times number of operators. > > > > > > > > > > > > If there are no other issues, I'll update the FLIP accordingly > and > > > > > > start a vote thread. Thanks all for the valuable feedback again. > > > > > > > > > > > > Best, > > > > > > Yangze Guo > > > > > > > > > > > > Best, > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song < > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song < > > > [hidden email]> > > > > > > > > > wrote: > > > > > > >> > > > > > > >> I think Chesnay's proposal could actually work. IIUC, the > > keypoint > > > is > > > > > > to derive operator requirements from SSG requirements on the API > > > side, so > > > > > > that the runtime only deals with operator requirements. It's > > > debatable > > > > > how > > > > > > the deriving should be done though. E.g., an alternative could be > > to > > > > > evenly > > > > > > divide the SSG requirement into requirements of operators in the > > > group. > > > > > > >> > > > > > > >> > > > > > > >> However, I'm not entirely sure which option is more desired. > > > > > > Illustrating my understanding in the following figure, in which > on > > > the > > > > > top > > > > > > is Chesnay's proposal and on the bottom is the SSG-based proposal > > in > > > this > > > > > > FLIP. > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> I think the major difference between the two approaches is > where > > > > > > deriving operator requirements from SSG requirements happens. > > > > > > >> > > > > > > >> - Chesnay's proposal simplifies the runtime logic and the > > > interface to > > > > > > expose, at the price of moving more complexity (i.e. the > deriving) > > to > > > the > > > > > > API side. The question is, where do we prefer to keep the > > complexity? > > > I'm > > > > > > slightly leaning towards having a thin API and keep the > complexity > > in > > > > > > runtime if possible. > > > > > > >> > > > > > > >> - Notice that the dash line arrows represent optional steps > that > > > are > > > > > > needed only for schedulers that do not respect SSGs, which we > don't > > > have > > > > > at > > > > > > the moment. If we only look at the solid line arrows, then the > > > SSG-based > > > > > > approach is much simpler, without needing to derive and aggregate > > the > > > > > > requirements back and forth. I'm not sure about complicating the > > > current > > > > > > design only for the potential future needs. > > > > > > >> > > > > > > >> > > > > > > >> Thank you~ > > > > > > >> > > > > > > >> Xintong Song > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > > > [hidden email]> > > > > > > wrote: > > > > > > >>> > > > > > > >>> You're raising a good point, but I think I can rectify that > > with > > > a > > > > > > minor > > > > > > >>> adjustment. > > > > > > >>> > > > > > > >>> Default requirements are whatever the default requirements > are, > > > > > setting > > > > > > >>> the requirements for one operator has no effect on other > > > operators. > > > > > > >>> > > > > > > >>> With these rules, and some API enhancements, the following > > mockup > > > > > would > > > > > > >>> replicate the SSG-based behavior: > > > > > > >>> > > > > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > > > > >>> vertices = slotSharingGroup.getVertices() > > > > > > >>> > > > > > > > > > > > > > > > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > > > > > >>> vertices.remainint().setRequirements(ZERO) > > > > > > >>> } > > > > > > >>> > > > > > > >>> We could even allow setting requirements on > slotsharing-groups > > > > > > >>> colocation-groups and internally translate them accordingly. > > > > > > >>> I can't help but feel this is a plain API issue. > > > > > > >>> > > > > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > > > > >>> > If I understand you correctly Chesnay, then you want to > > > decouple > > > > > the > > > > > > >>> > resource requirement specification from the slot sharing > > group > > > > > > >>> > assignment. Hence, per default all operators would be in > the > > > same > > > > > > slot > > > > > > >>> > sharing group. If there is no operator with a resource > > > > > specification, > > > > > > >>> > then the system would allocate a default slot for it. If > > there > > > is > > > > > at > > > > > > >>> > least one operator, then the system would sum up all the > > > specified > > > > > > >>> > resources and allocate a slot of this size. This > effectively > > > means > > > > > > >>> > that all unspecified operators will implicitly have a zero > > > resource > > > > > > >>> > requirement. Did I understand your idea correctly? > > > > > > >>> > > > > > > > >>> > I am wondering whether this wouldn't lead to a surprising > > > behaviour > > > > > > >>> > for the user. If the user specifies the resource > requirements > > > for a > > > > > > >>> > single operator, then he probably will assume that the > other > > > > > > operators > > > > > > >>> > will get the default share of resources and not nothing. > > > > > > >>> > > > > > > > >>> > Cheers, > > > > > > >>> > Till > > > > > > >>> > > > > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > > > > [hidden email] > > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > > >>> > > > > > > > >>> > Is there even a functional difference between specifying > the > > > > > > >>> > requirements for an SSG vs specifying the same requirements > > on > > > > > a > > > > > > >>> > single > > > > > > >>> > operator within that group (ideally a colocation group to > > avoid > > > > > > this > > > > > > >>> > whole hint business)? > > > > > > >>> > > > > > > > >>> > Wouldn't we get the best of both worlds in the latter case? > > > > > > >>> > > > > > > > >>> > Users can take shortcuts to define shared requirements, > > > > > > >>> > but refine them further as needed on a per-operator basis, > > > > > > >>> > without changing semantics of slotsharing groups > > > > > > >>> > nor the runtime being locked into SSG-based requirements. > > > > > > >>> > > > > > > > >>> > (And before anyone argues what happens if slotsharing > groups > > > > > > >>> > change or > > > > > > >>> > whatnot, that's a plain API issue that we could surely > solve. > > > > > (A > > > > > > >>> > plain > > > > > > >>> > iteration over slotsharing groups and therein contained > > > > > operators > > > > > > >>> > would > > > > > > >>> > suffice)). > > > > > > >>> > > > > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > > > > >>> > > Maybe a different minor idea: Would it be possible to > treat > > > > > > the SSG > > > > > > >>> > > resource requirements as a hint for the runtime similar > to > > > > > how > > > > > > >>> > slot sharing > > > > > > >>> > > groups are designed at the moment? Meaning that we don't > > give > > > > > > >>> > the guarantee > > > > > > >>> > > that Flink will always deploy this set of tasks together > no > > > > > > >>> > matter what > > > > > > >>> > > comes. If, for example, the runtime can derive by some > > means > > > > > > the > > > > > > >>> > resource > > > > > > >>> > > requirements for each task based on the requirements for > > the > > > > > > >>> > SSG, this > > > > > > >>> > > could be possible. One easy strategy would be to give > every > > > > > > task > > > > > > >>> > the same > > > > > > >>> > > resources as the whole slot sharing group. Another one > > could > > > > > be > > > > > > >>> > > distributing the resources equally among the tasks. This > > does > > > > > > >>> > not even have > > > > > > >>> > > to be implemented but we would give ourselves the freedom > > to > > > > > > change > > > > > > >>> > > scheduling if need should arise. > > > > > > >>> > > > > > > > > >>> > > Cheers, > > > > > > >>> > > Till > > > > > > >>> > > > > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > > > > [hidden email] > > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > > >>> > > > > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > > > > >>> > >> > > > > > > >>> > >> I second Xintong's comment that SSG-based runtime > > interface > > > > > > >>> > will give > > > > > > >>> > >> us the flexibility to achieve op/task-based approach. > > That's > > > > > > one of > > > > > > >>> > >> the most important reasons for our design choice. > > > > > > >>> > >> > > > > > > >>> > >> Some cents regarding the default operator resource: > > > > > > >>> > >> - It might be good for the scenario of DataStream jobs. > > > > > > >>> > >> ** For light-weight operators, the accumulative > > > > > > >>> > configuration error > > > > > > >>> > >> will not be significant. Then, the resource of a task > used > > > > > is > > > > > > >>> > >> proportional to the number of operators it contains. > > > > > > >>> > >> ** For heavy operators like join and window or operators > > > > > > >>> > using the > > > > > > >>> > >> external resources, user will turn to the fine-grained > > > > > > resource > > > > > > >>> > >> configuration. > > > > > > >>> > >> - It can increase the stability for the standalone > cluster > > > > > > >>> > where task > > > > > > >>> > >> executors registered are heterogeneous(with different > > > > > default > > > > > > slot > > > > > > >>> > >> resources). > > > > > > >>> > >> - It might not be good for SQL users. The operators that > > SQL > > > > > > >>> > will be > > > > > > >>> > >> transferred to is a black box to the user. We also do > not > > > > > > guarantee > > > > > > >>> > >> the cross-version of consistency of the transformation > so > > > > > far. > > > > > > >>> > >> > > > > > > >>> > >> I think it can be treated as a follow-up work when the > > > > > > fine-grained > > > > > > >>> > >> resource management is end-to-end ready. > > > > > > >>> > >> > > > > > > >>> > >> Best, > > > > > > >>> > >> Yangze Guo > > > > > > >>> > >> > > > > > > >>> > >> > > > > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > >>> > >> wrote: > > > > > > >>> > >>> Thanks for the feedback, Till. > > > > > > >>> > >>> > > > > > > >>> > >>> ## I feel that what you proposed (operator-based + > > default > > > > > > >>> > value) might > > > > > > >>> > >> be > > > > > > >>> > >>> subsumed by the SSG-based approach. > > > > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 > > cases, > > > > > > >>> > categorized by > > > > > > >>> > >>> whether the resource requirements are known to the > users. > > > > > > >>> > >>> > > > > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > > > > >>> > reason to put > > > > > > >>> > >>> multiple operators whose individual resource > > > > > requirements > > > > > > >>> > are already > > > > > > >>> > >> known > > > > > > >>> > >>> into the same group in fine-grained resource > > > > > management. > > > > > > >>> > And if op_1 > > > > > > >>> > >> and > > > > > > >>> > >>> op_2 are in different groups, there should be no > > > > > problem > > > > > > >>> > switching > > > > > > >>> > >> data > > > > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > > > > >>> > equivalent to > > > > > > >>> > >> specifying > > > > > > >>> > >>> operator resource requirements in your proposal. > > > > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > > > > that > > > > > > >>> > op_2 is in a > > > > > > >>> > >>> SSG whose resource is not specified thus would have the > > > > > > >>> > default slot > > > > > > >>> > >>> resource. This is equivalent to having default operator > > > > > > >>> > resources in > > > > > > >>> > >> your > > > > > > >>> > >>> proposal. > > > > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > > > > op_2 > > > > > > >>> > to the same > > > > > > >>> > >> SSG > > > > > > >>> > >>> or separate SSGs. > > > > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > > > > >>> > equivalent to > > > > > > >>> > >> the > > > > > > >>> > >>> coarse-grained resource management, where op_1 and > > > > > > op_2 > > > > > > >>> > share a > > > > > > >>> > >> default > > > > > > >>> > >>> size slot no matter which data exchange mode is > > > > > used. > > > > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > > > > of > > > > > > >>> > them will > > > > > > >>> > >> use > > > > > > >>> > >>> a default size slot. This is equivalent to setting > > > > > > them > > > > > > >>> > with > > > > > > >>> > >> default > > > > > > >>> > >>> operator resources in your proposal. > > > > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and op_2 > > > > > > is > > > > > > >>> > known.* > > > > > > >>> > >>> - It is possible that the user learns the total / > > > > > max > > > > > > >>> > resource > > > > > > >>> > >>> requirement from executing and monitoring the job, > > > > > > >>> > while not > > > > > > >>> > >>> being aware of > > > > > > >>> > >>> individual operator requirements. > > > > > > >>> > >>> - I believe this is the case your proposal does not > > > > > > >>> > cover. And TBH, > > > > > > >>> > >>> this is probably how most users learn the resource > > > > > > >>> > requirements, > > > > > > >>> > >>> according > > > > > > >>> > >>> to my experiences. > > > > > > >>> > >>> - In this case, the user might need to specify > > > > > > >>> > different resources > > > > > > >>> > >> if > > > > > > >>> > >>> he wants to switch the execution mode, which should > > > > > > not > > > > > > >>> > be worse > > > > > > >>> > >> than not > > > > > > >>> > >>> being able to use fine-grained resource management. > > > > > > >>> > >>> > > > > > > >>> > >>> > > > > > > >>> > >>> ## An additional idea inspired by your proposal. > > > > > > >>> > >>> We may provide multiple options for deciding resources > > for > > > > > > >>> > SSGs whose > > > > > > >>> > >>> requirement is not specified, if needed. > > > > > > >>> > >>> > > > > > > >>> > >>> - Default slot resource (current design) > > > > > > >>> > >>> - Default operator resource times number of operators > > > > > > >>> > (equivalent to > > > > > > >>> > >>> your proposal) > > > > > > >>> > >>> > > > > > > >>> > >>> > > > > > > >>> > >>> ## Exposing internal runtime strategies > > > > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > > > > >>> > requirements might be > > > > > > >>> > >>> affected if how SSGs are internally handled changes in > > > > > > future. > > > > > > >>> > >> Practically, > > > > > > >>> > >>> I do not concretely see at the moment what kind of > > changes > > > > > we > > > > > > >>> > may want in > > > > > > >>> > >>> future that might conflict with this FLIP proposal, as > > the > > > > > > >>> > question of > > > > > > >>> > >>> switching data exchange mode answered above. I'd > suggest > > to > > > > > > >>> > not give up > > > > > > >>> > >> the > > > > > > >>> > >>> user friendliness we may gain now for the future > problems > > > > > > that > > > > > > >>> > may or may > > > > > > >>> > >>> not exist. > > > > > > >>> > >>> > > > > > > >>> > >>> Moreover, the SSG-based approach has the flexibility to > > > > > > >>> > achieve the > > > > > > >>> > >>> equivalent behavior as the operator-based approach, if > we > > > > > > set each > > > > > > >>> > >> operator > > > > > > >>> > >>> (or task) to a separate SSG. We can even provide a > > shortcut > > > > > > >>> > option to > > > > > > >>> > >>> automatically do that for users, if needed. > > > > > > >>> > >>> > > > > > > >>> > >>> > > > > > > >>> > >>> Thank you~ > > > > > > >>> > >>> > > > > > > >>> > >>> Xintong Song > > > > > > >>> > >>> > > > > > > >>> > >>> > > > > > > >>> > >>> > > > > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > >>> > >> wrote: > > > > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > > > > >>> > >>>> > > > > > > >>> > >>>> I agree that being able to define the resource > > > > > requirements > > > > > > for a > > > > > > >>> > >> group of > > > > > > >>> > >>>> operators is more user friendly. However, my concern > is > > > > > that > > > > > > >>> > we are > > > > > > >>> > >>>> exposing thereby internal runtime strategies which > might > > > > > > >>> > limit our > > > > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > > > > semantics > > > > > > of > > > > > > >>> > >> configuring > > > > > > >>> > >>>> resource requirements for SSGs could break if > switching > > > > > from > > > > > > >>> > streaming > > > > > > >>> > >> to > > > > > > >>> > >>>> batch execution. If one defines the resource > > requirements > > > > > > for > > > > > > >>> > op_1 -> > > > > > > >>> > >> op_2 > > > > > > >>> > >>>> which run in pipelined mode when using the streaming > > > > > > >>> > execution, then > > > > > > >>> > >> how do > > > > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 are > > > > > > >>> > executed with a > > > > > > >>> > >>>> blocking data exchange in batch execution mode? > > > > > > Consequently, > > > > > > >>> > I am > > > > > > >>> > >> still > > > > > > >>> > >>>> leaning towards Stephan's proposal to set the resource > > > > > > >>> > requirements per > > > > > > >>> > >>>> operator. > > > > > > >>> > >>>> > > > > > > >>> > >>>> Maybe the following proposal makes the configuration > > > > > easier: > > > > > > >>> > If the > > > > > > >>> > >> user > > > > > > >>> > >>>> wants to use fine-grained resource requirements, then > > she > > > > > > >>> > needs to > > > > > > >>> > >> specify > > > > > > >>> > >>>> the default size which is used for operators which > have > > no > > > > > > >>> > explicit > > > > > > >>> > >>>> resource annotation. If this holds true, then every > > > > > operator > > > > > > >>> > would > > > > > > >>> > >> have a > > > > > > >>> > >>>> resource requirement and the system can try to execute > > the > > > > > > >>> > operators > > > > > > >>> > >> in the > > > > > > >>> > >>>> best possible manner w/o being constrained by how the > > user > > > > > > >>> > set the SSG > > > > > > >>> > >>>> requirements. > > > > > > >>> > >>>> > > > > > > >>> > >>>> Cheers, > > > > > > >>> > >>>> Till > > > > > > >>> > >>>> > > > > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > >>> > >>>> wrote: > > > > > > >>> > >>>> > > > > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> Actually, your proposal has also come to my mind at > > some > > > > > > >>> > point. And I > > > > > > >>> > >>>> have > > > > > > >>> > >>>>> some concerns about it. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> 1. It does not give users the same control as the > > > > > SSG-based > > > > > > >>> > approach. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> While both approaches do not require specifying for > > each > > > > > > >>> > operator, > > > > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > > > > operators > > > > > > >>> > >> together > > > > > > >>> > >>>> use > > > > > > >>> > >>>>> this much resource" while the operator-based approach > > > > > > doesn't. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, o_2, > > ..., > > > > > > >>> > o_m), and > > > > > > >>> > >> at > > > > > > >>> > >>>> some > > > > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which > > significantly > > > > > > >>> > reduces the > > > > > > >>> > >> data > > > > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups > > SSG_1 > > > > > > >>> > (o_1, ..., > > > > > > >>> > >> o_n) > > > > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring much > > > > > higher > > > > > > >>> > >> parallelisms > > > > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 > > won't > > > > > > >>> > lead to too > > > > > > >>> > >> much > > > > > > >>> > >>>>> wasting of resources. If the two SSGs end up needing > > > > > > different > > > > > > >>> > >> resources, > > > > > > >>> > >>>>> with the SSG-based approach one can directly specify > > > > > > >>> > resources for > > > > > > >>> > >> the > > > > > > >>> > >>>> two > > > > > > >>> > >>>>> groups. However, with the operator-based approach, > the > > > > > > user will > > > > > > >>> > >> have to > > > > > > >>> > >>>>> specify resources for each operator in one of the two > > > > > > >>> > groups, and > > > > > > >>> > >> tune > > > > > > >>> > >>>> the > > > > > > >>> > >>>>> default slot resource via configurations to fit the > > other > > > > > > group. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> 2. It increases the chance of breaking operator > chains. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> Setting chainnable operators into different slot > > sharing > > > > > > >>> > groups will > > > > > > >>> > >>>>> prevent them from being chained. In the current > > > > > > implementation, > > > > > > >>> > >>>> downstream > > > > > > >>> > >>>>> operators, if SSG not explicitly specified, will be > set > > > > > to > > > > > > >>> > the same > > > > > > >>> > >> group > > > > > > >>> > >>>>> as the chainable upstream operators (unless multiple > > > > > > upstream > > > > > > >>> > >> operators > > > > > > >>> > >>>> in > > > > > > >>> > >>>>> different groups), to reduce the chance of breaking > > > > > chains. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 -> > > o_3, > > > > > > >>> > deciding > > > > > > >>> > >> SSGs > > > > > > >>> > >>>>> based on whether resource is specified we will easily > > get > > > > > > >>> > groups like > > > > > > >>> > >>>> (o_1, > > > > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can be > > > > > > >>> > chained. This > > > > > > >>> > >> is > > > > > > >>> > >>>> also > > > > > > >>> > >>>>> possible for the SSG-based approach, but I believe > the > > > > > > >>> > chance is much > > > > > > >>> > >>>>> smaller because there's no strong reason for users to > > > > > > >>> > specify the > > > > > > >>> > >> groups > > > > > > >>> > >>>>> with alternate operators like that. We are more > likely > > to > > > > > > >>> > get groups > > > > > > >>> > >> like > > > > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks only > > > > > > between > > > > > > >>> > o_2 and > > > > > > >>> > >> o_3. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> 3. It complicates the system by having two different > > > > > > >>> > mechanisms for > > > > > > >>> > >>>> sharing > > > > > > >>> > >>>>> managed memory in a slot. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > > > > memory > > > > > > >>> > sharing > > > > > > >>> > >>>>> mechanism, where managed memory is first distributed > > > > > > >>> > according to the > > > > > > >>> > >>>>> consumer type, then further distributed across > > operators > > > > > > of that > > > > > > >>> > >> consumer > > > > > > >>> > >>>>> type. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> - With the operator-based approach, managed memory > size > > > > > > >>> > specified > > > > > > >>> > >> for an > > > > > > >>> > >>>>> operator should account for all the consumer types of > > > > > that > > > > > > >>> > operator. > > > > > > >>> > >> That > > > > > > >>> > >>>>> means the managed memory is first distributed across > > > > > > >>> > operators, then > > > > > > >>> > >>>>> distributed to different consumer types of each > > operator. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> Unfortunately, the different order of the two > > calculation > > > > > > >>> > steps can > > > > > > >>> > >> lead > > > > > > >>> > >>>> to > > > > > > >>> > >>>>> different results. To be specific, the semantic of > the > > > > > > >>> > configuration > > > > > > >>> > >>>> option > > > > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. within > an > > > > > > >>> > operator). > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> To sum up things: > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> While (3) might be a bit more implementation related, > I > > > > > > >>> > think (1) > > > > > > >>> > >> and (2) > > > > > > >>> > >>>>> somehow suggest that, the price for the proposed > > approach > > > > > > to > > > > > > >>> > avoid > > > > > > >>> > >>>>> specifying resource for every operator is that it's > not > > > > > as > > > > > > >>> > >> independent > > > > > > >>> > >>>> from > > > > > > >>> > >>>>> operator chaining and slot sharing as the > > operator-based > > > > > > >>> > approach > > > > > > >>> > >>>> discussed > > > > > > >>> > >>>>> in the FLIP. > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> Thank you~ > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> Xintong Song > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> > > > > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > >>> > >> wrote: > > > > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> I want to say, first of all, that this is super well > > > > > > >>> > written. And > > > > > > >>> > >> the > > > > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > > > > >>> > configuration to > > > > > > >>> > >>>> users > > > > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > > > > >>> > >>>>>> So good job here! > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> About how to let users specify the resource > profiles. > > > > > If I > > > > > > >>> > can sum > > > > > > >>> > >> the > > > > > > >>> > >>>>> FLIP > > > > > > >>> > >>>>>> and previous discussion up in my own words, the > > problem > > > > > > is the > > > > > > >>> > >>>> following: > > > > > > >>> > >>>>>> Operator-level specification is the simplest and > > > > > cleanest > > > > > > >>> > approach, > > > > > > >>> > >>>>> because > > > > > > >>> > >>>>>>> it avoids mixing operator configuration (resource) > > and > > > > > > >>> > >> scheduling. No > > > > > > >>> > >>>>>>> matter what other parameters change (chaining, slot > > > > > > sharing, > > > > > > >>> > >>>> switching > > > > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource > > profiles > > > > > > >>> > stay the > > > > > > >>> > >>>> same. > > > > > > >>> > >>>>>>> But it would require that a user specifies > resources > > on > > > > > > all > > > > > > >>> > >>>> operators, > > > > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > > > > suggests > > > > > > going > > > > > > >>> > >> with > > > > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> I think both thoughts are important, so can we find > a > > > > > > solution > > > > > > >>> > >> where > > > > > > >>> > >>>> the > > > > > > >>> > >>>>>> Resource Profiles are specified on an Operator, but > we > > > > > > >>> > still avoid > > > > > > >>> > >> that > > > > > > >>> > >>>>> we > > > > > > >>> > >>>>>> need to specify a resource profile on every > operator? > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> What do you think about something like the > following: > > > > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > > > > level. > > > > > > >>> > >>>>>> - Not all operators need profiles > > > > > > >>> > >>>>>> - All Operators without a Resource Profile ended up > > > > > in > > > > > > the > > > > > > >>> > >> default > > > > > > >>> > >>>> slot > > > > > > >>> > >>>>>> sharing group with a default profile (will get a > > default > > > > > > slot). > > > > > > >>> > >>>>>> - All Operators with a Resource Profile will go into > > > > > > >>> > another slot > > > > > > >>> > >>>>> sharing > > > > > > >>> > >>>>>> group (the resource-specified-group). > > > > > > >>> > >>>>>> - Users can define different slot sharing groups for > > > > > > >>> > operators > > > > > > >>> > >> like > > > > > > >>> > >>>>> they > > > > > > >>> > >>>>>> do now, with the exception that you cannot mix > > operators > > > > > > >>> > that have > > > > > > >>> > >> a > > > > > > >>> > >>>>>> resource profile and operators that have no resource > > > > > > profile. > > > > > > >>> > >>>>>> - The default case where no operator has a resource > > > > > > >>> > profile is > > > > > > >>> > >> just a > > > > > > >>> > >>>>>> special case of this model > > > > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > > > > operator, > > > > > > >>> > like it > > > > > > >>> > >> does > > > > > > >>> > >>>>> now, > > > > > > >>> > >>>>>> and the scheduler sums up the profiles of the tasks > > that > > > > > > it > > > > > > >>> > >> schedules > > > > > > >>> > >>>>>> together. > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> There is another question about reactive scaling > > raised > > > > > > in the > > > > > > >>> > >> FLIP. I > > > > > > >>> > >>>>> need > > > > > > >>> > >>>>>> to think a bit about that. That is indeed a bit more > > > > > > tricky > > > > > > >>> > once we > > > > > > >>> > >>>> have > > > > > > >>> > >>>>>> slots of different sizes. > > > > > > >>> > >>>>>> It is not clear then which of the different slot > > > > > requests > > > > > > the > > > > > > >>> > >>>>>> ResourceManager should fulfill when new resources > > (TMs) > > > > > > >>> > show up, > > > > > > >>> > >> or how > > > > > > >>> > >>>>> the > > > > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > > > > resources > > > > > > >>> > (TMs) > > > > > > >>> > >>>>> disappear > > > > > > >>> > >>>>>> This question is pretty orthogonal, though, to the > > "how > > > > > to > > > > > > >>> > specify > > > > > > >>> > >> the > > > > > > >>> > >>>>>> resources". > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> Best, > > > > > > >>> > >>>>>> Stephan > > > > > > >>> > >>>>>> > > > > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > > > > >>> > <[hidden email] <mailto:[hidden email]> > > > > > > >>> > >>>>> wrote: > > > > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > > > > discussion, > > > > > > >>> > Yangze. > > > > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> @Till, > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> I agree that specifying requirements for SSGs means > > > > > that > > > > > > SSGs > > > > > > >>> > >> need to > > > > > > >>> > >>>>> be > > > > > > >>> > >>>>>>> supported in fine-grained resource management, > > > > > otherwise > > > > > > each > > > > > > >>> > >>>> operator > > > > > > >>> > >>>>>>> might use as many resources as the whole group. > > > > > However, > > > > > > I > > > > > > >>> > cannot > > > > > > >>> > >>>> think > > > > > > >>> > >>>>>> of > > > > > > >>> > >>>>>>> a strong reason for not supporting SSGs in > > fine-grained > > > > > > >>> > resource > > > > > > >>> > >>>>>>> management. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>>> Interestingly, if all operators have their > resources > > > > > > properly > > > > > > >>> > >>>>>> specified, > > > > > > >>> > >>>>>>>> then slot sharing is no longer needed because > Flink > > > > > > could > > > > > > >>> > >> slice off > > > > > > >>> > >>>>> the > > > > > > >>> > >>>>>>>> appropriately sized slots for every Task > > individually. > > > > > > >>> > >>>>>>>> > > > > > > >>> > >>>>>>> So for example, if we have a job consisting of two > > > > > > >>> > operator op_1 > > > > > > >>> > >> and > > > > > > >>> > >>>>> op_2 > > > > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would > then > > > > > say > > > > > > that > > > > > > >>> > >> the > > > > > > >>> > >>>> slot > > > > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If we > > > > > have > > > > > > a > > > > > > >>> > >> cluster > > > > > > >>> > >>>>> with > > > > > > >>> > >>>>>> 2 > > > > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the system > > > > > > cannot run > > > > > > >>> > >> this > > > > > > >>> > >>>>> job. > > > > > > >>> > >>>>>> If > > > > > > >>> > >>>>>>>> the resources were specified on an operator level, > > > > > then > > > > > > the > > > > > > >>> > >> system > > > > > > >>> > >>>>>> could > > > > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 and > > > > > op_2 > > > > > > to > > > > > > >>> > >> TM_2. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> Couldn't agree more that if all operators' > > requirements > > > > > > are > > > > > > >>> > >> properly > > > > > > >>> > >>>>>>> specified, slot sharing should be no longer needed. > I > > > > > > >>> > think this > > > > > > >>> > >>>>> exactly > > > > > > >>> > >>>>>>> disproves the example. If we already know op_1 and > > op_2 > > > > > > each > > > > > > >>> > >> needs > > > > > > >>> > >>>> 100 > > > > > > >>> > >>>>> MB > > > > > > >>> > >>>>>>> of memory, why would we put them in the same group? > > If > > > > > > >>> > they are > > > > > > >>> > >> in > > > > > > >>> > >>>>>> separate > > > > > > >>> > >>>>>>> groups, with the proposed approach the system can > > > > > freely > > > > > > >>> > deploy > > > > > > >>> > >> them > > > > > > >>> > >>>> to > > > > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> Moreover, the precondition for not needing slot > > sharing > > > > > > is > > > > > > >>> > having > > > > > > >>> > >>>>>> resource > > > > > > >>> > >>>>>>> requirements properly specified for all operators. > > This > > > > > > is not > > > > > > >>> > >> always > > > > > > >>> > >>>>>>> possible, and usually requires tremendous efforts. > > One > > > > > > of the > > > > > > >>> > >>>> benefits > > > > > > >>> > >>>>>> for > > > > > > >>> > >>>>>>> SSG-based requirements is that it allows the user > to > > > > > > freely > > > > > > >>> > >> decide > > > > > > >>> > >>>> the > > > > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I would > > > > > > >>> > consider SSG > > > > > > >>> > >> in > > > > > > >>> > >>>>>>> fine-grained resource management as a group of > > > > > operators > > > > > > >>> > that the > > > > > > >>> > >>>> user > > > > > > >>> > >>>>>>> would like to specify the total resource for. There > > can > > > > > > be > > > > > > >>> > only > > > > > > >>> > >> one > > > > > > >>> > >>>>> group > > > > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a few > > > > > major > > > > > > >>> > parts, > > > > > > >>> > >> or as > > > > > > >>> > >>>>>> many > > > > > > >>> > >>>>>>> groups as the number of tasks/operators, depending > on > > > > > how > > > > > > >>> > >>>> fine-grained > > > > > > >>> > >>>>>> the > > > > > > >>> > >>>>>>> user is able to specify the resources. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But > > given > > > > > > >>> > that all > > > > > > >>> > >> the > > > > > > >>> > >>>>>>> current scheduler implementations already support > > > > > SSGs, I > > > > > > >>> > tend to > > > > > > >>> > >>>> think > > > > > > >>> > >>>>>>> that as an acceptable price for the above discussed > > > > > > >>> > usability and > > > > > > >>> > >>>>>>> flexibility. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> @Chesnay > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> Will declaring them on slot sharing groups not also > > > > > waste > > > > > > >>> > >> resources > > > > > > >>> > >>>> if > > > > > > >>> > >>>>>> the > > > > > > >>> > >>>>>>>> parallelism of operators within that group are > > > > > > different? > > > > > > >>> > >>>>>>>> > > > > > > >>> > >>>>>>> Yes. It's a trade-off between usability and > resource > > > > > > >>> > >> utilization. To > > > > > > >>> > >>>>>> avoid > > > > > > >>> > >>>>>>> such wasting, the user can define more groups, so > > that > > > > > > >>> > each group > > > > > > >>> > >>>>>> contains > > > > > > >>> > >>>>>>> less operators and the chance of having operators > > with > > > > > > >>> > different > > > > > > >>> > >>>>>>> parallelism will be reduced. The price is to have > > more > > > > > > >>> > resource > > > > > > >>> > >>>>>>> requirements to specify. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> It also seems like quite a hassle for users having > to > > > > > > >>> > >> recalculate the > > > > > > >>> > >>>>>>>> resource requirements if they change the slot > > sharing. > > > > > > >>> > >>>>>>>> I'd think that it's not really workable for users > > that > > > > > > create > > > > > > >>> > >> a set > > > > > > >>> > >>>>> of > > > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > > > their > > > > > > >>> > >>>>> applications; > > > > > > >>> > >>>>>>>> managing the resources requirements in such a > > setting > > > > > > >>> > would be > > > > > > >>> > >> a > > > > > > >>> > >>>>>>>> nightmare, and in the end would require > > operator-level > > > > > > >>> > >> requirements > > > > > > >>> > >>>>> any > > > > > > >>> > >>>>>>>> way. > > > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > > > increases > > > > > > >>> > >>>>> usability. > > > > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > > > > there's no > > > > > > >>> > >> reason to > > > > > > >>> > >>>>> put > > > > > > >>> > >>>>>>> multiple operators whose individual resource > > > > > > >>> > requirements are > > > > > > >>> > >>>>> already > > > > > > >>> > >>>>>>> known > > > > > > >>> > >>>>>>> into the same group in fine-grained resource > > > > > > management. > > > > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > > > > multiple > > > > > > >>> > >>>>> applications, > > > > > > >>> > >>>>>>> it does not guarantee the same resource > > > > > requirements. > > > > > > >>> > During > > > > > > >>> > >> our > > > > > > >>> > >>>>> years > > > > > > >>> > >>>>>>> of > > > > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > > > > requirements > > > > > > >>> > >> specified for > > > > > > >>> > >>>>>>> Blink's > > > > > > >>> > >>>>>>> fine-grained resource management, very few users > > > > > > >>> > (including > > > > > > >>> > >> our > > > > > > >>> > >>>>>>> specialists > > > > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are as > > > > > > >>> > >> experienced as > > > > > > >>> > >>>>> to > > > > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > > > > >>> > >> requirements. > > > > > > >>> > >>>> Most > > > > > > >>> > >>>>>>> people > > > > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > > > > delay, cpu > > > > > > >>> > >> load, > > > > > > >>> > >>>>>> memory > > > > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > > > > specification. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> To sum up: > > > > > > >>> > >>>>>>> If the user is capable of providing proper resource > > > > > > >>> > requirements > > > > > > >>> > >> for > > > > > > >>> > >>>>>> every > > > > > > >>> > >>>>>>> operator, that's definitely a good thing and we > would > > > > > not > > > > > > >>> > need to > > > > > > >>> > >>>> rely > > > > > > >>> > >>>>> on > > > > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for > the > > > > > > >>> > >> fine-grained > > > > > > >>> > >>>>>> resource > > > > > > >>> > >>>>>>> management to work. For those users who are capable > > and > > > > > > do not > > > > > > >>> > >> like > > > > > > >>> > >>>>>> having > > > > > > >>> > >>>>>>> to set each operator to a separate SSG, I would be > ok > > > > > to > > > > > > have > > > > > > >>> > >> both > > > > > > >>> > >>>>>>> SSG-based and operator-based runtime interfaces and > > to > > > > > > only > > > > > > >>> > >> fallback > > > > > > >>> > >>>> to > > > > > > >>> > >>>>>> the > > > > > > >>> > >>>>>>> SSG requirements when the operator requirements are > > not > > > > > > >>> > >> specified. > > > > > > >>> > >>>>>> However, > > > > > > >>> > >>>>>>> as the first step, I think we should prioritise the > > use > > > > > > cases > > > > > > >>> > >> where > > > > > > >>> > >>>>> users > > > > > > >>> > >>>>>>> are not that experienced. > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> Thank you~ > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> Xintong Song > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>> On Thu, Jan 7, 2021 at 9:55 PM Chesnay Schepler < > > > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > > > >>> > >>>>>>> wrote: > > > > > > >>> > >>>>>>> > > > > > > >>> > >>>>>>>> Will declaring them on slot sharing groups not > also > > > > > > waste > > > > > > >>> > >> resources > > > > > > >>> > >>>>> if > > > > > > >>> > >>>>>>>> the parallelism of operators within that group are > > > > > > different? > > > > > > >>> > >>>>>>>> > > > > > > >>> > >>>>>>>> It also seems like quite a hassle for users having > > to > > > > > > >>> > >> recalculate > > > > > > >>> > >>>> the > > > > > > >>> > >>>>>>>> resource requirements if they change the slot > > sharing. > > > > > > >>> > >>>>>>>> I'd think that it's not really workable for users > > that > > > > > > create > > > > > > >>> > >> a set > > > > > > >>> > >>>>> of > > > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched in > > > > > their > > > > > > >>> > >>>>> applications; > > > > > > >>> > >>>>>>>> managing the resources requirements in such a > > setting > > > > > > >>> > would be > > > > > > >>> > >> a > > > > > > >>> > >>>>>>>> nightmare, and in the end would require > > operator-level > > > > > > >>> > >> requirements > > > > > > >>> > >>>>> any > > > > > > >>> > >>>>>>>> way. > > > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it really > > > > > > increases > > > > > > >>> > >>>>> usability. > > > > > > >>> > >>>>>>>> My main worry is that it if we wire the runtime to > > > > > work > > > > > > >>> > on SSGs > > > > > > >>> > >>>> it's > > > > > > >>> > >>>>>>>> gonna be difficult to implement more fine-grained > > > > > > approaches, > > > > > > >>> > >> which > > > > > > >>> > >>>>>>>> would not be the case if, for the runtime, they > are > > > > > > always > > > > > > >>> > >> defined > > > > > > >>> > >>>> on > > > > > > >>> > >>>>>> an > > > > > > >>> > >>>>>>>> operator-level. > > > > > > >>> > >>>>>>>> > > > > > > >>> > >>>>>>>> On 1/7/2021 2:42 PM, Till Rohrmann wrote: > > > > > > >>> > >>>>>>>>> Thanks for drafting this FLIP and starting this > > > > > > discussion > > > > > > >>> > >>>> Yangze. > > > > > > >>> > >>>>>>>>> I like that defining resource requirements on a > > slot > > > > > > sharing > > > > > > >>> > >>>> group > > > > > > >>> > >>>>>>> makes > > > > > > >>> > >>>>>>>>> the overall setup easier and improves usability > of > > > > > > resource > > > > > > >>> > >>>>>>> requirements. > > > > > > >>> > >>>>>>>>> What I do not like about it is that it changes > slot > > > > > > sharing > > > > > > >>> > >>>> groups > > > > > > >>> > >>>>>> from > > > > > > >>> > >>>>>>>>> being a scheduling hint to something which needs > to > > > > > be > > > > > > >>> > >> supported > > > > > > >>> > >>>> in > > > > > > >>> > >>>>>>> order > > > > > > >>> > >>>>>>>>> to support fine grained resource requirements. So > > > > > far, > > > > > > the > > > > > > >>> > >> idea > > > > > > >>> > >>>> of > > > > > > >>> > >>>>>> slot > > > > > > >>> > >>>>>>>>> sharing groups was that it tells the system that > a > > > > > set > > > > > > of > > > > > > >>> > >>>> operators > > > > > > >>> > >>>>>> can > > > > > > >>> > >>>>>>>> be > > > > > > >>> > >>>>>>>>> deployed in the same slot. But the system still > had > > > > > the > > > > > > >>> > >> freedom > > > > > > >>> > >>>> to > > > > > > >>> > >>>>>> say > > > > > > >>> > >>>>>>>> that > > > > > > >>> > >>>>>>>>> it would rather place these tasks in different > > slots > > > > > > if it > > > > > > >>> > >>>> wanted. > > > > > > >>> > >>>>> If > > > > > > >>> > >>>>>>> we > > > > > > >>> > >>>>>>>>> now specify resource requirements on a per slot > > > > > sharing > > > > > > >>> > >> group, > > > > > > >>> > >>>> then > > > > > > >>> > >>>>>> the > > > > > > >>> > >>>>>>>>> only option for a scheduler which does not > support > > > > > slot > > > > > > >>> > >> sharing > > > > > > >>> > >>>>>> groups > > > > > > >>> > >>>>>>> is > > > > > > >>> > >>>>>>>>> to say that every operator in this slot sharing > > group > > > > > > >>> > needs a > > > > > > >>> > >>>> slot > > > > > > >>> > >>>>>> with > > > > > > >>> > >>>>>>>> the > > > > > > >>> > >>>>>>>>> same resources as the whole group. > > > > > > >>> > >>>>>>>>> > > > > > > >>> > >>>>>>>>> So for example, if we have a job consisting of > two > > > > > > operator > > > > > > >>> > >> op_1 > > > > > > >>> > >>>>> and > > > > > > >>> > >>>>>>> op_2 > > > > > > >>> > >>>>>>>>> where each op needs 100 MB of memory, we would > then > > > > > > say that > > > > > > >>> > >> the > > > > > > >>> > >>>>> slot > > > > > > >>> > >>>>>>>>> sharing group needs 200 MB of memory to run. If > we > > > > > > have a > > > > > > >>> > >> cluster > > > > > > >>> > >>>>>> with > > > > > > >>> > >>>>>>> 2 > > > > > > >>> > >>>>>>>>> TMs with one slot of 100 MB each, then the system > > > > > > cannot run > > > > > > >>> > >> this > > > > > > >>> > >>>>>> job. > > > > > > >>> > >>>>>>> If > > > > > > >>> > >>>>>>>>> the resources were specified on an operator > level, > > > > > > then the > > > > > > >>> > >>>> system > > > > > > >>> > >>>>>>> could > > > > > > >>> > >>>>>>>>> still make the decision to deploy op_1 to TM_1 > and > > > > > > op_2 to > > > > > > >>> > >> TM_2. > > > > > > >>> > >>>>>>>>> Originally, one of the primary goals of slot > > sharing > > > > > > groups > > > > > > >>> > >> was > > > > > > >>> > >>>> to > > > > > > >>> > >>>>>> make > > > > > > >>> > >>>>>>>> it > > > > > > >>> > >>>>>>>>> easier for the user to reason about how many > slots > > a > > > > > > job > > > > > > >>> > >> needs > > > > > > >>> > >>>>>>>> independent > > > > > > >>> > >>>>>>>>> of the actual number of operators in the job. > > > > > > Interestingly, > > > > > > >>> > >> if > > > > > > >>> > >>>> all > > > > > > >>> > >>>>>>>>> operators have their resources properly > specified, > > > > > > then slot > > > > > > >>> > >>>>> sharing > > > > > > >>> > >>>>>> is > > > > > > >>> > >>>>>>>> no > > > > > > >>> > >>>>>>>>> longer needed because Flink could slice off the > > > > > > >>> > appropriately > > > > > > >>> > >>>> sized > > > > > > >>> > >>>>>>> slots > > > > > > >>> > >>>>>>>>> for every Task individually. What matters is > > whether > > > > > > the > > > > > > >>> > >> whole > > > > > > >>> > >>>>>> cluster > > > > > > >>> > >>>>>>>> has > > > > > > >>> > >>>>>>>>> enough resources to run all tasks or not. > > > > > > >>> > >>>>>>>>> > > > > > > >>> > >>>>>>>>> Cheers, > > > > > > >>> > >>>>>>>>> Till > > > > > > >>> > >>>>>>>>> > > > > > > >>> > >>>>>>>>> On Thu, Jan 7, 2021 at 4:08 AM Yangze Guo < > > > > > > >>> > >> [hidden email] <mailto:[hidden email]>> > > > > > > >>> > >>>>>> wrote: > > > > > > >>> > >>>>>>>>>> Hi, there, > > > > > > >>> > >>>>>>>>>> > > > > > > >>> > >>>>>>>>>> We would like to start a discussion thread on > > > > > > "FLIP-156: > > > > > > >>> > >> Runtime > > > > > > >>> > >>>>>>>>>> Interfaces for Fine-Grained Resource > > > > > Requirements"[1], > > > > > > >>> > >> where we > > > > > > >>> > >>>>>>>>>> propose Slot Sharing Group (SSG) based runtime > > > > > > interfaces > > > > > > >>> > >> for > > > > > > >>> > >>>>>>>>>> specifying fine-grained resource requirements. > > > > > > >>> > >>>>>>>>>> > > > > > > >>> > >>>>>>>>>> In this FLIP: > > > > > > >>> > >>>>>>>>>> - Expound the user story of fine-grained > resource > > > > > > >>> > >> management. > > > > > > >>> > >>>>>>>>>> - Propose runtime interfaces for specifying > > > > > SSG-based > > > > > > >>> > >> resource > > > > > > >>> > >>>>>>>>>> requirements. > > > > > > >>> > >>>>>>>>>> - Discuss the pros and cons of the three > potential > > > > > > >>> > >> granularities > > > > > > >>> > >>>>> for > > > > > > >>> > >>>>>>>>>> specifying the resource requirements (op, task > and > > > > > > slot > > > > > > >>> > >> sharing > > > > > > >>> > >>>>>> group) > > > > > > >>> > >>>>>>>>>> and explain why we choose the slot sharing > group. > > > > > > >>> > >>>>>>>>>> > > > > > > >>> > >>>>>>>>>> Please find more details in the FLIP wiki > document > > > > > > [1]. > > > > > > >>> > >> Looking > > > > > > >>> > >>>>>>>>>> forward to your feedback. > > > > > > >>> > >>>>>>>>>> > > > > > > >>> > >>>>>>>>>> [1] > > > > > > >>> > >>>>>>>>>> > > > > > > >>> > >> > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > >>> > < > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements > > > > > > > > > > > > > >>> > >>>>>>>>>> Best, > > > > > > >>> > >>>>>>>>>> Yangze Guo > > > > > > >>> > >>>>>>>>>> > > > > > > >>> > >>>>>>>> > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > > > > > > > |
Hi Xintong,
Thanks for the backgrounds! I understand the impractical of operator level specifications and the value of group level specifications. Just not that confident about “Coupling between operator chaining / slot sharing”, seems to me, it requires more knowledge than “Expose operator chaining”. Best, Kezhu Wang On Thu, Feb 4, 2021 at 13:22 Xintong Song <[hidden email]> wrote: > Hi Kezhu, > > Maybe let me share some backgrounds first. > > - We at Alibaba have been using fine-grained resource management for > many years, with Blink (an internal version of Flink). > - We have been trying to contribute this feature to Apache Flink since > many years ago. However, we haven't succeeded, due to various reasons. > - Back to years ago, I believe there were not many users that used > Flink in production at a very large scale, thus less demand for > the feature. > - The feature on Blink is quite specific to our internal use cases > and scenarios. We have not made it general enough to cover the > community's > common use cases. > - Divergences between Flink & Blink code bases. > - Blink used operator-level resource interfaces. According to our years > of production experiences, we believe that specifying operator-level > resources are neither necessary nor easy-to-use. This is why we propose > group-level interfaces. > > Back to your questions. > > I saw the dicussion to keep slot sharing as an hint, but in reality, will > > SSG jobs expect to fail or > > run slowly if scheduler does not respect it ? A slot with 20GB memory is > > different from two 1GB > > default sized slots. So, we actually depends on scheduler > > version/implementation/de-fact if we > > claim it is an hint. > > > > SSG-based resource requirements are considered hints because the SSG itself > is a hint. There's no guarantee that operators of a SSG will always be > scheduled together. I think you have a good point that, if SSGs are not > respected, is it prefered to fail the job or to interpret the resource of > an actual slot. It's possible that we provide a configuration option and > leave that decision to the users. However, that is a design choice we need > to make when there's indeed a need for not respecting the SSGs. > > Do you mean code-path or production environment ? If it is code-path, could > > you please point out where > > the story breaks ? > > > > From the dicussion and history, could I consider FLIP-156 is an > redirection > > more than inheritance/enhancement > > of current halfly-cooked/ancient implmentation ? > > > > If you try to set the operator resources, you would find that it won't work > at the moment. There are several things not ready. > > - Interfaces for setting operator resources are never really exposed to > users. > - The resource manager never allocates slots with the requested > resources. > - Managed memory size specified for operators will not be respected, > because managed memory is shared within a slot with a different > approach. > > While the first 2 points are more related to that the feature is not yet > ready, the last point is closely related to the specifying operator level > resources. > > To sum up, we do not want to support specifying operator level in the first > step, for the following reasons. > > - It's not likely needed, due to poor usability compared to the > SSG-based approach. > - It introduces the complexity to deal with the managed memory sharing. > - It introduces the complexity to deal with combining resource > requirements from two different levels. > > > Thank you~ > > Xintong Song > > > > On Wed, Feb 3, 2021 at 7:50 PM Kezhu Wang <[hidden email]> wrote: > > > Hi Till, > > > > Based on what I understood, if not wrong, the door is not closed after > SSG > > resource specifying. So, hope it could be useful in potential future > > improvement. > > > > Best, > > Kezhu Wang > > > > > > On February 3, 2021 at 18:07:21, Till Rohrmann ([hidden email]) > > wrote: > > > > Thanks for sharing your thoughts Kezhu. I like your ideas of how > > per-operator and SSG requirements can be combined. I've also thought > about > > defining a default resource profile for all tasks which have no resources > > configured. That way all operators would have resources assigned if the > > user chooses to use this feature. > > > > As Yangze and Xintong have said, we have decided to first only support > > specifying resources for SSGs as this seems more user friendly. Based on > > the feedback for this feature one potential development direction might > be > > to allow the resource specification on per-operator basis. Here we could > > pick up your ideas. > > > > Cheers, > > Till > > > > On Wed, Feb 3, 2021 at 7:31 AM Xintong Song <[hidden email]> > wrote: > > > > > Thanks for your feedback, Kezhu. > > > > > > I think Flink *runtime* already has an ideal granularity for resource > > > > management 'task'. If there is > > > > a slot shared by multiple tasks, that slot's resource requirement is > > > simple > > > > sum of all its logical > > > > slots. So basically, this is no resource requirement for > > SlotSharingGroup > > > > in runtime until now, > > > > right ? > > > > > > That is a halfly-cooked implementation, coming from the previous > attempts > > > (years ago) trying to deliver the fine-grained resource management > > feature, > > > and never really put into use. > > > > > > From the FLIP and dicusssion, I assume that SSG resource specifying > will > > > > override operator level > > > > resource specifying if both are specified ? > > > > > > > Actually, I think we should use the finer-grained resources (i.e. > > operator > > > level) if both are specified. And more importantly, that is based on > the > > > assumption that we do need two different levels of interfaces. > > > > > > So, I wonder whether we could interpret SSG resource specifying as an > > "add" > > > > but not an "set" on > > > > resource requirement ? > > > > > > > IIUC, this is the core idea behind your proposal. I think it provides > an > > > interesting idea of how we combine operator level and SSG level > > resources, > > > *if > > > we allow configuring resources at both levels*. However, I'm not sure > > > whether the configuring resources on the operator level is indeed > needed. > > > Therefore, as a first step, this FLIP proposes to only introduce the > > > SSG-level interfaces. As listed in the future plan, we would consider > > > allowing operator level resource configuration later if we do see a > need > > > for it. At that time, we definitely should discuss what to do if > > resources > > > are configured at both levels. > > > > > > * Could SSG express negative resource requirement ? > > > > > > > No. > > > > > > Is there concrete bar for partial resource configured not function ? I > > > > saw it will fail job submission in Dispatcher.submitJob. > > > > > > > With the SSG-based approach, this should no longer be needed. The > > > constraint was introduced because we can neither properly define what > is > > > the resource of a task chained from an operator with specified resource > > and > > > another with unspecified resource, nor for a slot shared by a task with > > > specified resource and another with unspecified resource. With the > > > SSG-based approach, we no longer have those problems. > > > > > > An option(cluster/job level) to force slot sharing in scheduler ? This > > > > could be useful in case of migration from FLIP-156 to future > approach. > > > > > > > I think this is exactly what we are trying to avoid, requiring the > > > scheduler to enforce slot sharing. > > > > > > An option(cluster) to ignore resource specifying(allow resource > specified > > > > job to run on open box environment) for no production usage ? > > > > > > > That's possible. Actually, we are planning to introduce an option for > > > activating the fine-grained resource management, for development > > purposes. > > > We might consider to keep that option after the feature is completed, > to > > > allow disable the feature without having to touch the job codes. > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Wed, Feb 3, 2021 at 1:28 PM Kezhu Wang <[hidden email]> wrote: > > > > > > > Hi all, sorry for join discussion even after voting started. > > > > > > > > I want to share my thoughts on this after reading above discussions. > > > > > > > > I think Flink *runtime* already has an ideal granularity for resource > > > > management 'task'. If there is > > > > a slot shared by multiple tasks, that slot's resource requirement is > > > simple > > > > sum of all its logical > > > > slots. So basically, this is no resource requirement for > > SlotSharingGroup > > > > in runtime until now, > > > > right ? > > > > > > > > As in discussion, we already agree upon that: "If all operators have > > > their > > > > resources properly > > > > specified, then slot sharing is no longer needed. " > > > > > > > > So seems to me, naturally in mind path, what we would discuss is > that: > > > how > > > > to bridge impractical > > > > operator level resource specifying to runtime task level resource > > > > requirement ? This is actually a > > > > pure api thing as Chesnay has pointed out. > > > > > > > > But FLIP-156 brings another direction on table: how about using SSG > for > > > > both api and runtime > > > > resource specifying ? > > > > > > > > From the FLIP and dicusssion, I assume that SSG resource specifying > > will > > > > override operator level > > > > resource specifying if both are specified ? > > > > > > > > So, I wonder whether we could interpret SSG resource specifying as an > > > "add" > > > > but not an "set" on > > > > resource requirement ? > > > > > > > > The semantics is that SSG resource specifying adds additional > resource > > to > > > > shared slot to express > > > > concerns on possible high thoughput and resource requirement for > tasks > > in > > > > one physical slot. > > > > > > > > The result is that if scheduler indeed respect slot sharing, > allocated > > > slot > > > > will gain extra resource > > > > specified for that SSG. > > > > > > > > I think one of coding barrier from "add" approach is > > ResourceSpec.UNKNOWN > > > > which didn't support > > > > 'merge' operation. I tend to use ResourceSpec.ZERO as default, task > > > > executor should be aware of > > > > this. > > > > > > > > @Chesnay > > > > > My main worry is that it if we wire the runtime to work on SSGs > it's > > > > > gonna be difficult to implement more fine-grained approaches, which > > > > > would not be the case if, for the runtime, they are always defined > on > > > an > > > > > operator-level. > > > > > > > > An "add" operation should be less invasive and enforce low barrier > for > > > > future find-grained > > > > approaches. > > > > > > > > @Stephan > > > > > - Users can define different slot sharing groups for operators like > > > > they > > > > > do now, with the exception that you cannot mix operators that have > a > > > > > resource profile and operators that have no resource profile. > > > > > > > > @Till > > > > > This effectively means that all unspecified operators > > > > > will implicitly have a zero resource requirement. > > > > > I am wondering whether this wouldn't lead to a surprising behaviour > > for > > > > the > > > > > user. If the user specifies the resource requirements for a single > > > > > operator, then he probably will assume that the other operators > will > > > get > > > > > the default share of resources and not nothing. > > > > > > > > I think it is inherent due to fact that we could not defining > > > > ResourceSpec.ONE, eg. resource > > > > requirement for exact one default slot, with concrete numbers ? I > tend > > to > > > > squash out unspecified one > > > > if there are operators in chaining with explicit resource specifying. > > > > Otherwise, the protocol tends > > > > to verbose as say "give me this much resource and a default". I think > > if > > > we > > > > have explict resource > > > > specifying for partial operators, it is just saying "I don't care > other > > > > operators that much, just > > > > get them places to run". It is most likely be cases there are > stateless > > > > fliter/map or other less > > > > resource consuming operators. If there is indeed a problem, I think > > > clients > > > > can specify a global > > > > default(or other level default in future). In job graph generating > > phase, > > > > we could take that default > > > > into account for unspecified operators. > > > > > > > > @FLIP-156 > > > > > Expose operator chaining. (Cons fo task level resource specifying) > > > > > > > > Is it inherent for all group level resource specifying ? They will > > either > > > > break chaining or obey it, > > > > or event could not work with. > > > > > > > > To sum up above, my suggestions are: > > > > > > > > In api side: > > > > * StreamExecutionEnvironment: A global default(ResourceSpec.ZERO if > > > > unspecified). > > > > * Operator: ResourceSpec.ZERO(unspecified) as default. > > > > * Task: sum of requirements from specified operators + global > > default(if > > > > there are any unspecified operators) > > > > * SSG: additional resource to physical slot. > > > > > > > > In runtime side: > > > > * Task: ResourceSpec.Task or ResourceSpec.ZERO > > > > * SSG: ResourceSpec.SSG or ResourceSpec.ZERO > > > > > > > > Physical slot gets sum up resources from logical slots and SSG, if it > > > gets > > > > ResourceSpec.ZERO, it is > > > > just a default sized slot. > > > > > > > > In short, turn SSG resource speciying as "add" and drop > > > > ResourceSpec.UNKNOWN. > > > > > > > > > > > > Questions/Issues: > > > > * Could SSG express negative resource requirement ? > > > > * Is there concrete bar for partial resource configured not function > ? > > I > > > > saw it will fail job submission in Dispatcher.submitJob. > > > > * An option(cluster/job level) to force slot sharing in scheduler ? > > This > > > > could be useful in case of migration from FLIP-156 to future > approach. > > > > * An option(cluster) to ignore resource specifying(allow resource > > > specified > > > > job to run on open box environment) for no production usage ? > > > > > > > > > > > > > > > > On February 1, 2021 at 11:54:10, Yangze Guo ([hidden email]) > > wrote: > > > > > > > > Thanks for reply, Till and Xintong! > > > > > > > > I update the FLIP, including: > > > > - Edit the JavaDoc of the proposed > > > > StreamGraphGenerator#setSlotSharingGroupResource. > > > > - Add "Future Plan" section, which contains the potential follow-up > > > > issues and the limitations to be documented when fine-grained > resource > > > > management is exposed to users. > > > > > > > > I'll start a vote in another thread. > > > > > > > > Best, > > > > Yangze Guo > > > > > > > > On Fri, Jan 29, 2021 at 10:07 PM Till Rohrmann <[hidden email] > > > > > > wrote: > > > > > > > > > > Thanks for summarizing the discussion, Yangze. I agree that setting > > > > > resource requirements per operator is not very user friendly. > > > Moreover, I > > > > > couldn't come up with a different proposal which would be as easy > to > > > use > > > > > and wouldn't expose internal scheduling details. In fact, following > > > this > > > > > argument then we shouldn't have exposed the slot sharing groups in > > the > > > > > first place. > > > > > > > > > > What is important for the user is that we properly document the > > > > limitations > > > > > and constraints the fine grained resource specification has. For > > > example, > > > > > we should explain how optimizations like chaining are affected by > it > > > and > > > > > how different execution modes (batch vs. streaming) affect the > > > execution > > > > of > > > > > operators which have specified resources. These things shouldn't > > become > > > > > part of the contract of this feature and are more caused by > internal > > > > > implementation details but it will be important to understand these > > > > things > > > > > properly in order to use this feature effectively. > > > > > > > > > > Hence, +1 for starting the vote for this FLIP. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Tue, Jan 26, 2021 at 4:37 AM Xintong Song < > [hidden email]> > > > > wrote: > > > > > > > > > > > Thanks for the summary, Yangze. > > > > > > > > > > > > The changes and follow-up issues LGTM. Let's wait for responses > > from > > > > the > > > > > > others before starting a vote. > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jan 26, 2021 at 11:08 AM Yangze Guo <[hidden email]> > > > > wrote: > > > > > > > > > > > > > Thanks everyone for the lively discussion. I'd like to try to > > > > > > > summarize the current convergence in the discussion. Please let > > me > > > > > > > know if I got things wrong or missed something crucial here. > > > > > > > > > > > > > > Change of this FLIP: > > > > > > > - Treat the SSG resource requirements as a hint instead of a > > > > > > > restriction for the runtime. That's should be explicitly > > explained > > > in > > > > > > > the JavaDocs. > > > > > > > > > > > > > > Potential follow-up issues if needed: > > > > > > > - Provide operator-level resource configuration interface. > > > > > > > - Provide multiple options for deciding resources for SSGs > whose > > > > > > > requirement is not specified: > > > > > > > ** Default slot resource. > > > > > > > ** Default operator resource times number of operators. > > > > > > > > > > > > > > If there are no other issues, I'll update the FLIP accordingly > > and > > > > > > > start a vote thread. Thanks all for the valuable feedback > again. > > > > > > > > > > > > > > Best, > > > > > > > Yangze Guo > > > > > > > > > > > > > > Best, > > > > > > > Yangze Guo > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:30 AM Xintong Song < > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > FGRuntimeInterface.png > > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:11 AM Xintong Song < > > > > [hidden email]> > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > >> I think Chesnay's proposal could actually work. IIUC, the > > > keypoint > > > > is > > > > > > > to derive operator requirements from SSG requirements on the > API > > > > side, so > > > > > > > that the runtime only deals with operator requirements. It's > > > > debatable > > > > > > how > > > > > > > the deriving should be done though. E.g., an alternative could > be > > > to > > > > > > evenly > > > > > > > divide the SSG requirement into requirements of operators in > the > > > > group. > > > > > > > >> > > > > > > > >> > > > > > > > >> However, I'm not entirely sure which option is more desired. > > > > > > > Illustrating my understanding in the following figure, in which > > on > > > > the > > > > > > top > > > > > > > is Chesnay's proposal and on the bottom is the SSG-based > proposal > > > in > > > > this > > > > > > > FLIP. > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> I think the major difference between the two approaches is > > where > > > > > > > deriving operator requirements from SSG requirements happens. > > > > > > > >> > > > > > > > >> - Chesnay's proposal simplifies the runtime logic and the > > > > interface to > > > > > > > expose, at the price of moving more complexity (i.e. the > > deriving) > > > to > > > > the > > > > > > > API side. The question is, where do we prefer to keep the > > > complexity? > > > > I'm > > > > > > > slightly leaning towards having a thin API and keep the > > complexity > > > in > > > > > > > runtime if possible. > > > > > > > >> > > > > > > > >> - Notice that the dash line arrows represent optional steps > > that > > > > are > > > > > > > needed only for schedulers that do not respect SSGs, which we > > don't > > > > have > > > > > > at > > > > > > > the moment. If we only look at the solid line arrows, then the > > > > SSG-based > > > > > > > approach is much simpler, without needing to derive and > aggregate > > > the > > > > > > > requirements back and forth. I'm not sure about complicating > the > > > > current > > > > > > > design only for the potential future needs. > > > > > > > >> > > > > > > > >> > > > > > > > >> Thank you~ > > > > > > > >> > > > > > > > >> Xintong Song > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> On Fri, Jan 22, 2021 at 7:35 AM Chesnay Schepler < > > > > [hidden email]> > > > > > > > wrote: > > > > > > > >>> > > > > > > > >>> You're raising a good point, but I think I can rectify that > > > with > > > > a > > > > > > > minor > > > > > > > >>> adjustment. > > > > > > > >>> > > > > > > > >>> Default requirements are whatever the default requirements > > are, > > > > > > setting > > > > > > > >>> the requirements for one operator has no effect on other > > > > operators. > > > > > > > >>> > > > > > > > >>> With these rules, and some API enhancements, the following > > > mockup > > > > > > would > > > > > > > >>> replicate the SSG-based behavior: > > > > > > > >>> > > > > > > > >>> Map<SlotSharingGroupId, Requirements> requirements = ... > > > > > > > >>> for slotSharingGroup in env.getSlotSharingGroups() { > > > > > > > >>> vertices = slotSharingGroup.getVertices() > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > vertices.first().setRequirements(requirements.get(slotSharingGroup.getID()) > > > > > > > >>> vertices.remainint().setRequirements(ZERO) > > > > > > > >>> } > > > > > > > >>> > > > > > > > >>> We could even allow setting requirements on > > slotsharing-groups > > > > > > > >>> colocation-groups and internally translate them > accordingly. > > > > > > > >>> I can't help but feel this is a plain API issue. > > > > > > > >>> > > > > > > > >>> On 1/21/2021 9:44 AM, Till Rohrmann wrote: > > > > > > > >>> > If I understand you correctly Chesnay, then you want to > > > > decouple > > > > > > the > > > > > > > >>> > resource requirement specification from the slot sharing > > > group > > > > > > > >>> > assignment. Hence, per default all operators would be in > > the > > > > same > > > > > > > slot > > > > > > > >>> > sharing group. If there is no operator with a resource > > > > > > specification, > > > > > > > >>> > then the system would allocate a default slot for it. If > > > there > > > > is > > > > > > at > > > > > > > >>> > least one operator, then the system would sum up all the > > > > specified > > > > > > > >>> > resources and allocate a slot of this size. This > > effectively > > > > means > > > > > > > >>> > that all unspecified operators will implicitly have a > zero > > > > resource > > > > > > > >>> > requirement. Did I understand your idea correctly? > > > > > > > >>> > > > > > > > > >>> > I am wondering whether this wouldn't lead to a surprising > > > > behaviour > > > > > > > >>> > for the user. If the user specifies the resource > > requirements > > > > for a > > > > > > > >>> > single operator, then he probably will assume that the > > other > > > > > > > operators > > > > > > > >>> > will get the default share of resources and not nothing. > > > > > > > >>> > > > > > > > > >>> > Cheers, > > > > > > > >>> > Till > > > > > > > >>> > > > > > > > > >>> > On Thu, Jan 21, 2021 at 3:25 AM Chesnay Schepler < > > > > > > [hidden email] > > > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > > > >>> > > > > > > > > >>> > Is there even a functional difference between specifying > > the > > > > > > > >>> > requirements for an SSG vs specifying the same > requirements > > > on > > > > > > a > > > > > > > >>> > single > > > > > > > >>> > operator within that group (ideally a colocation group to > > > avoid > > > > > > > this > > > > > > > >>> > whole hint business)? > > > > > > > >>> > > > > > > > > >>> > Wouldn't we get the best of both worlds in the latter > case? > > > > > > > >>> > > > > > > > > >>> > Users can take shortcuts to define shared requirements, > > > > > > > >>> > but refine them further as needed on a per-operator > basis, > > > > > > > >>> > without changing semantics of slotsharing groups > > > > > > > >>> > nor the runtime being locked into SSG-based requirements. > > > > > > > >>> > > > > > > > > >>> > (And before anyone argues what happens if slotsharing > > groups > > > > > > > >>> > change or > > > > > > > >>> > whatnot, that's a plain API issue that we could surely > > solve. > > > > > > (A > > > > > > > >>> > plain > > > > > > > >>> > iteration over slotsharing groups and therein contained > > > > > > operators > > > > > > > >>> > would > > > > > > > >>> > suffice)). > > > > > > > >>> > > > > > > > > >>> > On 1/20/2021 6:48 PM, Till Rohrmann wrote: > > > > > > > >>> > > Maybe a different minor idea: Would it be possible to > > treat > > > > > > > the SSG > > > > > > > >>> > > resource requirements as a hint for the runtime similar > > to > > > > > > how > > > > > > > >>> > slot sharing > > > > > > > >>> > > groups are designed at the moment? Meaning that we > don't > > > give > > > > > > > >>> > the guarantee > > > > > > > >>> > > that Flink will always deploy this set of tasks > together > > no > > > > > > > >>> > matter what > > > > > > > >>> > > comes. If, for example, the runtime can derive by some > > > means > > > > > > > the > > > > > > > >>> > resource > > > > > > > >>> > > requirements for each task based on the requirements > for > > > the > > > > > > > >>> > SSG, this > > > > > > > >>> > > could be possible. One easy strategy would be to give > > every > > > > > > > task > > > > > > > >>> > the same > > > > > > > >>> > > resources as the whole slot sharing group. Another one > > > could > > > > > > be > > > > > > > >>> > > distributing the resources equally among the tasks. > This > > > does > > > > > > > >>> > not even have > > > > > > > >>> > > to be implemented but we would give ourselves the > freedom > > > to > > > > > > > change > > > > > > > >>> > > scheduling if need should arise. > > > > > > > >>> > > > > > > > > > >>> > > Cheers, > > > > > > > >>> > > Till > > > > > > > >>> > > > > > > > > > >>> > > On Wed, Jan 20, 2021 at 7:04 AM Yangze Guo < > > > > > > [hidden email] > > > > > > > >>> > <mailto:[hidden email]>> wrote: > > > > > > > >>> > > > > > > > > > >>> > >> Thanks for the responses, Till and Xintong. > > > > > > > >>> > >> > > > > > > > >>> > >> I second Xintong's comment that SSG-based runtime > > > interface > > > > > > > >>> > will give > > > > > > > >>> > >> us the flexibility to achieve op/task-based approach. > > > That's > > > > > > > one of > > > > > > > >>> > >> the most important reasons for our design choice. > > > > > > > >>> > >> > > > > > > > >>> > >> Some cents regarding the default operator resource: > > > > > > > >>> > >> - It might be good for the scenario of DataStream > jobs. > > > > > > > >>> > >> ** For light-weight operators, the accumulative > > > > > > > >>> > configuration error > > > > > > > >>> > >> will not be significant. Then, the resource of a task > > used > > > > > > is > > > > > > > >>> > >> proportional to the number of operators it contains. > > > > > > > >>> > >> ** For heavy operators like join and window or > operators > > > > > > > >>> > using the > > > > > > > >>> > >> external resources, user will turn to the fine-grained > > > > > > > resource > > > > > > > >>> > >> configuration. > > > > > > > >>> > >> - It can increase the stability for the standalone > > cluster > > > > > > > >>> > where task > > > > > > > >>> > >> executors registered are heterogeneous(with different > > > > > > default > > > > > > > slot > > > > > > > >>> > >> resources). > > > > > > > >>> > >> - It might not be good for SQL users. The operators > that > > > SQL > > > > > > > >>> > will be > > > > > > > >>> > >> transferred to is a black box to the user. We also do > > not > > > > > > > guarantee > > > > > > > >>> > >> the cross-version of consistency of the transformation > > so > > > > > > far. > > > > > > > >>> > >> > > > > > > > >>> > >> I think it can be treated as a follow-up work when the > > > > > > > fine-grained > > > > > > > >>> > >> resource management is end-to-end ready. > > > > > > > >>> > >> > > > > > > > >>> > >> Best, > > > > > > > >>> > >> Yangze Guo > > > > > > > >>> > >> > > > > > > > >>> > >> > > > > > > > >>> > >> On Wed, Jan 20, 2021 at 11:16 AM Xintong Song > > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > > >>> > >> wrote: > > > > > > > >>> > >>> Thanks for the feedback, Till. > > > > > > > >>> > >>> > > > > > > > >>> > >>> ## I feel that what you proposed (operator-based + > > > default > > > > > > > >>> > value) might > > > > > > > >>> > >> be > > > > > > > >>> > >>> subsumed by the SSG-based approach. > > > > > > > >>> > >>> Thinking of op_1 -> op_2, there are the following 4 > > > cases, > > > > > > > >>> > categorized by > > > > > > > >>> > >>> whether the resource requirements are known to the > > users. > > > > > > > >>> > >>> > > > > > > > >>> > >>> 1. *Both known.* As previously mentioned, there's no > > > > > > > >>> > reason to put > > > > > > > >>> > >>> multiple operators whose individual resource > > > > > > requirements > > > > > > > >>> > are already > > > > > > > >>> > >> known > > > > > > > >>> > >>> into the same group in fine-grained resource > > > > > > management. > > > > > > > >>> > And if op_1 > > > > > > > >>> > >> and > > > > > > > >>> > >>> op_2 are in different groups, there should be no > > > > > > problem > > > > > > > >>> > switching > > > > > > > >>> > >> data > > > > > > > >>> > >>> exchange mode from pipelined to blocking. This is > > > > > > > >>> > equivalent to > > > > > > > >>> > >> specifying > > > > > > > >>> > >>> operator resource requirements in your proposal. > > > > > > > >>> > >>> 2. *op_1 known, op_2 unknown.* Similar to 1), except > > > > > > that > > > > > > > >>> > op_2 is in a > > > > > > > >>> > >>> SSG whose resource is not specified thus would have > the > > > > > > > >>> > default slot > > > > > > > >>> > >>> resource. This is equivalent to having default > operator > > > > > > > >>> > resources in > > > > > > > >>> > >> your > > > > > > > >>> > >>> proposal. > > > > > > > >>> > >>> 3. *Both unknown*. The user can either set op_1 and > > > > > > op_2 > > > > > > > >>> > to the same > > > > > > > >>> > >> SSG > > > > > > > >>> > >>> or separate SSGs. > > > > > > > >>> > >>> - If op_1 and op_2 are in the same SSG, it will be > > > > > > > >>> > equivalent to > > > > > > > >>> > >> the > > > > > > > >>> > >>> coarse-grained resource management, where op_1 and > > > > > > > op_2 > > > > > > > >>> > share a > > > > > > > >>> > >> default > > > > > > > >>> > >>> size slot no matter which data exchange mode is > > > > > > used. > > > > > > > >>> > >>> - If op_1 and op_2 are in different SSGs, then each > > > > > > of > > > > > > > >>> > them will > > > > > > > >>> > >> use > > > > > > > >>> > >>> a default size slot. This is equivalent to setting > > > > > > > them > > > > > > > >>> > with > > > > > > > >>> > >> default > > > > > > > >>> > >>> operator resources in your proposal. > > > > > > > >>> > >>> 4. *Total (pipeline) or max (blocking) of op_1 and > op_2 > > > > > > > is > > > > > > > >>> > known.* > > > > > > > >>> > >>> - It is possible that the user learns the total / > > > > > > max > > > > > > > >>> > resource > > > > > > > >>> > >>> requirement from executing and monitoring the job, > > > > > > > >>> > while not > > > > > > > >>> > >>> being aware of > > > > > > > >>> > >>> individual operator requirements. > > > > > > > >>> > >>> - I believe this is the case your proposal does not > > > > > > > >>> > cover. And TBH, > > > > > > > >>> > >>> this is probably how most users learn the resource > > > > > > > >>> > requirements, > > > > > > > >>> > >>> according > > > > > > > >>> > >>> to my experiences. > > > > > > > >>> > >>> - In this case, the user might need to specify > > > > > > > >>> > different resources > > > > > > > >>> > >> if > > > > > > > >>> > >>> he wants to switch the execution mode, which should > > > > > > > not > > > > > > > >>> > be worse > > > > > > > >>> > >> than not > > > > > > > >>> > >>> being able to use fine-grained resource management. > > > > > > > >>> > >>> > > > > > > > >>> > >>> > > > > > > > >>> > >>> ## An additional idea inspired by your proposal. > > > > > > > >>> > >>> We may provide multiple options for deciding > resources > > > for > > > > > > > >>> > SSGs whose > > > > > > > >>> > >>> requirement is not specified, if needed. > > > > > > > >>> > >>> > > > > > > > >>> > >>> - Default slot resource (current design) > > > > > > > >>> > >>> - Default operator resource times number of operators > > > > > > > >>> > (equivalent to > > > > > > > >>> > >>> your proposal) > > > > > > > >>> > >>> > > > > > > > >>> > >>> > > > > > > > >>> > >>> ## Exposing internal runtime strategies > > > > > > > >>> > >>> Theoretically, yes. Tying to the SSGs, the resource > > > > > > > >>> > requirements might be > > > > > > > >>> > >>> affected if how SSGs are internally handled changes > in > > > > > > > future. > > > > > > > >>> > >> Practically, > > > > > > > >>> > >>> I do not concretely see at the moment what kind of > > > changes > > > > > > we > > > > > > > >>> > may want in > > > > > > > >>> > >>> future that might conflict with this FLIP proposal, > as > > > the > > > > > > > >>> > question of > > > > > > > >>> > >>> switching data exchange mode answered above. I'd > > suggest > > > to > > > > > > > >>> > not give up > > > > > > > >>> > >> the > > > > > > > >>> > >>> user friendliness we may gain now for the future > > problems > > > > > > > that > > > > > > > >>> > may or may > > > > > > > >>> > >>> not exist. > > > > > > > >>> > >>> > > > > > > > >>> > >>> Moreover, the SSG-based approach has the flexibility > to > > > > > > > >>> > achieve the > > > > > > > >>> > >>> equivalent behavior as the operator-based approach, > if > > we > > > > > > > set each > > > > > > > >>> > >> operator > > > > > > > >>> > >>> (or task) to a separate SSG. We can even provide a > > > shortcut > > > > > > > >>> > option to > > > > > > > >>> > >>> automatically do that for users, if needed. > > > > > > > >>> > >>> > > > > > > > >>> > >>> > > > > > > > >>> > >>> Thank you~ > > > > > > > >>> > >>> > > > > > > > >>> > >>> Xintong Song > > > > > > > >>> > >>> > > > > > > > >>> > >>> > > > > > > > >>> > >>> > > > > > > > >>> > >>> On Tue, Jan 19, 2021 at 11:48 PM Till Rohrmann > > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > > >>> > >> wrote: > > > > > > > >>> > >>>> Thanks for the responses Xintong and Stephan, > > > > > > > >>> > >>>> > > > > > > > >>> > >>>> I agree that being able to define the resource > > > > > > requirements > > > > > > > for a > > > > > > > >>> > >> group of > > > > > > > >>> > >>>> operators is more user friendly. However, my concern > > is > > > > > > that > > > > > > > >>> > we are > > > > > > > >>> > >>>> exposing thereby internal runtime strategies which > > might > > > > > > > >>> > limit our > > > > > > > >>> > >>>> flexibility to execute a given job. Moreover, the > > > > > > semantics > > > > > > > of > > > > > > > >>> > >> configuring > > > > > > > >>> > >>>> resource requirements for SSGs could break if > > switching > > > > > > from > > > > > > > >>> > streaming > > > > > > > >>> > >> to > > > > > > > >>> > >>>> batch execution. If one defines the resource > > > requirements > > > > > > > for > > > > > > > >>> > op_1 -> > > > > > > > >>> > >> op_2 > > > > > > > >>> > >>>> which run in pipelined mode when using the streaming > > > > > > > >>> > execution, then > > > > > > > >>> > >> how do > > > > > > > >>> > >>>> we interpret these requirements when op_1 -> op_2 > are > > > > > > > >>> > executed with a > > > > > > > >>> > >>>> blocking data exchange in batch execution mode? > > > > > > > Consequently, > > > > > > > >>> > I am > > > > > > > >>> > >> still > > > > > > > >>> > >>>> leaning towards Stephan's proposal to set the > resource > > > > > > > >>> > requirements per > > > > > > > >>> > >>>> operator. > > > > > > > >>> > >>>> > > > > > > > >>> > >>>> Maybe the following proposal makes the configuration > > > > > > easier: > > > > > > > >>> > If the > > > > > > > >>> > >> user > > > > > > > >>> > >>>> wants to use fine-grained resource requirements, > then > > > she > > > > > > > >>> > needs to > > > > > > > >>> > >> specify > > > > > > > >>> > >>>> the default size which is used for operators which > > have > > > no > > > > > > > >>> > explicit > > > > > > > >>> > >>>> resource annotation. If this holds true, then every > > > > > > operator > > > > > > > >>> > would > > > > > > > >>> > >> have a > > > > > > > >>> > >>>> resource requirement and the system can try to > execute > > > the > > > > > > > >>> > operators > > > > > > > >>> > >> in the > > > > > > > >>> > >>>> best possible manner w/o being constrained by how > the > > > user > > > > > > > >>> > set the SSG > > > > > > > >>> > >>>> requirements. > > > > > > > >>> > >>>> > > > > > > > >>> > >>>> Cheers, > > > > > > > >>> > >>>> Till > > > > > > > >>> > >>>> > > > > > > > >>> > >>>> On Tue, Jan 19, 2021 at 9:09 AM Xintong Song > > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > > >>> > >>>> wrote: > > > > > > > >>> > >>>> > > > > > > > >>> > >>>>> Thanks for the feedback, Stephan. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> Actually, your proposal has also come to my mind at > > > some > > > > > > > >>> > point. And I > > > > > > > >>> > >>>> have > > > > > > > >>> > >>>>> some concerns about it. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> 1. It does not give users the same control as the > > > > > > SSG-based > > > > > > > >>> > approach. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> While both approaches do not require specifying for > > > each > > > > > > > >>> > operator, > > > > > > > >>> > >>>>> SSG-based approach supports the semantic that "some > > > > > > > operators > > > > > > > >>> > >> together > > > > > > > >>> > >>>> use > > > > > > > >>> > >>>>> this much resource" while the operator-based > approach > > > > > > > doesn't. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> Think of a long pipeline with m operators (o_1, > o_2, > > > ..., > > > > > > > >>> > o_m), and > > > > > > > >>> > >> at > > > > > > > >>> > >>>> some > > > > > > > >>> > >>>>> point there's an agg o_n (1 < n < m) which > > > significantly > > > > > > > >>> > reduces the > > > > > > > >>> > >> data > > > > > > > >>> > >>>>> amount. One can separate the pipeline into 2 groups > > > SSG_1 > > > > > > > >>> > (o_1, ..., > > > > > > > >>> > >> o_n) > > > > > > > >>> > >>>>> and SSG_2 (o_n+1, ... o_m), so that configuring > much > > > > > > higher > > > > > > > >>> > >> parallelisms > > > > > > > >>> > >>>>> for operators in SSG_1 than for operators in SSG_2 > > > won't > > > > > > > >>> > lead to too > > > > > > > >>> > >> much > > > > > > > >>> > >>>>> wasting of resources. If the two SSGs end up > needing > > > > > > > different > > > > > > > >>> > >> resources, > > > > > > > >>> > >>>>> with the SSG-based approach one can directly > specify > > > > > > > >>> > resources for > > > > > > > >>> > >> the > > > > > > > >>> > >>>> two > > > > > > > >>> > >>>>> groups. However, with the operator-based approach, > > the > > > > > > > user will > > > > > > > >>> > >> have to > > > > > > > >>> > >>>>> specify resources for each operator in one of the > two > > > > > > > >>> > groups, and > > > > > > > >>> > >> tune > > > > > > > >>> > >>>> the > > > > > > > >>> > >>>>> default slot resource via configurations to fit the > > > other > > > > > > > group. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> 2. It increases the chance of breaking operator > > chains. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> Setting chainnable operators into different slot > > > sharing > > > > > > > >>> > groups will > > > > > > > >>> > >>>>> prevent them from being chained. In the current > > > > > > > implementation, > > > > > > > >>> > >>>> downstream > > > > > > > >>> > >>>>> operators, if SSG not explicitly specified, will be > > set > > > > > > to > > > > > > > >>> > the same > > > > > > > >>> > >> group > > > > > > > >>> > >>>>> as the chainable upstream operators (unless > multiple > > > > > > > upstream > > > > > > > >>> > >> operators > > > > > > > >>> > >>>> in > > > > > > > >>> > >>>>> different groups), to reduce the chance of breaking > > > > > > chains. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> Thinking of chainable operators o_1 -> o_2 -> o_3 > -> > > > o_3, > > > > > > > >>> > deciding > > > > > > > >>> > >> SSGs > > > > > > > >>> > >>>>> based on whether resource is specified we will > easily > > > get > > > > > > > >>> > groups like > > > > > > > >>> > >>>> (o_1, > > > > > > > >>> > >>>>> o_3) & (o_2, o_4), where none of the operators can > be > > > > > > > >>> > chained. This > > > > > > > >>> > >> is > > > > > > > >>> > >>>> also > > > > > > > >>> > >>>>> possible for the SSG-based approach, but I believe > > the > > > > > > > >>> > chance is much > > > > > > > >>> > >>>>> smaller because there's no strong reason for users > to > > > > > > > >>> > specify the > > > > > > > >>> > >> groups > > > > > > > >>> > >>>>> with alternate operators like that. We are more > > likely > > > to > > > > > > > >>> > get groups > > > > > > > >>> > >> like > > > > > > > >>> > >>>>> (o_1, o_2) & (o_3, o_4), where the chain breaks > only > > > > > > > between > > > > > > > >>> > o_2 and > > > > > > > >>> > >> o_3. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> 3. It complicates the system by having two > different > > > > > > > >>> > mechanisms for > > > > > > > >>> > >>>> sharing > > > > > > > >>> > >>>>> managed memory in a slot. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> - In FLIP-141, we introduced the intra-slot managed > > > > > > memory > > > > > > > >>> > sharing > > > > > > > >>> > >>>>> mechanism, where managed memory is first > distributed > > > > > > > >>> > according to the > > > > > > > >>> > >>>>> consumer type, then further distributed across > > > operators > > > > > > > of that > > > > > > > >>> > >> consumer > > > > > > > >>> > >>>>> type. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> - With the operator-based approach, managed memory > > size > > > > > > > >>> > specified > > > > > > > >>> > >> for an > > > > > > > >>> > >>>>> operator should account for all the consumer types > of > > > > > > that > > > > > > > >>> > operator. > > > > > > > >>> > >> That > > > > > > > >>> > >>>>> means the managed memory is first distributed > across > > > > > > > >>> > operators, then > > > > > > > >>> > >>>>> distributed to different consumer types of each > > > operator. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> Unfortunately, the different order of the two > > > calculation > > > > > > > >>> > steps can > > > > > > > >>> > >> lead > > > > > > > >>> > >>>> to > > > > > > > >>> > >>>>> different results. To be specific, the semantic of > > the > > > > > > > >>> > configuration > > > > > > > >>> > >>>> option > > > > > > > >>> > >>>>> `consumer-weights` changed (within a slot vs. > within > > an > > > > > > > >>> > operator). > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> To sum up things: > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> While (3) might be a bit more implementation > related, > > I > > > > > > > >>> > think (1) > > > > > > > >>> > >> and (2) > > > > > > > >>> > >>>>> somehow suggest that, the price for the proposed > > > approach > > > > > > > to > > > > > > > >>> > avoid > > > > > > > >>> > >>>>> specifying resource for every operator is that it's > > not > > > > > > as > > > > > > > >>> > >> independent > > > > > > > >>> > >>>> from > > > > > > > >>> > >>>>> operator chaining and slot sharing as the > > > operator-based > > > > > > > >>> > approach > > > > > > > >>> > >>>> discussed > > > > > > > >>> > >>>>> in the FLIP. > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> Thank you~ > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> Xintong Song > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> > > > > > > > >>> > >>>>> On Tue, Jan 19, 2021 at 4:29 AM Stephan Ewen > > > > > > > >>> > <[hidden email] <mailto:[hidden email]>> > > > > > > > >>> > >> wrote: > > > > > > > >>> > >>>>>> Thanks a lot, Yangze and Xintong for this FLIP. > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> I want to say, first of all, that this is super > well > > > > > > > >>> > written. And > > > > > > > >>> > >> the > > > > > > > >>> > >>>>>> points that the FLIP makes about how to expose the > > > > > > > >>> > configuration to > > > > > > > >>> > >>>> users > > > > > > > >>> > >>>>>> is exactly the right thing to figure out first. > > > > > > > >>> > >>>>>> So good job here! > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> About how to let users specify the resource > > profiles. > > > > > > If I > > > > > > > >>> > can sum > > > > > > > >>> > >> the > > > > > > > >>> > >>>>> FLIP > > > > > > > >>> > >>>>>> and previous discussion up in my own words, the > > > problem > > > > > > > is the > > > > > > > >>> > >>>> following: > > > > > > > >>> > >>>>>> Operator-level specification is the simplest and > > > > > > cleanest > > > > > > > >>> > approach, > > > > > > > >>> > >>>>> because > > > > > > > >>> > >>>>>>> it avoids mixing operator configuration > (resource) > > > and > > > > > > > >>> > >> scheduling. No > > > > > > > >>> > >>>>>>> matter what other parameters change (chaining, > slot > > > > > > > sharing, > > > > > > > >>> > >>>> switching > > > > > > > >>> > >>>>>>> pipelined and blocking shuffles), the resource > > > profiles > > > > > > > >>> > stay the > > > > > > > >>> > >>>> same. > > > > > > > >>> > >>>>>>> But it would require that a user specifies > > resources > > > on > > > > > > > all > > > > > > > >>> > >>>> operators, > > > > > > > >>> > >>>>>>> which makes it hard to use. That's why the FLIP > > > > > > suggests > > > > > > > going > > > > > > > >>> > >> with > > > > > > > >>> > >>>>>>> specifying resources on a Sharing-Group. > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> I think both thoughts are important, so can we > find > > a > > > > > > > solution > > > > > > > >>> > >> where > > > > > > > >>> > >>>> the > > > > > > > >>> > >>>>>> Resource Profiles are specified on an Operator, > but > > we > > > > > > > >>> > still avoid > > > > > > > >>> > >> that > > > > > > > >>> > >>>>> we > > > > > > > >>> > >>>>>> need to specify a resource profile on every > > operator? > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> What do you think about something like the > > following: > > > > > > > >>> > >>>>>> - Resource Profiles are specified on an operator > > > > > > level. > > > > > > > >>> > >>>>>> - Not all operators need profiles > > > > > > > >>> > >>>>>> - All Operators without a Resource Profile ended > up > > > > > > in > > > > > > > the > > > > > > > >>> > >> default > > > > > > > >>> > >>>> slot > > > > > > > >>> > >>>>>> sharing group with a default profile (will get a > > > default > > > > > > > slot). > > > > > > > >>> > >>>>>> - All Operators with a Resource Profile will go > into > > > > > > > >>> > another slot > > > > > > > >>> > >>>>> sharing > > > > > > > >>> > >>>>>> group (the resource-specified-group). > > > > > > > >>> > >>>>>> - Users can define different slot sharing groups > for > > > > > > > >>> > operators > > > > > > > >>> > >> like > > > > > > > >>> > >>>>> they > > > > > > > >>> > >>>>>> do now, with the exception that you cannot mix > > > operators > > > > > > > >>> > that have > > > > > > > >>> > >> a > > > > > > > >>> > >>>>>> resource profile and operators that have no > resource > > > > > > > profile. > > > > > > > >>> > >>>>>> - The default case where no operator has a > resource > > > > > > > >>> > profile is > > > > > > > >>> > >> just a > > > > > > > >>> > >>>>>> special case of this model > > > > > > > >>> > >>>>>> - The chaining logic sums up the profiles per > > > > > > operator, > > > > > > > >>> > like it > > > > > > > >>> > >> does > > > > > > > >>> > >>>>> now, > > > > > > > >>> > >>>>>> and the scheduler sums up the profiles of the > tasks > > > that > > > > > > > it > > > > > > > >>> > >> schedules > > > > > > > >>> > >>>>>> together. > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> There is another question about reactive scaling > > > raised > > > > > > > in the > > > > > > > >>> > >> FLIP. I > > > > > > > >>> > >>>>> need > > > > > > > >>> > >>>>>> to think a bit about that. That is indeed a bit > more > > > > > > > tricky > > > > > > > >>> > once we > > > > > > > >>> > >>>> have > > > > > > > >>> > >>>>>> slots of different sizes. > > > > > > > >>> > >>>>>> It is not clear then which of the different slot > > > > > > requests > > > > > > > the > > > > > > > >>> > >>>>>> ResourceManager should fulfill when new resources > > > (TMs) > > > > > > > >>> > show up, > > > > > > > >>> > >> or how > > > > > > > >>> > >>>>> the > > > > > > > >>> > >>>>>> JobManager redistributes the slots resources when > > > > > > > resources > > > > > > > >>> > (TMs) > > > > > > > >>> > >>>>> disappear > > > > > > > >>> > >>>>>> This question is pretty orthogonal, though, to the > > > "how > > > > > > to > > > > > > > >>> > specify > > > > > > > >>> > >> the > > > > > > > >>> > >>>>>> resources". > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> Best, > > > > > > > >>> > >>>>>> Stephan > > > > > > > >>> > >>>>>> > > > > > > > >>> > >>>>>> On Fri, Jan 8, 2021 at 5:14 AM Xintong Song > > > > > > > >>> > <[hidden email] <mailto:[hidden email]> > > > > > > > >>> > >>>>> wrote: > > > > > > > >>> > >>>>>>> Thanks for drafting the FLIP and driving the > > > > > > discussion, > > > > > > > >>> > Yangze. > > > > > > > >>> > >>>>>>> And Thanks for the feedback, Till and Chesnay. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> @Till, > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> I agree that specifying requirements for SSGs > means > > > > > > that > > > > > > > SSGs > > > > > > > >>> > >> need to > > > > > > > >>> > >>>>> be > > > > > > > >>> > >>>>>>> supported in fine-grained resource management, > > > > > > otherwise > > > > > > > each > > > > > > > >>> > >>>> operator > > > > > > > >>> > >>>>>>> might use as many resources as the whole group. > > > > > > However, > > > > > > > I > > > > > > > >>> > cannot > > > > > > > >>> > >>>> think > > > > > > > >>> > >>>>>> of > > > > > > > >>> > >>>>>>> a strong reason for not supporting SSGs in > > > fine-grained > > > > > > > >>> > resource > > > > > > > >>> > >>>>>>> management. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>>> Interestingly, if all operators have their > > resources > > > > > > > properly > > > > > > > >>> > >>>>>> specified, > > > > > > > >>> > >>>>>>>> then slot sharing is no longer needed because > > Flink > > > > > > > could > > > > > > > >>> > >> slice off > > > > > > > >>> > >>>>> the > > > > > > > >>> > >>>>>>>> appropriately sized slots for every Task > > > individually. > > > > > > > >>> > >>>>>>>> > > > > > > > >>> > >>>>>>> So for example, if we have a job consisting of > two > > > > > > > >>> > operator op_1 > > > > > > > >>> > >> and > > > > > > > >>> > >>>>> op_2 > > > > > > > >>> > >>>>>>>> where each op needs 100 MB of memory, we would > > then > > > > > > say > > > > > > > that > > > > > > > >>> > >> the > > > > > > > >>> > >>>> slot > > > > > > > >>> > >>>>>>>> sharing group needs 200 MB of memory to run. If > we > > > > > > have > > > > > > > a > > > > > > > >>> > >> cluster > > > > > > > >>> > >>>>> with > > > > > > > >>> > >>>>>> 2 > > > > > > > >>> > >>>>>>>> TMs with one slot of 100 MB each, then the > system > > > > > > > cannot run > > > > > > > >>> > >> this > > > > > > > >>> > >>>>> job. > > > > > > > >>> > >>>>>> If > > > > > > > >>> > >>>>>>>> the resources were specified on an operator > level, > > > > > > then > > > > > > > the > > > > > > > >>> > >> system > > > > > > > >>> > >>>>>> could > > > > > > > >>> > >>>>>>>> still make the decision to deploy op_1 to TM_1 > and > > > > > > op_2 > > > > > > > to > > > > > > > >>> > >> TM_2. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> Couldn't agree more that if all operators' > > > requirements > > > > > > > are > > > > > > > >>> > >> properly > > > > > > > >>> > >>>>>>> specified, slot sharing should be no longer > needed. > > I > > > > > > > >>> > think this > > > > > > > >>> > >>>>> exactly > > > > > > > >>> > >>>>>>> disproves the example. If we already know op_1 > and > > > op_2 > > > > > > > each > > > > > > > >>> > >> needs > > > > > > > >>> > >>>> 100 > > > > > > > >>> > >>>>> MB > > > > > > > >>> > >>>>>>> of memory, why would we put them in the same > group? > > > If > > > > > > > >>> > they are > > > > > > > >>> > >> in > > > > > > > >>> > >>>>>> separate > > > > > > > >>> > >>>>>>> groups, with the proposed approach the system can > > > > > > freely > > > > > > > >>> > deploy > > > > > > > >>> > >> them > > > > > > > >>> > >>>> to > > > > > > > >>> > >>>>>>> either a 200 MB TM or two 100 MB TMs. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> Moreover, the precondition for not needing slot > > > sharing > > > > > > > is > > > > > > > >>> > having > > > > > > > >>> > >>>>>> resource > > > > > > > >>> > >>>>>>> requirements properly specified for all > operators. > > > This > > > > > > > is not > > > > > > > >>> > >> always > > > > > > > >>> > >>>>>>> possible, and usually requires tremendous > efforts. > > > One > > > > > > > of the > > > > > > > >>> > >>>> benefits > > > > > > > >>> > >>>>>> for > > > > > > > >>> > >>>>>>> SSG-based requirements is that it allows the user > > to > > > > > > > freely > > > > > > > >>> > >> decide > > > > > > > >>> > >>>> the > > > > > > > >>> > >>>>>>> granularity, thus efforts they want to pay. I > would > > > > > > > >>> > consider SSG > > > > > > > >>> > >> in > > > > > > > >>> > >>>>>>> fine-grained resource management as a group of > > > > > > operators > > > > > > > >>> > that the > > > > > > > >>> > >>>> user > > > > > > > >>> > >>>>>>> would like to specify the total resource for. > There > > > can > > > > > > > be > > > > > > > >>> > only > > > > > > > >>> > >> one > > > > > > > >>> > >>>>> group > > > > > > > >>> > >>>>>>> in the job, 2~3 groups dividing the job into a > few > > > > > > major > > > > > > > >>> > parts, > > > > > > > >>> > >> or as > > > > > > > >>> > >>>>>> many > > > > > > > >>> > >>>>>>> groups as the number of tasks/operators, > depending > > on > > > > > > how > > > > > > > >>> > >>>> fine-grained > > > > > > > >>> > >>>>>> the > > > > > > > >>> > >>>>>>> user is able to specify the resources. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> Having to support SSGs might be a constraint. But > > > given > > > > > > > >>> > that all > > > > > > > >>> > >> the > > > > > > > >>> > >>>>>>> current scheduler implementations already support > > > > > > SSGs, I > > > > > > > >>> > tend to > > > > > > > >>> > >>>> think > > > > > > > >>> > >>>>>>> that as an acceptable price for the above > discussed > > > > > > > >>> > usability and > > > > > > > >>> > >>>>>>> flexibility. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> @Chesnay > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> Will declaring them on slot sharing groups not > also > > > > > > waste > > > > > > > >>> > >> resources > > > > > > > >>> > >>>> if > > > > > > > >>> > >>>>>> the > > > > > > > >>> > >>>>>>>> parallelism of operators within that group are > > > > > > > different? > > > > > > > >>> > >>>>>>>> > > > > > > > >>> > >>>>>>> Yes. It's a trade-off between usability and > > resource > > > > > > > >>> > >> utilization. To > > > > > > > >>> > >>>>>> avoid > > > > > > > >>> > >>>>>>> such wasting, the user can define more groups, so > > > that > > > > > > > >>> > each group > > > > > > > >>> > >>>>>> contains > > > > > > > >>> > >>>>>>> less operators and the chance of having operators > > > with > > > > > > > >>> > different > > > > > > > >>> > >>>>>>> parallelism will be reduced. The price is to have > > > more > > > > > > > >>> > resource > > > > > > > >>> > >>>>>>> requirements to specify. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> It also seems like quite a hassle for users > having > > to > > > > > > > >>> > >> recalculate the > > > > > > > >>> > >>>>>>>> resource requirements if they change the slot > > > sharing. > > > > > > > >>> > >>>>>>>> I'd think that it's not really workable for > users > > > that > > > > > > > create > > > > > > > >>> > >> a set > > > > > > > >>> > >>>>> of > > > > > > > >>> > >>>>>>>> re-usable operators which are mixed and matched > in > > > > > > their > > > > > > > >>> > >>>>> applications; > > > > > > > >>> > >>>>>>>> managing the resources requirements in such a > > > setting > > > > > > > >>> > would be > > > > > > > >>> > >> a > > > > > > > >>> > >>>>>>>> nightmare, and in the end would require > > > operator-level > > > > > > > >>> > >> requirements > > > > > > > >>> > >>>>> any > > > > > > > >>> > >>>>>>>> way. > > > > > > > >>> > >>>>>>>> In that sense, I'm not even sure whether it > really > > > > > > > increases > > > > > > > >>> > >>>>> usability. > > > > > > > >>> > >>>>>>> - As mentioned in my reply to Till's comment, > > > > > > > there's no > > > > > > > >>> > >> reason to > > > > > > > >>> > >>>>> put > > > > > > > >>> > >>>>>>> multiple operators whose individual resource > > > > > > > >>> > requirements are > > > > > > > >>> > >>>>> already > > > > > > > >>> > >>>>>>> known > > > > > > > >>> > >>>>>>> into the same group in fine-grained resource > > > > > > > management. > > > > > > > >>> > >>>>>>> - Even an operator implementation is reused for > > > > > > > multiple > > > > > > > >>> > >>>>> applications, > > > > > > > >>> > >>>>>>> it does not guarantee the same resource > > > > > > requirements. > > > > > > > >>> > During > > > > > > > >>> > >> our > > > > > > > >>> > >>>>> years > > > > > > > >>> > >>>>>>> of > > > > > > > >>> > >>>>>>> practices in Alibaba, with per-operator > > > > > > requirements > > > > > > > >>> > >> specified for > > > > > > > >>> > >>>>>>> Blink's > > > > > > > >>> > >>>>>>> fine-grained resource management, very few users > > > > > > > >>> > (including > > > > > > > >>> > >> our > > > > > > > >>> > >>>>>>> specialists > > > > > > > >>> > >>>>>>> who are dedicated to supporting Blink users) are > as > > > > > > > >>> > >> experienced as > > > > > > > >>> > >>>>> to > > > > > > > >>> > >>>>>>> accurately predict/estimate the operator resource > > > > > > > >>> > >> requirements. > > > > > > > >>> > >>>> Most > > > > > > > >>> > >>>>>>> people > > > > > > > >>> > >>>>>>> rely on the execution-time metrics (throughput, > > > > > > > delay, cpu > > > > > > > >>> > >> load, > > > > > > > >>> > >>>>>> memory > > > > > > > >>> > >>>>>>> usage, GC pressure, etc.) to improve the > > > > > > > specification. > > > > > > > >>> > >>>>>>> > > > > > > > >>> > >>>>>>> To sum up: > > > > > > > >>> > >>>>>>> If the user is capable of providing proper > resource > > > > > > > >>> > requirements > > > > > > > >>> > >> for > > > > > > > >>> > >>>>>> every > > > > > > > >>> > >>>>>>> operator, that's definitely a good thing and we > > would > > > > > > not > > > > > > > >>> > need to > > > > > > > >>> > >>>> rely > > > > > > > >>> > >>>>> on > > > > > > > >>> > >>>>>>> the SSGs. However, that shouldn't be a *must* for > > the > > > > > > > >>> > >> fine-grained > > > > > > > >>> > >>>>>> resource > > > > > > > >>> > >>>>>>> management to work. For those users who are > capable > > > and > > > > > > > do not > > > > > > > >>> > >> like > > > > > > > >>> > >>>>>> having > > > > > > > >>> > >>>>>>> to set each operator to a separate SSG, I would > be > > ok > > > > > > to > > > > > > > have > > > > > > > >>> > >> both > > > > > > > >>> > >>>>>>> SSG-based and operator-based runtime |
Free forum by Nabble | Edit this page |