[DISCUSS] FLIP-53: Fine Grained Resource Management

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Hi everyone,

We would like to start a discussion thread on "FLIP-53: Fine Grained
Resource Management"[1], where we propose how to improve Flink resource
management and scheduling.

This FLIP mainly discusses the following issues.

   - How to support tasks with fine grained resource requirements.
   - How to unify resource management for jobs with / without fine grained
   resource requirements.
   - How to unify resource management for streaming / batch jobs.

Key changes proposed in the FLIP are as follows.

   - Unify memory management for operators with / without fine grained
   resource requirements by applying a fraction based quota mechanism.
   - Unify resource scheduling for streaming and batch jobs by setting slot
   sharing groups for pipelined regions during compiling stage.
   - Dynamically allocate slots from task executors' available resources.

Please find more details in the FLIP wiki document [1]. Looking forward to
your feedbacks.

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Yangze Guo
Hi, Xintong

Thanks to propose this FLIP. The general design looks good to me, +1
for this feature.

Since slots in the same task executor could have different resource
profile, we will
meet resource fragment problem. Think about this case:
 - request A want 1G memory while request B & C want 0.5G memory
 - There are two task executors T1 & T2 with 1G and 0.5G free memory
respectively
If B come first and we cut a slot from T1 for B, A must wait for the
free resource from
other task. But A could have been scheduled immediately if we cut a
slot from T2 for B.

The logic of findMatchingSlot now become finding a task executor which
has enough
resource and then cut a slot from it. Current method could be seen as
"First-fit strategy",
which works well in general but sometimes could not be the optimization method.

Actually, this problem could be abstracted as "Bin Packing Problem"[1]. Here are
some common approximate algorithms:
- First fit
- Next fit
- Best fit

But it become multi-dimensional bin packing problem if we take CPU
into account. It hard
to define which one is best fit now. Some research addressed this
problem, such like Tetris[2].

Here are some thinking about it:
1. We could make the strategy of finding matching task executor
pluginable. Let user to config the
best strategy in their scenario.
2. We could support batch request interface in RM, because we have
opportunities to optimize
if we have more information. If we know the A, B, C at the same time,
we could always make the best decision.

[1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
[2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf

Best,
Yangze Guo

On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]> wrote:

>
> Hi everyone,
>
> We would like to start a discussion thread on "FLIP-53: Fine Grained
> Resource Management"[1], where we propose how to improve Flink resource
> management and scheduling.
>
> This FLIP mainly discusses the following issues.
>
>    - How to support tasks with fine grained resource requirements.
>    - How to unify resource management for jobs with / without fine grained
>    resource requirements.
>    - How to unify resource management for streaming / batch jobs.
>
> Key changes proposed in the FLIP are as follows.
>
>    - Unify memory management for operators with / without fine grained
>    resource requirements by applying a fraction based quota mechanism.
>    - Unify resource scheduling for streaming and batch jobs by setting slot
>    sharing groups for pipelined regions during compiling stage.
>    - Dynamically allocate slots from task executors' available resources.
>
> Please find more details in the FLIP wiki document [1]. Looking forward to
> your feedbacks.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Till Rohrmann
Hi Xintong,

thanks for drafting this FLIP. I think your proposal helps to improve the
execution of batch jobs more efficiently. Moreover, it enables the proper
integration of the Blink planner which is very important as well.

Overall, the FLIP looks good to me. I was wondering whether it wouldn't
make sense to actually split it up into two FLIPs: Operator resource
management and dynamic slot allocation. I think these two FLIPs could be
seen as orthogonal and it would decrease the scope of each individual FLIP.

Some smaller comments:

- I'm not sure whether we should pass in the default slot size via an
environment variable. Without having unified the way how Flink components
are configured [1], I think it would be better to pass it in as part of the
configuration.
- I would avoid returning a null value from
TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we
should introduce an explicit return value saying this or throw an exception.

Concerning Yangze's comments: I think you are right that it would be
helpful to make the selection strategy pluggable. Also batching slot
requests to the RM could be a good optimization. For the sake of keeping
the scope of this FLIP smaller I would try to tackle these things after the
initial version has been completed (without spoiling these optimization
opportunities). In particular batching the slot requests depends on the
current scheduler refactoring and could also be realized on the RM side
only.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration

Cheers,
Till



On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> wrote:

> Hi, Xintong
>
> Thanks to propose this FLIP. The general design looks good to me, +1
> for this feature.
>
> Since slots in the same task executor could have different resource
> profile, we will
> meet resource fragment problem. Think about this case:
>  - request A want 1G memory while request B & C want 0.5G memory
>  - There are two task executors T1 & T2 with 1G and 0.5G free memory
> respectively
> If B come first and we cut a slot from T1 for B, A must wait for the
> free resource from
> other task. But A could have been scheduled immediately if we cut a
> slot from T2 for B.
>
> The logic of findMatchingSlot now become finding a task executor which
> has enough
> resource and then cut a slot from it. Current method could be seen as
> "First-fit strategy",
> which works well in general but sometimes could not be the optimization
> method.
>
> Actually, this problem could be abstracted as "Bin Packing Problem"[1].
> Here are
> some common approximate algorithms:
> - First fit
> - Next fit
> - Best fit
>
> But it become multi-dimensional bin packing problem if we take CPU
> into account. It hard
> to define which one is best fit now. Some research addressed this
> problem, such like Tetris[2].
>
> Here are some thinking about it:
> 1. We could make the strategy of finding matching task executor
> pluginable. Let user to config the
> best strategy in their scenario.
> 2. We could support batch request interface in RM, because we have
> opportunities to optimize
> if we have more information. If we know the A, B, C at the same time,
> we could always make the best decision.
>
> [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> [2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
>
> Best,
> Yangze Guo
>
> On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]>
> wrote:
> >
> > Hi everyone,
> >
> > We would like to start a discussion thread on "FLIP-53: Fine Grained
> > Resource Management"[1], where we propose how to improve Flink resource
> > management and scheduling.
> >
> > This FLIP mainly discusses the following issues.
> >
> >    - How to support tasks with fine grained resource requirements.
> >    - How to unify resource management for jobs with / without fine
> grained
> >    resource requirements.
> >    - How to unify resource management for streaming / batch jobs.
> >
> > Key changes proposed in the FLIP are as follows.
> >
> >    - Unify memory management for operators with / without fine grained
> >    resource requirements by applying a fraction based quota mechanism.
> >    - Unify resource scheduling for streaming and batch jobs by setting
> slot
> >    sharing groups for pipelined regions during compiling stage.
> >    - Dynamically allocate slots from task executors' available resources.
> >
> > Please find more details in the FLIP wiki document [1]. Looking forward
> to
> > your feedbacks.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Thanks for the feedbacks, Yangze and Till.

Yangze,

I agree with you that we should make scheduling strategy pluggable and
optimize the strategy to reduce the memory fragmentation problem, and
thanks for the inputs on the potential algorithmic solutions. However, I'm
in favor of keep this FLIP focusing on the overall mechanism design rather
than strategies. Solving the fragmentation issue should be considered as an
optimization, and I agree with Till that we probably should tackle this
afterwards.

Till,

- Regarding splitting the FLIP, I think it makes sense. The operator
resource management and dynamic slot allocation do not have much dependency
on each other.

- Regarding the default slot size, I think this is similar to FLIP-49 [1]
where we want all the deriving happens at one place. I think it would be
nice to pass the default slot size into the task executor in the same way
that we pass in the memory pool sizes in FLIP-49 [1].

- Regarding the return value of TaskExecutorGateway#requestResource, I
think you're right. We should avoid using null as the return value. I think
we probably should thrown an exception here.

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]> wrote:

> Hi Xintong,
>
> thanks for drafting this FLIP. I think your proposal helps to improve the
> execution of batch jobs more efficiently. Moreover, it enables the proper
> integration of the Blink planner which is very important as well.
>
> Overall, the FLIP looks good to me. I was wondering whether it wouldn't
> make sense to actually split it up into two FLIPs: Operator resource
> management and dynamic slot allocation. I think these two FLIPs could be
> seen as orthogonal and it would decrease the scope of each individual FLIP.
>
> Some smaller comments:
>
> - I'm not sure whether we should pass in the default slot size via an
> environment variable. Without having unified the way how Flink components
> are configured [1], I think it would be better to pass it in as part of the
> configuration.
> - I would avoid returning a null value from
> TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we
> should introduce an explicit return value saying this or throw an
> exception.
>
> Concerning Yangze's comments: I think you are right that it would be
> helpful to make the selection strategy pluggable. Also batching slot
> requests to the RM could be a good optimization. For the sake of keeping
> the scope of this FLIP smaller I would try to tackle these things after the
> initial version has been completed (without spoiling these optimization
> opportunities). In particular batching the slot requests depends on the
> current scheduler refactoring and could also be realized on the RM side
> only.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
>
> Cheers,
> Till
>
>
>
> On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> wrote:
>
> > Hi, Xintong
> >
> > Thanks to propose this FLIP. The general design looks good to me, +1
> > for this feature.
> >
> > Since slots in the same task executor could have different resource
> > profile, we will
> > meet resource fragment problem. Think about this case:
> >  - request A want 1G memory while request B & C want 0.5G memory
> >  - There are two task executors T1 & T2 with 1G and 0.5G free memory
> > respectively
> > If B come first and we cut a slot from T1 for B, A must wait for the
> > free resource from
> > other task. But A could have been scheduled immediately if we cut a
> > slot from T2 for B.
> >
> > The logic of findMatchingSlot now become finding a task executor which
> > has enough
> > resource and then cut a slot from it. Current method could be seen as
> > "First-fit strategy",
> > which works well in general but sometimes could not be the optimization
> > method.
> >
> > Actually, this problem could be abstracted as "Bin Packing Problem"[1].
> > Here are
> > some common approximate algorithms:
> > - First fit
> > - Next fit
> > - Best fit
> >
> > But it become multi-dimensional bin packing problem if we take CPU
> > into account. It hard
> > to define which one is best fit now. Some research addressed this
> > problem, such like Tetris[2].
> >
> > Here are some thinking about it:
> > 1. We could make the strategy of finding matching task executor
> > pluginable. Let user to config the
> > best strategy in their scenario.
> > 2. We could support batch request interface in RM, because we have
> > opportunities to optimize
> > if we have more information. If we know the A, B, C at the same time,
> > we could always make the best decision.
> >
> > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > [2] https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> >
> > Best,
> > Yangze Guo
> >
> > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]>
> > wrote:
> > >
> > > Hi everyone,
> > >
> > > We would like to start a discussion thread on "FLIP-53: Fine Grained
> > > Resource Management"[1], where we propose how to improve Flink resource
> > > management and scheduling.
> > >
> > > This FLIP mainly discusses the following issues.
> > >
> > >    - How to support tasks with fine grained resource requirements.
> > >    - How to unify resource management for jobs with / without fine
> > grained
> > >    resource requirements.
> > >    - How to unify resource management for streaming / batch jobs.
> > >
> > > Key changes proposed in the FLIP are as follows.
> > >
> > >    - Unify memory management for operators with / without fine grained
> > >    resource requirements by applying a fraction based quota mechanism.
> > >    - Unify resource scheduling for streaming and batch jobs by setting
> > slot
> > >    sharing groups for pipelined regions during compiling stage.
> > >    - Dynamically allocate slots from task executors' available
> resources.
> > >
> > > Please find more details in the FLIP wiki document [1]. Looking forward
> > to
> > > your feedbacks.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Yang Wang
Hi Xintong,


Thanks for your detailed proposal. I think many users are suffering from
waste of resources. The resource spec of all task managers are same and we
have to increase all task managers to make the heavy one more stable. So we
will benefit from the fine grained resource management a lot. We could get
better resource utilization and stability.


Just to share some thoughts.



   1. How to calculate the resource specification of TaskManagers? Do they
   have them same resource spec calculated based on the configuration? I think
   we still have wasted resources in this situation. Or we could start
   TaskManagers with different spec.
   2. If a slot is released and returned to SlotPool, does it could be
   reused by other SlotRequest that the request resource is smaller than it?
      - If it is yes, what happens to the available resource in the
      TaskManager.
      - What is the SlotStatus of the cached slot in SlotPool? The
      AllocationId is null?
   3. In a session cluster, some jobs are configured with operator
   resources, meanwhile other jobs are using UNKNOWN. How to deal with this
   situation?



Best,
Yang

Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:

> Thanks for the feedbacks, Yangze and Till.
>
> Yangze,
>
> I agree with you that we should make scheduling strategy pluggable and
> optimize the strategy to reduce the memory fragmentation problem, and
> thanks for the inputs on the potential algorithmic solutions. However, I'm
> in favor of keep this FLIP focusing on the overall mechanism design rather
> than strategies. Solving the fragmentation issue should be considered as an
> optimization, and I agree with Till that we probably should tackle this
> afterwards.
>
> Till,
>
> - Regarding splitting the FLIP, I think it makes sense. The operator
> resource management and dynamic slot allocation do not have much dependency
> on each other.
>
> - Regarding the default slot size, I think this is similar to FLIP-49 [1]
> where we want all the deriving happens at one place. I think it would be
> nice to pass the default slot size into the task executor in the same way
> that we pass in the memory pool sizes in FLIP-49 [1].
>
> - Regarding the return value of TaskExecutorGateway#requestResource, I
> think you're right. We should avoid using null as the return value. I think
> we probably should thrown an exception here.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Hi Xintong,
> >
> > thanks for drafting this FLIP. I think your proposal helps to improve the
> > execution of batch jobs more efficiently. Moreover, it enables the proper
> > integration of the Blink planner which is very important as well.
> >
> > Overall, the FLIP looks good to me. I was wondering whether it wouldn't
> > make sense to actually split it up into two FLIPs: Operator resource
> > management and dynamic slot allocation. I think these two FLIPs could be
> > seen as orthogonal and it would decrease the scope of each individual
> FLIP.
> >
> > Some smaller comments:
> >
> > - I'm not sure whether we should pass in the default slot size via an
> > environment variable. Without having unified the way how Flink components
> > are configured [1], I think it would be better to pass it in as part of
> the
> > configuration.
> > - I would avoid returning a null value from
> > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we
> > should introduce an explicit return value saying this or throw an
> > exception.
> >
> > Concerning Yangze's comments: I think you are right that it would be
> > helpful to make the selection strategy pluggable. Also batching slot
> > requests to the RM could be a good optimization. For the sake of keeping
> > the scope of this FLIP smaller I would try to tackle these things after
> the
> > initial version has been completed (without spoiling these optimization
> > opportunities). In particular batching the slot requests depends on the
> > current scheduler refactoring and could also be realized on the RM side
> > only.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> >
> > Cheers,
> > Till
> >
> >
> >
> > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]> wrote:
> >
> > > Hi, Xintong
> > >
> > > Thanks to propose this FLIP. The general design looks good to me, +1
> > > for this feature.
> > >
> > > Since slots in the same task executor could have different resource
> > > profile, we will
> > > meet resource fragment problem. Think about this case:
> > >  - request A want 1G memory while request B & C want 0.5G memory
> > >  - There are two task executors T1 & T2 with 1G and 0.5G free memory
> > > respectively
> > > If B come first and we cut a slot from T1 for B, A must wait for the
> > > free resource from
> > > other task. But A could have been scheduled immediately if we cut a
> > > slot from T2 for B.
> > >
> > > The logic of findMatchingSlot now become finding a task executor which
> > > has enough
> > > resource and then cut a slot from it. Current method could be seen as
> > > "First-fit strategy",
> > > which works well in general but sometimes could not be the optimization
> > > method.
> > >
> > > Actually, this problem could be abstracted as "Bin Packing Problem"[1].
> > > Here are
> > > some common approximate algorithms:
> > > - First fit
> > > - Next fit
> > > - Best fit
> > >
> > > But it become multi-dimensional bin packing problem if we take CPU
> > > into account. It hard
> > > to define which one is best fit now. Some research addressed this
> > > problem, such like Tetris[2].
> > >
> > > Here are some thinking about it:
> > > 1. We could make the strategy of finding matching task executor
> > > pluginable. Let user to config the
> > > best strategy in their scenario.
> > > 2. We could support batch request interface in RM, because we have
> > > opportunities to optimize
> > > if we have more information. If we know the A, B, C at the same time,
> > > we could always make the best decision.
> > >
> > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > > [2]
> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]>
> > > wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > We would like to start a discussion thread on "FLIP-53: Fine Grained
> > > > Resource Management"[1], where we propose how to improve Flink
> resource
> > > > management and scheduling.
> > > >
> > > > This FLIP mainly discusses the following issues.
> > > >
> > > >    - How to support tasks with fine grained resource requirements.
> > > >    - How to unify resource management for jobs with / without fine
> > > grained
> > > >    resource requirements.
> > > >    - How to unify resource management for streaming / batch jobs.
> > > >
> > > > Key changes proposed in the FLIP are as follows.
> > > >
> > > >    - Unify memory management for operators with / without fine
> grained
> > > >    resource requirements by applying a fraction based quota
> mechanism.
> > > >    - Unify resource scheduling for streaming and batch jobs by
> setting
> > > slot
> > > >    sharing groups for pipelined regions during compiling stage.
> > > >    - Dynamically allocate slots from task executors' available
> > resources.
> > > >
> > > > Please find more details in the FLIP wiki document [1]. Looking
> forward
> > > to
> > > > your feedbacks.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Thinks for the comments, Yang.

Regarding your questions:

   1. How to calculate the resource specification of TaskManagers? Do they
>    have them same resource spec calculated based on the configuration? I
> think
>    we still have wasted resources in this situation. Or we could start
>    TaskManagers with different spec.
>
I agree with you that we can further improve the resource utility by
customizing task executors with different resource specifications. However,
I'm in favor of limiting the scope of this FLIP and leave it as a future
optimization. The plan for that part is to move the logic of deciding task
executor specifications into the slot manager and make slot manager
pluggable, so inside the slot manager plugin we can have different logics
for deciding the task executor specifications.


>    2. If a slot is released and returned to SlotPool, does it could be
>    reused by other SlotRequest that the request resource is smaller than
> it?
>
No, I think slot pool should always return slots if they do not exactly
match the pending requests, so that resource manager can deal with the
extra resources.

>       - If it is yes, what happens to the available resource in the

      TaskManager.
>       - What is the SlotStatus of the cached slot in SlotPool? The
>       AllocationId is null?
>
The allocation id does not change as long as the slot is not returned from
the job master, no matter its occupied or available in the slot pool. I
think we have the same behavior currently. No matter how many tasks the job
master deploy into the slot, concurrently or sequentially, it is one
allocation from the cluster to the job until the slot is freed from the job
master.

>    3. In a session cluster, some jobs are configured with operator
>    resources, meanwhile other jobs are using UNKNOWN. How to deal with this
>    situation?

As long as we do not mix unknown / specified resource profiles within the
same job / slot, there shouldn't be a problem. Resource manager converts
unknown resource profiles in slot requests to specified default resource
profiles, so they can be dynamically allocated from task executors'
available resources just as other slot requests with specified resource
profiles.

Thank you~

Xintong Song



On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> wrote:

> Hi Xintong,
>
>
> Thanks for your detailed proposal. I think many users are suffering from
> waste of resources. The resource spec of all task managers are same and we
> have to increase all task managers to make the heavy one more stable. So we
> will benefit from the fine grained resource management a lot. We could get
> better resource utilization and stability.
>
>
> Just to share some thoughts.
>
>
>
>    1. How to calculate the resource specification of TaskManagers? Do they
>    have them same resource spec calculated based on the configuration? I
> think
>    we still have wasted resources in this situation. Or we could start
>    TaskManagers with different spec.
>    2. If a slot is released and returned to SlotPool, does it could be
>    reused by other SlotRequest that the request resource is smaller than
> it?
>       - If it is yes, what happens to the available resource in the
>       TaskManager.
>       - What is the SlotStatus of the cached slot in SlotPool? The
>       AllocationId is null?
>    3. In a session cluster, some jobs are configured with operator
>    resources, meanwhile other jobs are using UNKNOWN. How to deal with this
>    situation?
>
>
>
> Best,
> Yang
>
> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
>
> > Thanks for the feedbacks, Yangze and Till.
> >
> > Yangze,
> >
> > I agree with you that we should make scheduling strategy pluggable and
> > optimize the strategy to reduce the memory fragmentation problem, and
> > thanks for the inputs on the potential algorithmic solutions. However,
> I'm
> > in favor of keep this FLIP focusing on the overall mechanism design
> rather
> > than strategies. Solving the fragmentation issue should be considered as
> an
> > optimization, and I agree with Till that we probably should tackle this
> > afterwards.
> >
> > Till,
> >
> > - Regarding splitting the FLIP, I think it makes sense. The operator
> > resource management and dynamic slot allocation do not have much
> dependency
> > on each other.
> >
> > - Regarding the default slot size, I think this is similar to FLIP-49 [1]
> > where we want all the deriving happens at one place. I think it would be
> > nice to pass the default slot size into the task executor in the same way
> > that we pass in the memory pool sizes in FLIP-49 [1].
> >
> > - Regarding the return value of TaskExecutorGateway#requestResource, I
> > think you're right. We should avoid using null as the return value. I
> think
> > we probably should thrown an exception here.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >
> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Hi Xintong,
> > >
> > > thanks for drafting this FLIP. I think your proposal helps to improve
> the
> > > execution of batch jobs more efficiently. Moreover, it enables the
> proper
> > > integration of the Blink planner which is very important as well.
> > >
> > > Overall, the FLIP looks good to me. I was wondering whether it wouldn't
> > > make sense to actually split it up into two FLIPs: Operator resource
> > > management and dynamic slot allocation. I think these two FLIPs could
> be
> > > seen as orthogonal and it would decrease the scope of each individual
> > FLIP.
> > >
> > > Some smaller comments:
> > >
> > > - I'm not sure whether we should pass in the default slot size via an
> > > environment variable. Without having unified the way how Flink
> components
> > > are configured [1], I think it would be better to pass it in as part of
> > the
> > > configuration.
> > > - I would avoid returning a null value from
> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either
> we
> > > should introduce an explicit return value saying this or throw an
> > > exception.
> > >
> > > Concerning Yangze's comments: I think you are right that it would be
> > > helpful to make the selection strategy pluggable. Also batching slot
> > > requests to the RM could be a good optimization. For the sake of
> keeping
> > > the scope of this FLIP smaller I would try to tackle these things after
> > the
> > > initial version has been completed (without spoiling these optimization
> > > opportunities). In particular batching the slot requests depends on the
> > > current scheduler refactoring and could also be realized on the RM side
> > > only.
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> > >
> > > Cheers,
> > > Till
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]>
> wrote:
> > >
> > > > Hi, Xintong
> > > >
> > > > Thanks to propose this FLIP. The general design looks good to me, +1
> > > > for this feature.
> > > >
> > > > Since slots in the same task executor could have different resource
> > > > profile, we will
> > > > meet resource fragment problem. Think about this case:
> > > >  - request A want 1G memory while request B & C want 0.5G memory
> > > >  - There are two task executors T1 & T2 with 1G and 0.5G free memory
> > > > respectively
> > > > If B come first and we cut a slot from T1 for B, A must wait for the
> > > > free resource from
> > > > other task. But A could have been scheduled immediately if we cut a
> > > > slot from T2 for B.
> > > >
> > > > The logic of findMatchingSlot now become finding a task executor
> which
> > > > has enough
> > > > resource and then cut a slot from it. Current method could be seen as
> > > > "First-fit strategy",
> > > > which works well in general but sometimes could not be the
> optimization
> > > > method.
> > > >
> > > > Actually, this problem could be abstracted as "Bin Packing
> Problem"[1].
> > > > Here are
> > > > some common approximate algorithms:
> > > > - First fit
> > > > - Next fit
> > > > - Best fit
> > > >
> > > > But it become multi-dimensional bin packing problem if we take CPU
> > > > into account. It hard
> > > > to define which one is best fit now. Some research addressed this
> > > > problem, such like Tetris[2].
> > > >
> > > > Here are some thinking about it:
> > > > 1. We could make the strategy of finding matching task executor
> > > > pluginable. Let user to config the
> > > > best strategy in their scenario.
> > > > 2. We could support batch request interface in RM, because we have
> > > > opportunities to optimize
> > > > if we have more information. If we know the A, B, C at the same time,
> > > > we could always make the best decision.
> > > >
> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > > > [2]
> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > We would like to start a discussion thread on "FLIP-53: Fine
> Grained
> > > > > Resource Management"[1], where we propose how to improve Flink
> > resource
> > > > > management and scheduling.
> > > > >
> > > > > This FLIP mainly discusses the following issues.
> > > > >
> > > > >    - How to support tasks with fine grained resource requirements.
> > > > >    - How to unify resource management for jobs with / without fine
> > > > grained
> > > > >    resource requirements.
> > > > >    - How to unify resource management for streaming / batch jobs.
> > > > >
> > > > > Key changes proposed in the FLIP are as follows.
> > > > >
> > > > >    - Unify memory management for operators with / without fine
> > grained
> > > > >    resource requirements by applying a fraction based quota
> > mechanism.
> > > > >    - Unify resource scheduling for streaming and batch jobs by
> > setting
> > > > slot
> > > > >    sharing groups for pipelined regions during compiling stage.
> > > > >    - Dynamically allocate slots from task executors' available
> > > resources.
> > > > >
> > > > > Please find more details in the FLIP wiki document [1]. Looking
> > forward
> > > > to
> > > > > your feedbacks.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Hi everyone,

As Till suggested, the original "FLIP-53: Fine Grained Resource Management"
splits into two separate FLIPs,

   - FLIP-53: Fine Grained Operator Resource Management [1]
   - FLIP-56: Dynamic Slot Allocation [2]

We'll continue using this discussion thread for FLIP-53. For FLIP-56, I
just started a new discussion thread [3].

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation

[3]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html

On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]> wrote:

> Thinks for the comments, Yang.
>
> Regarding your questions:
>
>    1. How to calculate the resource specification of TaskManagers? Do they
>>    have them same resource spec calculated based on the configuration? I
>> think
>>    we still have wasted resources in this situation. Or we could start
>>    TaskManagers with different spec.
>>
> I agree with you that we can further improve the resource utility by
> customizing task executors with different resource specifications. However,
> I'm in favor of limiting the scope of this FLIP and leave it as a future
> optimization. The plan for that part is to move the logic of deciding task
> executor specifications into the slot manager and make slot manager
> pluggable, so inside the slot manager plugin we can have different logics
> for deciding the task executor specifications.
>
>
>>    2. If a slot is released and returned to SlotPool, does it could be
>>    reused by other SlotRequest that the request resource is smaller than
>> it?
>>
> No, I think slot pool should always return slots if they do not exactly
> match the pending requests, so that resource manager can deal with the
> extra resources.
>
>>       - If it is yes, what happens to the available resource in the
>
>       TaskManager.
>>       - What is the SlotStatus of the cached slot in SlotPool? The
>>       AllocationId is null?
>>
> The allocation id does not change as long as the slot is not returned from
> the job master, no matter its occupied or available in the slot pool. I
> think we have the same behavior currently. No matter how many tasks the job
> master deploy into the slot, concurrently or sequentially, it is one
> allocation from the cluster to the job until the slot is freed from the job
> master.
>
>>    3. In a session cluster, some jobs are configured with operator
>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
>> this
>>    situation?
>
> As long as we do not mix unknown / specified resource profiles within the
> same job / slot, there shouldn't be a problem. Resource manager converts
> unknown resource profiles in slot requests to specified default resource
> profiles, so they can be dynamically allocated from task executors'
> available resources just as other slot requests with specified resource
> profiles.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> wrote:
>
>> Hi Xintong,
>>
>>
>> Thanks for your detailed proposal. I think many users are suffering from
>> waste of resources. The resource spec of all task managers are same and we
>> have to increase all task managers to make the heavy one more stable. So
>> we
>> will benefit from the fine grained resource management a lot. We could get
>> better resource utilization and stability.
>>
>>
>> Just to share some thoughts.
>>
>>
>>
>>    1. How to calculate the resource specification of TaskManagers? Do they
>>    have them same resource spec calculated based on the configuration? I
>> think
>>    we still have wasted resources in this situation. Or we could start
>>    TaskManagers with different spec.
>>    2. If a slot is released and returned to SlotPool, does it could be
>>    reused by other SlotRequest that the request resource is smaller than
>> it?
>>       - If it is yes, what happens to the available resource in the
>>       TaskManager.
>>       - What is the SlotStatus of the cached slot in SlotPool? The
>>       AllocationId is null?
>>    3. In a session cluster, some jobs are configured with operator
>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
>> this
>>    situation?
>>
>>
>>
>> Best,
>> Yang
>>
>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
>>
>> > Thanks for the feedbacks, Yangze and Till.
>> >
>> > Yangze,
>> >
>> > I agree with you that we should make scheduling strategy pluggable and
>> > optimize the strategy to reduce the memory fragmentation problem, and
>> > thanks for the inputs on the potential algorithmic solutions. However,
>> I'm
>> > in favor of keep this FLIP focusing on the overall mechanism design
>> rather
>> > than strategies. Solving the fragmentation issue should be considered
>> as an
>> > optimization, and I agree with Till that we probably should tackle this
>> > afterwards.
>> >
>> > Till,
>> >
>> > - Regarding splitting the FLIP, I think it makes sense. The operator
>> > resource management and dynamic slot allocation do not have much
>> dependency
>> > on each other.
>> >
>> > - Regarding the default slot size, I think this is similar to FLIP-49
>> [1]
>> > where we want all the deriving happens at one place. I think it would be
>> > nice to pass the default slot size into the task executor in the same
>> way
>> > that we pass in the memory pool sizes in FLIP-49 [1].
>> >
>> > - Regarding the return value of TaskExecutorGateway#requestResource, I
>> > think you're right. We should avoid using null as the return value. I
>> think
>> > we probably should thrown an exception here.
>> >
>> > Thank you~
>> >
>> > Xintong Song
>> >
>> >
>> > [1]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>> >
>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]>
>> > wrote:
>> >
>> > > Hi Xintong,
>> > >
>> > > thanks for drafting this FLIP. I think your proposal helps to improve
>> the
>> > > execution of batch jobs more efficiently. Moreover, it enables the
>> proper
>> > > integration of the Blink planner which is very important as well.
>> > >
>> > > Overall, the FLIP looks good to me. I was wondering whether it
>> wouldn't
>> > > make sense to actually split it up into two FLIPs: Operator resource
>> > > management and dynamic slot allocation. I think these two FLIPs could
>> be
>> > > seen as orthogonal and it would decrease the scope of each individual
>> > FLIP.
>> > >
>> > > Some smaller comments:
>> > >
>> > > - I'm not sure whether we should pass in the default slot size via an
>> > > environment variable. Without having unified the way how Flink
>> components
>> > > are configured [1], I think it would be better to pass it in as part
>> of
>> > the
>> > > configuration.
>> > > - I would avoid returning a null value from
>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either
>> we
>> > > should introduce an explicit return value saying this or throw an
>> > > exception.
>> > >
>> > > Concerning Yangze's comments: I think you are right that it would be
>> > > helpful to make the selection strategy pluggable. Also batching slot
>> > > requests to the RM could be a good optimization. For the sake of
>> keeping
>> > > the scope of this FLIP smaller I would try to tackle these things
>> after
>> > the
>> > > initial version has been completed (without spoiling these
>> optimization
>> > > opportunities). In particular batching the slot requests depends on
>> the
>> > > current scheduler refactoring and could also be realized on the RM
>> side
>> > > only.
>> > >
>> > > [1]
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
>> > >
>> > > Cheers,
>> > > Till
>> > >
>> > >
>> > >
>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]>
>> wrote:
>> > >
>> > > > Hi, Xintong
>> > > >
>> > > > Thanks to propose this FLIP. The general design looks good to me, +1
>> > > > for this feature.
>> > > >
>> > > > Since slots in the same task executor could have different resource
>> > > > profile, we will
>> > > > meet resource fragment problem. Think about this case:
>> > > >  - request A want 1G memory while request B & C want 0.5G memory
>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G free memory
>> > > > respectively
>> > > > If B come first and we cut a slot from T1 for B, A must wait for the
>> > > > free resource from
>> > > > other task. But A could have been scheduled immediately if we cut a
>> > > > slot from T2 for B.
>> > > >
>> > > > The logic of findMatchingSlot now become finding a task executor
>> which
>> > > > has enough
>> > > > resource and then cut a slot from it. Current method could be seen
>> as
>> > > > "First-fit strategy",
>> > > > which works well in general but sometimes could not be the
>> optimization
>> > > > method.
>> > > >
>> > > > Actually, this problem could be abstracted as "Bin Packing
>> Problem"[1].
>> > > > Here are
>> > > > some common approximate algorithms:
>> > > > - First fit
>> > > > - Next fit
>> > > > - Best fit
>> > > >
>> > > > But it become multi-dimensional bin packing problem if we take CPU
>> > > > into account. It hard
>> > > > to define which one is best fit now. Some research addressed this
>> > > > problem, such like Tetris[2].
>> > > >
>> > > > Here are some thinking about it:
>> > > > 1. We could make the strategy of finding matching task executor
>> > > > pluginable. Let user to config the
>> > > > best strategy in their scenario.
>> > > > 2. We could support batch request interface in RM, because we have
>> > > > opportunities to optimize
>> > > > if we have more information. If we know the A, B, C at the same
>> time,
>> > > > we could always make the best decision.
>> > > >
>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
>> > > > [2]
>> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
>> > > >
>> > > > Best,
>> > > > Yangze Guo
>> > > >
>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
>> [hidden email]>
>> > > > wrote:
>> > > > >
>> > > > > Hi everyone,
>> > > > >
>> > > > > We would like to start a discussion thread on "FLIP-53: Fine
>> Grained
>> > > > > Resource Management"[1], where we propose how to improve Flink
>> > resource
>> > > > > management and scheduling.
>> > > > >
>> > > > > This FLIP mainly discusses the following issues.
>> > > > >
>> > > > >    - How to support tasks with fine grained resource requirements.
>> > > > >    - How to unify resource management for jobs with / without fine
>> > > > grained
>> > > > >    resource requirements.
>> > > > >    - How to unify resource management for streaming / batch jobs.
>> > > > >
>> > > > > Key changes proposed in the FLIP are as follows.
>> > > > >
>> > > > >    - Unify memory management for operators with / without fine
>> > grained
>> > > > >    resource requirements by applying a fraction based quota
>> > mechanism.
>> > > > >    - Unify resource scheduling for streaming and batch jobs by
>> > setting
>> > > > slot
>> > > > >    sharing groups for pipelined regions during compiling stage.
>> > > > >    - Dynamically allocate slots from task executors' available
>> > > resources.
>> > > > >
>> > > > > Please find more details in the FLIP wiki document [1]. Looking
>> > forward
>> > > > to
>> > > > > your feedbacks.
>> > > > >
>> > > > > Thank you~
>> > > > >
>> > > > > Xintong Song
>> > > > >
>> > > > >
>> > > > > [1]
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
>> > > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Added implementation steps for this FLIP on the wiki page [1].


Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]> wrote:

> Hi everyone,
>
> As Till suggested, the original "FLIP-53: Fine Grained Resource
> Management" splits into two separate FLIPs,
>
>    - FLIP-53: Fine Grained Operator Resource Management [1]
>    - FLIP-56: Dynamic Slot Allocation [2]
>
> We'll continue using this discussion thread for FLIP-53. For FLIP-56, I
> just started a new discussion thread [3].
>
> Thank you~
>
> Xintong Song
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>
> [3]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
>
> On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]>
> wrote:
>
>> Thinks for the comments, Yang.
>>
>> Regarding your questions:
>>
>>    1. How to calculate the resource specification of TaskManagers? Do they
>>>    have them same resource spec calculated based on the configuration? I
>>> think
>>>    we still have wasted resources in this situation. Or we could start
>>>    TaskManagers with different spec.
>>>
>> I agree with you that we can further improve the resource utility by
>> customizing task executors with different resource specifications. However,
>> I'm in favor of limiting the scope of this FLIP and leave it as a future
>> optimization. The plan for that part is to move the logic of deciding task
>> executor specifications into the slot manager and make slot manager
>> pluggable, so inside the slot manager plugin we can have different logics
>> for deciding the task executor specifications.
>>
>>
>>>    2. If a slot is released and returned to SlotPool, does it could be
>>>    reused by other SlotRequest that the request resource is smaller than
>>> it?
>>>
>> No, I think slot pool should always return slots if they do not exactly
>> match the pending requests, so that resource manager can deal with the
>> extra resources.
>>
>>>       - If it is yes, what happens to the available resource in the
>>
>>       TaskManager.
>>>       - What is the SlotStatus of the cached slot in SlotPool? The
>>>       AllocationId is null?
>>>
>> The allocation id does not change as long as the slot is not returned
>> from the job master, no matter its occupied or available in the slot pool.
>> I think we have the same behavior currently. No matter how many tasks the
>> job master deploy into the slot, concurrently or sequentially, it is one
>> allocation from the cluster to the job until the slot is freed from the job
>> master.
>>
>>>    3. In a session cluster, some jobs are configured with operator
>>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
>>> this
>>>    situation?
>>
>> As long as we do not mix unknown / specified resource profiles within the
>> same job / slot, there shouldn't be a problem. Resource manager converts
>> unknown resource profiles in slot requests to specified default resource
>> profiles, so they can be dynamically allocated from task executors'
>> available resources just as other slot requests with specified resource
>> profiles.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]> wrote:
>>
>>> Hi Xintong,
>>>
>>>
>>> Thanks for your detailed proposal. I think many users are suffering from
>>> waste of resources. The resource spec of all task managers are same and
>>> we
>>> have to increase all task managers to make the heavy one more stable. So
>>> we
>>> will benefit from the fine grained resource management a lot. We could
>>> get
>>> better resource utilization and stability.
>>>
>>>
>>> Just to share some thoughts.
>>>
>>>
>>>
>>>    1. How to calculate the resource specification of TaskManagers? Do
>>> they
>>>    have them same resource spec calculated based on the configuration? I
>>> think
>>>    we still have wasted resources in this situation. Or we could start
>>>    TaskManagers with different spec.
>>>    2. If a slot is released and returned to SlotPool, does it could be
>>>    reused by other SlotRequest that the request resource is smaller than
>>> it?
>>>       - If it is yes, what happens to the available resource in the
>>>       TaskManager.
>>>       - What is the SlotStatus of the cached slot in SlotPool? The
>>>       AllocationId is null?
>>>    3. In a session cluster, some jobs are configured with operator
>>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
>>> this
>>>    situation?
>>>
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
>>>
>>> > Thanks for the feedbacks, Yangze and Till.
>>> >
>>> > Yangze,
>>> >
>>> > I agree with you that we should make scheduling strategy pluggable and
>>> > optimize the strategy to reduce the memory fragmentation problem, and
>>> > thanks for the inputs on the potential algorithmic solutions. However,
>>> I'm
>>> > in favor of keep this FLIP focusing on the overall mechanism design
>>> rather
>>> > than strategies. Solving the fragmentation issue should be considered
>>> as an
>>> > optimization, and I agree with Till that we probably should tackle this
>>> > afterwards.
>>> >
>>> > Till,
>>> >
>>> > - Regarding splitting the FLIP, I think it makes sense. The operator
>>> > resource management and dynamic slot allocation do not have much
>>> dependency
>>> > on each other.
>>> >
>>> > - Regarding the default slot size, I think this is similar to FLIP-49
>>> [1]
>>> > where we want all the deriving happens at one place. I think it would
>>> be
>>> > nice to pass the default slot size into the task executor in the same
>>> way
>>> > that we pass in the memory pool sizes in FLIP-49 [1].
>>> >
>>> > - Regarding the return value of TaskExecutorGateway#requestResource, I
>>> > think you're right. We should avoid using null as the return value. I
>>> think
>>> > we probably should thrown an exception here.
>>> >
>>> > Thank you~
>>> >
>>> > Xintong Song
>>> >
>>> >
>>> > [1]
>>> >
>>> >
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>>> >
>>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]>
>>> > wrote:
>>> >
>>> > > Hi Xintong,
>>> > >
>>> > > thanks for drafting this FLIP. I think your proposal helps to
>>> improve the
>>> > > execution of batch jobs more efficiently. Moreover, it enables the
>>> proper
>>> > > integration of the Blink planner which is very important as well.
>>> > >
>>> > > Overall, the FLIP looks good to me. I was wondering whether it
>>> wouldn't
>>> > > make sense to actually split it up into two FLIPs: Operator resource
>>> > > management and dynamic slot allocation. I think these two FLIPs
>>> could be
>>> > > seen as orthogonal and it would decrease the scope of each individual
>>> > FLIP.
>>> > >
>>> > > Some smaller comments:
>>> > >
>>> > > - I'm not sure whether we should pass in the default slot size via an
>>> > > environment variable. Without having unified the way how Flink
>>> components
>>> > > are configured [1], I think it would be better to pass it in as part
>>> of
>>> > the
>>> > > configuration.
>>> > > - I would avoid returning a null value from
>>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled.
>>> Either we
>>> > > should introduce an explicit return value saying this or throw an
>>> > > exception.
>>> > >
>>> > > Concerning Yangze's comments: I think you are right that it would be
>>> > > helpful to make the selection strategy pluggable. Also batching slot
>>> > > requests to the RM could be a good optimization. For the sake of
>>> keeping
>>> > > the scope of this FLIP smaller I would try to tackle these things
>>> after
>>> > the
>>> > > initial version has been completed (without spoiling these
>>> optimization
>>> > > opportunities). In particular batching the slot requests depends on
>>> the
>>> > > current scheduler refactoring and could also be realized on the RM
>>> side
>>> > > only.
>>> > >
>>> > > [1]
>>> > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
>>> > >
>>> > > Cheers,
>>> > > Till
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]>
>>> wrote:
>>> > >
>>> > > > Hi, Xintong
>>> > > >
>>> > > > Thanks to propose this FLIP. The general design looks good to me,
>>> +1
>>> > > > for this feature.
>>> > > >
>>> > > > Since slots in the same task executor could have different resource
>>> > > > profile, we will
>>> > > > meet resource fragment problem. Think about this case:
>>> > > >  - request A want 1G memory while request B & C want 0.5G memory
>>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G free
>>> memory
>>> > > > respectively
>>> > > > If B come first and we cut a slot from T1 for B, A must wait for
>>> the
>>> > > > free resource from
>>> > > > other task. But A could have been scheduled immediately if we cut a
>>> > > > slot from T2 for B.
>>> > > >
>>> > > > The logic of findMatchingSlot now become finding a task executor
>>> which
>>> > > > has enough
>>> > > > resource and then cut a slot from it. Current method could be seen
>>> as
>>> > > > "First-fit strategy",
>>> > > > which works well in general but sometimes could not be the
>>> optimization
>>> > > > method.
>>> > > >
>>> > > > Actually, this problem could be abstracted as "Bin Packing
>>> Problem"[1].
>>> > > > Here are
>>> > > > some common approximate algorithms:
>>> > > > - First fit
>>> > > > - Next fit
>>> > > > - Best fit
>>> > > >
>>> > > > But it become multi-dimensional bin packing problem if we take CPU
>>> > > > into account. It hard
>>> > > > to define which one is best fit now. Some research addressed this
>>> > > > problem, such like Tetris[2].
>>> > > >
>>> > > > Here are some thinking about it:
>>> > > > 1. We could make the strategy of finding matching task executor
>>> > > > pluginable. Let user to config the
>>> > > > best strategy in their scenario.
>>> > > > 2. We could support batch request interface in RM, because we have
>>> > > > opportunities to optimize
>>> > > > if we have more information. If we know the A, B, C at the same
>>> time,
>>> > > > we could always make the best decision.
>>> > > >
>>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
>>> > > > [2]
>>> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
>>> > > >
>>> > > > Best,
>>> > > > Yangze Guo
>>> > > >
>>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
>>> [hidden email]>
>>> > > > wrote:
>>> > > > >
>>> > > > > Hi everyone,
>>> > > > >
>>> > > > > We would like to start a discussion thread on "FLIP-53: Fine
>>> Grained
>>> > > > > Resource Management"[1], where we propose how to improve Flink
>>> > resource
>>> > > > > management and scheduling.
>>> > > > >
>>> > > > > This FLIP mainly discusses the following issues.
>>> > > > >
>>> > > > >    - How to support tasks with fine grained resource
>>> requirements.
>>> > > > >    - How to unify resource management for jobs with / without
>>> fine
>>> > > > grained
>>> > > > >    resource requirements.
>>> > > > >    - How to unify resource management for streaming / batch jobs.
>>> > > > >
>>> > > > > Key changes proposed in the FLIP are as follows.
>>> > > > >
>>> > > > >    - Unify memory management for operators with / without fine
>>> > grained
>>> > > > >    resource requirements by applying a fraction based quota
>>> > mechanism.
>>> > > > >    - Unify resource scheduling for streaming and batch jobs by
>>> > setting
>>> > > > slot
>>> > > > >    sharing groups for pipelined regions during compiling stage.
>>> > > > >    - Dynamically allocate slots from task executors' available
>>> > > resources.
>>> > > > >
>>> > > > > Please find more details in the FLIP wiki document [1]. Looking
>>> > forward
>>> > > > to
>>> > > > > your feedbacks.
>>> > > > >
>>> > > > > Thank you~
>>> > > > >
>>> > > > > Xintong Song
>>> > > > >
>>> > > > >
>>> > > > > [1]
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
>>> > > >
>>> > >
>>> >
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Till Rohrmann
I guess there is a typo since the link to the FLIP-53 is
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management

Cheers,
Till

On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]> wrote:

> Added implementation steps for this FLIP on the wiki page [1].
>
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]>
> wrote:
>
> > Hi everyone,
> >
> > As Till suggested, the original "FLIP-53: Fine Grained Resource
> > Management" splits into two separate FLIPs,
> >
> >    - FLIP-53: Fine Grained Operator Resource Management [1]
> >    - FLIP-56: Dynamic Slot Allocation [2]
> >
> > We'll continue using this discussion thread for FLIP-53. For FLIP-56, I
> > just started a new discussion thread [3].
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> >
> > [2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> >
> > [3]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> >
> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]>
> > wrote:
> >
> >> Thinks for the comments, Yang.
> >>
> >> Regarding your questions:
> >>
> >>    1. How to calculate the resource specification of TaskManagers? Do
> they
> >>>    have them same resource spec calculated based on the configuration?
> I
> >>> think
> >>>    we still have wasted resources in this situation. Or we could start
> >>>    TaskManagers with different spec.
> >>>
> >> I agree with you that we can further improve the resource utility by
> >> customizing task executors with different resource specifications.
> However,
> >> I'm in favor of limiting the scope of this FLIP and leave it as a future
> >> optimization. The plan for that part is to move the logic of deciding
> task
> >> executor specifications into the slot manager and make slot manager
> >> pluggable, so inside the slot manager plugin we can have different
> logics
> >> for deciding the task executor specifications.
> >>
> >>
> >>>    2. If a slot is released and returned to SlotPool, does it could be
> >>>    reused by other SlotRequest that the request resource is smaller
> than
> >>> it?
> >>>
> >> No, I think slot pool should always return slots if they do not exactly
> >> match the pending requests, so that resource manager can deal with the
> >> extra resources.
> >>
> >>>       - If it is yes, what happens to the available resource in the
> >>
> >>       TaskManager.
> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
> >>>       AllocationId is null?
> >>>
> >> The allocation id does not change as long as the slot is not returned
> >> from the job master, no matter its occupied or available in the slot
> pool.
> >> I think we have the same behavior currently. No matter how many tasks
> the
> >> job master deploy into the slot, concurrently or sequentially, it is one
> >> allocation from the cluster to the job until the slot is freed from the
> job
> >> master.
> >>
> >>>    3. In a session cluster, some jobs are configured with operator
> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
> >>> this
> >>>    situation?
> >>
> >> As long as we do not mix unknown / specified resource profiles within
> the
> >> same job / slot, there shouldn't be a problem. Resource manager converts
> >> unknown resource profiles in slot requests to specified default resource
> >> profiles, so they can be dynamically allocated from task executors'
> >> available resources just as other slot requests with specified resource
> >> profiles.
> >>
> >> Thank you~
> >>
> >> Xintong Song
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]>
> wrote:
> >>
> >>> Hi Xintong,
> >>>
> >>>
> >>> Thanks for your detailed proposal. I think many users are suffering
> from
> >>> waste of resources. The resource spec of all task managers are same and
> >>> we
> >>> have to increase all task managers to make the heavy one more stable.
> So
> >>> we
> >>> will benefit from the fine grained resource management a lot. We could
> >>> get
> >>> better resource utilization and stability.
> >>>
> >>>
> >>> Just to share some thoughts.
> >>>
> >>>
> >>>
> >>>    1. How to calculate the resource specification of TaskManagers? Do
> >>> they
> >>>    have them same resource spec calculated based on the configuration?
> I
> >>> think
> >>>    we still have wasted resources in this situation. Or we could start
> >>>    TaskManagers with different spec.
> >>>    2. If a slot is released and returned to SlotPool, does it could be
> >>>    reused by other SlotRequest that the request resource is smaller
> than
> >>> it?
> >>>       - If it is yes, what happens to the available resource in the
> >>>       TaskManager.
> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
> >>>       AllocationId is null?
> >>>    3. In a session cluster, some jobs are configured with operator
> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
> >>> this
> >>>    situation?
> >>>
> >>>
> >>>
> >>> Best,
> >>> Yang
> >>>
> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
> >>>
> >>> > Thanks for the feedbacks, Yangze and Till.
> >>> >
> >>> > Yangze,
> >>> >
> >>> > I agree with you that we should make scheduling strategy pluggable
> and
> >>> > optimize the strategy to reduce the memory fragmentation problem, and
> >>> > thanks for the inputs on the potential algorithmic solutions.
> However,
> >>> I'm
> >>> > in favor of keep this FLIP focusing on the overall mechanism design
> >>> rather
> >>> > than strategies. Solving the fragmentation issue should be considered
> >>> as an
> >>> > optimization, and I agree with Till that we probably should tackle
> this
> >>> > afterwards.
> >>> >
> >>> > Till,
> >>> >
> >>> > - Regarding splitting the FLIP, I think it makes sense. The operator
> >>> > resource management and dynamic slot allocation do not have much
> >>> dependency
> >>> > on each other.
> >>> >
> >>> > - Regarding the default slot size, I think this is similar to FLIP-49
> >>> [1]
> >>> > where we want all the deriving happens at one place. I think it would
> >>> be
> >>> > nice to pass the default slot size into the task executor in the same
> >>> way
> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
> >>> >
> >>> > - Regarding the return value of TaskExecutorGateway#requestResource,
> I
> >>> > think you're right. We should avoid using null as the return value. I
> >>> think
> >>> > we probably should thrown an exception here.
> >>> >
> >>> > Thank you~
> >>> >
> >>> > Xintong Song
> >>> >
> >>> >
> >>> > [1]
> >>> >
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >>> >
> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]>
> >>> > wrote:
> >>> >
> >>> > > Hi Xintong,
> >>> > >
> >>> > > thanks for drafting this FLIP. I think your proposal helps to
> >>> improve the
> >>> > > execution of batch jobs more efficiently. Moreover, it enables the
> >>> proper
> >>> > > integration of the Blink planner which is very important as well.
> >>> > >
> >>> > > Overall, the FLIP looks good to me. I was wondering whether it
> >>> wouldn't
> >>> > > make sense to actually split it up into two FLIPs: Operator
> resource
> >>> > > management and dynamic slot allocation. I think these two FLIPs
> >>> could be
> >>> > > seen as orthogonal and it would decrease the scope of each
> individual
> >>> > FLIP.
> >>> > >
> >>> > > Some smaller comments:
> >>> > >
> >>> > > - I'm not sure whether we should pass in the default slot size via
> an
> >>> > > environment variable. Without having unified the way how Flink
> >>> components
> >>> > > are configured [1], I think it would be better to pass it in as
> part
> >>> of
> >>> > the
> >>> > > configuration.
> >>> > > - I would avoid returning a null value from
> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled.
> >>> Either we
> >>> > > should introduce an explicit return value saying this or throw an
> >>> > > exception.
> >>> > >
> >>> > > Concerning Yangze's comments: I think you are right that it would
> be
> >>> > > helpful to make the selection strategy pluggable. Also batching
> slot
> >>> > > requests to the RM could be a good optimization. For the sake of
> >>> keeping
> >>> > > the scope of this FLIP smaller I would try to tackle these things
> >>> after
> >>> > the
> >>> > > initial version has been completed (without spoiling these
> >>> optimization
> >>> > > opportunities). In particular batching the slot requests depends on
> >>> the
> >>> > > current scheduler refactoring and could also be realized on the RM
> >>> side
> >>> > > only.
> >>> > >
> >>> > > [1]
> >>> > >
> >>> > >
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> >>> > >
> >>> > > Cheers,
> >>> > > Till
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]>
> >>> wrote:
> >>> > >
> >>> > > > Hi, Xintong
> >>> > > >
> >>> > > > Thanks to propose this FLIP. The general design looks good to me,
> >>> +1
> >>> > > > for this feature.
> >>> > > >
> >>> > > > Since slots in the same task executor could have different
> resource
> >>> > > > profile, we will
> >>> > > > meet resource fragment problem. Think about this case:
> >>> > > >  - request A want 1G memory while request B & C want 0.5G memory
> >>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G free
> >>> memory
> >>> > > > respectively
> >>> > > > If B come first and we cut a slot from T1 for B, A must wait for
> >>> the
> >>> > > > free resource from
> >>> > > > other task. But A could have been scheduled immediately if we
> cut a
> >>> > > > slot from T2 for B.
> >>> > > >
> >>> > > > The logic of findMatchingSlot now become finding a task executor
> >>> which
> >>> > > > has enough
> >>> > > > resource and then cut a slot from it. Current method could be
> seen
> >>> as
> >>> > > > "First-fit strategy",
> >>> > > > which works well in general but sometimes could not be the
> >>> optimization
> >>> > > > method.
> >>> > > >
> >>> > > > Actually, this problem could be abstracted as "Bin Packing
> >>> Problem"[1].
> >>> > > > Here are
> >>> > > > some common approximate algorithms:
> >>> > > > - First fit
> >>> > > > - Next fit
> >>> > > > - Best fit
> >>> > > >
> >>> > > > But it become multi-dimensional bin packing problem if we take
> CPU
> >>> > > > into account. It hard
> >>> > > > to define which one is best fit now. Some research addressed this
> >>> > > > problem, such like Tetris[2].
> >>> > > >
> >>> > > > Here are some thinking about it:
> >>> > > > 1. We could make the strategy of finding matching task executor
> >>> > > > pluginable. Let user to config the
> >>> > > > best strategy in their scenario.
> >>> > > > 2. We could support batch request interface in RM, because we
> have
> >>> > > > opportunities to optimize
> >>> > > > if we have more information. If we know the A, B, C at the same
> >>> time,
> >>> > > > we could always make the best decision.
> >>> > > >
> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> >>> > > > [2]
> >>> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> >>> > > >
> >>> > > > Best,
> >>> > > > Yangze Guo
> >>> > > >
> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> >>> [hidden email]>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > Hi everyone,
> >>> > > > >
> >>> > > > > We would like to start a discussion thread on "FLIP-53: Fine
> >>> Grained
> >>> > > > > Resource Management"[1], where we propose how to improve Flink
> >>> > resource
> >>> > > > > management and scheduling.
> >>> > > > >
> >>> > > > > This FLIP mainly discusses the following issues.
> >>> > > > >
> >>> > > > >    - How to support tasks with fine grained resource
> >>> requirements.
> >>> > > > >    - How to unify resource management for jobs with / without
> >>> fine
> >>> > > > grained
> >>> > > > >    resource requirements.
> >>> > > > >    - How to unify resource management for streaming / batch
> jobs.
> >>> > > > >
> >>> > > > > Key changes proposed in the FLIP are as follows.
> >>> > > > >
> >>> > > > >    - Unify memory management for operators with / without fine
> >>> > grained
> >>> > > > >    resource requirements by applying a fraction based quota
> >>> > mechanism.
> >>> > > > >    - Unify resource scheduling for streaming and batch jobs by
> >>> > setting
> >>> > > > slot
> >>> > > > >    sharing groups for pipelined regions during compiling stage.
> >>> > > > >    - Dynamically allocate slots from task executors' available
> >>> > > resources.
> >>> > > > >
> >>> > > > > Please find more details in the FLIP wiki document [1]. Looking
> >>> > forward
> >>> > > > to
> >>> > > > > your feedbacks.
> >>> > > > >
> >>> > > > > Thank you~
> >>> > > > >
> >>> > > > > Xintong Song
> >>> > > > >
> >>> > > > >
> >>> > > > > [1]
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Till Rohrmann
Thanks for creating the implementation plan Xintong. Overall, the
implementation plan looks good. I had a couple of comments:

- What will happen if a user has defined a streaming job with two slot
sharing groups? Would the code insert a blocking data exchange between
these two groups? If yes, then this breaks existing Flink streaming jobs.
- How do we detect unbounded streaming jobs to set
the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to set
it false if we are using the DataSet API or the Blink planner with a
bounded job?

Cheers,
Till

On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]> wrote:

> I guess there is a typo since the link to the FLIP-53 is
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>
> Cheers,
> Till
>
> On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]>
> wrote:
>
>> Added implementation steps for this FLIP on the wiki page [1].
>>
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>>
>> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]>
>> wrote:
>>
>> > Hi everyone,
>> >
>> > As Till suggested, the original "FLIP-53: Fine Grained Resource
>> > Management" splits into two separate FLIPs,
>> >
>> >    - FLIP-53: Fine Grained Operator Resource Management [1]
>> >    - FLIP-56: Dynamic Slot Allocation [2]
>> >
>> > We'll continue using this discussion thread for FLIP-53. For FLIP-56, I
>> > just started a new discussion thread [3].
>> >
>> > Thank you~
>> >
>> > Xintong Song
>> >
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>> >
>> > [2]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>> >
>> > [3]
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
>> >
>> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]>
>> > wrote:
>> >
>> >> Thinks for the comments, Yang.
>> >>
>> >> Regarding your questions:
>> >>
>> >>    1. How to calculate the resource specification of TaskManagers? Do
>> they
>> >>>    have them same resource spec calculated based on the
>> configuration? I
>> >>> think
>> >>>    we still have wasted resources in this situation. Or we could start
>> >>>    TaskManagers with different spec.
>> >>>
>> >> I agree with you that we can further improve the resource utility by
>> >> customizing task executors with different resource specifications.
>> However,
>> >> I'm in favor of limiting the scope of this FLIP and leave it as a
>> future
>> >> optimization. The plan for that part is to move the logic of deciding
>> task
>> >> executor specifications into the slot manager and make slot manager
>> >> pluggable, so inside the slot manager plugin we can have different
>> logics
>> >> for deciding the task executor specifications.
>> >>
>> >>
>> >>>    2. If a slot is released and returned to SlotPool, does it could be
>> >>>    reused by other SlotRequest that the request resource is smaller
>> than
>> >>> it?
>> >>>
>> >> No, I think slot pool should always return slots if they do not exactly
>> >> match the pending requests, so that resource manager can deal with the
>> >> extra resources.
>> >>
>> >>>       - If it is yes, what happens to the available resource in the
>> >>
>> >>       TaskManager.
>> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
>> >>>       AllocationId is null?
>> >>>
>> >> The allocation id does not change as long as the slot is not returned
>> >> from the job master, no matter its occupied or available in the slot
>> pool.
>> >> I think we have the same behavior currently. No matter how many tasks
>> the
>> >> job master deploy into the slot, concurrently or sequentially, it is
>> one
>> >> allocation from the cluster to the job until the slot is freed from
>> the job
>> >> master.
>> >>
>> >>>    3. In a session cluster, some jobs are configured with operator
>> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
>> >>> this
>> >>>    situation?
>> >>
>> >> As long as we do not mix unknown / specified resource profiles within
>> the
>> >> same job / slot, there shouldn't be a problem. Resource manager
>> converts
>> >> unknown resource profiles in slot requests to specified default
>> resource
>> >> profiles, so they can be dynamically allocated from task executors'
>> >> available resources just as other slot requests with specified resource
>> >> profiles.
>> >>
>> >> Thank you~
>> >>
>> >> Xintong Song
>> >>
>> >>
>> >>
>> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]>
>> wrote:
>> >>
>> >>> Hi Xintong,
>> >>>
>> >>>
>> >>> Thanks for your detailed proposal. I think many users are suffering
>> from
>> >>> waste of resources. The resource spec of all task managers are same
>> and
>> >>> we
>> >>> have to increase all task managers to make the heavy one more stable.
>> So
>> >>> we
>> >>> will benefit from the fine grained resource management a lot. We could
>> >>> get
>> >>> better resource utilization and stability.
>> >>>
>> >>>
>> >>> Just to share some thoughts.
>> >>>
>> >>>
>> >>>
>> >>>    1. How to calculate the resource specification of TaskManagers? Do
>> >>> they
>> >>>    have them same resource spec calculated based on the
>> configuration? I
>> >>> think
>> >>>    we still have wasted resources in this situation. Or we could start
>> >>>    TaskManagers with different spec.
>> >>>    2. If a slot is released and returned to SlotPool, does it could be
>> >>>    reused by other SlotRequest that the request resource is smaller
>> than
>> >>> it?
>> >>>       - If it is yes, what happens to the available resource in the
>> >>>       TaskManager.
>> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
>> >>>       AllocationId is null?
>> >>>    3. In a session cluster, some jobs are configured with operator
>> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal with
>> >>> this
>> >>>    situation?
>> >>>
>> >>>
>> >>>
>> >>> Best,
>> >>> Yang
>> >>>
>> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
>> >>>
>> >>> > Thanks for the feedbacks, Yangze and Till.
>> >>> >
>> >>> > Yangze,
>> >>> >
>> >>> > I agree with you that we should make scheduling strategy pluggable
>> and
>> >>> > optimize the strategy to reduce the memory fragmentation problem,
>> and
>> >>> > thanks for the inputs on the potential algorithmic solutions.
>> However,
>> >>> I'm
>> >>> > in favor of keep this FLIP focusing on the overall mechanism design
>> >>> rather
>> >>> > than strategies. Solving the fragmentation issue should be
>> considered
>> >>> as an
>> >>> > optimization, and I agree with Till that we probably should tackle
>> this
>> >>> > afterwards.
>> >>> >
>> >>> > Till,
>> >>> >
>> >>> > - Regarding splitting the FLIP, I think it makes sense. The operator
>> >>> > resource management and dynamic slot allocation do not have much
>> >>> dependency
>> >>> > on each other.
>> >>> >
>> >>> > - Regarding the default slot size, I think this is similar to
>> FLIP-49
>> >>> [1]
>> >>> > where we want all the deriving happens at one place. I think it
>> would
>> >>> be
>> >>> > nice to pass the default slot size into the task executor in the
>> same
>> >>> way
>> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
>> >>> >
>> >>> > - Regarding the return value of
>> TaskExecutorGateway#requestResource, I
>> >>> > think you're right. We should avoid using null as the return value.
>> I
>> >>> think
>> >>> > we probably should thrown an exception here.
>> >>> >
>> >>> > Thank you~
>> >>> >
>> >>> > Xintong Song
>> >>> >
>> >>> >
>> >>> > [1]
>> >>> >
>> >>> >
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>> >>> >
>> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <[hidden email]
>> >
>> >>> > wrote:
>> >>> >
>> >>> > > Hi Xintong,
>> >>> > >
>> >>> > > thanks for drafting this FLIP. I think your proposal helps to
>> >>> improve the
>> >>> > > execution of batch jobs more efficiently. Moreover, it enables the
>> >>> proper
>> >>> > > integration of the Blink planner which is very important as well.
>> >>> > >
>> >>> > > Overall, the FLIP looks good to me. I was wondering whether it
>> >>> wouldn't
>> >>> > > make sense to actually split it up into two FLIPs: Operator
>> resource
>> >>> > > management and dynamic slot allocation. I think these two FLIPs
>> >>> could be
>> >>> > > seen as orthogonal and it would decrease the scope of each
>> individual
>> >>> > FLIP.
>> >>> > >
>> >>> > > Some smaller comments:
>> >>> > >
>> >>> > > - I'm not sure whether we should pass in the default slot size
>> via an
>> >>> > > environment variable. Without having unified the way how Flink
>> >>> components
>> >>> > > are configured [1], I think it would be better to pass it in as
>> part
>> >>> of
>> >>> > the
>> >>> > > configuration.
>> >>> > > - I would avoid returning a null value from
>> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled.
>> >>> Either we
>> >>> > > should introduce an explicit return value saying this or throw an
>> >>> > > exception.
>> >>> > >
>> >>> > > Concerning Yangze's comments: I think you are right that it would
>> be
>> >>> > > helpful to make the selection strategy pluggable. Also batching
>> slot
>> >>> > > requests to the RM could be a good optimization. For the sake of
>> >>> keeping
>> >>> > > the scope of this FLIP smaller I would try to tackle these things
>> >>> after
>> >>> > the
>> >>> > > initial version has been completed (without spoiling these
>> >>> optimization
>> >>> > > opportunities). In particular batching the slot requests depends
>> on
>> >>> the
>> >>> > > current scheduler refactoring and could also be realized on the RM
>> >>> side
>> >>> > > only.
>> >>> > >
>> >>> > > [1]
>> >>> > >
>> >>> > >
>> >>> >
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
>> >>> > >
>> >>> > > Cheers,
>> >>> > > Till
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]>
>> >>> wrote:
>> >>> > >
>> >>> > > > Hi, Xintong
>> >>> > > >
>> >>> > > > Thanks to propose this FLIP. The general design looks good to
>> me,
>> >>> +1
>> >>> > > > for this feature.
>> >>> > > >
>> >>> > > > Since slots in the same task executor could have different
>> resource
>> >>> > > > profile, we will
>> >>> > > > meet resource fragment problem. Think about this case:
>> >>> > > >  - request A want 1G memory while request B & C want 0.5G memory
>> >>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G free
>> >>> memory
>> >>> > > > respectively
>> >>> > > > If B come first and we cut a slot from T1 for B, A must wait for
>> >>> the
>> >>> > > > free resource from
>> >>> > > > other task. But A could have been scheduled immediately if we
>> cut a
>> >>> > > > slot from T2 for B.
>> >>> > > >
>> >>> > > > The logic of findMatchingSlot now become finding a task executor
>> >>> which
>> >>> > > > has enough
>> >>> > > > resource and then cut a slot from it. Current method could be
>> seen
>> >>> as
>> >>> > > > "First-fit strategy",
>> >>> > > > which works well in general but sometimes could not be the
>> >>> optimization
>> >>> > > > method.
>> >>> > > >
>> >>> > > > Actually, this problem could be abstracted as "Bin Packing
>> >>> Problem"[1].
>> >>> > > > Here are
>> >>> > > > some common approximate algorithms:
>> >>> > > > - First fit
>> >>> > > > - Next fit
>> >>> > > > - Best fit
>> >>> > > >
>> >>> > > > But it become multi-dimensional bin packing problem if we take
>> CPU
>> >>> > > > into account. It hard
>> >>> > > > to define which one is best fit now. Some research addressed
>> this
>> >>> > > > problem, such like Tetris[2].
>> >>> > > >
>> >>> > > > Here are some thinking about it:
>> >>> > > > 1. We could make the strategy of finding matching task executor
>> >>> > > > pluginable. Let user to config the
>> >>> > > > best strategy in their scenario.
>> >>> > > > 2. We could support batch request interface in RM, because we
>> have
>> >>> > > > opportunities to optimize
>> >>> > > > if we have more information. If we know the A, B, C at the same
>> >>> time,
>> >>> > > > we could always make the best decision.
>> >>> > > >
>> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
>> >>> > > > [2]
>> >>> >
>> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
>> >>> > > >
>> >>> > > > Best,
>> >>> > > > Yangze Guo
>> >>> > > >
>> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
>> >>> [hidden email]>
>> >>> > > > wrote:
>> >>> > > > >
>> >>> > > > > Hi everyone,
>> >>> > > > >
>> >>> > > > > We would like to start a discussion thread on "FLIP-53: Fine
>> >>> Grained
>> >>> > > > > Resource Management"[1], where we propose how to improve Flink
>> >>> > resource
>> >>> > > > > management and scheduling.
>> >>> > > > >
>> >>> > > > > This FLIP mainly discusses the following issues.
>> >>> > > > >
>> >>> > > > >    - How to support tasks with fine grained resource
>> >>> requirements.
>> >>> > > > >    - How to unify resource management for jobs with / without
>> >>> fine
>> >>> > > > grained
>> >>> > > > >    resource requirements.
>> >>> > > > >    - How to unify resource management for streaming / batch
>> jobs.
>> >>> > > > >
>> >>> > > > > Key changes proposed in the FLIP are as follows.
>> >>> > > > >
>> >>> > > > >    - Unify memory management for operators with / without fine
>> >>> > grained
>> >>> > > > >    resource requirements by applying a fraction based quota
>> >>> > mechanism.
>> >>> > > > >    - Unify resource scheduling for streaming and batch jobs by
>> >>> > setting
>> >>> > > > slot
>> >>> > > > >    sharing groups for pipelined regions during compiling
>> stage.
>> >>> > > > >    - Dynamically allocate slots from task executors' available
>> >>> > > resources.
>> >>> > > > >
>> >>> > > > > Please find more details in the FLIP wiki document [1].
>> Looking
>> >>> > forward
>> >>> > > > to
>> >>> > > > > your feedbacks.
>> >>> > > > >
>> >>> > > > > Thank you~
>> >>> > > > >
>> >>> > > > > Xintong Song
>> >>> > > > >
>> >>> > > > >
>> >>> > > > > [1]
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Thanks for the correction, Till.

Regarding your comments:
- You are right, we should not change the edge type for streaming jobs.
Then I think we can change the option 'allSourcesInSamePipelinedRegion' in
step 2 to 'isStreamingJob', and implement the current step 2 before the
current step 1 so we can use this option to decide whether should change
the edge type. What do you think?
- Agree. It should be easier to make the default value of
'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set it
to 'false' when using DataSet API or blink planner.

Thank you~

Xintong Song



On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for creating the implementation plan Xintong. Overall, the
> implementation plan looks good. I had a couple of comments:
>
> - What will happen if a user has defined a streaming job with two slot
> sharing groups? Would the code insert a blocking data exchange between
> these two groups? If yes, then this breaks existing Flink streaming jobs.
> - How do we detect unbounded streaming jobs to set
> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to set
> it false if we are using the DataSet API or the Blink planner with a
> bounded job?
>
> Cheers,
> Till
>
> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > I guess there is a typo since the link to the FLIP-53 is
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> >
> > Cheers,
> > Till
> >
> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]>
> > wrote:
> >
> >> Added implementation steps for this FLIP on the wiki page [1].
> >>
> >>
> >> Thank you~
> >>
> >> Xintong Song
> >>
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >>
> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]>
> >> wrote:
> >>
> >> > Hi everyone,
> >> >
> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource
> >> > Management" splits into two separate FLIPs,
> >> >
> >> >    - FLIP-53: Fine Grained Operator Resource Management [1]
> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> >> >
> >> > We'll continue using this discussion thread for FLIP-53. For FLIP-56,
> I
> >> > just started a new discussion thread [3].
> >> >
> >> > Thank you~
> >> >
> >> > Xintong Song
> >> >
> >> >
> >> > [1]
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> >> >
> >> > [2]
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> >> >
> >> > [3]
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> >> >
> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]>
> >> > wrote:
> >> >
> >> >> Thinks for the comments, Yang.
> >> >>
> >> >> Regarding your questions:
> >> >>
> >> >>    1. How to calculate the resource specification of TaskManagers? Do
> >> they
> >> >>>    have them same resource spec calculated based on the
> >> configuration? I
> >> >>> think
> >> >>>    we still have wasted resources in this situation. Or we could
> start
> >> >>>    TaskManagers with different spec.
> >> >>>
> >> >> I agree with you that we can further improve the resource utility by
> >> >> customizing task executors with different resource specifications.
> >> However,
> >> >> I'm in favor of limiting the scope of this FLIP and leave it as a
> >> future
> >> >> optimization. The plan for that part is to move the logic of deciding
> >> task
> >> >> executor specifications into the slot manager and make slot manager
> >> >> pluggable, so inside the slot manager plugin we can have different
> >> logics
> >> >> for deciding the task executor specifications.
> >> >>
> >> >>
> >> >>>    2. If a slot is released and returned to SlotPool, does it could
> be
> >> >>>    reused by other SlotRequest that the request resource is smaller
> >> than
> >> >>> it?
> >> >>>
> >> >> No, I think slot pool should always return slots if they do not
> exactly
> >> >> match the pending requests, so that resource manager can deal with
> the
> >> >> extra resources.
> >> >>
> >> >>>       - If it is yes, what happens to the available resource in the
> >> >>
> >> >>       TaskManager.
> >> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
> >> >>>       AllocationId is null?
> >> >>>
> >> >> The allocation id does not change as long as the slot is not returned
> >> >> from the job master, no matter its occupied or available in the slot
> >> pool.
> >> >> I think we have the same behavior currently. No matter how many tasks
> >> the
> >> >> job master deploy into the slot, concurrently or sequentially, it is
> >> one
> >> >> allocation from the cluster to the job until the slot is freed from
> >> the job
> >> >> master.
> >> >>
> >> >>>    3. In a session cluster, some jobs are configured with operator
> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal
> with
> >> >>> this
> >> >>>    situation?
> >> >>
> >> >> As long as we do not mix unknown / specified resource profiles within
> >> the
> >> >> same job / slot, there shouldn't be a problem. Resource manager
> >> converts
> >> >> unknown resource profiles in slot requests to specified default
> >> resource
> >> >> profiles, so they can be dynamically allocated from task executors'
> >> >> available resources just as other slot requests with specified
> resource
> >> >> profiles.
> >> >>
> >> >> Thank you~
> >> >>
> >> >> Xintong Song
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]>
> >> wrote:
> >> >>
> >> >>> Hi Xintong,
> >> >>>
> >> >>>
> >> >>> Thanks for your detailed proposal. I think many users are suffering
> >> from
> >> >>> waste of resources. The resource spec of all task managers are same
> >> and
> >> >>> we
> >> >>> have to increase all task managers to make the heavy one more
> stable.
> >> So
> >> >>> we
> >> >>> will benefit from the fine grained resource management a lot. We
> could
> >> >>> get
> >> >>> better resource utilization and stability.
> >> >>>
> >> >>>
> >> >>> Just to share some thoughts.
> >> >>>
> >> >>>
> >> >>>
> >> >>>    1. How to calculate the resource specification of TaskManagers?
> Do
> >> >>> they
> >> >>>    have them same resource spec calculated based on the
> >> configuration? I
> >> >>> think
> >> >>>    we still have wasted resources in this situation. Or we could
> start
> >> >>>    TaskManagers with different spec.
> >> >>>    2. If a slot is released and returned to SlotPool, does it could
> be
> >> >>>    reused by other SlotRequest that the request resource is smaller
> >> than
> >> >>> it?
> >> >>>       - If it is yes, what happens to the available resource in the
> >> >>>       TaskManager.
> >> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
> >> >>>       AllocationId is null?
> >> >>>    3. In a session cluster, some jobs are configured with operator
> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal
> with
> >> >>> this
> >> >>>    situation?
> >> >>>
> >> >>>
> >> >>>
> >> >>> Best,
> >> >>> Yang
> >> >>>
> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
> >> >>>
> >> >>> > Thanks for the feedbacks, Yangze and Till.
> >> >>> >
> >> >>> > Yangze,
> >> >>> >
> >> >>> > I agree with you that we should make scheduling strategy pluggable
> >> and
> >> >>> > optimize the strategy to reduce the memory fragmentation problem,
> >> and
> >> >>> > thanks for the inputs on the potential algorithmic solutions.
> >> However,
> >> >>> I'm
> >> >>> > in favor of keep this FLIP focusing on the overall mechanism
> design
> >> >>> rather
> >> >>> > than strategies. Solving the fragmentation issue should be
> >> considered
> >> >>> as an
> >> >>> > optimization, and I agree with Till that we probably should tackle
> >> this
> >> >>> > afterwards.
> >> >>> >
> >> >>> > Till,
> >> >>> >
> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The
> operator
> >> >>> > resource management and dynamic slot allocation do not have much
> >> >>> dependency
> >> >>> > on each other.
> >> >>> >
> >> >>> > - Regarding the default slot size, I think this is similar to
> >> FLIP-49
> >> >>> [1]
> >> >>> > where we want all the deriving happens at one place. I think it
> >> would
> >> >>> be
> >> >>> > nice to pass the default slot size into the task executor in the
> >> same
> >> >>> way
> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
> >> >>> >
> >> >>> > - Regarding the return value of
> >> TaskExecutorGateway#requestResource, I
> >> >>> > think you're right. We should avoid using null as the return
> value.
> >> I
> >> >>> think
> >> >>> > we probably should thrown an exception here.
> >> >>> >
> >> >>> > Thank you~
> >> >>> >
> >> >>> > Xintong Song
> >> >>> >
> >> >>> >
> >> >>> > [1]
> >> >>> >
> >> >>> >
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >> >>> >
> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> [hidden email]
> >> >
> >> >>> > wrote:
> >> >>> >
> >> >>> > > Hi Xintong,
> >> >>> > >
> >> >>> > > thanks for drafting this FLIP. I think your proposal helps to
> >> >>> improve the
> >> >>> > > execution of batch jobs more efficiently. Moreover, it enables
> the
> >> >>> proper
> >> >>> > > integration of the Blink planner which is very important as
> well.
> >> >>> > >
> >> >>> > > Overall, the FLIP looks good to me. I was wondering whether it
> >> >>> wouldn't
> >> >>> > > make sense to actually split it up into two FLIPs: Operator
> >> resource
> >> >>> > > management and dynamic slot allocation. I think these two FLIPs
> >> >>> could be
> >> >>> > > seen as orthogonal and it would decrease the scope of each
> >> individual
> >> >>> > FLIP.
> >> >>> > >
> >> >>> > > Some smaller comments:
> >> >>> > >
> >> >>> > > - I'm not sure whether we should pass in the default slot size
> >> via an
> >> >>> > > environment variable. Without having unified the way how Flink
> >> >>> components
> >> >>> > > are configured [1], I think it would be better to pass it in as
> >> part
> >> >>> of
> >> >>> > the
> >> >>> > > configuration.
> >> >>> > > - I would avoid returning a null value from
> >> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled.
> >> >>> Either we
> >> >>> > > should introduce an explicit return value saying this or throw
> an
> >> >>> > > exception.
> >> >>> > >
> >> >>> > > Concerning Yangze's comments: I think you are right that it
> would
> >> be
> >> >>> > > helpful to make the selection strategy pluggable. Also batching
> >> slot
> >> >>> > > requests to the RM could be a good optimization. For the sake of
> >> >>> keeping
> >> >>> > > the scope of this FLIP smaller I would try to tackle these
> things
> >> >>> after
> >> >>> > the
> >> >>> > > initial version has been completed (without spoiling these
> >> >>> optimization
> >> >>> > > opportunities). In particular batching the slot requests depends
> >> on
> >> >>> the
> >> >>> > > current scheduler refactoring and could also be realized on the
> RM
> >> >>> side
> >> >>> > > only.
> >> >>> > >
> >> >>> > > [1]
> >> >>> > >
> >> >>> > >
> >> >>> >
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> >> >>> > >
> >> >>> > > Cheers,
> >> >>> > > Till
> >> >>> > >
> >> >>> > >
> >> >>> > >
> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <[hidden email]
> >
> >> >>> wrote:
> >> >>> > >
> >> >>> > > > Hi, Xintong
> >> >>> > > >
> >> >>> > > > Thanks to propose this FLIP. The general design looks good to
> >> me,
> >> >>> +1
> >> >>> > > > for this feature.
> >> >>> > > >
> >> >>> > > > Since slots in the same task executor could have different
> >> resource
> >> >>> > > > profile, we will
> >> >>> > > > meet resource fragment problem. Think about this case:
> >> >>> > > >  - request A want 1G memory while request B & C want 0.5G
> memory
> >> >>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G free
> >> >>> memory
> >> >>> > > > respectively
> >> >>> > > > If B come first and we cut a slot from T1 for B, A must wait
> for
> >> >>> the
> >> >>> > > > free resource from
> >> >>> > > > other task. But A could have been scheduled immediately if we
> >> cut a
> >> >>> > > > slot from T2 for B.
> >> >>> > > >
> >> >>> > > > The logic of findMatchingSlot now become finding a task
> executor
> >> >>> which
> >> >>> > > > has enough
> >> >>> > > > resource and then cut a slot from it. Current method could be
> >> seen
> >> >>> as
> >> >>> > > > "First-fit strategy",
> >> >>> > > > which works well in general but sometimes could not be the
> >> >>> optimization
> >> >>> > > > method.
> >> >>> > > >
> >> >>> > > > Actually, this problem could be abstracted as "Bin Packing
> >> >>> Problem"[1].
> >> >>> > > > Here are
> >> >>> > > > some common approximate algorithms:
> >> >>> > > > - First fit
> >> >>> > > > - Next fit
> >> >>> > > > - Best fit
> >> >>> > > >
> >> >>> > > > But it become multi-dimensional bin packing problem if we take
> >> CPU
> >> >>> > > > into account. It hard
> >> >>> > > > to define which one is best fit now. Some research addressed
> >> this
> >> >>> > > > problem, such like Tetris[2].
> >> >>> > > >
> >> >>> > > > Here are some thinking about it:
> >> >>> > > > 1. We could make the strategy of finding matching task
> executor
> >> >>> > > > pluginable. Let user to config the
> >> >>> > > > best strategy in their scenario.
> >> >>> > > > 2. We could support batch request interface in RM, because we
> >> have
> >> >>> > > > opportunities to optimize
> >> >>> > > > if we have more information. If we know the A, B, C at the
> same
> >> >>> time,
> >> >>> > > > we could always make the best decision.
> >> >>> > > >
> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> >> >>> > > > [2]
> >> >>> >
> >> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> >> >>> > > >
> >> >>> > > > Best,
> >> >>> > > > Yangze Guo
> >> >>> > > >
> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> >> >>> [hidden email]>
> >> >>> > > > wrote:
> >> >>> > > > >
> >> >>> > > > > Hi everyone,
> >> >>> > > > >
> >> >>> > > > > We would like to start a discussion thread on "FLIP-53: Fine
> >> >>> Grained
> >> >>> > > > > Resource Management"[1], where we propose how to improve
> Flink
> >> >>> > resource
> >> >>> > > > > management and scheduling.
> >> >>> > > > >
> >> >>> > > > > This FLIP mainly discusses the following issues.
> >> >>> > > > >
> >> >>> > > > >    - How to support tasks with fine grained resource
> >> >>> requirements.
> >> >>> > > > >    - How to unify resource management for jobs with /
> without
> >> >>> fine
> >> >>> > > > grained
> >> >>> > > > >    resource requirements.
> >> >>> > > > >    - How to unify resource management for streaming / batch
> >> jobs.
> >> >>> > > > >
> >> >>> > > > > Key changes proposed in the FLIP are as follows.
> >> >>> > > > >
> >> >>> > > > >    - Unify memory management for operators with / without
> fine
> >> >>> > grained
> >> >>> > > > >    resource requirements by applying a fraction based quota
> >> >>> > mechanism.
> >> >>> > > > >    - Unify resource scheduling for streaming and batch jobs
> by
> >> >>> > setting
> >> >>> > > > slot
> >> >>> > > > >    sharing groups for pipelined regions during compiling
> >> stage.
> >> >>> > > > >    - Dynamically allocate slots from task executors'
> available
> >> >>> > > resources.
> >> >>> > > > >
> >> >>> > > > > Please find more details in the FLIP wiki document [1].
> >> Looking
> >> >>> > forward
> >> >>> > > > to
> >> >>> > > > > your feedbacks.
> >> >>> > > > >
> >> >>> > > > > Thank you~
> >> >>> > > > >
> >> >>> > > > > Xintong Song
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > > [1]
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Regarding changing edge type, I think actually we don't need to do this for
batch jobs neither, because we don't have public interfaces for users to
explicitly set slot sharing groups in DataSet API and SQL/Table API. We
have such interfaces in DataStream API only.

Thank you~

Xintong Song



On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]> wrote:

> Thanks for the correction, Till.
>
> Regarding your comments:
> - You are right, we should not change the edge type for streaming jobs.
> Then I think we can change the option 'allSourcesInSamePipelinedRegion' in
> step 2 to 'isStreamingJob', and implement the current step 2 before the
> current step 1 so we can use this option to decide whether should change
> the edge type. What do you think?
> - Agree. It should be easier to make the default value of
> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set it
> to 'false' when using DataSet API or blink planner.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]>
> wrote:
>
>> Thanks for creating the implementation plan Xintong. Overall, the
>> implementation plan looks good. I had a couple of comments:
>>
>> - What will happen if a user has defined a streaming job with two slot
>> sharing groups? Would the code insert a blocking data exchange between
>> these two groups? If yes, then this breaks existing Flink streaming jobs.
>> - How do we detect unbounded streaming jobs to set
>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to
>> set
>> it false if we are using the DataSet API or the Blink planner with a
>> bounded job?
>>
>> Cheers,
>> Till
>>
>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]>
>> wrote:
>>
>> > I guess there is a typo since the link to the FLIP-53 is
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>> >
>> > Cheers,
>> > Till
>> >
>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]>
>> > wrote:
>> >
>> >> Added implementation steps for this FLIP on the wiki page [1].
>> >>
>> >>
>> >> Thank you~
>> >>
>> >> Xintong Song
>> >>
>> >>
>> >> [1]
>> >>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>> >>
>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]>
>> >> wrote:
>> >>
>> >> > Hi everyone,
>> >> >
>> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource
>> >> > Management" splits into two separate FLIPs,
>> >> >
>> >> >    - FLIP-53: Fine Grained Operator Resource Management [1]
>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
>> >> >
>> >> > We'll continue using this discussion thread for FLIP-53. For
>> FLIP-56, I
>> >> > just started a new discussion thread [3].
>> >> >
>> >> > Thank you~
>> >> >
>> >> > Xintong Song
>> >> >
>> >> >
>> >> > [1]
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>> >> >
>> >> > [2]
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>> >> >
>> >> > [3]
>> >> >
>> >>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
>> >> >
>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]>
>> >> > wrote:
>> >> >
>> >> >> Thinks for the comments, Yang.
>> >> >>
>> >> >> Regarding your questions:
>> >> >>
>> >> >>    1. How to calculate the resource specification of TaskManagers?
>> Do
>> >> they
>> >> >>>    have them same resource spec calculated based on the
>> >> configuration? I
>> >> >>> think
>> >> >>>    we still have wasted resources in this situation. Or we could
>> start
>> >> >>>    TaskManagers with different spec.
>> >> >>>
>> >> >> I agree with you that we can further improve the resource utility by
>> >> >> customizing task executors with different resource specifications.
>> >> However,
>> >> >> I'm in favor of limiting the scope of this FLIP and leave it as a
>> >> future
>> >> >> optimization. The plan for that part is to move the logic of
>> deciding
>> >> task
>> >> >> executor specifications into the slot manager and make slot manager
>> >> >> pluggable, so inside the slot manager plugin we can have different
>> >> logics
>> >> >> for deciding the task executor specifications.
>> >> >>
>> >> >>
>> >> >>>    2. If a slot is released and returned to SlotPool, does it
>> could be
>> >> >>>    reused by other SlotRequest that the request resource is smaller
>> >> than
>> >> >>> it?
>> >> >>>
>> >> >> No, I think slot pool should always return slots if they do not
>> exactly
>> >> >> match the pending requests, so that resource manager can deal with
>> the
>> >> >> extra resources.
>> >> >>
>> >> >>>       - If it is yes, what happens to the available resource in the
>> >> >>
>> >> >>       TaskManager.
>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
>> >> >>>       AllocationId is null?
>> >> >>>
>> >> >> The allocation id does not change as long as the slot is not
>> returned
>> >> >> from the job master, no matter its occupied or available in the slot
>> >> pool.
>> >> >> I think we have the same behavior currently. No matter how many
>> tasks
>> >> the
>> >> >> job master deploy into the slot, concurrently or sequentially, it is
>> >> one
>> >> >> allocation from the cluster to the job until the slot is freed from
>> >> the job
>> >> >> master.
>> >> >>
>> >> >>>    3. In a session cluster, some jobs are configured with operator
>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal
>> with
>> >> >>> this
>> >> >>>    situation?
>> >> >>
>> >> >> As long as we do not mix unknown / specified resource profiles
>> within
>> >> the
>> >> >> same job / slot, there shouldn't be a problem. Resource manager
>> >> converts
>> >> >> unknown resource profiles in slot requests to specified default
>> >> resource
>> >> >> profiles, so they can be dynamically allocated from task executors'
>> >> >> available resources just as other slot requests with specified
>> resource
>> >> >> profiles.
>> >> >>
>> >> >> Thank you~
>> >> >>
>> >> >> Xintong Song
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]>
>> >> wrote:
>> >> >>
>> >> >>> Hi Xintong,
>> >> >>>
>> >> >>>
>> >> >>> Thanks for your detailed proposal. I think many users are suffering
>> >> from
>> >> >>> waste of resources. The resource spec of all task managers are same
>> >> and
>> >> >>> we
>> >> >>> have to increase all task managers to make the heavy one more
>> stable.
>> >> So
>> >> >>> we
>> >> >>> will benefit from the fine grained resource management a lot. We
>> could
>> >> >>> get
>> >> >>> better resource utilization and stability.
>> >> >>>
>> >> >>>
>> >> >>> Just to share some thoughts.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>    1. How to calculate the resource specification of TaskManagers?
>> Do
>> >> >>> they
>> >> >>>    have them same resource spec calculated based on the
>> >> configuration? I
>> >> >>> think
>> >> >>>    we still have wasted resources in this situation. Or we could
>> start
>> >> >>>    TaskManagers with different spec.
>> >> >>>    2. If a slot is released and returned to SlotPool, does it
>> could be
>> >> >>>    reused by other SlotRequest that the request resource is smaller
>> >> than
>> >> >>> it?
>> >> >>>       - If it is yes, what happens to the available resource in the
>> >> >>>       TaskManager.
>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
>> >> >>>       AllocationId is null?
>> >> >>>    3. In a session cluster, some jobs are configured with operator
>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal
>> with
>> >> >>> this
>> >> >>>    situation?
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Best,
>> >> >>> Yang
>> >> >>>
>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
>> >> >>>
>> >> >>> > Thanks for the feedbacks, Yangze and Till.
>> >> >>> >
>> >> >>> > Yangze,
>> >> >>> >
>> >> >>> > I agree with you that we should make scheduling strategy
>> pluggable
>> >> and
>> >> >>> > optimize the strategy to reduce the memory fragmentation problem,
>> >> and
>> >> >>> > thanks for the inputs on the potential algorithmic solutions.
>> >> However,
>> >> >>> I'm
>> >> >>> > in favor of keep this FLIP focusing on the overall mechanism
>> design
>> >> >>> rather
>> >> >>> > than strategies. Solving the fragmentation issue should be
>> >> considered
>> >> >>> as an
>> >> >>> > optimization, and I agree with Till that we probably should
>> tackle
>> >> this
>> >> >>> > afterwards.
>> >> >>> >
>> >> >>> > Till,
>> >> >>> >
>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The
>> operator
>> >> >>> > resource management and dynamic slot allocation do not have much
>> >> >>> dependency
>> >> >>> > on each other.
>> >> >>> >
>> >> >>> > - Regarding the default slot size, I think this is similar to
>> >> FLIP-49
>> >> >>> [1]
>> >> >>> > where we want all the deriving happens at one place. I think it
>> >> would
>> >> >>> be
>> >> >>> > nice to pass the default slot size into the task executor in the
>> >> same
>> >> >>> way
>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
>> >> >>> >
>> >> >>> > - Regarding the return value of
>> >> TaskExecutorGateway#requestResource, I
>> >> >>> > think you're right. We should avoid using null as the return
>> value.
>> >> I
>> >> >>> think
>> >> >>> > we probably should thrown an exception here.
>> >> >>> >
>> >> >>> > Thank you~
>> >> >>> >
>> >> >>> > Xintong Song
>> >> >>> >
>> >> >>> >
>> >> >>> > [1]
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>> >> >>> >
>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
>> [hidden email]
>> >> >
>> >> >>> > wrote:
>> >> >>> >
>> >> >>> > > Hi Xintong,
>> >> >>> > >
>> >> >>> > > thanks for drafting this FLIP. I think your proposal helps to
>> >> >>> improve the
>> >> >>> > > execution of batch jobs more efficiently. Moreover, it enables
>> the
>> >> >>> proper
>> >> >>> > > integration of the Blink planner which is very important as
>> well.
>> >> >>> > >
>> >> >>> > > Overall, the FLIP looks good to me. I was wondering whether it
>> >> >>> wouldn't
>> >> >>> > > make sense to actually split it up into two FLIPs: Operator
>> >> resource
>> >> >>> > > management and dynamic slot allocation. I think these two FLIPs
>> >> >>> could be
>> >> >>> > > seen as orthogonal and it would decrease the scope of each
>> >> individual
>> >> >>> > FLIP.
>> >> >>> > >
>> >> >>> > > Some smaller comments:
>> >> >>> > >
>> >> >>> > > - I'm not sure whether we should pass in the default slot size
>> >> via an
>> >> >>> > > environment variable. Without having unified the way how Flink
>> >> >>> components
>> >> >>> > > are configured [1], I think it would be better to pass it in as
>> >> part
>> >> >>> of
>> >> >>> > the
>> >> >>> > > configuration.
>> >> >>> > > - I would avoid returning a null value from
>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled.
>> >> >>> Either we
>> >> >>> > > should introduce an explicit return value saying this or throw
>> an
>> >> >>> > > exception.
>> >> >>> > >
>> >> >>> > > Concerning Yangze's comments: I think you are right that it
>> would
>> >> be
>> >> >>> > > helpful to make the selection strategy pluggable. Also batching
>> >> slot
>> >> >>> > > requests to the RM could be a good optimization. For the sake
>> of
>> >> >>> keeping
>> >> >>> > > the scope of this FLIP smaller I would try to tackle these
>> things
>> >> >>> after
>> >> >>> > the
>> >> >>> > > initial version has been completed (without spoiling these
>> >> >>> optimization
>> >> >>> > > opportunities). In particular batching the slot requests
>> depends
>> >> on
>> >> >>> the
>> >> >>> > > current scheduler refactoring and could also be realized on
>> the RM
>> >> >>> side
>> >> >>> > > only.
>> >> >>> > >
>> >> >>> > > [1]
>> >> >>> > >
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
>> >> >>> > >
>> >> >>> > > Cheers,
>> >> >>> > > Till
>> >> >>> > >
>> >> >>> > >
>> >> >>> > >
>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
>> [hidden email]>
>> >> >>> wrote:
>> >> >>> > >
>> >> >>> > > > Hi, Xintong
>> >> >>> > > >
>> >> >>> > > > Thanks to propose this FLIP. The general design looks good to
>> >> me,
>> >> >>> +1
>> >> >>> > > > for this feature.
>> >> >>> > > >
>> >> >>> > > > Since slots in the same task executor could have different
>> >> resource
>> >> >>> > > > profile, we will
>> >> >>> > > > meet resource fragment problem. Think about this case:
>> >> >>> > > >  - request A want 1G memory while request B & C want 0.5G
>> memory
>> >> >>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G free
>> >> >>> memory
>> >> >>> > > > respectively
>> >> >>> > > > If B come first and we cut a slot from T1 for B, A must wait
>> for
>> >> >>> the
>> >> >>> > > > free resource from
>> >> >>> > > > other task. But A could have been scheduled immediately if we
>> >> cut a
>> >> >>> > > > slot from T2 for B.
>> >> >>> > > >
>> >> >>> > > > The logic of findMatchingSlot now become finding a task
>> executor
>> >> >>> which
>> >> >>> > > > has enough
>> >> >>> > > > resource and then cut a slot from it. Current method could be
>> >> seen
>> >> >>> as
>> >> >>> > > > "First-fit strategy",
>> >> >>> > > > which works well in general but sometimes could not be the
>> >> >>> optimization
>> >> >>> > > > method.
>> >> >>> > > >
>> >> >>> > > > Actually, this problem could be abstracted as "Bin Packing
>> >> >>> Problem"[1].
>> >> >>> > > > Here are
>> >> >>> > > > some common approximate algorithms:
>> >> >>> > > > - First fit
>> >> >>> > > > - Next fit
>> >> >>> > > > - Best fit
>> >> >>> > > >
>> >> >>> > > > But it become multi-dimensional bin packing problem if we
>> take
>> >> CPU
>> >> >>> > > > into account. It hard
>> >> >>> > > > to define which one is best fit now. Some research addressed
>> >> this
>> >> >>> > > > problem, such like Tetris[2].
>> >> >>> > > >
>> >> >>> > > > Here are some thinking about it:
>> >> >>> > > > 1. We could make the strategy of finding matching task
>> executor
>> >> >>> > > > pluginable. Let user to config the
>> >> >>> > > > best strategy in their scenario.
>> >> >>> > > > 2. We could support batch request interface in RM, because we
>> >> have
>> >> >>> > > > opportunities to optimize
>> >> >>> > > > if we have more information. If we know the A, B, C at the
>> same
>> >> >>> time,
>> >> >>> > > > we could always make the best decision.
>> >> >>> > > >
>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
>> >> >>> > > > [2]
>> >> >>> >
>> >> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
>> >> >>> > > >
>> >> >>> > > > Best,
>> >> >>> > > > Yangze Guo
>> >> >>> > > >
>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
>> >> >>> [hidden email]>
>> >> >>> > > > wrote:
>> >> >>> > > > >
>> >> >>> > > > > Hi everyone,
>> >> >>> > > > >
>> >> >>> > > > > We would like to start a discussion thread on "FLIP-53:
>> Fine
>> >> >>> Grained
>> >> >>> > > > > Resource Management"[1], where we propose how to improve
>> Flink
>> >> >>> > resource
>> >> >>> > > > > management and scheduling.
>> >> >>> > > > >
>> >> >>> > > > > This FLIP mainly discusses the following issues.
>> >> >>> > > > >
>> >> >>> > > > >    - How to support tasks with fine grained resource
>> >> >>> requirements.
>> >> >>> > > > >    - How to unify resource management for jobs with /
>> without
>> >> >>> fine
>> >> >>> > > > grained
>> >> >>> > > > >    resource requirements.
>> >> >>> > > > >    - How to unify resource management for streaming / batch
>> >> jobs.
>> >> >>> > > > >
>> >> >>> > > > > Key changes proposed in the FLIP are as follows.
>> >> >>> > > > >
>> >> >>> > > > >    - Unify memory management for operators with / without
>> fine
>> >> >>> > grained
>> >> >>> > > > >    resource requirements by applying a fraction based quota
>> >> >>> > mechanism.
>> >> >>> > > > >    - Unify resource scheduling for streaming and batch
>> jobs by
>> >> >>> > setting
>> >> >>> > > > slot
>> >> >>> > > > >    sharing groups for pipelined regions during compiling
>> >> stage.
>> >> >>> > > > >    - Dynamically allocate slots from task executors'
>> available
>> >> >>> > > resources.
>> >> >>> > > > >
>> >> >>> > > > > Please find more details in the FLIP wiki document [1].
>> >> Looking
>> >> >>> > forward
>> >> >>> > > > to
>> >> >>> > > > > your feedbacks.
>> >> >>> > > > >
>> >> >>> > > > > Thank you~
>> >> >>> > > > >
>> >> >>> > > > > Xintong Song
>> >> >>> > > > >
>> >> >>> > > > >
>> >> >>> > > > > [1]
>> >> >>> > > > >
>> >> >>> > > >
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
>> >> >>> > > >
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >> >>
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Updated the FLIP wiki page [1], with the following changes.

   - Remove the step of converting pipelined edges between different slot
   sharing groups into blocking edges.
   - Set `allSourcesInSamePipelinedRegion` to true by default.

Thank you~

Xintong Song



On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]> wrote:

> Regarding changing edge type, I think actually we don't need to do this
> for batch jobs neither, because we don't have public interfaces for users
> to explicitly set slot sharing groups in DataSet API and SQL/Table API. We
> have such interfaces in DataStream API only.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]>
> wrote:
>
>> Thanks for the correction, Till.
>>
>> Regarding your comments:
>> - You are right, we should not change the edge type for streaming jobs.
>> Then I think we can change the option 'allSourcesInSamePipelinedRegion' in
>> step 2 to 'isStreamingJob', and implement the current step 2 before the
>> current step 1 so we can use this option to decide whether should change
>> the edge type. What do you think?
>> - Agree. It should be easier to make the default value of
>> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set it
>> to 'false' when using DataSet API or blink planner.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]>
>> wrote:
>>
>>> Thanks for creating the implementation plan Xintong. Overall, the
>>> implementation plan looks good. I had a couple of comments:
>>>
>>> - What will happen if a user has defined a streaming job with two slot
>>> sharing groups? Would the code insert a blocking data exchange between
>>> these two groups? If yes, then this breaks existing Flink streaming jobs.
>>> - How do we detect unbounded streaming jobs to set
>>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to
>>> set
>>> it false if we are using the DataSet API or the Blink planner with a
>>> bounded job?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]>
>>> wrote:
>>>
>>> > I guess there is a typo since the link to the FLIP-53 is
>>> >
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]>
>>> > wrote:
>>> >
>>> >> Added implementation steps for this FLIP on the wiki page [1].
>>> >>
>>> >>
>>> >> Thank you~
>>> >>
>>> >> Xintong Song
>>> >>
>>> >>
>>> >> [1]
>>> >>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>>> >>
>>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <[hidden email]>
>>> >> wrote:
>>> >>
>>> >> > Hi everyone,
>>> >> >
>>> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource
>>> >> > Management" splits into two separate FLIPs,
>>> >> >
>>> >> >    - FLIP-53: Fine Grained Operator Resource Management [1]
>>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
>>> >> >
>>> >> > We'll continue using this discussion thread for FLIP-53. For
>>> FLIP-56, I
>>> >> > just started a new discussion thread [3].
>>> >> >
>>> >> > Thank you~
>>> >> >
>>> >> > Xintong Song
>>> >> >
>>> >> >
>>> >> > [1]
>>> >> >
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>>> >> >
>>> >> > [2]
>>> >> >
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>>> >> >
>>> >> > [3]
>>> >> >
>>> >>
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
>>> >> >
>>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <[hidden email]
>>> >
>>> >> > wrote:
>>> >> >
>>> >> >> Thinks for the comments, Yang.
>>> >> >>
>>> >> >> Regarding your questions:
>>> >> >>
>>> >> >>    1. How to calculate the resource specification of TaskManagers?
>>> Do
>>> >> they
>>> >> >>>    have them same resource spec calculated based on the
>>> >> configuration? I
>>> >> >>> think
>>> >> >>>    we still have wasted resources in this situation. Or we could
>>> start
>>> >> >>>    TaskManagers with different spec.
>>> >> >>>
>>> >> >> I agree with you that we can further improve the resource utility
>>> by
>>> >> >> customizing task executors with different resource specifications.
>>> >> However,
>>> >> >> I'm in favor of limiting the scope of this FLIP and leave it as a
>>> >> future
>>> >> >> optimization. The plan for that part is to move the logic of
>>> deciding
>>> >> task
>>> >> >> executor specifications into the slot manager and make slot manager
>>> >> >> pluggable, so inside the slot manager plugin we can have different
>>> >> logics
>>> >> >> for deciding the task executor specifications.
>>> >> >>
>>> >> >>
>>> >> >>>    2. If a slot is released and returned to SlotPool, does it
>>> could be
>>> >> >>>    reused by other SlotRequest that the request resource is
>>> smaller
>>> >> than
>>> >> >>> it?
>>> >> >>>
>>> >> >> No, I think slot pool should always return slots if they do not
>>> exactly
>>> >> >> match the pending requests, so that resource manager can deal with
>>> the
>>> >> >> extra resources.
>>> >> >>
>>> >> >>>       - If it is yes, what happens to the available resource in
>>> the
>>> >> >>
>>> >> >>       TaskManager.
>>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
>>> >> >>>       AllocationId is null?
>>> >> >>>
>>> >> >> The allocation id does not change as long as the slot is not
>>> returned
>>> >> >> from the job master, no matter its occupied or available in the
>>> slot
>>> >> pool.
>>> >> >> I think we have the same behavior currently. No matter how many
>>> tasks
>>> >> the
>>> >> >> job master deploy into the slot, concurrently or sequentially, it
>>> is
>>> >> one
>>> >> >> allocation from the cluster to the job until the slot is freed from
>>> >> the job
>>> >> >> master.
>>> >> >>
>>> >> >>>    3. In a session cluster, some jobs are configured with operator
>>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal
>>> with
>>> >> >>> this
>>> >> >>>    situation?
>>> >> >>
>>> >> >> As long as we do not mix unknown / specified resource profiles
>>> within
>>> >> the
>>> >> >> same job / slot, there shouldn't be a problem. Resource manager
>>> >> converts
>>> >> >> unknown resource profiles in slot requests to specified default
>>> >> resource
>>> >> >> profiles, so they can be dynamically allocated from task executors'
>>> >> >> available resources just as other slot requests with specified
>>> resource
>>> >> >> profiles.
>>> >> >>
>>> >> >> Thank you~
>>> >> >>
>>> >> >> Xintong Song
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <[hidden email]>
>>> >> wrote:
>>> >> >>
>>> >> >>> Hi Xintong,
>>> >> >>>
>>> >> >>>
>>> >> >>> Thanks for your detailed proposal. I think many users are
>>> suffering
>>> >> from
>>> >> >>> waste of resources. The resource spec of all task managers are
>>> same
>>> >> and
>>> >> >>> we
>>> >> >>> have to increase all task managers to make the heavy one more
>>> stable.
>>> >> So
>>> >> >>> we
>>> >> >>> will benefit from the fine grained resource management a lot. We
>>> could
>>> >> >>> get
>>> >> >>> better resource utilization and stability.
>>> >> >>>
>>> >> >>>
>>> >> >>> Just to share some thoughts.
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>    1. How to calculate the resource specification of
>>> TaskManagers? Do
>>> >> >>> they
>>> >> >>>    have them same resource spec calculated based on the
>>> >> configuration? I
>>> >> >>> think
>>> >> >>>    we still have wasted resources in this situation. Or we could
>>> start
>>> >> >>>    TaskManagers with different spec.
>>> >> >>>    2. If a slot is released and returned to SlotPool, does it
>>> could be
>>> >> >>>    reused by other SlotRequest that the request resource is
>>> smaller
>>> >> than
>>> >> >>> it?
>>> >> >>>       - If it is yes, what happens to the available resource in
>>> the
>>> >> >>>       TaskManager.
>>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool? The
>>> >> >>>       AllocationId is null?
>>> >> >>>    3. In a session cluster, some jobs are configured with operator
>>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to deal
>>> with
>>> >> >>> this
>>> >> >>>    situation?
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> Best,
>>> >> >>> Yang
>>> >> >>>
>>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
>>> >> >>>
>>> >> >>> > Thanks for the feedbacks, Yangze and Till.
>>> >> >>> >
>>> >> >>> > Yangze,
>>> >> >>> >
>>> >> >>> > I agree with you that we should make scheduling strategy
>>> pluggable
>>> >> and
>>> >> >>> > optimize the strategy to reduce the memory fragmentation
>>> problem,
>>> >> and
>>> >> >>> > thanks for the inputs on the potential algorithmic solutions.
>>> >> However,
>>> >> >>> I'm
>>> >> >>> > in favor of keep this FLIP focusing on the overall mechanism
>>> design
>>> >> >>> rather
>>> >> >>> > than strategies. Solving the fragmentation issue should be
>>> >> considered
>>> >> >>> as an
>>> >> >>> > optimization, and I agree with Till that we probably should
>>> tackle
>>> >> this
>>> >> >>> > afterwards.
>>> >> >>> >
>>> >> >>> > Till,
>>> >> >>> >
>>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The
>>> operator
>>> >> >>> > resource management and dynamic slot allocation do not have much
>>> >> >>> dependency
>>> >> >>> > on each other.
>>> >> >>> >
>>> >> >>> > - Regarding the default slot size, I think this is similar to
>>> >> FLIP-49
>>> >> >>> [1]
>>> >> >>> > where we want all the deriving happens at one place. I think it
>>> >> would
>>> >> >>> be
>>> >> >>> > nice to pass the default slot size into the task executor in the
>>> >> same
>>> >> >>> way
>>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
>>> >> >>> >
>>> >> >>> > - Regarding the return value of
>>> >> TaskExecutorGateway#requestResource, I
>>> >> >>> > think you're right. We should avoid using null as the return
>>> value.
>>> >> I
>>> >> >>> think
>>> >> >>> > we probably should thrown an exception here.
>>> >> >>> >
>>> >> >>> > Thank you~
>>> >> >>> >
>>> >> >>> > Xintong Song
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > [1]
>>> >> >>> >
>>> >> >>> >
>>> >> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>>> >> >>> >
>>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
>>> [hidden email]
>>> >> >
>>> >> >>> > wrote:
>>> >> >>> >
>>> >> >>> > > Hi Xintong,
>>> >> >>> > >
>>> >> >>> > > thanks for drafting this FLIP. I think your proposal helps to
>>> >> >>> improve the
>>> >> >>> > > execution of batch jobs more efficiently. Moreover, it
>>> enables the
>>> >> >>> proper
>>> >> >>> > > integration of the Blink planner which is very important as
>>> well.
>>> >> >>> > >
>>> >> >>> > > Overall, the FLIP looks good to me. I was wondering whether it
>>> >> >>> wouldn't
>>> >> >>> > > make sense to actually split it up into two FLIPs: Operator
>>> >> resource
>>> >> >>> > > management and dynamic slot allocation. I think these two
>>> FLIPs
>>> >> >>> could be
>>> >> >>> > > seen as orthogonal and it would decrease the scope of each
>>> >> individual
>>> >> >>> > FLIP.
>>> >> >>> > >
>>> >> >>> > > Some smaller comments:
>>> >> >>> > >
>>> >> >>> > > - I'm not sure whether we should pass in the default slot size
>>> >> via an
>>> >> >>> > > environment variable. Without having unified the way how Flink
>>> >> >>> components
>>> >> >>> > > are configured [1], I think it would be better to pass it in
>>> as
>>> >> part
>>> >> >>> of
>>> >> >>> > the
>>> >> >>> > > configuration.
>>> >> >>> > > - I would avoid returning a null value from
>>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be fulfilled.
>>> >> >>> Either we
>>> >> >>> > > should introduce an explicit return value saying this or
>>> throw an
>>> >> >>> > > exception.
>>> >> >>> > >
>>> >> >>> > > Concerning Yangze's comments: I think you are right that it
>>> would
>>> >> be
>>> >> >>> > > helpful to make the selection strategy pluggable. Also
>>> batching
>>> >> slot
>>> >> >>> > > requests to the RM could be a good optimization. For the sake
>>> of
>>> >> >>> keeping
>>> >> >>> > > the scope of this FLIP smaller I would try to tackle these
>>> things
>>> >> >>> after
>>> >> >>> > the
>>> >> >>> > > initial version has been completed (without spoiling these
>>> >> >>> optimization
>>> >> >>> > > opportunities). In particular batching the slot requests
>>> depends
>>> >> on
>>> >> >>> the
>>> >> >>> > > current scheduler refactoring and could also be realized on
>>> the RM
>>> >> >>> side
>>> >> >>> > > only.
>>> >> >>> > >
>>> >> >>> > > [1]
>>> >> >>> > >
>>> >> >>> > >
>>> >> >>> >
>>> >> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
>>> >> >>> > >
>>> >> >>> > > Cheers,
>>> >> >>> > > Till
>>> >> >>> > >
>>> >> >>> > >
>>> >> >>> > >
>>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
>>> [hidden email]>
>>> >> >>> wrote:
>>> >> >>> > >
>>> >> >>> > > > Hi, Xintong
>>> >> >>> > > >
>>> >> >>> > > > Thanks to propose this FLIP. The general design looks good
>>> to
>>> >> me,
>>> >> >>> +1
>>> >> >>> > > > for this feature.
>>> >> >>> > > >
>>> >> >>> > > > Since slots in the same task executor could have different
>>> >> resource
>>> >> >>> > > > profile, we will
>>> >> >>> > > > meet resource fragment problem. Think about this case:
>>> >> >>> > > >  - request A want 1G memory while request B & C want 0.5G
>>> memory
>>> >> >>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G
>>> free
>>> >> >>> memory
>>> >> >>> > > > respectively
>>> >> >>> > > > If B come first and we cut a slot from T1 for B, A must
>>> wait for
>>> >> >>> the
>>> >> >>> > > > free resource from
>>> >> >>> > > > other task. But A could have been scheduled immediately if
>>> we
>>> >> cut a
>>> >> >>> > > > slot from T2 for B.
>>> >> >>> > > >
>>> >> >>> > > > The logic of findMatchingSlot now become finding a task
>>> executor
>>> >> >>> which
>>> >> >>> > > > has enough
>>> >> >>> > > > resource and then cut a slot from it. Current method could
>>> be
>>> >> seen
>>> >> >>> as
>>> >> >>> > > > "First-fit strategy",
>>> >> >>> > > > which works well in general but sometimes could not be the
>>> >> >>> optimization
>>> >> >>> > > > method.
>>> >> >>> > > >
>>> >> >>> > > > Actually, this problem could be abstracted as "Bin Packing
>>> >> >>> Problem"[1].
>>> >> >>> > > > Here are
>>> >> >>> > > > some common approximate algorithms:
>>> >> >>> > > > - First fit
>>> >> >>> > > > - Next fit
>>> >> >>> > > > - Best fit
>>> >> >>> > > >
>>> >> >>> > > > But it become multi-dimensional bin packing problem if we
>>> take
>>> >> CPU
>>> >> >>> > > > into account. It hard
>>> >> >>> > > > to define which one is best fit now. Some research addressed
>>> >> this
>>> >> >>> > > > problem, such like Tetris[2].
>>> >> >>> > > >
>>> >> >>> > > > Here are some thinking about it:
>>> >> >>> > > > 1. We could make the strategy of finding matching task
>>> executor
>>> >> >>> > > > pluginable. Let user to config the
>>> >> >>> > > > best strategy in their scenario.
>>> >> >>> > > > 2. We could support batch request interface in RM, because
>>> we
>>> >> have
>>> >> >>> > > > opportunities to optimize
>>> >> >>> > > > if we have more information. If we know the A, B, C at the
>>> same
>>> >> >>> time,
>>> >> >>> > > > we could always make the best decision.
>>> >> >>> > > >
>>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
>>> >> >>> > > > [2]
>>> >> >>> >
>>> >> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
>>> >> >>> > > >
>>> >> >>> > > > Best,
>>> >> >>> > > > Yangze Guo
>>> >> >>> > > >
>>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
>>> >> >>> [hidden email]>
>>> >> >>> > > > wrote:
>>> >> >>> > > > >
>>> >> >>> > > > > Hi everyone,
>>> >> >>> > > > >
>>> >> >>> > > > > We would like to start a discussion thread on "FLIP-53:
>>> Fine
>>> >> >>> Grained
>>> >> >>> > > > > Resource Management"[1], where we propose how to improve
>>> Flink
>>> >> >>> > resource
>>> >> >>> > > > > management and scheduling.
>>> >> >>> > > > >
>>> >> >>> > > > > This FLIP mainly discusses the following issues.
>>> >> >>> > > > >
>>> >> >>> > > > >    - How to support tasks with fine grained resource
>>> >> >>> requirements.
>>> >> >>> > > > >    - How to unify resource management for jobs with /
>>> without
>>> >> >>> fine
>>> >> >>> > > > grained
>>> >> >>> > > > >    resource requirements.
>>> >> >>> > > > >    - How to unify resource management for streaming /
>>> batch
>>> >> jobs.
>>> >> >>> > > > >
>>> >> >>> > > > > Key changes proposed in the FLIP are as follows.
>>> >> >>> > > > >
>>> >> >>> > > > >    - Unify memory management for operators with / without
>>> fine
>>> >> >>> > grained
>>> >> >>> > > > >    resource requirements by applying a fraction based
>>> quota
>>> >> >>> > mechanism.
>>> >> >>> > > > >    - Unify resource scheduling for streaming and batch
>>> jobs by
>>> >> >>> > setting
>>> >> >>> > > > slot
>>> >> >>> > > > >    sharing groups for pipelined regions during compiling
>>> >> stage.
>>> >> >>> > > > >    - Dynamically allocate slots from task executors'
>>> available
>>> >> >>> > > resources.
>>> >> >>> > > > >
>>> >> >>> > > > > Please find more details in the FLIP wiki document [1].
>>> >> Looking
>>> >> >>> > forward
>>> >> >>> > > > to
>>> >> >>> > > > > your feedbacks.
>>> >> >>> > > > >
>>> >> >>> > > > > Thank you~
>>> >> >>> > > > >
>>> >> >>> > > > > Xintong Song
>>> >> >>> > > > >
>>> >> >>> > > > >
>>> >> >>> > > > > [1]
>>> >> >>> > > > >
>>> >> >>> > > >
>>> >> >>> > >
>>> >> >>> >
>>> >> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
>>> >> >>> > > >
>>> >> >>> > >
>>> >> >>> >
>>> >> >>>
>>> >> >>
>>> >>
>>> >
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Zhu Zhu
Thanks Xintong for proposing this improvement. Fine grained resources can
be very helpful when user has good planning on resources.

I have a few questions:
1. Currently in a batch job, vertices from different regions can run at the
same time in slots from the same shared group, as long as they do not have
data dependency on each other and available slot count is not smaller than
the *max* of parallelism of all tasks.
With changes in this FLIP however, tasks from different regions cannot
share slots anymore.
Once available slot count is smaller than the *sum* of all parallelism of
tasks from all regions, tasks may need to be executed sequentially, which
might result in a performance regression.
Is this(performance regression to existing DataSet jobs) considered as a
necessary and accepted trade off in this FLIP?

2. The network memory depends on the input/output ExecutionEdge count and
thus can be different even for parallel instances of the same JobVertex.
Does this mean that when adding task resources to calculating the slot
resource for a shared group, the max possible network memory of the vertex
instance shall be used?
This might result in larger resource required than actually needed.

And some minor comments:
1. Regarding "fracManagedMemOnHeap = 1 / numOpsUseOnHeapManagedMemory", I
guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ?
2. I think the *StreamGraphGenerator* in the #Slot Sharing section and
implementation step 4 should be *StreamingJobGraphGenerator*, as
*StreamGraphGenerator* is not aware of JobGraph and pipelined region.


Thanks,
Zhu Zhu

Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道:

> Updated the FLIP wiki page [1], with the following changes.
>
>    - Remove the step of converting pipelined edges between different slot
>    sharing groups into blocking edges.
>    - Set `allSourcesInSamePipelinedRegion` to true by default.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]>
> wrote:
>
> > Regarding changing edge type, I think actually we don't need to do this
> > for batch jobs neither, because we don't have public interfaces for users
> > to explicitly set slot sharing groups in DataSet API and SQL/Table API.
> We
> > have such interfaces in DataStream API only.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]>
> > wrote:
> >
> >> Thanks for the correction, Till.
> >>
> >> Regarding your comments:
> >> - You are right, we should not change the edge type for streaming jobs.
> >> Then I think we can change the option 'allSourcesInSamePipelinedRegion'
> in
> >> step 2 to 'isStreamingJob', and implement the current step 2 before the
> >> current step 1 so we can use this option to decide whether should change
> >> the edge type. What do you think?
> >> - Agree. It should be easier to make the default value of
> >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and set
> it
> >> to 'false' when using DataSet API or blink planner.
> >>
> >> Thank you~
> >>
> >> Xintong Song
> >>
> >>
> >>
> >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]>
> >> wrote:
> >>
> >>> Thanks for creating the implementation plan Xintong. Overall, the
> >>> implementation plan looks good. I had a couple of comments:
> >>>
> >>> - What will happen if a user has defined a streaming job with two slot
> >>> sharing groups? Would the code insert a blocking data exchange between
> >>> these two groups? If yes, then this breaks existing Flink streaming
> jobs.
> >>> - How do we detect unbounded streaming jobs to set
> >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier to
> >>> set
> >>> it false if we are using the DataSet API or the Blink planner with a
> >>> bounded job?
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]>
> >>> wrote:
> >>>
> >>> > I guess there is a typo since the link to the FLIP-53 is
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> >>> >
> >>> > Cheers,
> >>> > Till
> >>> >
> >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <[hidden email]>
> >>> > wrote:
> >>> >
> >>> >> Added implementation steps for this FLIP on the wiki page [1].
> >>> >>
> >>> >>
> >>> >> Thank you~
> >>> >>
> >>> >> Xintong Song
> >>> >>
> >>> >>
> >>> >> [1]
> >>> >>
> >>> >>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >>> >>
> >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <
> [hidden email]>
> >>> >> wrote:
> >>> >>
> >>> >> > Hi everyone,
> >>> >> >
> >>> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource
> >>> >> > Management" splits into two separate FLIPs,
> >>> >> >
> >>> >> >    - FLIP-53: Fine Grained Operator Resource Management [1]
> >>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> >>> >> >
> >>> >> > We'll continue using this discussion thread for FLIP-53. For
> >>> FLIP-56, I
> >>> >> > just started a new discussion thread [3].
> >>> >> >
> >>> >> > Thank you~
> >>> >> >
> >>> >> > Xintong Song
> >>> >> >
> >>> >> >
> >>> >> > [1]
> >>> >> >
> >>> >>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> >>> >> >
> >>> >> > [2]
> >>> >> >
> >>> >>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> >>> >> >
> >>> >> > [3]
> >>> >> >
> >>> >>
> >>>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> >>> >> >
> >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <
> [hidden email]
> >>> >
> >>> >> > wrote:
> >>> >> >
> >>> >> >> Thinks for the comments, Yang.
> >>> >> >>
> >>> >> >> Regarding your questions:
> >>> >> >>
> >>> >> >>    1. How to calculate the resource specification of
> TaskManagers?
> >>> Do
> >>> >> they
> >>> >> >>>    have them same resource spec calculated based on the
> >>> >> configuration? I
> >>> >> >>> think
> >>> >> >>>    we still have wasted resources in this situation. Or we could
> >>> start
> >>> >> >>>    TaskManagers with different spec.
> >>> >> >>>
> >>> >> >> I agree with you that we can further improve the resource utility
> >>> by
> >>> >> >> customizing task executors with different resource
> specifications.
> >>> >> However,
> >>> >> >> I'm in favor of limiting the scope of this FLIP and leave it as a
> >>> >> future
> >>> >> >> optimization. The plan for that part is to move the logic of
> >>> deciding
> >>> >> task
> >>> >> >> executor specifications into the slot manager and make slot
> manager
> >>> >> >> pluggable, so inside the slot manager plugin we can have
> different
> >>> >> logics
> >>> >> >> for deciding the task executor specifications.
> >>> >> >>
> >>> >> >>
> >>> >> >>>    2. If a slot is released and returned to SlotPool, does it
> >>> could be
> >>> >> >>>    reused by other SlotRequest that the request resource is
> >>> smaller
> >>> >> than
> >>> >> >>> it?
> >>> >> >>>
> >>> >> >> No, I think slot pool should always return slots if they do not
> >>> exactly
> >>> >> >> match the pending requests, so that resource manager can deal
> with
> >>> the
> >>> >> >> extra resources.
> >>> >> >>
> >>> >> >>>       - If it is yes, what happens to the available resource in
> >>> the
> >>> >> >>
> >>> >> >>       TaskManager.
> >>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool?
> The
> >>> >> >>>       AllocationId is null?
> >>> >> >>>
> >>> >> >> The allocation id does not change as long as the slot is not
> >>> returned
> >>> >> >> from the job master, no matter its occupied or available in the
> >>> slot
> >>> >> pool.
> >>> >> >> I think we have the same behavior currently. No matter how many
> >>> tasks
> >>> >> the
> >>> >> >> job master deploy into the slot, concurrently or sequentially, it
> >>> is
> >>> >> one
> >>> >> >> allocation from the cluster to the job until the slot is freed
> from
> >>> >> the job
> >>> >> >> master.
> >>> >> >>
> >>> >> >>>    3. In a session cluster, some jobs are configured with
> operator
> >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to
> deal
> >>> with
> >>> >> >>> this
> >>> >> >>>    situation?
> >>> >> >>
> >>> >> >> As long as we do not mix unknown / specified resource profiles
> >>> within
> >>> >> the
> >>> >> >> same job / slot, there shouldn't be a problem. Resource manager
> >>> >> converts
> >>> >> >> unknown resource profiles in slot requests to specified default
> >>> >> resource
> >>> >> >> profiles, so they can be dynamically allocated from task
> executors'
> >>> >> >> available resources just as other slot requests with specified
> >>> resource
> >>> >> >> profiles.
> >>> >> >>
> >>> >> >> Thank you~
> >>> >> >>
> >>> >> >> Xintong Song
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <
> [hidden email]>
> >>> >> wrote:
> >>> >> >>
> >>> >> >>> Hi Xintong,
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Thanks for your detailed proposal. I think many users are
> >>> suffering
> >>> >> from
> >>> >> >>> waste of resources. The resource spec of all task managers are
> >>> same
> >>> >> and
> >>> >> >>> we
> >>> >> >>> have to increase all task managers to make the heavy one more
> >>> stable.
> >>> >> So
> >>> >> >>> we
> >>> >> >>> will benefit from the fine grained resource management a lot. We
> >>> could
> >>> >> >>> get
> >>> >> >>> better resource utilization and stability.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Just to share some thoughts.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>    1. How to calculate the resource specification of
> >>> TaskManagers? Do
> >>> >> >>> they
> >>> >> >>>    have them same resource spec calculated based on the
> >>> >> configuration? I
> >>> >> >>> think
> >>> >> >>>    we still have wasted resources in this situation. Or we could
> >>> start
> >>> >> >>>    TaskManagers with different spec.
> >>> >> >>>    2. If a slot is released and returned to SlotPool, does it
> >>> could be
> >>> >> >>>    reused by other SlotRequest that the request resource is
> >>> smaller
> >>> >> than
> >>> >> >>> it?
> >>> >> >>>       - If it is yes, what happens to the available resource in
> >>> the
> >>> >> >>>       TaskManager.
> >>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool?
> The
> >>> >> >>>       AllocationId is null?
> >>> >> >>>    3. In a session cluster, some jobs are configured with
> operator
> >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to
> deal
> >>> with
> >>> >> >>> this
> >>> >> >>>    situation?
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Best,
> >>> >> >>> Yang
> >>> >> >>>
> >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
> >>> >> >>>
> >>> >> >>> > Thanks for the feedbacks, Yangze and Till.
> >>> >> >>> >
> >>> >> >>> > Yangze,
> >>> >> >>> >
> >>> >> >>> > I agree with you that we should make scheduling strategy
> >>> pluggable
> >>> >> and
> >>> >> >>> > optimize the strategy to reduce the memory fragmentation
> >>> problem,
> >>> >> and
> >>> >> >>> > thanks for the inputs on the potential algorithmic solutions.
> >>> >> However,
> >>> >> >>> I'm
> >>> >> >>> > in favor of keep this FLIP focusing on the overall mechanism
> >>> design
> >>> >> >>> rather
> >>> >> >>> > than strategies. Solving the fragmentation issue should be
> >>> >> considered
> >>> >> >>> as an
> >>> >> >>> > optimization, and I agree with Till that we probably should
> >>> tackle
> >>> >> this
> >>> >> >>> > afterwards.
> >>> >> >>> >
> >>> >> >>> > Till,
> >>> >> >>> >
> >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The
> >>> operator
> >>> >> >>> > resource management and dynamic slot allocation do not have
> much
> >>> >> >>> dependency
> >>> >> >>> > on each other.
> >>> >> >>> >
> >>> >> >>> > - Regarding the default slot size, I think this is similar to
> >>> >> FLIP-49
> >>> >> >>> [1]
> >>> >> >>> > where we want all the deriving happens at one place. I think
> it
> >>> >> would
> >>> >> >>> be
> >>> >> >>> > nice to pass the default slot size into the task executor in
> the
> >>> >> same
> >>> >> >>> way
> >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
> >>> >> >>> >
> >>> >> >>> > - Regarding the return value of
> >>> >> TaskExecutorGateway#requestResource, I
> >>> >> >>> > think you're right. We should avoid using null as the return
> >>> value.
> >>> >> I
> >>> >> >>> think
> >>> >> >>> > we probably should thrown an exception here.
> >>> >> >>> >
> >>> >> >>> > Thank you~
> >>> >> >>> >
> >>> >> >>> > Xintong Song
> >>> >> >>> >
> >>> >> >>> >
> >>> >> >>> > [1]
> >>> >> >>> >
> >>> >> >>> >
> >>> >> >>>
> >>> >>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >>> >> >>> >
> >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> >>> [hidden email]
> >>> >> >
> >>> >> >>> > wrote:
> >>> >> >>> >
> >>> >> >>> > > Hi Xintong,
> >>> >> >>> > >
> >>> >> >>> > > thanks for drafting this FLIP. I think your proposal helps
> to
> >>> >> >>> improve the
> >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it
> >>> enables the
> >>> >> >>> proper
> >>> >> >>> > > integration of the Blink planner which is very important as
> >>> well.
> >>> >> >>> > >
> >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering whether
> it
> >>> >> >>> wouldn't
> >>> >> >>> > > make sense to actually split it up into two FLIPs: Operator
> >>> >> resource
> >>> >> >>> > > management and dynamic slot allocation. I think these two
> >>> FLIPs
> >>> >> >>> could be
> >>> >> >>> > > seen as orthogonal and it would decrease the scope of each
> >>> >> individual
> >>> >> >>> > FLIP.
> >>> >> >>> > >
> >>> >> >>> > > Some smaller comments:
> >>> >> >>> > >
> >>> >> >>> > > - I'm not sure whether we should pass in the default slot
> size
> >>> >> via an
> >>> >> >>> > > environment variable. Without having unified the way how
> Flink
> >>> >> >>> components
> >>> >> >>> > > are configured [1], I think it would be better to pass it in
> >>> as
> >>> >> part
> >>> >> >>> of
> >>> >> >>> > the
> >>> >> >>> > > configuration.
> >>> >> >>> > > - I would avoid returning a null value from
> >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be
> fulfilled.
> >>> >> >>> Either we
> >>> >> >>> > > should introduce an explicit return value saying this or
> >>> throw an
> >>> >> >>> > > exception.
> >>> >> >>> > >
> >>> >> >>> > > Concerning Yangze's comments: I think you are right that it
> >>> would
> >>> >> be
> >>> >> >>> > > helpful to make the selection strategy pluggable. Also
> >>> batching
> >>> >> slot
> >>> >> >>> > > requests to the RM could be a good optimization. For the
> sake
> >>> of
> >>> >> >>> keeping
> >>> >> >>> > > the scope of this FLIP smaller I would try to tackle these
> >>> things
> >>> >> >>> after
> >>> >> >>> > the
> >>> >> >>> > > initial version has been completed (without spoiling these
> >>> >> >>> optimization
> >>> >> >>> > > opportunities). In particular batching the slot requests
> >>> depends
> >>> >> on
> >>> >> >>> the
> >>> >> >>> > > current scheduler refactoring and could also be realized on
> >>> the RM
> >>> >> >>> side
> >>> >> >>> > > only.
> >>> >> >>> > >
> >>> >> >>> > > [1]
> >>> >> >>> > >
> >>> >> >>> > >
> >>> >> >>> >
> >>> >> >>>
> >>> >>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> >>> >> >>> > >
> >>> >> >>> > > Cheers,
> >>> >> >>> > > Till
> >>> >> >>> > >
> >>> >> >>> > >
> >>> >> >>> > >
> >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
> >>> [hidden email]>
> >>> >> >>> wrote:
> >>> >> >>> > >
> >>> >> >>> > > > Hi, Xintong
> >>> >> >>> > > >
> >>> >> >>> > > > Thanks to propose this FLIP. The general design looks good
> >>> to
> >>> >> me,
> >>> >> >>> +1
> >>> >> >>> > > > for this feature.
> >>> >> >>> > > >
> >>> >> >>> > > > Since slots in the same task executor could have different
> >>> >> resource
> >>> >> >>> > > > profile, we will
> >>> >> >>> > > > meet resource fragment problem. Think about this case:
> >>> >> >>> > > >  - request A want 1G memory while request B & C want 0.5G
> >>> memory
> >>> >> >>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G
> >>> free
> >>> >> >>> memory
> >>> >> >>> > > > respectively
> >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A must
> >>> wait for
> >>> >> >>> the
> >>> >> >>> > > > free resource from
> >>> >> >>> > > > other task. But A could have been scheduled immediately if
> >>> we
> >>> >> cut a
> >>> >> >>> > > > slot from T2 for B.
> >>> >> >>> > > >
> >>> >> >>> > > > The logic of findMatchingSlot now become finding a task
> >>> executor
> >>> >> >>> which
> >>> >> >>> > > > has enough
> >>> >> >>> > > > resource and then cut a slot from it. Current method could
> >>> be
> >>> >> seen
> >>> >> >>> as
> >>> >> >>> > > > "First-fit strategy",
> >>> >> >>> > > > which works well in general but sometimes could not be the
> >>> >> >>> optimization
> >>> >> >>> > > > method.
> >>> >> >>> > > >
> >>> >> >>> > > > Actually, this problem could be abstracted as "Bin Packing
> >>> >> >>> Problem"[1].
> >>> >> >>> > > > Here are
> >>> >> >>> > > > some common approximate algorithms:
> >>> >> >>> > > > - First fit
> >>> >> >>> > > > - Next fit
> >>> >> >>> > > > - Best fit
> >>> >> >>> > > >
> >>> >> >>> > > > But it become multi-dimensional bin packing problem if we
> >>> take
> >>> >> CPU
> >>> >> >>> > > > into account. It hard
> >>> >> >>> > > > to define which one is best fit now. Some research
> addressed
> >>> >> this
> >>> >> >>> > > > problem, such like Tetris[2].
> >>> >> >>> > > >
> >>> >> >>> > > > Here are some thinking about it:
> >>> >> >>> > > > 1. We could make the strategy of finding matching task
> >>> executor
> >>> >> >>> > > > pluginable. Let user to config the
> >>> >> >>> > > > best strategy in their scenario.
> >>> >> >>> > > > 2. We could support batch request interface in RM, because
> >>> we
> >>> >> have
> >>> >> >>> > > > opportunities to optimize
> >>> >> >>> > > > if we have more information. If we know the A, B, C at the
> >>> same
> >>> >> >>> time,
> >>> >> >>> > > > we could always make the best decision.
> >>> >> >>> > > >
> >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> >>> >> >>> > > > [2]
> >>> >> >>> >
> >>> >>
> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> >>> >> >>> > > >
> >>> >> >>> > > > Best,
> >>> >> >>> > > > Yangze Guo
> >>> >> >>> > > >
> >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> >>> >> >>> [hidden email]>
> >>> >> >>> > > > wrote:
> >>> >> >>> > > > >
> >>> >> >>> > > > > Hi everyone,
> >>> >> >>> > > > >
> >>> >> >>> > > > > We would like to start a discussion thread on "FLIP-53:
> >>> Fine
> >>> >> >>> Grained
> >>> >> >>> > > > > Resource Management"[1], where we propose how to improve
> >>> Flink
> >>> >> >>> > resource
> >>> >> >>> > > > > management and scheduling.
> >>> >> >>> > > > >
> >>> >> >>> > > > > This FLIP mainly discusses the following issues.
> >>> >> >>> > > > >
> >>> >> >>> > > > >    - How to support tasks with fine grained resource
> >>> >> >>> requirements.
> >>> >> >>> > > > >    - How to unify resource management for jobs with /
> >>> without
> >>> >> >>> fine
> >>> >> >>> > > > grained
> >>> >> >>> > > > >    resource requirements.
> >>> >> >>> > > > >    - How to unify resource management for streaming /
> >>> batch
> >>> >> jobs.
> >>> >> >>> > > > >
> >>> >> >>> > > > > Key changes proposed in the FLIP are as follows.
> >>> >> >>> > > > >
> >>> >> >>> > > > >    - Unify memory management for operators with /
> without
> >>> fine
> >>> >> >>> > grained
> >>> >> >>> > > > >    resource requirements by applying a fraction based
> >>> quota
> >>> >> >>> > mechanism.
> >>> >> >>> > > > >    - Unify resource scheduling for streaming and batch
> >>> jobs by
> >>> >> >>> > setting
> >>> >> >>> > > > slot
> >>> >> >>> > > > >    sharing groups for pipelined regions during compiling
> >>> >> stage.
> >>> >> >>> > > > >    - Dynamically allocate slots from task executors'
> >>> available
> >>> >> >>> > > resources.
> >>> >> >>> > > > >
> >>> >> >>> > > > > Please find more details in the FLIP wiki document [1].
> >>> >> Looking
> >>> >> >>> > forward
> >>> >> >>> > > > to
> >>> >> >>> > > > > your feedbacks.
> >>> >> >>> > > > >
> >>> >> >>> > > > > Thank you~
> >>> >> >>> > > > >
> >>> >> >>> > > > > Xintong Song
> >>> >> >>> > > > >
> >>> >> >>> > > > >
> >>> >> >>> > > > > [1]
> >>> >> >>> > > > >
> >>> >> >>> > > >
> >>> >> >>> > >
> >>> >> >>> >
> >>> >> >>>
> >>> >>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> >>> >> >>> > > >
> >>> >> >>> > >
> >>> >> >>> >
> >>> >> >>>
> >>> >> >>
> >>> >>
> >>> >
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Kurt Young
Thanks Xingtong for driving this effort, I haven't finished the whole
document yet,
but have couple of questions:

1. Regarding to network memory, the document said it will be derived by
framework
automatically. I'm wondering whether we should delete this dimension from
user-
facing API?

2. Regarding to fraction based quota, I don't quite get the meaning of
"slotSharingGroupOnHeapManagedMem" and "slotSharingGroupOffHeapManagedMem".
What if the sharing group is mixed with specified resource and UNKNOWN
resource
requirements.

3 IIUC, even user had set resource requirements, lets say 500MB off-heap
managed
memory, during execution the operator may or may not have 500MB off-heap
managed
memory, right?

Best,
Kurt


On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote:

> Thanks Xintong for proposing this improvement. Fine grained resources can
> be very helpful when user has good planning on resources.
>
> I have a few questions:
> 1. Currently in a batch job, vertices from different regions can run at the
> same time in slots from the same shared group, as long as they do not have
> data dependency on each other and available slot count is not smaller than
> the *max* of parallelism of all tasks.
> With changes in this FLIP however, tasks from different regions cannot
> share slots anymore.
> Once available slot count is smaller than the *sum* of all parallelism of
> tasks from all regions, tasks may need to be executed sequentially, which
> might result in a performance regression.
> Is this(performance regression to existing DataSet jobs) considered as a
> necessary and accepted trade off in this FLIP?
>
> 2. The network memory depends on the input/output ExecutionEdge count and
> thus can be different even for parallel instances of the same JobVertex.
> Does this mean that when adding task resources to calculating the slot
> resource for a shared group, the max possible network memory of the vertex
> instance shall be used?
> This might result in larger resource required than actually needed.
>
> And some minor comments:
> 1. Regarding "fracManagedMemOnHeap = 1 / numOpsUseOnHeapManagedMemory", I
> guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ?
> 2. I think the *StreamGraphGenerator* in the #Slot Sharing section and
> implementation step 4 should be *StreamingJobGraphGenerator*, as
> *StreamGraphGenerator* is not aware of JobGraph and pipelined region.
>
>
> Thanks,
> Zhu Zhu
>
> Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道:
>
> > Updated the FLIP wiki page [1], with the following changes.
> >
> >    - Remove the step of converting pipelined edges between different slot
> >    sharing groups into blocking edges.
> >    - Set `allSourcesInSamePipelinedRegion` to true by default.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]>
> > wrote:
> >
> > > Regarding changing edge type, I think actually we don't need to do this
> > > for batch jobs neither, because we don't have public interfaces for
> users
> > > to explicitly set slot sharing groups in DataSet API and SQL/Table API.
> > We
> > > have such interfaces in DataStream API only.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > >> Thanks for the correction, Till.
> > >>
> > >> Regarding your comments:
> > >> - You are right, we should not change the edge type for streaming
> jobs.
> > >> Then I think we can change the option
> 'allSourcesInSamePipelinedRegion'
> > in
> > >> step 2 to 'isStreamingJob', and implement the current step 2 before
> the
> > >> current step 1 so we can use this option to decide whether should
> change
> > >> the edge type. What do you think?
> > >> - Agree. It should be easier to make the default value of
> > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and
> set
> > it
> > >> to 'false' when using DataSet API or blink planner.
> > >>
> > >> Thank you~
> > >>
> > >> Xintong Song
> > >>
> > >>
> > >>
> > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]>
> > >> wrote:
> > >>
> > >>> Thanks for creating the implementation plan Xintong. Overall, the
> > >>> implementation plan looks good. I had a couple of comments:
> > >>>
> > >>> - What will happen if a user has defined a streaming job with two
> slot
> > >>> sharing groups? Would the code insert a blocking data exchange
> between
> > >>> these two groups? If yes, then this breaks existing Flink streaming
> > jobs.
> > >>> - How do we detect unbounded streaming jobs to set
> > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be easier
> to
> > >>> set
> > >>> it false if we are using the DataSet API or the Blink planner with a
> > >>> bounded job?
> > >>>
> > >>> Cheers,
> > >>> Till
> > >>>
> > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <[hidden email]>
> > >>> wrote:
> > >>>
> > >>> > I guess there is a typo since the link to the FLIP-53 is
> > >>> >
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > >>> >
> > >>> > Cheers,
> > >>> > Till
> > >>> >
> > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <
> [hidden email]>
> > >>> > wrote:
> > >>> >
> > >>> >> Added implementation steps for this FLIP on the wiki page [1].
> > >>> >>
> > >>> >>
> > >>> >> Thank you~
> > >>> >>
> > >>> >> Xintong Song
> > >>> >>
> > >>> >>
> > >>> >> [1]
> > >>> >>
> > >>> >>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > >>> >>
> > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <
> > [hidden email]>
> > >>> >> wrote:
> > >>> >>
> > >>> >> > Hi everyone,
> > >>> >> >
> > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained Resource
> > >>> >> > Management" splits into two separate FLIPs,
> > >>> >> >
> > >>> >> >    - FLIP-53: Fine Grained Operator Resource Management [1]
> > >>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> > >>> >> >
> > >>> >> > We'll continue using this discussion thread for FLIP-53. For
> > >>> FLIP-56, I
> > >>> >> > just started a new discussion thread [3].
> > >>> >> >
> > >>> >> > Thank you~
> > >>> >> >
> > >>> >> > Xintong Song
> > >>> >> >
> > >>> >> >
> > >>> >> > [1]
> > >>> >> >
> > >>> >>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > >>> >> >
> > >>> >> > [2]
> > >>> >> >
> > >>> >>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > >>> >> >
> > >>> >> > [3]
> > >>> >> >
> > >>> >>
> > >>>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > >>> >> >
> > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <
> > [hidden email]
> > >>> >
> > >>> >> > wrote:
> > >>> >> >
> > >>> >> >> Thinks for the comments, Yang.
> > >>> >> >>
> > >>> >> >> Regarding your questions:
> > >>> >> >>
> > >>> >> >>    1. How to calculate the resource specification of
> > TaskManagers?
> > >>> Do
> > >>> >> they
> > >>> >> >>>    have them same resource spec calculated based on the
> > >>> >> configuration? I
> > >>> >> >>> think
> > >>> >> >>>    we still have wasted resources in this situation. Or we
> could
> > >>> start
> > >>> >> >>>    TaskManagers with different spec.
> > >>> >> >>>
> > >>> >> >> I agree with you that we can further improve the resource
> utility
> > >>> by
> > >>> >> >> customizing task executors with different resource
> > specifications.
> > >>> >> However,
> > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave it
> as a
> > >>> >> future
> > >>> >> >> optimization. The plan for that part is to move the logic of
> > >>> deciding
> > >>> >> task
> > >>> >> >> executor specifications into the slot manager and make slot
> > manager
> > >>> >> >> pluggable, so inside the slot manager plugin we can have
> > different
> > >>> >> logics
> > >>> >> >> for deciding the task executor specifications.
> > >>> >> >>
> > >>> >> >>
> > >>> >> >>>    2. If a slot is released and returned to SlotPool, does it
> > >>> could be
> > >>> >> >>>    reused by other SlotRequest that the request resource is
> > >>> smaller
> > >>> >> than
> > >>> >> >>> it?
> > >>> >> >>>
> > >>> >> >> No, I think slot pool should always return slots if they do not
> > >>> exactly
> > >>> >> >> match the pending requests, so that resource manager can deal
> > with
> > >>> the
> > >>> >> >> extra resources.
> > >>> >> >>
> > >>> >> >>>       - If it is yes, what happens to the available resource
> in
> > >>> the
> > >>> >> >>
> > >>> >> >>       TaskManager.
> > >>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool?
> > The
> > >>> >> >>>       AllocationId is null?
> > >>> >> >>>
> > >>> >> >> The allocation id does not change as long as the slot is not
> > >>> returned
> > >>> >> >> from the job master, no matter its occupied or available in the
> > >>> slot
> > >>> >> pool.
> > >>> >> >> I think we have the same behavior currently. No matter how many
> > >>> tasks
> > >>> >> the
> > >>> >> >> job master deploy into the slot, concurrently or sequentially,
> it
> > >>> is
> > >>> >> one
> > >>> >> >> allocation from the cluster to the job until the slot is freed
> > from
> > >>> >> the job
> > >>> >> >> master.
> > >>> >> >>
> > >>> >> >>>    3. In a session cluster, some jobs are configured with
> > operator
> > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to
> > deal
> > >>> with
> > >>> >> >>> this
> > >>> >> >>>    situation?
> > >>> >> >>
> > >>> >> >> As long as we do not mix unknown / specified resource profiles
> > >>> within
> > >>> >> the
> > >>> >> >> same job / slot, there shouldn't be a problem. Resource manager
> > >>> >> converts
> > >>> >> >> unknown resource profiles in slot requests to specified default
> > >>> >> resource
> > >>> >> >> profiles, so they can be dynamically allocated from task
> > executors'
> > >>> >> >> available resources just as other slot requests with specified
> > >>> resource
> > >>> >> >> profiles.
> > >>> >> >>
> > >>> >> >> Thank you~
> > >>> >> >>
> > >>> >> >> Xintong Song
> > >>> >> >>
> > >>> >> >>
> > >>> >> >>
> > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <
> > [hidden email]>
> > >>> >> wrote:
> > >>> >> >>
> > >>> >> >>> Hi Xintong,
> > >>> >> >>>
> > >>> >> >>>
> > >>> >> >>> Thanks for your detailed proposal. I think many users are
> > >>> suffering
> > >>> >> from
> > >>> >> >>> waste of resources. The resource spec of all task managers are
> > >>> same
> > >>> >> and
> > >>> >> >>> we
> > >>> >> >>> have to increase all task managers to make the heavy one more
> > >>> stable.
> > >>> >> So
> > >>> >> >>> we
> > >>> >> >>> will benefit from the fine grained resource management a lot.
> We
> > >>> could
> > >>> >> >>> get
> > >>> >> >>> better resource utilization and stability.
> > >>> >> >>>
> > >>> >> >>>
> > >>> >> >>> Just to share some thoughts.
> > >>> >> >>>
> > >>> >> >>>
> > >>> >> >>>
> > >>> >> >>>    1. How to calculate the resource specification of
> > >>> TaskManagers? Do
> > >>> >> >>> they
> > >>> >> >>>    have them same resource spec calculated based on the
> > >>> >> configuration? I
> > >>> >> >>> think
> > >>> >> >>>    we still have wasted resources in this situation. Or we
> could
> > >>> start
> > >>> >> >>>    TaskManagers with different spec.
> > >>> >> >>>    2. If a slot is released and returned to SlotPool, does it
> > >>> could be
> > >>> >> >>>    reused by other SlotRequest that the request resource is
> > >>> smaller
> > >>> >> than
> > >>> >> >>> it?
> > >>> >> >>>       - If it is yes, what happens to the available resource
> in
> > >>> the
> > >>> >> >>>       TaskManager.
> > >>> >> >>>       - What is the SlotStatus of the cached slot in SlotPool?
> > The
> > >>> >> >>>       AllocationId is null?
> > >>> >> >>>    3. In a session cluster, some jobs are configured with
> > operator
> > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to
> > deal
> > >>> with
> > >>> >> >>> this
> > >>> >> >>>    situation?
> > >>> >> >>>
> > >>> >> >>>
> > >>> >> >>>
> > >>> >> >>> Best,
> > >>> >> >>> Yang
> > >>> >> >>>
> > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五 下午8:57写道:
> > >>> >> >>>
> > >>> >> >>> > Thanks for the feedbacks, Yangze and Till.
> > >>> >> >>> >
> > >>> >> >>> > Yangze,
> > >>> >> >>> >
> > >>> >> >>> > I agree with you that we should make scheduling strategy
> > >>> pluggable
> > >>> >> and
> > >>> >> >>> > optimize the strategy to reduce the memory fragmentation
> > >>> problem,
> > >>> >> and
> > >>> >> >>> > thanks for the inputs on the potential algorithmic
> solutions.
> > >>> >> However,
> > >>> >> >>> I'm
> > >>> >> >>> > in favor of keep this FLIP focusing on the overall mechanism
> > >>> design
> > >>> >> >>> rather
> > >>> >> >>> > than strategies. Solving the fragmentation issue should be
> > >>> >> considered
> > >>> >> >>> as an
> > >>> >> >>> > optimization, and I agree with Till that we probably should
> > >>> tackle
> > >>> >> this
> > >>> >> >>> > afterwards.
> > >>> >> >>> >
> > >>> >> >>> > Till,
> > >>> >> >>> >
> > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense. The
> > >>> operator
> > >>> >> >>> > resource management and dynamic slot allocation do not have
> > much
> > >>> >> >>> dependency
> > >>> >> >>> > on each other.
> > >>> >> >>> >
> > >>> >> >>> > - Regarding the default slot size, I think this is similar
> to
> > >>> >> FLIP-49
> > >>> >> >>> [1]
> > >>> >> >>> > where we want all the deriving happens at one place. I think
> > it
> > >>> >> would
> > >>> >> >>> be
> > >>> >> >>> > nice to pass the default slot size into the task executor in
> > the
> > >>> >> same
> > >>> >> >>> way
> > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
> > >>> >> >>> >
> > >>> >> >>> > - Regarding the return value of
> > >>> >> TaskExecutorGateway#requestResource, I
> > >>> >> >>> > think you're right. We should avoid using null as the return
> > >>> value.
> > >>> >> I
> > >>> >> >>> think
> > >>> >> >>> > we probably should thrown an exception here.
> > >>> >> >>> >
> > >>> >> >>> > Thank you~
> > >>> >> >>> >
> > >>> >> >>> > Xintong Song
> > >>> >> >>> >
> > >>> >> >>> >
> > >>> >> >>> > [1]
> > >>> >> >>> >
> > >>> >> >>> >
> > >>> >> >>>
> > >>> >>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > >>> >> >>> >
> > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> > >>> [hidden email]
> > >>> >> >
> > >>> >> >>> > wrote:
> > >>> >> >>> >
> > >>> >> >>> > > Hi Xintong,
> > >>> >> >>> > >
> > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal helps
> > to
> > >>> >> >>> improve the
> > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it
> > >>> enables the
> > >>> >> >>> proper
> > >>> >> >>> > > integration of the Blink planner which is very important
> as
> > >>> well.
> > >>> >> >>> > >
> > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering
> whether
> > it
> > >>> >> >>> wouldn't
> > >>> >> >>> > > make sense to actually split it up into two FLIPs:
> Operator
> > >>> >> resource
> > >>> >> >>> > > management and dynamic slot allocation. I think these two
> > >>> FLIPs
> > >>> >> >>> could be
> > >>> >> >>> > > seen as orthogonal and it would decrease the scope of each
> > >>> >> individual
> > >>> >> >>> > FLIP.
> > >>> >> >>> > >
> > >>> >> >>> > > Some smaller comments:
> > >>> >> >>> > >
> > >>> >> >>> > > - I'm not sure whether we should pass in the default slot
> > size
> > >>> >> via an
> > >>> >> >>> > > environment variable. Without having unified the way how
> > Flink
> > >>> >> >>> components
> > >>> >> >>> > > are configured [1], I think it would be better to pass it
> in
> > >>> as
> > >>> >> part
> > >>> >> >>> of
> > >>> >> >>> > the
> > >>> >> >>> > > configuration.
> > >>> >> >>> > > - I would avoid returning a null value from
> > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be
> > fulfilled.
> > >>> >> >>> Either we
> > >>> >> >>> > > should introduce an explicit return value saying this or
> > >>> throw an
> > >>> >> >>> > > exception.
> > >>> >> >>> > >
> > >>> >> >>> > > Concerning Yangze's comments: I think you are right that
> it
> > >>> would
> > >>> >> be
> > >>> >> >>> > > helpful to make the selection strategy pluggable. Also
> > >>> batching
> > >>> >> slot
> > >>> >> >>> > > requests to the RM could be a good optimization. For the
> > sake
> > >>> of
> > >>> >> >>> keeping
> > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle these
> > >>> things
> > >>> >> >>> after
> > >>> >> >>> > the
> > >>> >> >>> > > initial version has been completed (without spoiling these
> > >>> >> >>> optimization
> > >>> >> >>> > > opportunities). In particular batching the slot requests
> > >>> depends
> > >>> >> on
> > >>> >> >>> the
> > >>> >> >>> > > current scheduler refactoring and could also be realized
> on
> > >>> the RM
> > >>> >> >>> side
> > >>> >> >>> > > only.
> > >>> >> >>> > >
> > >>> >> >>> > > [1]
> > >>> >> >>> > >
> > >>> >> >>> > >
> > >>> >> >>> >
> > >>> >> >>>
> > >>> >>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> > >>> >> >>> > >
> > >>> >> >>> > > Cheers,
> > >>> >> >>> > > Till
> > >>> >> >>> > >
> > >>> >> >>> > >
> > >>> >> >>> > >
> > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
> > >>> [hidden email]>
> > >>> >> >>> wrote:
> > >>> >> >>> > >
> > >>> >> >>> > > > Hi, Xintong
> > >>> >> >>> > > >
> > >>> >> >>> > > > Thanks to propose this FLIP. The general design looks
> good
> > >>> to
> > >>> >> me,
> > >>> >> >>> +1
> > >>> >> >>> > > > for this feature.
> > >>> >> >>> > > >
> > >>> >> >>> > > > Since slots in the same task executor could have
> different
> > >>> >> resource
> > >>> >> >>> > > > profile, we will
> > >>> >> >>> > > > meet resource fragment problem. Think about this case:
> > >>> >> >>> > > >  - request A want 1G memory while request B & C want
> 0.5G
> > >>> memory
> > >>> >> >>> > > >  - There are two task executors T1 & T2 with 1G and 0.5G
> > >>> free
> > >>> >> >>> memory
> > >>> >> >>> > > > respectively
> > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A must
> > >>> wait for
> > >>> >> >>> the
> > >>> >> >>> > > > free resource from
> > >>> >> >>> > > > other task. But A could have been scheduled immediately
> if
> > >>> we
> > >>> >> cut a
> > >>> >> >>> > > > slot from T2 for B.
> > >>> >> >>> > > >
> > >>> >> >>> > > > The logic of findMatchingSlot now become finding a task
> > >>> executor
> > >>> >> >>> which
> > >>> >> >>> > > > has enough
> > >>> >> >>> > > > resource and then cut a slot from it. Current method
> could
> > >>> be
> > >>> >> seen
> > >>> >> >>> as
> > >>> >> >>> > > > "First-fit strategy",
> > >>> >> >>> > > > which works well in general but sometimes could not be
> the
> > >>> >> >>> optimization
> > >>> >> >>> > > > method.
> > >>> >> >>> > > >
> > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin
> Packing
> > >>> >> >>> Problem"[1].
> > >>> >> >>> > > > Here are
> > >>> >> >>> > > > some common approximate algorithms:
> > >>> >> >>> > > > - First fit
> > >>> >> >>> > > > - Next fit
> > >>> >> >>> > > > - Best fit
> > >>> >> >>> > > >
> > >>> >> >>> > > > But it become multi-dimensional bin packing problem if
> we
> > >>> take
> > >>> >> CPU
> > >>> >> >>> > > > into account. It hard
> > >>> >> >>> > > > to define which one is best fit now. Some research
> > addressed
> > >>> >> this
> > >>> >> >>> > > > problem, such like Tetris[2].
> > >>> >> >>> > > >
> > >>> >> >>> > > > Here are some thinking about it:
> > >>> >> >>> > > > 1. We could make the strategy of finding matching task
> > >>> executor
> > >>> >> >>> > > > pluginable. Let user to config the
> > >>> >> >>> > > > best strategy in their scenario.
> > >>> >> >>> > > > 2. We could support batch request interface in RM,
> because
> > >>> we
> > >>> >> have
> > >>> >> >>> > > > opportunities to optimize
> > >>> >> >>> > > > if we have more information. If we know the A, B, C at
> the
> > >>> same
> > >>> >> >>> time,
> > >>> >> >>> > > > we could always make the best decision.
> > >>> >> >>> > > >
> > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > >>> >> >>> > > > [2]
> > >>> >> >>> >
> > >>> >>
> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > >>> >> >>> > > >
> > >>> >> >>> > > > Best,
> > >>> >> >>> > > > Yangze Guo
> > >>> >> >>> > > >
> > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> > >>> >> >>> [hidden email]>
> > >>> >> >>> > > > wrote:
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > Hi everyone,
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > We would like to start a discussion thread on
> "FLIP-53:
> > >>> Fine
> > >>> >> >>> Grained
> > >>> >> >>> > > > > Resource Management"[1], where we propose how to
> improve
> > >>> Flink
> > >>> >> >>> > resource
> > >>> >> >>> > > > > management and scheduling.
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > This FLIP mainly discusses the following issues.
> > >>> >> >>> > > > >
> > >>> >> >>> > > > >    - How to support tasks with fine grained resource
> > >>> >> >>> requirements.
> > >>> >> >>> > > > >    - How to unify resource management for jobs with /
> > >>> without
> > >>> >> >>> fine
> > >>> >> >>> > > > grained
> > >>> >> >>> > > > >    resource requirements.
> > >>> >> >>> > > > >    - How to unify resource management for streaming /
> > >>> batch
> > >>> >> jobs.
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows.
> > >>> >> >>> > > > >
> > >>> >> >>> > > > >    - Unify memory management for operators with /
> > without
> > >>> fine
> > >>> >> >>> > grained
> > >>> >> >>> > > > >    resource requirements by applying a fraction based
> > >>> quota
> > >>> >> >>> > mechanism.
> > >>> >> >>> > > > >    - Unify resource scheduling for streaming and batch
> > >>> jobs by
> > >>> >> >>> > setting
> > >>> >> >>> > > > slot
> > >>> >> >>> > > > >    sharing groups for pipelined regions during
> compiling
> > >>> >> stage.
> > >>> >> >>> > > > >    - Dynamically allocate slots from task executors'
> > >>> available
> > >>> >> >>> > > resources.
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > Please find more details in the FLIP wiki document
> [1].
> > >>> >> Looking
> > >>> >> >>> > forward
> > >>> >> >>> > > > to
> > >>> >> >>> > > > > your feedbacks.
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > Thank you~
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > Xintong Song
> > >>> >> >>> > > > >
> > >>> >> >>> > > > >
> > >>> >> >>> > > > > [1]
> > >>> >> >>> > > > >
> > >>> >> >>> > > >
> > >>> >> >>> > >
> > >>> >> >>> >
> > >>> >> >>>
> > >>> >>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > >>> >> >>> > > >
> > >>> >> >>> > >
> > >>> >> >>> >
> > >>> >> >>>
> > >>> >> >>
> > >>> >>
> > >>> >
> > >>>
> > >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
 Thanks for the comments, Zhu & Kurt.

Andrey and I also had some discussions offline, and I would like to first
post a summary of our discussion:

   1. The motivation of the fraction based approach is to unify resource
   management for both operators with specified and unknown resource
   requirements.
   2. The fraction based approach proposed in this FLIP should only affect
   streaming jobs (both bounded and unbounded). For DataSet jobs, there are
   already some fraction based approach (in TaskConfig and ChainedDriver), and
   we do not make any change to the existing approach.
   3. The scope of this FLIP does not include discussion of how to set
   ResourceSpec for operators.
      1. For blink jobs, the optimizer can set operator resources for the
      users, according to their configurations (default: unknown)
      2. For DataStream jobs, there are no method / interface to set
      operator resources at the moment (1.10). We can have in the future.
      3. For DataSet jobs, there are existing user interfaces to set
      operator resources.
   4. The FLIP should explain more about how ResourceSpecs works
      1. PhysicalTransformations (deployed with operators into the
      StreamTasks) get ResourceSpec: unknown by default or known (e.g. from the
      Blink planner)
      2. While generating stream graph, calculate fractions and set to
      StreamConfig
      3. While scheduling, convert ResourceSpec to ResourceProfile
      (ResourceSpec + network memory), and deploy to slots / TMs matching the
      resources
      4. While starting Task in TM, each operator gets fraction converted
      back to the original absolute value requested by user or fair
unknown share
      of the slot
      5. We should not set `allSourcesInSamePipelinedRegion` to `false` for
   DataSet jobs. Behaviors of DataSet jobs should not be changed.
   6. The FLIP document should differentiate works planed in this FLIP and
   the future follow-ups more clearly, by put the follow-ups in a separate
   section
   7. Another limitation of the rejected alternative setting fractions at
   scheduling time is that, the scheduler implementation does not know which
   tasks will be deployed into the same slot in advance.

Andrey, Please bring it up if there is anything I missed.

Zhu, regarding your comments:

   1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for
   DataSet jobs (point 5 in the discussion summary above), then there
   shouldn't be any regression right?
   2. I think it makes sense to set the max possible network memory for the
   JobVertex. When you say parallel instances of the same JobVertex may have
   need different network memory, I guess you mean the rescale scenarios where
   parallelisms of upstream / downstream vertex cannot be exactly divided by
   parallelism of downstream / upstream vertex? I would say it's acceptable to
   have slight difference between actually needed and allocated network memory.
   3. Yes, by numOpsUseOnHeapManagedMemory I mean
   numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the doc.
   4. Yes, it should be StreamingJobGraphGenerator. Thanks for the
   correction.


Kurt, regarding your comments:

   1. I think we don't have network memory in ResourceSpec, which is the
   user facing API. We only have network memory in ResourceProfile, which is
   used internally for scheduling. The reason we do not expose network memory
   to the user is that, currently how many network buffers each task needs is
   decided by the topology of execution graph (how many input / output
   channels it has).
   2. In the section "Operator Resource Requirements": "For the first
   version, we do not support mixing operators with specified / unknown
   resource requirements in the same job. Either all or none of the operators
   of the same job should specify their resource requirements.
   StreamGraphGenerator should check this and throw an error when mixing of
   specified / unknown resource requirements is detected, during the
   compilation stage."
   3. If the user set a resource requirement, then it is guaranteed that
   the task should get at least the much resource, otherwise there should be
   an exception. That should be guaranteed by the "Dynamic Slot Allocation"
   approach (FLIP-56).


I'll update the FLIP document addressing the comments ASAP.


Thank you~

Xintong Song



On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote:

> Thanks Xingtong for driving this effort, I haven't finished the whole
> document yet,
> but have couple of questions:
>
> 1. Regarding to network memory, the document said it will be derived by
> framework
> automatically. I'm wondering whether we should delete this dimension from
> user-
> facing API?
>
> 2. Regarding to fraction based quota, I don't quite get the meaning of
> "slotSharingGroupOnHeapManagedMem" and "slotSharingGroupOffHeapManagedMem".
> What if the sharing group is mixed with specified resource and UNKNOWN
> resource
> requirements.
>
> 3 IIUC, even user had set resource requirements, lets say 500MB off-heap
> managed
> memory, during execution the operator may or may not have 500MB off-heap
> managed
> memory, right?
>
> Best,
> Kurt
>
>
> On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote:
>
> > Thanks Xintong for proposing this improvement. Fine grained resources can
> > be very helpful when user has good planning on resources.
> >
> > I have a few questions:
> > 1. Currently in a batch job, vertices from different regions can run at
> the
> > same time in slots from the same shared group, as long as they do not
> have
> > data dependency on each other and available slot count is not smaller
> than
> > the *max* of parallelism of all tasks.
> > With changes in this FLIP however, tasks from different regions cannot
> > share slots anymore.
> > Once available slot count is smaller than the *sum* of all parallelism of
> > tasks from all regions, tasks may need to be executed sequentially, which
> > might result in a performance regression.
> > Is this(performance regression to existing DataSet jobs) considered as a
> > necessary and accepted trade off in this FLIP?
> >
> > 2. The network memory depends on the input/output ExecutionEdge count and
> > thus can be different even for parallel instances of the same JobVertex.
> > Does this mean that when adding task resources to calculating the slot
> > resource for a shared group, the max possible network memory of the
> vertex
> > instance shall be used?
> > This might result in larger resource required than actually needed.
> >
> > And some minor comments:
> > 1. Regarding "fracManagedMemOnHeap = 1 / numOpsUseOnHeapManagedMemory", I
> > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ?
> > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section and
> > implementation step 4 should be *StreamingJobGraphGenerator*, as
> > *StreamGraphGenerator* is not aware of JobGraph and pipelined region.
> >
> >
> > Thanks,
> > Zhu Zhu
> >
> > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道:
> >
> > > Updated the FLIP wiki page [1], with the following changes.
> > >
> > >    - Remove the step of converting pipelined edges between different
> slot
> > >    sharing groups into blocking edges.
> > >    - Set `allSourcesInSamePipelinedRegion` to true by default.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]>
> > > wrote:
> > >
> > > > Regarding changing edge type, I think actually we don't need to do
> this
> > > > for batch jobs neither, because we don't have public interfaces for
> > users
> > > > to explicitly set slot sharing groups in DataSet API and SQL/Table
> API.
> > > We
> > > > have such interfaces in DataStream API only.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <[hidden email]
> >
> > > > wrote:
> > > >
> > > >> Thanks for the correction, Till.
> > > >>
> > > >> Regarding your comments:
> > > >> - You are right, we should not change the edge type for streaming
> > jobs.
> > > >> Then I think we can change the option
> > 'allSourcesInSamePipelinedRegion'
> > > in
> > > >> step 2 to 'isStreamingJob', and implement the current step 2 before
> > the
> > > >> current step 1 so we can use this option to decide whether should
> > change
> > > >> the edge type. What do you think?
> > > >> - Agree. It should be easier to make the default value of
> > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true', and
> > set
> > > it
> > > >> to 'false' when using DataSet API or blink planner.
> > > >>
> > > >> Thank you~
> > > >>
> > > >> Xintong Song
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <[hidden email]
> >
> > > >> wrote:
> > > >>
> > > >>> Thanks for creating the implementation plan Xintong. Overall, the
> > > >>> implementation plan looks good. I had a couple of comments:
> > > >>>
> > > >>> - What will happen if a user has defined a streaming job with two
> > slot
> > > >>> sharing groups? Would the code insert a blocking data exchange
> > between
> > > >>> these two groups? If yes, then this breaks existing Flink streaming
> > > jobs.
> > > >>> - How do we detect unbounded streaming jobs to set
> > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be
> easier
> > to
> > > >>> set
> > > >>> it false if we are using the DataSet API or the Blink planner with
> a
> > > >>> bounded job?
> > > >>>
> > > >>> Cheers,
> > > >>> Till
> > > >>>
> > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <
> [hidden email]>
> > > >>> wrote:
> > > >>>
> > > >>> > I guess there is a typo since the link to the FLIP-53 is
> > > >>> >
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > >>> >
> > > >>> > Cheers,
> > > >>> > Till
> > > >>> >
> > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <
> > [hidden email]>
> > > >>> > wrote:
> > > >>> >
> > > >>> >> Added implementation steps for this FLIP on the wiki page [1].
> > > >>> >>
> > > >>> >>
> > > >>> >> Thank you~
> > > >>> >>
> > > >>> >> Xintong Song
> > > >>> >>
> > > >>> >>
> > > >>> >> [1]
> > > >>> >>
> > > >>> >>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > >>> >>
> > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <
> > > [hidden email]>
> > > >>> >> wrote:
> > > >>> >>
> > > >>> >> > Hi everyone,
> > > >>> >> >
> > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained
> Resource
> > > >>> >> > Management" splits into two separate FLIPs,
> > > >>> >> >
> > > >>> >> >    - FLIP-53: Fine Grained Operator Resource Management [1]
> > > >>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> > > >>> >> >
> > > >>> >> > We'll continue using this discussion thread for FLIP-53. For
> > > >>> FLIP-56, I
> > > >>> >> > just started a new discussion thread [3].
> > > >>> >> >
> > > >>> >> > Thank you~
> > > >>> >> >
> > > >>> >> > Xintong Song
> > > >>> >> >
> > > >>> >> >
> > > >>> >> > [1]
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > >>> >> >
> > > >>> >> > [2]
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > >>> >> >
> > > >>> >> > [3]
> > > >>> >> >
> > > >>> >>
> > > >>>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > >>> >> >
> > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <
> > > [hidden email]
> > > >>> >
> > > >>> >> > wrote:
> > > >>> >> >
> > > >>> >> >> Thinks for the comments, Yang.
> > > >>> >> >>
> > > >>> >> >> Regarding your questions:
> > > >>> >> >>
> > > >>> >> >>    1. How to calculate the resource specification of
> > > TaskManagers?
> > > >>> Do
> > > >>> >> they
> > > >>> >> >>>    have them same resource spec calculated based on the
> > > >>> >> configuration? I
> > > >>> >> >>> think
> > > >>> >> >>>    we still have wasted resources in this situation. Or we
> > could
> > > >>> start
> > > >>> >> >>>    TaskManagers with different spec.
> > > >>> >> >>>
> > > >>> >> >> I agree with you that we can further improve the resource
> > utility
> > > >>> by
> > > >>> >> >> customizing task executors with different resource
> > > specifications.
> > > >>> >> However,
> > > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave it
> > as a
> > > >>> >> future
> > > >>> >> >> optimization. The plan for that part is to move the logic of
> > > >>> deciding
> > > >>> >> task
> > > >>> >> >> executor specifications into the slot manager and make slot
> > > manager
> > > >>> >> >> pluggable, so inside the slot manager plugin we can have
> > > different
> > > >>> >> logics
> > > >>> >> >> for deciding the task executor specifications.
> > > >>> >> >>
> > > >>> >> >>
> > > >>> >> >>>    2. If a slot is released and returned to SlotPool, does
> it
> > > >>> could be
> > > >>> >> >>>    reused by other SlotRequest that the request resource is
> > > >>> smaller
> > > >>> >> than
> > > >>> >> >>> it?
> > > >>> >> >>>
> > > >>> >> >> No, I think slot pool should always return slots if they do
> not
> > > >>> exactly
> > > >>> >> >> match the pending requests, so that resource manager can deal
> > > with
> > > >>> the
> > > >>> >> >> extra resources.
> > > >>> >> >>
> > > >>> >> >>>       - If it is yes, what happens to the available resource
> > in
> > > >>> the
> > > >>> >> >>
> > > >>> >> >>       TaskManager.
> > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> SlotPool?
> > > The
> > > >>> >> >>>       AllocationId is null?
> > > >>> >> >>>
> > > >>> >> >> The allocation id does not change as long as the slot is not
> > > >>> returned
> > > >>> >> >> from the job master, no matter its occupied or available in
> the
> > > >>> slot
> > > >>> >> pool.
> > > >>> >> >> I think we have the same behavior currently. No matter how
> many
> > > >>> tasks
> > > >>> >> the
> > > >>> >> >> job master deploy into the slot, concurrently or
> sequentially,
> > it
> > > >>> is
> > > >>> >> one
> > > >>> >> >> allocation from the cluster to the job until the slot is
> freed
> > > from
> > > >>> >> the job
> > > >>> >> >> master.
> > > >>> >> >>
> > > >>> >> >>>    3. In a session cluster, some jobs are configured with
> > > operator
> > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to
> > > deal
> > > >>> with
> > > >>> >> >>> this
> > > >>> >> >>>    situation?
> > > >>> >> >>
> > > >>> >> >> As long as we do not mix unknown / specified resource
> profiles
> > > >>> within
> > > >>> >> the
> > > >>> >> >> same job / slot, there shouldn't be a problem. Resource
> manager
> > > >>> >> converts
> > > >>> >> >> unknown resource profiles in slot requests to specified
> default
> > > >>> >> resource
> > > >>> >> >> profiles, so they can be dynamically allocated from task
> > > executors'
> > > >>> >> >> available resources just as other slot requests with
> specified
> > > >>> resource
> > > >>> >> >> profiles.
> > > >>> >> >>
> > > >>> >> >> Thank you~
> > > >>> >> >>
> > > >>> >> >> Xintong Song
> > > >>> >> >>
> > > >>> >> >>
> > > >>> >> >>
> > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <
> > > [hidden email]>
> > > >>> >> wrote:
> > > >>> >> >>
> > > >>> >> >>> Hi Xintong,
> > > >>> >> >>>
> > > >>> >> >>>
> > > >>> >> >>> Thanks for your detailed proposal. I think many users are
> > > >>> suffering
> > > >>> >> from
> > > >>> >> >>> waste of resources. The resource spec of all task managers
> are
> > > >>> same
> > > >>> >> and
> > > >>> >> >>> we
> > > >>> >> >>> have to increase all task managers to make the heavy one
> more
> > > >>> stable.
> > > >>> >> So
> > > >>> >> >>> we
> > > >>> >> >>> will benefit from the fine grained resource management a
> lot.
> > We
> > > >>> could
> > > >>> >> >>> get
> > > >>> >> >>> better resource utilization and stability.
> > > >>> >> >>>
> > > >>> >> >>>
> > > >>> >> >>> Just to share some thoughts.
> > > >>> >> >>>
> > > >>> >> >>>
> > > >>> >> >>>
> > > >>> >> >>>    1. How to calculate the resource specification of
> > > >>> TaskManagers? Do
> > > >>> >> >>> they
> > > >>> >> >>>    have them same resource spec calculated based on the
> > > >>> >> configuration? I
> > > >>> >> >>> think
> > > >>> >> >>>    we still have wasted resources in this situation. Or we
> > could
> > > >>> start
> > > >>> >> >>>    TaskManagers with different spec.
> > > >>> >> >>>    2. If a slot is released and returned to SlotPool, does
> it
> > > >>> could be
> > > >>> >> >>>    reused by other SlotRequest that the request resource is
> > > >>> smaller
> > > >>> >> than
> > > >>> >> >>> it?
> > > >>> >> >>>       - If it is yes, what happens to the available resource
> > in
> > > >>> the
> > > >>> >> >>>       TaskManager.
> > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> SlotPool?
> > > The
> > > >>> >> >>>       AllocationId is null?
> > > >>> >> >>>    3. In a session cluster, some jobs are configured with
> > > operator
> > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How to
> > > deal
> > > >>> with
> > > >>> >> >>> this
> > > >>> >> >>>    situation?
> > > >>> >> >>>
> > > >>> >> >>>
> > > >>> >> >>>
> > > >>> >> >>> Best,
> > > >>> >> >>> Yang
> > > >>> >> >>>
> > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五
> 下午8:57写道:
> > > >>> >> >>>
> > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till.
> > > >>> >> >>> >
> > > >>> >> >>> > Yangze,
> > > >>> >> >>> >
> > > >>> >> >>> > I agree with you that we should make scheduling strategy
> > > >>> pluggable
> > > >>> >> and
> > > >>> >> >>> > optimize the strategy to reduce the memory fragmentation
> > > >>> problem,
> > > >>> >> and
> > > >>> >> >>> > thanks for the inputs on the potential algorithmic
> > solutions.
> > > >>> >> However,
> > > >>> >> >>> I'm
> > > >>> >> >>> > in favor of keep this FLIP focusing on the overall
> mechanism
> > > >>> design
> > > >>> >> >>> rather
> > > >>> >> >>> > than strategies. Solving the fragmentation issue should be
> > > >>> >> considered
> > > >>> >> >>> as an
> > > >>> >> >>> > optimization, and I agree with Till that we probably
> should
> > > >>> tackle
> > > >>> >> this
> > > >>> >> >>> > afterwards.
> > > >>> >> >>> >
> > > >>> >> >>> > Till,
> > > >>> >> >>> >
> > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense.
> The
> > > >>> operator
> > > >>> >> >>> > resource management and dynamic slot allocation do not
> have
> > > much
> > > >>> >> >>> dependency
> > > >>> >> >>> > on each other.
> > > >>> >> >>> >
> > > >>> >> >>> > - Regarding the default slot size, I think this is similar
> > to
> > > >>> >> FLIP-49
> > > >>> >> >>> [1]
> > > >>> >> >>> > where we want all the deriving happens at one place. I
> think
> > > it
> > > >>> >> would
> > > >>> >> >>> be
> > > >>> >> >>> > nice to pass the default slot size into the task executor
> in
> > > the
> > > >>> >> same
> > > >>> >> >>> way
> > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
> > > >>> >> >>> >
> > > >>> >> >>> > - Regarding the return value of
> > > >>> >> TaskExecutorGateway#requestResource, I
> > > >>> >> >>> > think you're right. We should avoid using null as the
> return
> > > >>> value.
> > > >>> >> I
> > > >>> >> >>> think
> > > >>> >> >>> > we probably should thrown an exception here.
> > > >>> >> >>> >
> > > >>> >> >>> > Thank you~
> > > >>> >> >>> >
> > > >>> >> >>> > Xintong Song
> > > >>> >> >>> >
> > > >>> >> >>> >
> > > >>> >> >>> > [1]
> > > >>> >> >>> >
> > > >>> >> >>> >
> > > >>> >> >>>
> > > >>> >>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > >>> >> >>> >
> > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> > > >>> [hidden email]
> > > >>> >> >
> > > >>> >> >>> > wrote:
> > > >>> >> >>> >
> > > >>> >> >>> > > Hi Xintong,
> > > >>> >> >>> > >
> > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal
> helps
> > > to
> > > >>> >> >>> improve the
> > > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it
> > > >>> enables the
> > > >>> >> >>> proper
> > > >>> >> >>> > > integration of the Blink planner which is very important
> > as
> > > >>> well.
> > > >>> >> >>> > >
> > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering
> > whether
> > > it
> > > >>> >> >>> wouldn't
> > > >>> >> >>> > > make sense to actually split it up into two FLIPs:
> > Operator
> > > >>> >> resource
> > > >>> >> >>> > > management and dynamic slot allocation. I think these
> two
> > > >>> FLIPs
> > > >>> >> >>> could be
> > > >>> >> >>> > > seen as orthogonal and it would decrease the scope of
> each
> > > >>> >> individual
> > > >>> >> >>> > FLIP.
> > > >>> >> >>> > >
> > > >>> >> >>> > > Some smaller comments:
> > > >>> >> >>> > >
> > > >>> >> >>> > > - I'm not sure whether we should pass in the default
> slot
> > > size
> > > >>> >> via an
> > > >>> >> >>> > > environment variable. Without having unified the way how
> > > Flink
> > > >>> >> >>> components
> > > >>> >> >>> > > are configured [1], I think it would be better to pass
> it
> > in
> > > >>> as
> > > >>> >> part
> > > >>> >> >>> of
> > > >>> >> >>> > the
> > > >>> >> >>> > > configuration.
> > > >>> >> >>> > > - I would avoid returning a null value from
> > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be
> > > fulfilled.
> > > >>> >> >>> Either we
> > > >>> >> >>> > > should introduce an explicit return value saying this or
> > > >>> throw an
> > > >>> >> >>> > > exception.
> > > >>> >> >>> > >
> > > >>> >> >>> > > Concerning Yangze's comments: I think you are right that
> > it
> > > >>> would
> > > >>> >> be
> > > >>> >> >>> > > helpful to make the selection strategy pluggable. Also
> > > >>> batching
> > > >>> >> slot
> > > >>> >> >>> > > requests to the RM could be a good optimization. For the
> > > sake
> > > >>> of
> > > >>> >> >>> keeping
> > > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle
> these
> > > >>> things
> > > >>> >> >>> after
> > > >>> >> >>> > the
> > > >>> >> >>> > > initial version has been completed (without spoiling
> these
> > > >>> >> >>> optimization
> > > >>> >> >>> > > opportunities). In particular batching the slot requests
> > > >>> depends
> > > >>> >> on
> > > >>> >> >>> the
> > > >>> >> >>> > > current scheduler refactoring and could also be realized
> > on
> > > >>> the RM
> > > >>> >> >>> side
> > > >>> >> >>> > > only.
> > > >>> >> >>> > >
> > > >>> >> >>> > > [1]
> > > >>> >> >>> > >
> > > >>> >> >>> > >
> > > >>> >> >>> >
> > > >>> >> >>>
> > > >>> >>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> > > >>> >> >>> > >
> > > >>> >> >>> > > Cheers,
> > > >>> >> >>> > > Till
> > > >>> >> >>> > >
> > > >>> >> >>> > >
> > > >>> >> >>> > >
> > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
> > > >>> [hidden email]>
> > > >>> >> >>> wrote:
> > > >>> >> >>> > >
> > > >>> >> >>> > > > Hi, Xintong
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > Thanks to propose this FLIP. The general design looks
> > good
> > > >>> to
> > > >>> >> me,
> > > >>> >> >>> +1
> > > >>> >> >>> > > > for this feature.
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > Since slots in the same task executor could have
> > different
> > > >>> >> resource
> > > >>> >> >>> > > > profile, we will
> > > >>> >> >>> > > > meet resource fragment problem. Think about this case:
> > > >>> >> >>> > > >  - request A want 1G memory while request B & C want
> > 0.5G
> > > >>> memory
> > > >>> >> >>> > > >  - There are two task executors T1 & T2 with 1G and
> 0.5G
> > > >>> free
> > > >>> >> >>> memory
> > > >>> >> >>> > > > respectively
> > > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A
> must
> > > >>> wait for
> > > >>> >> >>> the
> > > >>> >> >>> > > > free resource from
> > > >>> >> >>> > > > other task. But A could have been scheduled
> immediately
> > if
> > > >>> we
> > > >>> >> cut a
> > > >>> >> >>> > > > slot from T2 for B.
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a
> task
> > > >>> executor
> > > >>> >> >>> which
> > > >>> >> >>> > > > has enough
> > > >>> >> >>> > > > resource and then cut a slot from it. Current method
> > could
> > > >>> be
> > > >>> >> seen
> > > >>> >> >>> as
> > > >>> >> >>> > > > "First-fit strategy",
> > > >>> >> >>> > > > which works well in general but sometimes could not be
> > the
> > > >>> >> >>> optimization
> > > >>> >> >>> > > > method.
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin
> > Packing
> > > >>> >> >>> Problem"[1].
> > > >>> >> >>> > > > Here are
> > > >>> >> >>> > > > some common approximate algorithms:
> > > >>> >> >>> > > > - First fit
> > > >>> >> >>> > > > - Next fit
> > > >>> >> >>> > > > - Best fit
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > But it become multi-dimensional bin packing problem if
> > we
> > > >>> take
> > > >>> >> CPU
> > > >>> >> >>> > > > into account. It hard
> > > >>> >> >>> > > > to define which one is best fit now. Some research
> > > addressed
> > > >>> >> this
> > > >>> >> >>> > > > problem, such like Tetris[2].
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > Here are some thinking about it:
> > > >>> >> >>> > > > 1. We could make the strategy of finding matching task
> > > >>> executor
> > > >>> >> >>> > > > pluginable. Let user to config the
> > > >>> >> >>> > > > best strategy in their scenario.
> > > >>> >> >>> > > > 2. We could support batch request interface in RM,
> > because
> > > >>> we
> > > >>> >> have
> > > >>> >> >>> > > > opportunities to optimize
> > > >>> >> >>> > > > if we have more information. If we know the A, B, C at
> > the
> > > >>> same
> > > >>> >> >>> time,
> > > >>> >> >>> > > > we could always make the best decision.
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > > >>> >> >>> > > > [2]
> > > >>> >> >>> >
> > > >>> >>
> > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > Best,
> > > >>> >> >>> > > > Yangze Guo
> > > >>> >> >>> > > >
> > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> > > >>> >> >>> [hidden email]>
> > > >>> >> >>> > > > wrote:
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > Hi everyone,
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > We would like to start a discussion thread on
> > "FLIP-53:
> > > >>> Fine
> > > >>> >> >>> Grained
> > > >>> >> >>> > > > > Resource Management"[1], where we propose how to
> > improve
> > > >>> Flink
> > > >>> >> >>> > resource
> > > >>> >> >>> > > > > management and scheduling.
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > This FLIP mainly discusses the following issues.
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > >    - How to support tasks with fine grained resource
> > > >>> >> >>> requirements.
> > > >>> >> >>> > > > >    - How to unify resource management for jobs with
> /
> > > >>> without
> > > >>> >> >>> fine
> > > >>> >> >>> > > > grained
> > > >>> >> >>> > > > >    resource requirements.
> > > >>> >> >>> > > > >    - How to unify resource management for streaming
> /
> > > >>> batch
> > > >>> >> jobs.
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows.
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > >    - Unify memory management for operators with /
> > > without
> > > >>> fine
> > > >>> >> >>> > grained
> > > >>> >> >>> > > > >    resource requirements by applying a fraction
> based
> > > >>> quota
> > > >>> >> >>> > mechanism.
> > > >>> >> >>> > > > >    - Unify resource scheduling for streaming and
> batch
> > > >>> jobs by
> > > >>> >> >>> > setting
> > > >>> >> >>> > > > slot
> > > >>> >> >>> > > > >    sharing groups for pipelined regions during
> > compiling
> > > >>> >> stage.
> > > >>> >> >>> > > > >    - Dynamically allocate slots from task executors'
> > > >>> available
> > > >>> >> >>> > > resources.
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > Please find more details in the FLIP wiki document
> > [1].
> > > >>> >> Looking
> > > >>> >> >>> > forward
> > > >>> >> >>> > > > to
> > > >>> >> >>> > > > > your feedbacks.
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > Thank you~
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > Xintong Song
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > > > [1]
> > > >>> >> >>> > > > >
> > > >>> >> >>> > > >
> > > >>> >> >>> > >
> > > >>> >> >>> >
> > > >>> >> >>>
> > > >>> >>
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > > >>> >> >>> > > >
> > > >>> >> >>> > >
> > > >>> >> >>> >
> > > >>> >> >>>
> > > >>> >> >>
> > > >>> >>
> > > >>> >
> > > >>>
> > > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Zhu Zhu
Thanks Xintong for the explanation.

For question #1, I think it's good as long as DataSet job behaviors remains
the same.

For question #2, agreed that the resource difference is small enough(at
most 1 edge diff) in current supported point-wise execution edge connection
patterns.

Thanks,
Zhu Zhu

Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道:

>  Thanks for the comments, Zhu & Kurt.
>
> Andrey and I also had some discussions offline, and I would like to first
> post a summary of our discussion:
>
>    1. The motivation of the fraction based approach is to unify resource
>    management for both operators with specified and unknown resource
>    requirements.
>    2. The fraction based approach proposed in this FLIP should only affect
>    streaming jobs (both bounded and unbounded). For DataSet jobs, there are
>    already some fraction based approach (in TaskConfig and ChainedDriver),
> and
>    we do not make any change to the existing approach.
>    3. The scope of this FLIP does not include discussion of how to set
>    ResourceSpec for operators.
>       1. For blink jobs, the optimizer can set operator resources for the
>       users, according to their configurations (default: unknown)
>       2. For DataStream jobs, there are no method / interface to set
>       operator resources at the moment (1.10). We can have in the future.
>       3. For DataSet jobs, there are existing user interfaces to set
>       operator resources.
>    4. The FLIP should explain more about how ResourceSpecs works
>       1. PhysicalTransformations (deployed with operators into the
>       StreamTasks) get ResourceSpec: unknown by default or known (e.g.
> from the
>       Blink planner)
>       2. While generating stream graph, calculate fractions and set to
>       StreamConfig
>       3. While scheduling, convert ResourceSpec to ResourceProfile
>       (ResourceSpec + network memory), and deploy to slots / TMs matching
> the
>       resources
>       4. While starting Task in TM, each operator gets fraction converted
>       back to the original absolute value requested by user or fair
> unknown share
>       of the slot
>       5. We should not set `allSourcesInSamePipelinedRegion` to `false` for
>    DataSet jobs. Behaviors of DataSet jobs should not be changed.
>    6. The FLIP document should differentiate works planed in this FLIP and
>    the future follow-ups more clearly, by put the follow-ups in a separate
>    section
>    7. Another limitation of the rejected alternative setting fractions at
>    scheduling time is that, the scheduler implementation does not know
> which
>    tasks will be deployed into the same slot in advance.
>
> Andrey, Please bring it up if there is anything I missed.
>
> Zhu, regarding your comments:
>
>    1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for
>    DataSet jobs (point 5 in the discussion summary above), then there
>    shouldn't be any regression right?
>    2. I think it makes sense to set the max possible network memory for the
>    JobVertex. When you say parallel instances of the same JobVertex may
> have
>    need different network memory, I guess you mean the rescale scenarios
> where
>    parallelisms of upstream / downstream vertex cannot be exactly divided
> by
>    parallelism of downstream / upstream vertex? I would say it's
> acceptable to
>    have slight difference between actually needed and allocated network
> memory.
>    3. Yes, by numOpsUseOnHeapManagedMemory I mean
>    numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the doc.
>    4. Yes, it should be StreamingJobGraphGenerator. Thanks for the
>    correction.
>
>
> Kurt, regarding your comments:
>
>    1. I think we don't have network memory in ResourceSpec, which is the
>    user facing API. We only have network memory in ResourceProfile, which
> is
>    used internally for scheduling. The reason we do not expose network
> memory
>    to the user is that, currently how many network buffers each task needs
> is
>    decided by the topology of execution graph (how many input / output
>    channels it has).
>    2. In the section "Operator Resource Requirements": "For the first
>    version, we do not support mixing operators with specified / unknown
>    resource requirements in the same job. Either all or none of the
> operators
>    of the same job should specify their resource requirements.
>    StreamGraphGenerator should check this and throw an error when mixing of
>    specified / unknown resource requirements is detected, during the
>    compilation stage."
>    3. If the user set a resource requirement, then it is guaranteed that
>    the task should get at least the much resource, otherwise there should
> be
>    an exception. That should be guaranteed by the "Dynamic Slot Allocation"
>    approach (FLIP-56).
>
>
> I'll update the FLIP document addressing the comments ASAP.
>
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote:
>
> > Thanks Xingtong for driving this effort, I haven't finished the whole
> > document yet,
> > but have couple of questions:
> >
> > 1. Regarding to network memory, the document said it will be derived by
> > framework
> > automatically. I'm wondering whether we should delete this dimension from
> > user-
> > facing API?
> >
> > 2. Regarding to fraction based quota, I don't quite get the meaning of
> > "slotSharingGroupOnHeapManagedMem" and
> "slotSharingGroupOffHeapManagedMem".
> > What if the sharing group is mixed with specified resource and UNKNOWN
> > resource
> > requirements.
> >
> > 3 IIUC, even user had set resource requirements, lets say 500MB off-heap
> > managed
> > memory, during execution the operator may or may not have 500MB off-heap
> > managed
> > memory, right?
> >
> > Best,
> > Kurt
> >
> >
> > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote:
> >
> > > Thanks Xintong for proposing this improvement. Fine grained resources
> can
> > > be very helpful when user has good planning on resources.
> > >
> > > I have a few questions:
> > > 1. Currently in a batch job, vertices from different regions can run at
> > the
> > > same time in slots from the same shared group, as long as they do not
> > have
> > > data dependency on each other and available slot count is not smaller
> > than
> > > the *max* of parallelism of all tasks.
> > > With changes in this FLIP however, tasks from different regions cannot
> > > share slots anymore.
> > > Once available slot count is smaller than the *sum* of all parallelism
> of
> > > tasks from all regions, tasks may need to be executed sequentially,
> which
> > > might result in a performance regression.
> > > Is this(performance regression to existing DataSet jobs) considered as
> a
> > > necessary and accepted trade off in this FLIP?
> > >
> > > 2. The network memory depends on the input/output ExecutionEdge count
> and
> > > thus can be different even for parallel instances of the same
> JobVertex.
> > > Does this mean that when adding task resources to calculating the slot
> > > resource for a shared group, the max possible network memory of the
> > vertex
> > > instance shall be used?
> > > This might result in larger resource required than actually needed.
> > >
> > > And some minor comments:
> > > 1. Regarding "fracManagedMemOnHeap = 1 /
> numOpsUseOnHeapManagedMemory", I
> > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ?
> > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section and
> > > implementation step 4 should be *StreamingJobGraphGenerator*, as
> > > *StreamGraphGenerator* is not aware of JobGraph and pipelined region.
> > >
> > >
> > > Thanks,
> > > Zhu Zhu
> > >
> > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道:
> > >
> > > > Updated the FLIP wiki page [1], with the following changes.
> > > >
> > > >    - Remove the step of converting pipelined edges between different
> > slot
> > > >    sharing groups into blocking edges.
> > > >    - Set `allSourcesInSamePipelinedRegion` to true by default.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <[hidden email]>
> > > > wrote:
> > > >
> > > > > Regarding changing edge type, I think actually we don't need to do
> > this
> > > > > for batch jobs neither, because we don't have public interfaces for
> > > users
> > > > > to explicitly set slot sharing groups in DataSet API and SQL/Table
> > API.
> > > > We
> > > > > have such interfaces in DataStream API only.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > >> Thanks for the correction, Till.
> > > > >>
> > > > >> Regarding your comments:
> > > > >> - You are right, we should not change the edge type for streaming
> > > jobs.
> > > > >> Then I think we can change the option
> > > 'allSourcesInSamePipelinedRegion'
> > > > in
> > > > >> step 2 to 'isStreamingJob', and implement the current step 2
> before
> > > the
> > > > >> current step 1 so we can use this option to decide whether should
> > > change
> > > > >> the edge type. What do you think?
> > > > >> - Agree. It should be easier to make the default value of
> > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true',
> and
> > > set
> > > > it
> > > > >> to 'false' when using DataSet API or blink planner.
> > > > >>
> > > > >> Thank you~
> > > > >>
> > > > >> Xintong Song
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <
> [hidden email]
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> Thanks for creating the implementation plan Xintong. Overall, the
> > > > >>> implementation plan looks good. I had a couple of comments:
> > > > >>>
> > > > >>> - What will happen if a user has defined a streaming job with two
> > > slot
> > > > >>> sharing groups? Would the code insert a blocking data exchange
> > > between
> > > > >>> these two groups? If yes, then this breaks existing Flink
> streaming
> > > > jobs.
> > > > >>> - How do we detect unbounded streaming jobs to set
> > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be
> > easier
> > > to
> > > > >>> set
> > > > >>> it false if we are using the DataSet API or the Blink planner
> with
> > a
> > > > >>> bounded job?
> > > > >>>
> > > > >>> Cheers,
> > > > >>> Till
> > > > >>>
> > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <
> > [hidden email]>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > I guess there is a typo since the link to the FLIP-53 is
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > >>> >
> > > > >>> > Cheers,
> > > > >>> > Till
> > > > >>> >
> > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <
> > > [hidden email]>
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> >> Added implementation steps for this FLIP on the wiki page [1].
> > > > >>> >>
> > > > >>> >>
> > > > >>> >> Thank you~
> > > > >>> >>
> > > > >>> >> Xintong Song
> > > > >>> >>
> > > > >>> >>
> > > > >>> >> [1]
> > > > >>> >>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > >>> >>
> > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <
> > > > [hidden email]>
> > > > >>> >> wrote:
> > > > >>> >>
> > > > >>> >> > Hi everyone,
> > > > >>> >> >
> > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained
> > Resource
> > > > >>> >> > Management" splits into two separate FLIPs,
> > > > >>> >> >
> > > > >>> >> >    - FLIP-53: Fine Grained Operator Resource Management [1]
> > > > >>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> > > > >>> >> >
> > > > >>> >> > We'll continue using this discussion thread for FLIP-53. For
> > > > >>> FLIP-56, I
> > > > >>> >> > just started a new discussion thread [3].
> > > > >>> >> >
> > > > >>> >> > Thank you~
> > > > >>> >> >
> > > > >>> >> > Xintong Song
> > > > >>> >> >
> > > > >>> >> >
> > > > >>> >> > [1]
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > >>> >> >
> > > > >>> >> > [2]
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > > >>> >> >
> > > > >>> >> > [3]
> > > > >>> >> >
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > > >>> >> >
> > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <
> > > > [hidden email]
> > > > >>> >
> > > > >>> >> > wrote:
> > > > >>> >> >
> > > > >>> >> >> Thinks for the comments, Yang.
> > > > >>> >> >>
> > > > >>> >> >> Regarding your questions:
> > > > >>> >> >>
> > > > >>> >> >>    1. How to calculate the resource specification of
> > > > TaskManagers?
> > > > >>> Do
> > > > >>> >> they
> > > > >>> >> >>>    have them same resource spec calculated based on the
> > > > >>> >> configuration? I
> > > > >>> >> >>> think
> > > > >>> >> >>>    we still have wasted resources in this situation. Or we
> > > could
> > > > >>> start
> > > > >>> >> >>>    TaskManagers with different spec.
> > > > >>> >> >>>
> > > > >>> >> >> I agree with you that we can further improve the resource
> > > utility
> > > > >>> by
> > > > >>> >> >> customizing task executors with different resource
> > > > specifications.
> > > > >>> >> However,
> > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave
> it
> > > as a
> > > > >>> >> future
> > > > >>> >> >> optimization. The plan for that part is to move the logic
> of
> > > > >>> deciding
> > > > >>> >> task
> > > > >>> >> >> executor specifications into the slot manager and make slot
> > > > manager
> > > > >>> >> >> pluggable, so inside the slot manager plugin we can have
> > > > different
> > > > >>> >> logics
> > > > >>> >> >> for deciding the task executor specifications.
> > > > >>> >> >>
> > > > >>> >> >>
> > > > >>> >> >>>    2. If a slot is released and returned to SlotPool, does
> > it
> > > > >>> could be
> > > > >>> >> >>>    reused by other SlotRequest that the request resource
> is
> > > > >>> smaller
> > > > >>> >> than
> > > > >>> >> >>> it?
> > > > >>> >> >>>
> > > > >>> >> >> No, I think slot pool should always return slots if they do
> > not
> > > > >>> exactly
> > > > >>> >> >> match the pending requests, so that resource manager can
> deal
> > > > with
> > > > >>> the
> > > > >>> >> >> extra resources.
> > > > >>> >> >>
> > > > >>> >> >>>       - If it is yes, what happens to the available
> resource
> > > in
> > > > >>> the
> > > > >>> >> >>
> > > > >>> >> >>       TaskManager.
> > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > SlotPool?
> > > > The
> > > > >>> >> >>>       AllocationId is null?
> > > > >>> >> >>>
> > > > >>> >> >> The allocation id does not change as long as the slot is
> not
> > > > >>> returned
> > > > >>> >> >> from the job master, no matter its occupied or available in
> > the
> > > > >>> slot
> > > > >>> >> pool.
> > > > >>> >> >> I think we have the same behavior currently. No matter how
> > many
> > > > >>> tasks
> > > > >>> >> the
> > > > >>> >> >> job master deploy into the slot, concurrently or
> > sequentially,
> > > it
> > > > >>> is
> > > > >>> >> one
> > > > >>> >> >> allocation from the cluster to the job until the slot is
> > freed
> > > > from
> > > > >>> >> the job
> > > > >>> >> >> master.
> > > > >>> >> >>
> > > > >>> >> >>>    3. In a session cluster, some jobs are configured with
> > > > operator
> > > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How
> to
> > > > deal
> > > > >>> with
> > > > >>> >> >>> this
> > > > >>> >> >>>    situation?
> > > > >>> >> >>
> > > > >>> >> >> As long as we do not mix unknown / specified resource
> > profiles
> > > > >>> within
> > > > >>> >> the
> > > > >>> >> >> same job / slot, there shouldn't be a problem. Resource
> > manager
> > > > >>> >> converts
> > > > >>> >> >> unknown resource profiles in slot requests to specified
> > default
> > > > >>> >> resource
> > > > >>> >> >> profiles, so they can be dynamically allocated from task
> > > > executors'
> > > > >>> >> >> available resources just as other slot requests with
> > specified
> > > > >>> resource
> > > > >>> >> >> profiles.
> > > > >>> >> >>
> > > > >>> >> >> Thank you~
> > > > >>> >> >>
> > > > >>> >> >> Xintong Song
> > > > >>> >> >>
> > > > >>> >> >>
> > > > >>> >> >>
> > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <
> > > > [hidden email]>
> > > > >>> >> wrote:
> > > > >>> >> >>
> > > > >>> >> >>> Hi Xintong,
> > > > >>> >> >>>
> > > > >>> >> >>>
> > > > >>> >> >>> Thanks for your detailed proposal. I think many users are
> > > > >>> suffering
> > > > >>> >> from
> > > > >>> >> >>> waste of resources. The resource spec of all task managers
> > are
> > > > >>> same
> > > > >>> >> and
> > > > >>> >> >>> we
> > > > >>> >> >>> have to increase all task managers to make the heavy one
> > more
> > > > >>> stable.
> > > > >>> >> So
> > > > >>> >> >>> we
> > > > >>> >> >>> will benefit from the fine grained resource management a
> > lot.
> > > We
> > > > >>> could
> > > > >>> >> >>> get
> > > > >>> >> >>> better resource utilization and stability.
> > > > >>> >> >>>
> > > > >>> >> >>>
> > > > >>> >> >>> Just to share some thoughts.
> > > > >>> >> >>>
> > > > >>> >> >>>
> > > > >>> >> >>>
> > > > >>> >> >>>    1. How to calculate the resource specification of
> > > > >>> TaskManagers? Do
> > > > >>> >> >>> they
> > > > >>> >> >>>    have them same resource spec calculated based on the
> > > > >>> >> configuration? I
> > > > >>> >> >>> think
> > > > >>> >> >>>    we still have wasted resources in this situation. Or we
> > > could
> > > > >>> start
> > > > >>> >> >>>    TaskManagers with different spec.
> > > > >>> >> >>>    2. If a slot is released and returned to SlotPool, does
> > it
> > > > >>> could be
> > > > >>> >> >>>    reused by other SlotRequest that the request resource
> is
> > > > >>> smaller
> > > > >>> >> than
> > > > >>> >> >>> it?
> > > > >>> >> >>>       - If it is yes, what happens to the available
> resource
> > > in
> > > > >>> the
> > > > >>> >> >>>       TaskManager.
> > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > SlotPool?
> > > > The
> > > > >>> >> >>>       AllocationId is null?
> > > > >>> >> >>>    3. In a session cluster, some jobs are configured with
> > > > operator
> > > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN. How
> to
> > > > deal
> > > > >>> with
> > > > >>> >> >>> this
> > > > >>> >> >>>    situation?
> > > > >>> >> >>>
> > > > >>> >> >>>
> > > > >>> >> >>>
> > > > >>> >> >>> Best,
> > > > >>> >> >>> Yang
> > > > >>> >> >>>
> > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五
> > 下午8:57写道:
> > > > >>> >> >>>
> > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till.
> > > > >>> >> >>> >
> > > > >>> >> >>> > Yangze,
> > > > >>> >> >>> >
> > > > >>> >> >>> > I agree with you that we should make scheduling strategy
> > > > >>> pluggable
> > > > >>> >> and
> > > > >>> >> >>> > optimize the strategy to reduce the memory fragmentation
> > > > >>> problem,
> > > > >>> >> and
> > > > >>> >> >>> > thanks for the inputs on the potential algorithmic
> > > solutions.
> > > > >>> >> However,
> > > > >>> >> >>> I'm
> > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall
> > mechanism
> > > > >>> design
> > > > >>> >> >>> rather
> > > > >>> >> >>> > than strategies. Solving the fragmentation issue should
> be
> > > > >>> >> considered
> > > > >>> >> >>> as an
> > > > >>> >> >>> > optimization, and I agree with Till that we probably
> > should
> > > > >>> tackle
> > > > >>> >> this
> > > > >>> >> >>> > afterwards.
> > > > >>> >> >>> >
> > > > >>> >> >>> > Till,
> > > > >>> >> >>> >
> > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes sense.
> > The
> > > > >>> operator
> > > > >>> >> >>> > resource management and dynamic slot allocation do not
> > have
> > > > much
> > > > >>> >> >>> dependency
> > > > >>> >> >>> > on each other.
> > > > >>> >> >>> >
> > > > >>> >> >>> > - Regarding the default slot size, I think this is
> similar
> > > to
> > > > >>> >> FLIP-49
> > > > >>> >> >>> [1]
> > > > >>> >> >>> > where we want all the deriving happens at one place. I
> > think
> > > > it
> > > > >>> >> would
> > > > >>> >> >>> be
> > > > >>> >> >>> > nice to pass the default slot size into the task
> executor
> > in
> > > > the
> > > > >>> >> same
> > > > >>> >> >>> way
> > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
> > > > >>> >> >>> >
> > > > >>> >> >>> > - Regarding the return value of
> > > > >>> >> TaskExecutorGateway#requestResource, I
> > > > >>> >> >>> > think you're right. We should avoid using null as the
> > return
> > > > >>> value.
> > > > >>> >> I
> > > > >>> >> >>> think
> > > > >>> >> >>> > we probably should thrown an exception here.
> > > > >>> >> >>> >
> > > > >>> >> >>> > Thank you~
> > > > >>> >> >>> >
> > > > >>> >> >>> > Xintong Song
> > > > >>> >> >>> >
> > > > >>> >> >>> >
> > > > >>> >> >>> > [1]
> > > > >>> >> >>> >
> > > > >>> >> >>> >
> > > > >>> >> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > >>> >> >>> >
> > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> > > > >>> [hidden email]
> > > > >>> >> >
> > > > >>> >> >>> > wrote:
> > > > >>> >> >>> >
> > > > >>> >> >>> > > Hi Xintong,
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal
> > helps
> > > > to
> > > > >>> >> >>> improve the
> > > > >>> >> >>> > > execution of batch jobs more efficiently. Moreover, it
> > > > >>> enables the
> > > > >>> >> >>> proper
> > > > >>> >> >>> > > integration of the Blink planner which is very
> important
> > > as
> > > > >>> well.
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering
> > > whether
> > > > it
> > > > >>> >> >>> wouldn't
> > > > >>> >> >>> > > make sense to actually split it up into two FLIPs:
> > > Operator
> > > > >>> >> resource
> > > > >>> >> >>> > > management and dynamic slot allocation. I think these
> > two
> > > > >>> FLIPs
> > > > >>> >> >>> could be
> > > > >>> >> >>> > > seen as orthogonal and it would decrease the scope of
> > each
> > > > >>> >> individual
> > > > >>> >> >>> > FLIP.
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > Some smaller comments:
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > - I'm not sure whether we should pass in the default
> > slot
> > > > size
> > > > >>> >> via an
> > > > >>> >> >>> > > environment variable. Without having unified the way
> how
> > > > Flink
> > > > >>> >> >>> components
> > > > >>> >> >>> > > are configured [1], I think it would be better to pass
> > it
> > > in
> > > > >>> as
> > > > >>> >> part
> > > > >>> >> >>> of
> > > > >>> >> >>> > the
> > > > >>> >> >>> > > configuration.
> > > > >>> >> >>> > > - I would avoid returning a null value from
> > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be
> > > > fulfilled.
> > > > >>> >> >>> Either we
> > > > >>> >> >>> > > should introduce an explicit return value saying this
> or
> > > > >>> throw an
> > > > >>> >> >>> > > exception.
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > Concerning Yangze's comments: I think you are right
> that
> > > it
> > > > >>> would
> > > > >>> >> be
> > > > >>> >> >>> > > helpful to make the selection strategy pluggable. Also
> > > > >>> batching
> > > > >>> >> slot
> > > > >>> >> >>> > > requests to the RM could be a good optimization. For
> the
> > > > sake
> > > > >>> of
> > > > >>> >> >>> keeping
> > > > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle
> > these
> > > > >>> things
> > > > >>> >> >>> after
> > > > >>> >> >>> > the
> > > > >>> >> >>> > > initial version has been completed (without spoiling
> > these
> > > > >>> >> >>> optimization
> > > > >>> >> >>> > > opportunities). In particular batching the slot
> requests
> > > > >>> depends
> > > > >>> >> on
> > > > >>> >> >>> the
> > > > >>> >> >>> > > current scheduler refactoring and could also be
> realized
> > > on
> > > > >>> the RM
> > > > >>> >> >>> side
> > > > >>> >> >>> > > only.
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > [1]
> > > > >>> >> >>> > >
> > > > >>> >> >>> > >
> > > > >>> >> >>> >
> > > > >>> >> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > Cheers,
> > > > >>> >> >>> > > Till
> > > > >>> >> >>> > >
> > > > >>> >> >>> > >
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
> > > > >>> [hidden email]>
> > > > >>> >> >>> wrote:
> > > > >>> >> >>> > >
> > > > >>> >> >>> > > > Hi, Xintong
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design
> looks
> > > good
> > > > >>> to
> > > > >>> >> me,
> > > > >>> >> >>> +1
> > > > >>> >> >>> > > > for this feature.
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > Since slots in the same task executor could have
> > > different
> > > > >>> >> resource
> > > > >>> >> >>> > > > profile, we will
> > > > >>> >> >>> > > > meet resource fragment problem. Think about this
> case:
> > > > >>> >> >>> > > >  - request A want 1G memory while request B & C want
> > > 0.5G
> > > > >>> memory
> > > > >>> >> >>> > > >  - There are two task executors T1 & T2 with 1G and
> > 0.5G
> > > > >>> free
> > > > >>> >> >>> memory
> > > > >>> >> >>> > > > respectively
> > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A
> > must
> > > > >>> wait for
> > > > >>> >> >>> the
> > > > >>> >> >>> > > > free resource from
> > > > >>> >> >>> > > > other task. But A could have been scheduled
> > immediately
> > > if
> > > > >>> we
> > > > >>> >> cut a
> > > > >>> >> >>> > > > slot from T2 for B.
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a
> > task
> > > > >>> executor
> > > > >>> >> >>> which
> > > > >>> >> >>> > > > has enough
> > > > >>> >> >>> > > > resource and then cut a slot from it. Current method
> > > could
> > > > >>> be
> > > > >>> >> seen
> > > > >>> >> >>> as
> > > > >>> >> >>> > > > "First-fit strategy",
> > > > >>> >> >>> > > > which works well in general but sometimes could not
> be
> > > the
> > > > >>> >> >>> optimization
> > > > >>> >> >>> > > > method.
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin
> > > Packing
> > > > >>> >> >>> Problem"[1].
> > > > >>> >> >>> > > > Here are
> > > > >>> >> >>> > > > some common approximate algorithms:
> > > > >>> >> >>> > > > - First fit
> > > > >>> >> >>> > > > - Next fit
> > > > >>> >> >>> > > > - Best fit
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > But it become multi-dimensional bin packing problem
> if
> > > we
> > > > >>> take
> > > > >>> >> CPU
> > > > >>> >> >>> > > > into account. It hard
> > > > >>> >> >>> > > > to define which one is best fit now. Some research
> > > > addressed
> > > > >>> >> this
> > > > >>> >> >>> > > > problem, such like Tetris[2].
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > Here are some thinking about it:
> > > > >>> >> >>> > > > 1. We could make the strategy of finding matching
> task
> > > > >>> executor
> > > > >>> >> >>> > > > pluginable. Let user to config the
> > > > >>> >> >>> > > > best strategy in their scenario.
> > > > >>> >> >>> > > > 2. We could support batch request interface in RM,
> > > because
> > > > >>> we
> > > > >>> >> have
> > > > >>> >> >>> > > > opportunities to optimize
> > > > >>> >> >>> > > > if we have more information. If we know the A, B, C
> at
> > > the
> > > > >>> same
> > > > >>> >> >>> time,
> > > > >>> >> >>> > > > we could always make the best decision.
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > > > >>> >> >>> > > > [2]
> > > > >>> >> >>> >
> > > > >>> >>
> > > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > Best,
> > > > >>> >> >>> > > > Yangze Guo
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> > > > >>> >> >>> [hidden email]>
> > > > >>> >> >>> > > > wrote:
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > Hi everyone,
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > We would like to start a discussion thread on
> > > "FLIP-53:
> > > > >>> Fine
> > > > >>> >> >>> Grained
> > > > >>> >> >>> > > > > Resource Management"[1], where we propose how to
> > > improve
> > > > >>> Flink
> > > > >>> >> >>> > resource
> > > > >>> >> >>> > > > > management and scheduling.
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues.
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > >    - How to support tasks with fine grained
> resource
> > > > >>> >> >>> requirements.
> > > > >>> >> >>> > > > >    - How to unify resource management for jobs
> with
> > /
> > > > >>> without
> > > > >>> >> >>> fine
> > > > >>> >> >>> > > > grained
> > > > >>> >> >>> > > > >    resource requirements.
> > > > >>> >> >>> > > > >    - How to unify resource management for
> streaming
> > /
> > > > >>> batch
> > > > >>> >> jobs.
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows.
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > >    - Unify memory management for operators with /
> > > > without
> > > > >>> fine
> > > > >>> >> >>> > grained
> > > > >>> >> >>> > > > >    resource requirements by applying a fraction
> > based
> > > > >>> quota
> > > > >>> >> >>> > mechanism.
> > > > >>> >> >>> > > > >    - Unify resource scheduling for streaming and
> > batch
> > > > >>> jobs by
> > > > >>> >> >>> > setting
> > > > >>> >> >>> > > > slot
> > > > >>> >> >>> > > > >    sharing groups for pipelined regions during
> > > compiling
> > > > >>> >> stage.
> > > > >>> >> >>> > > > >    - Dynamically allocate slots from task
> executors'
> > > > >>> available
> > > > >>> >> >>> > > resources.
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > Please find more details in the FLIP wiki document
> > > [1].
> > > > >>> >> Looking
> > > > >>> >> >>> > forward
> > > > >>> >> >>> > > > to
> > > > >>> >> >>> > > > > your feedbacks.
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > Thank you~
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > Xintong Song
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > > > [1]
> > > > >>> >> >>> > > > >
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > >
> > > > >>> >> >>> >
> > > > >>> >> >>>
> > > > >>> >>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > > > >>> >> >>> > > >
> > > > >>> >> >>> > >
> > > > >>> >> >>> >
> > > > >>> >> >>>
> > > > >>> >> >>
> > > > >>> >>
> > > > >>> >
> > > > >>>
> > > > >>
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
@all

The FLIP document [1] has been updated.

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management

On Tue, Sep 3, 2019 at 7:20 PM Zhu Zhu <[hidden email]> wrote:

> Thanks Xintong for the explanation.
>
> For question #1, I think it's good as long as DataSet job behaviors remains
> the same.
>
> For question #2, agreed that the resource difference is small enough(at
> most 1 edge diff) in current supported point-wise execution edge connection
> patterns.
>
> Thanks,
> Zhu Zhu
>
> Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道:
>
> >  Thanks for the comments, Zhu & Kurt.
> >
> > Andrey and I also had some discussions offline, and I would like to first
> > post a summary of our discussion:
> >
> >    1. The motivation of the fraction based approach is to unify resource
> >    management for both operators with specified and unknown resource
> >    requirements.
> >    2. The fraction based approach proposed in this FLIP should only
> affect
> >    streaming jobs (both bounded and unbounded). For DataSet jobs, there
> are
> >    already some fraction based approach (in TaskConfig and
> ChainedDriver),
> > and
> >    we do not make any change to the existing approach.
> >    3. The scope of this FLIP does not include discussion of how to set
> >    ResourceSpec for operators.
> >       1. For blink jobs, the optimizer can set operator resources for the
> >       users, according to their configurations (default: unknown)
> >       2. For DataStream jobs, there are no method / interface to set
> >       operator resources at the moment (1.10). We can have in the future.
> >       3. For DataSet jobs, there are existing user interfaces to set
> >       operator resources.
> >    4. The FLIP should explain more about how ResourceSpecs works
> >       1. PhysicalTransformations (deployed with operators into the
> >       StreamTasks) get ResourceSpec: unknown by default or known (e.g.
> > from the
> >       Blink planner)
> >       2. While generating stream graph, calculate fractions and set to
> >       StreamConfig
> >       3. While scheduling, convert ResourceSpec to ResourceProfile
> >       (ResourceSpec + network memory), and deploy to slots / TMs matching
> > the
> >       resources
> >       4. While starting Task in TM, each operator gets fraction converted
> >       back to the original absolute value requested by user or fair
> > unknown share
> >       of the slot
> >       5. We should not set `allSourcesInSamePipelinedRegion` to `false`
> for
> >    DataSet jobs. Behaviors of DataSet jobs should not be changed.
> >    6. The FLIP document should differentiate works planed in this FLIP
> and
> >    the future follow-ups more clearly, by put the follow-ups in a
> separate
> >    section
> >    7. Another limitation of the rejected alternative setting fractions at
> >    scheduling time is that, the scheduler implementation does not know
> > which
> >    tasks will be deployed into the same slot in advance.
> >
> > Andrey, Please bring it up if there is anything I missed.
> >
> > Zhu, regarding your comments:
> >
> >    1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for
> >    DataSet jobs (point 5 in the discussion summary above), then there
> >    shouldn't be any regression right?
> >    2. I think it makes sense to set the max possible network memory for
> the
> >    JobVertex. When you say parallel instances of the same JobVertex may
> > have
> >    need different network memory, I guess you mean the rescale scenarios
> > where
> >    parallelisms of upstream / downstream vertex cannot be exactly divided
> > by
> >    parallelism of downstream / upstream vertex? I would say it's
> > acceptable to
> >    have slight difference between actually needed and allocated network
> > memory.
> >    3. Yes, by numOpsUseOnHeapManagedMemory I mean
> >    numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the doc.
> >    4. Yes, it should be StreamingJobGraphGenerator. Thanks for the
> >    correction.
> >
> >
> > Kurt, regarding your comments:
> >
> >    1. I think we don't have network memory in ResourceSpec, which is the
> >    user facing API. We only have network memory in ResourceProfile, which
> > is
> >    used internally for scheduling. The reason we do not expose network
> > memory
> >    to the user is that, currently how many network buffers each task
> needs
> > is
> >    decided by the topology of execution graph (how many input / output
> >    channels it has).
> >    2. In the section "Operator Resource Requirements": "For the first
> >    version, we do not support mixing operators with specified / unknown
> >    resource requirements in the same job. Either all or none of the
> > operators
> >    of the same job should specify their resource requirements.
> >    StreamGraphGenerator should check this and throw an error when mixing
> of
> >    specified / unknown resource requirements is detected, during the
> >    compilation stage."
> >    3. If the user set a resource requirement, then it is guaranteed that
> >    the task should get at least the much resource, otherwise there should
> > be
> >    an exception. That should be guaranteed by the "Dynamic Slot
> Allocation"
> >    approach (FLIP-56).
> >
> >
> > I'll update the FLIP document addressing the comments ASAP.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote:
> >
> > > Thanks Xingtong for driving this effort, I haven't finished the whole
> > > document yet,
> > > but have couple of questions:
> > >
> > > 1. Regarding to network memory, the document said it will be derived by
> > > framework
> > > automatically. I'm wondering whether we should delete this dimension
> from
> > > user-
> > > facing API?
> > >
> > > 2. Regarding to fraction based quota, I don't quite get the meaning of
> > > "slotSharingGroupOnHeapManagedMem" and
> > "slotSharingGroupOffHeapManagedMem".
> > > What if the sharing group is mixed with specified resource and UNKNOWN
> > > resource
> > > requirements.
> > >
> > > 3 IIUC, even user had set resource requirements, lets say 500MB
> off-heap
> > > managed
> > > memory, during execution the operator may or may not have 500MB
> off-heap
> > > managed
> > > memory, right?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote:
> > >
> > > > Thanks Xintong for proposing this improvement. Fine grained resources
> > can
> > > > be very helpful when user has good planning on resources.
> > > >
> > > > I have a few questions:
> > > > 1. Currently in a batch job, vertices from different regions can run
> at
> > > the
> > > > same time in slots from the same shared group, as long as they do not
> > > have
> > > > data dependency on each other and available slot count is not smaller
> > > than
> > > > the *max* of parallelism of all tasks.
> > > > With changes in this FLIP however, tasks from different regions
> cannot
> > > > share slots anymore.
> > > > Once available slot count is smaller than the *sum* of all
> parallelism
> > of
> > > > tasks from all regions, tasks may need to be executed sequentially,
> > which
> > > > might result in a performance regression.
> > > > Is this(performance regression to existing DataSet jobs) considered
> as
> > a
> > > > necessary and accepted trade off in this FLIP?
> > > >
> > > > 2. The network memory depends on the input/output ExecutionEdge count
> > and
> > > > thus can be different even for parallel instances of the same
> > JobVertex.
> > > > Does this mean that when adding task resources to calculating the
> slot
> > > > resource for a shared group, the max possible network memory of the
> > > vertex
> > > > instance shall be used?
> > > > This might result in larger resource required than actually needed.
> > > >
> > > > And some minor comments:
> > > > 1. Regarding "fracManagedMemOnHeap = 1 /
> > numOpsUseOnHeapManagedMemory", I
> > > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ?
> > > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section
> and
> > > > implementation step 4 should be *StreamingJobGraphGenerator*, as
> > > > *StreamGraphGenerator* is not aware of JobGraph and pipelined region.
> > > >
> > > >
> > > > Thanks,
> > > > Zhu Zhu
> > > >
> > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道:
> > > >
> > > > > Updated the FLIP wiki page [1], with the following changes.
> > > > >
> > > > >    - Remove the step of converting pipelined edges between
> different
> > > slot
> > > > >    sharing groups into blocking edges.
> > > > >    - Set `allSourcesInSamePipelinedRegion` to true by default.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Regarding changing edge type, I think actually we don't need to
> do
> > > this
> > > > > > for batch jobs neither, because we don't have public interfaces
> for
> > > > users
> > > > > > to explicitly set slot sharing groups in DataSet API and
> SQL/Table
> > > API.
> > > > > We
> > > > > > have such interfaces in DataStream API only.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks for the correction, Till.
> > > > > >>
> > > > > >> Regarding your comments:
> > > > > >> - You are right, we should not change the edge type for
> streaming
> > > > jobs.
> > > > > >> Then I think we can change the option
> > > > 'allSourcesInSamePipelinedRegion'
> > > > > in
> > > > > >> step 2 to 'isStreamingJob', and implement the current step 2
> > before
> > > > the
> > > > > >> current step 1 so we can use this option to decide whether
> should
> > > > change
> > > > > >> the edge type. What do you think?
> > > > > >> - Agree. It should be easier to make the default value of
> > > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob') 'true',
> > and
> > > > set
> > > > > it
> > > > > >> to 'false' when using DataSet API or blink planner.
> > > > > >>
> > > > > >> Thank you~
> > > > > >>
> > > > > >> Xintong Song
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <
> > [hidden email]
> > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Thanks for creating the implementation plan Xintong. Overall,
> the
> > > > > >>> implementation plan looks good. I had a couple of comments:
> > > > > >>>
> > > > > >>> - What will happen if a user has defined a streaming job with
> two
> > > > slot
> > > > > >>> sharing groups? Would the code insert a blocking data exchange
> > > > between
> > > > > >>> these two groups? If yes, then this breaks existing Flink
> > streaming
> > > > > jobs.
> > > > > >>> - How do we detect unbounded streaming jobs to set
> > > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be
> > > easier
> > > > to
> > > > > >>> set
> > > > > >>> it false if we are using the DataSet API or the Blink planner
> > with
> > > a
> > > > > >>> bounded job?
> > > > > >>>
> > > > > >>> Cheers,
> > > > > >>> Till
> > > > > >>>
> > > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <
> > > [hidden email]>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > I guess there is a typo since the link to the FLIP-53 is
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > > >>> >
> > > > > >>> > Cheers,
> > > > > >>> > Till
> > > > > >>> >
> > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <
> > > > [hidden email]>
> > > > > >>> > wrote:
> > > > > >>> >
> > > > > >>> >> Added implementation steps for this FLIP on the wiki page
> [1].
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >> Thank you~
> > > > > >>> >>
> > > > > >>> >> Xintong Song
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>> >> [1]
> > > > > >>> >>
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > >>> >>
> > > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <
> > > > > [hidden email]>
> > > > > >>> >> wrote:
> > > > > >>> >>
> > > > > >>> >> > Hi everyone,
> > > > > >>> >> >
> > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained
> > > Resource
> > > > > >>> >> > Management" splits into two separate FLIPs,
> > > > > >>> >> >
> > > > > >>> >> >    - FLIP-53: Fine Grained Operator Resource Management
> [1]
> > > > > >>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> > > > > >>> >> >
> > > > > >>> >> > We'll continue using this discussion thread for FLIP-53.
> For
> > > > > >>> FLIP-56, I
> > > > > >>> >> > just started a new discussion thread [3].
> > > > > >>> >> >
> > > > > >>> >> > Thank you~
> > > > > >>> >> >
> > > > > >>> >> > Xintong Song
> > > > > >>> >> >
> > > > > >>> >> >
> > > > > >>> >> > [1]
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > > >>> >> >
> > > > > >>> >> > [2]
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > > > >>> >> >
> > > > > >>> >> > [3]
> > > > > >>> >> >
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > > > >>> >> >
> > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <
> > > > > [hidden email]
> > > > > >>> >
> > > > > >>> >> > wrote:
> > > > > >>> >> >
> > > > > >>> >> >> Thinks for the comments, Yang.
> > > > > >>> >> >>
> > > > > >>> >> >> Regarding your questions:
> > > > > >>> >> >>
> > > > > >>> >> >>    1. How to calculate the resource specification of
> > > > > TaskManagers?
> > > > > >>> Do
> > > > > >>> >> they
> > > > > >>> >> >>>    have them same resource spec calculated based on the
> > > > > >>> >> configuration? I
> > > > > >>> >> >>> think
> > > > > >>> >> >>>    we still have wasted resources in this situation. Or
> we
> > > > could
> > > > > >>> start
> > > > > >>> >> >>>    TaskManagers with different spec.
> > > > > >>> >> >>>
> > > > > >>> >> >> I agree with you that we can further improve the resource
> > > > utility
> > > > > >>> by
> > > > > >>> >> >> customizing task executors with different resource
> > > > > specifications.
> > > > > >>> >> However,
> > > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and leave
> > it
> > > > as a
> > > > > >>> >> future
> > > > > >>> >> >> optimization. The plan for that part is to move the logic
> > of
> > > > > >>> deciding
> > > > > >>> >> task
> > > > > >>> >> >> executor specifications into the slot manager and make
> slot
> > > > > manager
> > > > > >>> >> >> pluggable, so inside the slot manager plugin we can have
> > > > > different
> > > > > >>> >> logics
> > > > > >>> >> >> for deciding the task executor specifications.
> > > > > >>> >> >>
> > > > > >>> >> >>
> > > > > >>> >> >>>    2. If a slot is released and returned to SlotPool,
> does
> > > it
> > > > > >>> could be
> > > > > >>> >> >>>    reused by other SlotRequest that the request resource
> > is
> > > > > >>> smaller
> > > > > >>> >> than
> > > > > >>> >> >>> it?
> > > > > >>> >> >>>
> > > > > >>> >> >> No, I think slot pool should always return slots if they
> do
> > > not
> > > > > >>> exactly
> > > > > >>> >> >> match the pending requests, so that resource manager can
> > deal
> > > > > with
> > > > > >>> the
> > > > > >>> >> >> extra resources.
> > > > > >>> >> >>
> > > > > >>> >> >>>       - If it is yes, what happens to the available
> > resource
> > > > in
> > > > > >>> the
> > > > > >>> >> >>
> > > > > >>> >> >>       TaskManager.
> > > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > > SlotPool?
> > > > > The
> > > > > >>> >> >>>       AllocationId is null?
> > > > > >>> >> >>>
> > > > > >>> >> >> The allocation id does not change as long as the slot is
> > not
> > > > > >>> returned
> > > > > >>> >> >> from the job master, no matter its occupied or available
> in
> > > the
> > > > > >>> slot
> > > > > >>> >> pool.
> > > > > >>> >> >> I think we have the same behavior currently. No matter
> how
> > > many
> > > > > >>> tasks
> > > > > >>> >> the
> > > > > >>> >> >> job master deploy into the slot, concurrently or
> > > sequentially,
> > > > it
> > > > > >>> is
> > > > > >>> >> one
> > > > > >>> >> >> allocation from the cluster to the job until the slot is
> > > freed
> > > > > from
> > > > > >>> >> the job
> > > > > >>> >> >> master.
> > > > > >>> >> >>
> > > > > >>> >> >>>    3. In a session cluster, some jobs are configured
> with
> > > > > operator
> > > > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN.
> How
> > to
> > > > > deal
> > > > > >>> with
> > > > > >>> >> >>> this
> > > > > >>> >> >>>    situation?
> > > > > >>> >> >>
> > > > > >>> >> >> As long as we do not mix unknown / specified resource
> > > profiles
> > > > > >>> within
> > > > > >>> >> the
> > > > > >>> >> >> same job / slot, there shouldn't be a problem. Resource
> > > manager
> > > > > >>> >> converts
> > > > > >>> >> >> unknown resource profiles in slot requests to specified
> > > default
> > > > > >>> >> resource
> > > > > >>> >> >> profiles, so they can be dynamically allocated from task
> > > > > executors'
> > > > > >>> >> >> available resources just as other slot requests with
> > > specified
> > > > > >>> resource
> > > > > >>> >> >> profiles.
> > > > > >>> >> >>
> > > > > >>> >> >> Thank you~
> > > > > >>> >> >>
> > > > > >>> >> >> Xintong Song
> > > > > >>> >> >>
> > > > > >>> >> >>
> > > > > >>> >> >>
> > > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <
> > > > > [hidden email]>
> > > > > >>> >> wrote:
> > > > > >>> >> >>
> > > > > >>> >> >>> Hi Xintong,
> > > > > >>> >> >>>
> > > > > >>> >> >>>
> > > > > >>> >> >>> Thanks for your detailed proposal. I think many users
> are
> > > > > >>> suffering
> > > > > >>> >> from
> > > > > >>> >> >>> waste of resources. The resource spec of all task
> managers
> > > are
> > > > > >>> same
> > > > > >>> >> and
> > > > > >>> >> >>> we
> > > > > >>> >> >>> have to increase all task managers to make the heavy one
> > > more
> > > > > >>> stable.
> > > > > >>> >> So
> > > > > >>> >> >>> we
> > > > > >>> >> >>> will benefit from the fine grained resource management a
> > > lot.
> > > > We
> > > > > >>> could
> > > > > >>> >> >>> get
> > > > > >>> >> >>> better resource utilization and stability.
> > > > > >>> >> >>>
> > > > > >>> >> >>>
> > > > > >>> >> >>> Just to share some thoughts.
> > > > > >>> >> >>>
> > > > > >>> >> >>>
> > > > > >>> >> >>>
> > > > > >>> >> >>>    1. How to calculate the resource specification of
> > > > > >>> TaskManagers? Do
> > > > > >>> >> >>> they
> > > > > >>> >> >>>    have them same resource spec calculated based on the
> > > > > >>> >> configuration? I
> > > > > >>> >> >>> think
> > > > > >>> >> >>>    we still have wasted resources in this situation. Or
> we
> > > > could
> > > > > >>> start
> > > > > >>> >> >>>    TaskManagers with different spec.
> > > > > >>> >> >>>    2. If a slot is released and returned to SlotPool,
> does
> > > it
> > > > > >>> could be
> > > > > >>> >> >>>    reused by other SlotRequest that the request resource
> > is
> > > > > >>> smaller
> > > > > >>> >> than
> > > > > >>> >> >>> it?
> > > > > >>> >> >>>       - If it is yes, what happens to the available
> > resource
> > > > in
> > > > > >>> the
> > > > > >>> >> >>>       TaskManager.
> > > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > > SlotPool?
> > > > > The
> > > > > >>> >> >>>       AllocationId is null?
> > > > > >>> >> >>>    3. In a session cluster, some jobs are configured
> with
> > > > > operator
> > > > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN.
> How
> > to
> > > > > deal
> > > > > >>> with
> > > > > >>> >> >>> this
> > > > > >>> >> >>>    situation?
> > > > > >>> >> >>>
> > > > > >>> >> >>>
> > > > > >>> >> >>>
> > > > > >>> >> >>> Best,
> > > > > >>> >> >>> Yang
> > > > > >>> >> >>>
> > > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五
> > > 下午8:57写道:
> > > > > >>> >> >>>
> > > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till.
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > Yangze,
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > I agree with you that we should make scheduling
> strategy
> > > > > >>> pluggable
> > > > > >>> >> and
> > > > > >>> >> >>> > optimize the strategy to reduce the memory
> fragmentation
> > > > > >>> problem,
> > > > > >>> >> and
> > > > > >>> >> >>> > thanks for the inputs on the potential algorithmic
> > > > solutions.
> > > > > >>> >> However,
> > > > > >>> >> >>> I'm
> > > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall
> > > mechanism
> > > > > >>> design
> > > > > >>> >> >>> rather
> > > > > >>> >> >>> > than strategies. Solving the fragmentation issue
> should
> > be
> > > > > >>> >> considered
> > > > > >>> >> >>> as an
> > > > > >>> >> >>> > optimization, and I agree with Till that we probably
> > > should
> > > > > >>> tackle
> > > > > >>> >> this
> > > > > >>> >> >>> > afterwards.
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > Till,
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes
> sense.
> > > The
> > > > > >>> operator
> > > > > >>> >> >>> > resource management and dynamic slot allocation do not
> > > have
> > > > > much
> > > > > >>> >> >>> dependency
> > > > > >>> >> >>> > on each other.
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > - Regarding the default slot size, I think this is
> > similar
> > > > to
> > > > > >>> >> FLIP-49
> > > > > >>> >> >>> [1]
> > > > > >>> >> >>> > where we want all the deriving happens at one place. I
> > > think
> > > > > it
> > > > > >>> >> would
> > > > > >>> >> >>> be
> > > > > >>> >> >>> > nice to pass the default slot size into the task
> > executor
> > > in
> > > > > the
> > > > > >>> >> same
> > > > > >>> >> >>> way
> > > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49 [1].
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > - Regarding the return value of
> > > > > >>> >> TaskExecutorGateway#requestResource, I
> > > > > >>> >> >>> > think you're right. We should avoid using null as the
> > > return
> > > > > >>> value.
> > > > > >>> >> I
> > > > > >>> >> >>> think
> > > > > >>> >> >>> > we probably should thrown an exception here.
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > Thank you~
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > Xintong Song
> > > > > >>> >> >>> >
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > [1]
> > > > > >>> >> >>> >
> > > > > >>> >> >>> >
> > > > > >>> >> >>>
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> > > > > >>> [hidden email]
> > > > > >>> >> >
> > > > > >>> >> >>> > wrote:
> > > > > >>> >> >>> >
> > > > > >>> >> >>> > > Hi Xintong,
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your proposal
> > > helps
> > > > > to
> > > > > >>> >> >>> improve the
> > > > > >>> >> >>> > > execution of batch jobs more efficiently. Moreover,
> it
> > > > > >>> enables the
> > > > > >>> >> >>> proper
> > > > > >>> >> >>> > > integration of the Blink planner which is very
> > important
> > > > as
> > > > > >>> well.
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was wondering
> > > > whether
> > > > > it
> > > > > >>> >> >>> wouldn't
> > > > > >>> >> >>> > > make sense to actually split it up into two FLIPs:
> > > > Operator
> > > > > >>> >> resource
> > > > > >>> >> >>> > > management and dynamic slot allocation. I think
> these
> > > two
> > > > > >>> FLIPs
> > > > > >>> >> >>> could be
> > > > > >>> >> >>> > > seen as orthogonal and it would decrease the scope
> of
> > > each
> > > > > >>> >> individual
> > > > > >>> >> >>> > FLIP.
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > Some smaller comments:
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > - I'm not sure whether we should pass in the default
> > > slot
> > > > > size
> > > > > >>> >> via an
> > > > > >>> >> >>> > > environment variable. Without having unified the way
> > how
> > > > > Flink
> > > > > >>> >> >>> components
> > > > > >>> >> >>> > > are configured [1], I think it would be better to
> pass
> > > it
> > > > in
> > > > > >>> as
> > > > > >>> >> part
> > > > > >>> >> >>> of
> > > > > >>> >> >>> > the
> > > > > >>> >> >>> > > configuration.
> > > > > >>> >> >>> > > - I would avoid returning a null value from
> > > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot be
> > > > > fulfilled.
> > > > > >>> >> >>> Either we
> > > > > >>> >> >>> > > should introduce an explicit return value saying
> this
> > or
> > > > > >>> throw an
> > > > > >>> >> >>> > > exception.
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are right
> > that
> > > > it
> > > > > >>> would
> > > > > >>> >> be
> > > > > >>> >> >>> > > helpful to make the selection strategy pluggable.
> Also
> > > > > >>> batching
> > > > > >>> >> slot
> > > > > >>> >> >>> > > requests to the RM could be a good optimization. For
> > the
> > > > > sake
> > > > > >>> of
> > > > > >>> >> >>> keeping
> > > > > >>> >> >>> > > the scope of this FLIP smaller I would try to tackle
> > > these
> > > > > >>> things
> > > > > >>> >> >>> after
> > > > > >>> >> >>> > the
> > > > > >>> >> >>> > > initial version has been completed (without spoiling
> > > these
> > > > > >>> >> >>> optimization
> > > > > >>> >> >>> > > opportunities). In particular batching the slot
> > requests
> > > > > >>> depends
> > > > > >>> >> on
> > > > > >>> >> >>> the
> > > > > >>> >> >>> > > current scheduler refactoring and could also be
> > realized
> > > > on
> > > > > >>> the RM
> > > > > >>> >> >>> side
> > > > > >>> >> >>> > > only.
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > [1]
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> >
> > > > > >>> >> >>>
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > Cheers,
> > > > > >>> >> >>> > > Till
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
> > > > > >>> [hidden email]>
> > > > > >>> >> >>> wrote:
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> > > > Hi, Xintong
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design
> > looks
> > > > good
> > > > > >>> to
> > > > > >>> >> me,
> > > > > >>> >> >>> +1
> > > > > >>> >> >>> > > > for this feature.
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > Since slots in the same task executor could have
> > > > different
> > > > > >>> >> resource
> > > > > >>> >> >>> > > > profile, we will
> > > > > >>> >> >>> > > > meet resource fragment problem. Think about this
> > case:
> > > > > >>> >> >>> > > >  - request A want 1G memory while request B & C
> want
> > > > 0.5G
> > > > > >>> memory
> > > > > >>> >> >>> > > >  - There are two task executors T1 & T2 with 1G
> and
> > > 0.5G
> > > > > >>> free
> > > > > >>> >> >>> memory
> > > > > >>> >> >>> > > > respectively
> > > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for B, A
> > > must
> > > > > >>> wait for
> > > > > >>> >> >>> the
> > > > > >>> >> >>> > > > free resource from
> > > > > >>> >> >>> > > > other task. But A could have been scheduled
> > > immediately
> > > > if
> > > > > >>> we
> > > > > >>> >> cut a
> > > > > >>> >> >>> > > > slot from T2 for B.
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > The logic of findMatchingSlot now become finding a
> > > task
> > > > > >>> executor
> > > > > >>> >> >>> which
> > > > > >>> >> >>> > > > has enough
> > > > > >>> >> >>> > > > resource and then cut a slot from it. Current
> method
> > > > could
> > > > > >>> be
> > > > > >>> >> seen
> > > > > >>> >> >>> as
> > > > > >>> >> >>> > > > "First-fit strategy",
> > > > > >>> >> >>> > > > which works well in general but sometimes could
> not
> > be
> > > > the
> > > > > >>> >> >>> optimization
> > > > > >>> >> >>> > > > method.
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > Actually, this problem could be abstracted as "Bin
> > > > Packing
> > > > > >>> >> >>> Problem"[1].
> > > > > >>> >> >>> > > > Here are
> > > > > >>> >> >>> > > > some common approximate algorithms:
> > > > > >>> >> >>> > > > - First fit
> > > > > >>> >> >>> > > > - Next fit
> > > > > >>> >> >>> > > > - Best fit
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > But it become multi-dimensional bin packing
> problem
> > if
> > > > we
> > > > > >>> take
> > > > > >>> >> CPU
> > > > > >>> >> >>> > > > into account. It hard
> > > > > >>> >> >>> > > > to define which one is best fit now. Some research
> > > > > addressed
> > > > > >>> >> this
> > > > > >>> >> >>> > > > problem, such like Tetris[2].
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > Here are some thinking about it:
> > > > > >>> >> >>> > > > 1. We could make the strategy of finding matching
> > task
> > > > > >>> executor
> > > > > >>> >> >>> > > > pluginable. Let user to config the
> > > > > >>> >> >>> > > > best strategy in their scenario.
> > > > > >>> >> >>> > > > 2. We could support batch request interface in RM,
> > > > because
> > > > > >>> we
> > > > > >>> >> have
> > > > > >>> >> >>> > > > opportunities to optimize
> > > > > >>> >> >>> > > > if we have more information. If we know the A, B,
> C
> > at
> > > > the
> > > > > >>> same
> > > > > >>> >> >>> time,
> > > > > >>> >> >>> > > > we could always make the best decision.
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > > > > >>> >> >>> > > > [2]
> > > > > >>> >> >>> >
> > > > > >>> >>
> > > > >
> https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > Best,
> > > > > >>> >> >>> > > > Yangze Guo
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> > > > > >>> >> >>> [hidden email]>
> > > > > >>> >> >>> > > > wrote:
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > Hi everyone,
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > We would like to start a discussion thread on
> > > > "FLIP-53:
> > > > > >>> Fine
> > > > > >>> >> >>> Grained
> > > > > >>> >> >>> > > > > Resource Management"[1], where we propose how to
> > > > improve
> > > > > >>> Flink
> > > > > >>> >> >>> > resource
> > > > > >>> >> >>> > > > > management and scheduling.
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > This FLIP mainly discusses the following issues.
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > >    - How to support tasks with fine grained
> > resource
> > > > > >>> >> >>> requirements.
> > > > > >>> >> >>> > > > >    - How to unify resource management for jobs
> > with
> > > /
> > > > > >>> without
> > > > > >>> >> >>> fine
> > > > > >>> >> >>> > > > grained
> > > > > >>> >> >>> > > > >    resource requirements.
> > > > > >>> >> >>> > > > >    - How to unify resource management for
> > streaming
> > > /
> > > > > >>> batch
> > > > > >>> >> jobs.
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as follows.
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > >    - Unify memory management for operators with
> /
> > > > > without
> > > > > >>> fine
> > > > > >>> >> >>> > grained
> > > > > >>> >> >>> > > > >    resource requirements by applying a fraction
> > > based
> > > > > >>> quota
> > > > > >>> >> >>> > mechanism.
> > > > > >>> >> >>> > > > >    - Unify resource scheduling for streaming and
> > > batch
> > > > > >>> jobs by
> > > > > >>> >> >>> > setting
> > > > > >>> >> >>> > > > slot
> > > > > >>> >> >>> > > > >    sharing groups for pipelined regions during
> > > > compiling
> > > > > >>> >> stage.
> > > > > >>> >> >>> > > > >    - Dynamically allocate slots from task
> > executors'
> > > > > >>> available
> > > > > >>> >> >>> > > resources.
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki
> document
> > > > [1].
> > > > > >>> >> Looking
> > > > > >>> >> >>> > forward
> > > > > >>> >> >>> > > > to
> > > > > >>> >> >>> > > > > your feedbacks.
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > Thank you~
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > Xintong Song
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > > > [1]
> > > > > >>> >> >>> > > > >
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> >
> > > > > >>> >> >>>
> > > > > >>> >>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > > > > >>> >> >>> > > >
> > > > > >>> >> >>> > >
> > > > > >>> >> >>> >
> > > > > >>> >> >>>
> > > > > >>> >> >>
> > > > > >>> >>
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Andrey Zagrebin-3
Thanks for updating the FLIP Xintong. It looks good to me. I would be ok to
start a vote for it.

Best,
Andrey

On Wed, Sep 4, 2019 at 10:03 AM Xintong Song <[hidden email]> wrote:

> @all
>
> The FLIP document [1] has been updated.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
>
> On Tue, Sep 3, 2019 at 7:20 PM Zhu Zhu <[hidden email]> wrote:
>
> > Thanks Xintong for the explanation.
> >
> > For question #1, I think it's good as long as DataSet job behaviors
> remains
> > the same.
> >
> > For question #2, agreed that the resource difference is small enough(at
> > most 1 edge diff) in current supported point-wise execution edge
> connection
> > patterns.
> >
> > Thanks,
> > Zhu Zhu
> >
> > Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道:
> >
> > >  Thanks for the comments, Zhu & Kurt.
> > >
> > > Andrey and I also had some discussions offline, and I would like to
> first
> > > post a summary of our discussion:
> > >
> > >    1. The motivation of the fraction based approach is to unify
> resource
> > >    management for both operators with specified and unknown resource
> > >    requirements.
> > >    2. The fraction based approach proposed in this FLIP should only
> > affect
> > >    streaming jobs (both bounded and unbounded). For DataSet jobs, there
> > are
> > >    already some fraction based approach (in TaskConfig and
> > ChainedDriver),
> > > and
> > >    we do not make any change to the existing approach.
> > >    3. The scope of this FLIP does not include discussion of how to set
> > >    ResourceSpec for operators.
> > >       1. For blink jobs, the optimizer can set operator resources for
> the
> > >       users, according to their configurations (default: unknown)
> > >       2. For DataStream jobs, there are no method / interface to set
> > >       operator resources at the moment (1.10). We can have in the
> future.
> > >       3. For DataSet jobs, there are existing user interfaces to set
> > >       operator resources.
> > >    4. The FLIP should explain more about how ResourceSpecs works
> > >       1. PhysicalTransformations (deployed with operators into the
> > >       StreamTasks) get ResourceSpec: unknown by default or known (e.g.
> > > from the
> > >       Blink planner)
> > >       2. While generating stream graph, calculate fractions and set to
> > >       StreamConfig
> > >       3. While scheduling, convert ResourceSpec to ResourceProfile
> > >       (ResourceSpec + network memory), and deploy to slots / TMs
> matching
> > > the
> > >       resources
> > >       4. While starting Task in TM, each operator gets fraction
> converted
> > >       back to the original absolute value requested by user or fair
> > > unknown share
> > >       of the slot
> > >       5. We should not set `allSourcesInSamePipelinedRegion` to `false`
> > for
> > >    DataSet jobs. Behaviors of DataSet jobs should not be changed.
> > >    6. The FLIP document should differentiate works planed in this FLIP
> > and
> > >    the future follow-ups more clearly, by put the follow-ups in a
> > separate
> > >    section
> > >    7. Another limitation of the rejected alternative setting fractions
> at
> > >    scheduling time is that, the scheduler implementation does not know
> > > which
> > >    tasks will be deployed into the same slot in advance.
> > >
> > > Andrey, Please bring it up if there is anything I missed.
> > >
> > > Zhu, regarding your comments:
> > >
> > >    1. If we do not set `allSourcesInSamePipelinedRegion` to `false` for
> > >    DataSet jobs (point 5 in the discussion summary above), then there
> > >    shouldn't be any regression right?
> > >    2. I think it makes sense to set the max possible network memory for
> > the
> > >    JobVertex. When you say parallel instances of the same JobVertex may
> > > have
> > >    need different network memory, I guess you mean the rescale
> scenarios
> > > where
> > >    parallelisms of upstream / downstream vertex cannot be exactly
> divided
> > > by
> > >    parallelism of downstream / upstream vertex? I would say it's
> > > acceptable to
> > >    have slight difference between actually needed and allocated network
> > > memory.
> > >    3. Yes, by numOpsUseOnHeapManagedMemory I mean
> > >    numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the
> doc.
> > >    4. Yes, it should be StreamingJobGraphGenerator. Thanks for the
> > >    correction.
> > >
> > >
> > > Kurt, regarding your comments:
> > >
> > >    1. I think we don't have network memory in ResourceSpec, which is
> the
> > >    user facing API. We only have network memory in ResourceProfile,
> which
> > > is
> > >    used internally for scheduling. The reason we do not expose network
> > > memory
> > >    to the user is that, currently how many network buffers each task
> > needs
> > > is
> > >    decided by the topology of execution graph (how many input / output
> > >    channels it has).
> > >    2. In the section "Operator Resource Requirements": "For the first
> > >    version, we do not support mixing operators with specified / unknown
> > >    resource requirements in the same job. Either all or none of the
> > > operators
> > >    of the same job should specify their resource requirements.
> > >    StreamGraphGenerator should check this and throw an error when
> mixing
> > of
> > >    specified / unknown resource requirements is detected, during the
> > >    compilation stage."
> > >    3. If the user set a resource requirement, then it is guaranteed
> that
> > >    the task should get at least the much resource, otherwise there
> should
> > > be
> > >    an exception. That should be guaranteed by the "Dynamic Slot
> > Allocation"
> > >    approach (FLIP-56).
> > >
> > >
> > > I'll update the FLIP document addressing the comments ASAP.
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote:
> > >
> > > > Thanks Xingtong for driving this effort, I haven't finished the whole
> > > > document yet,
> > > > but have couple of questions:
> > > >
> > > > 1. Regarding to network memory, the document said it will be derived
> by
> > > > framework
> > > > automatically. I'm wondering whether we should delete this dimension
> > from
> > > > user-
> > > > facing API?
> > > >
> > > > 2. Regarding to fraction based quota, I don't quite get the meaning
> of
> > > > "slotSharingGroupOnHeapManagedMem" and
> > > "slotSharingGroupOffHeapManagedMem".
> > > > What if the sharing group is mixed with specified resource and
> UNKNOWN
> > > > resource
> > > > requirements.
> > > >
> > > > 3 IIUC, even user had set resource requirements, lets say 500MB
> > off-heap
> > > > managed
> > > > memory, during execution the operator may or may not have 500MB
> > off-heap
> > > > managed
> > > > memory, right?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote:
> > > >
> > > > > Thanks Xintong for proposing this improvement. Fine grained
> resources
> > > can
> > > > > be very helpful when user has good planning on resources.
> > > > >
> > > > > I have a few questions:
> > > > > 1. Currently in a batch job, vertices from different regions can
> run
> > at
> > > > the
> > > > > same time in slots from the same shared group, as long as they do
> not
> > > > have
> > > > > data dependency on each other and available slot count is not
> smaller
> > > > than
> > > > > the *max* of parallelism of all tasks.
> > > > > With changes in this FLIP however, tasks from different regions
> > cannot
> > > > > share slots anymore.
> > > > > Once available slot count is smaller than the *sum* of all
> > parallelism
> > > of
> > > > > tasks from all regions, tasks may need to be executed sequentially,
> > > which
> > > > > might result in a performance regression.
> > > > > Is this(performance regression to existing DataSet jobs) considered
> > as
> > > a
> > > > > necessary and accepted trade off in this FLIP?
> > > > >
> > > > > 2. The network memory depends on the input/output ExecutionEdge
> count
> > > and
> > > > > thus can be different even for parallel instances of the same
> > > JobVertex.
> > > > > Does this mean that when adding task resources to calculating the
> > slot
> > > > > resource for a shared group, the max possible network memory of the
> > > > vertex
> > > > > instance shall be used?
> > > > > This might result in larger resource required than actually needed.
> > > > >
> > > > > And some minor comments:
> > > > > 1. Regarding "fracManagedMemOnHeap = 1 /
> > > numOpsUseOnHeapManagedMemory", I
> > > > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ?
> > > > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing section
> > and
> > > > > implementation step 4 should be *StreamingJobGraphGenerator*, as
> > > > > *StreamGraphGenerator* is not aware of JobGraph and pipelined
> region.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Zhu Zhu
> > > > >
> > > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道:
> > > > >
> > > > > > Updated the FLIP wiki page [1], with the following changes.
> > > > > >
> > > > > >    - Remove the step of converting pipelined edges between
> > different
> > > > slot
> > > > > >    sharing groups into blocking edges.
> > > > > >    - Set `allSourcesInSamePipelinedRegion` to true by default.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Regarding changing edge type, I think actually we don't need to
> > do
> > > > this
> > > > > > > for batch jobs neither, because we don't have public interfaces
> > for
> > > > > users
> > > > > > > to explicitly set slot sharing groups in DataSet API and
> > SQL/Table
> > > > API.
> > > > > > We
> > > > > > > have such interfaces in DataStream API only.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <
> > > [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Thanks for the correction, Till.
> > > > > > >>
> > > > > > >> Regarding your comments:
> > > > > > >> - You are right, we should not change the edge type for
> > streaming
> > > > > jobs.
> > > > > > >> Then I think we can change the option
> > > > > 'allSourcesInSamePipelinedRegion'
> > > > > > in
> > > > > > >> step 2 to 'isStreamingJob', and implement the current step 2
> > > before
> > > > > the
> > > > > > >> current step 1 so we can use this option to decide whether
> > should
> > > > > change
> > > > > > >> the edge type. What do you think?
> > > > > > >> - Agree. It should be easier to make the default value of
> > > > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob')
> 'true',
> > > and
> > > > > set
> > > > > > it
> > > > > > >> to 'false' when using DataSet API or blink planner.
> > > > > > >>
> > > > > > >> Thank you~
> > > > > > >>
> > > > > > >> Xintong Song
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <
> > > [hidden email]
> > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Thanks for creating the implementation plan Xintong. Overall,
> > the
> > > > > > >>> implementation plan looks good. I had a couple of comments:
> > > > > > >>>
> > > > > > >>> - What will happen if a user has defined a streaming job with
> > two
> > > > > slot
> > > > > > >>> sharing groups? Would the code insert a blocking data
> exchange
> > > > > between
> > > > > > >>> these two groups? If yes, then this breaks existing Flink
> > > streaming
> > > > > > jobs.
> > > > > > >>> - How do we detect unbounded streaming jobs to set
> > > > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it be
> > > > easier
> > > > > to
> > > > > > >>> set
> > > > > > >>> it false if we are using the DataSet API or the Blink planner
> > > with
> > > > a
> > > > > > >>> bounded job?
> > > > > > >>>
> > > > > > >>> Cheers,
> > > > > > >>> Till
> > > > > > >>>
> > > > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <
> > > > [hidden email]>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> > I guess there is a typo since the link to the FLIP-53 is
> > > > > > >>> >
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > > > >>> >
> > > > > > >>> > Cheers,
> > > > > > >>> > Till
> > > > > > >>> >
> > > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <
> > > > > [hidden email]>
> > > > > > >>> > wrote:
> > > > > > >>> >
> > > > > > >>> >> Added implementation steps for this FLIP on the wiki page
> > [1].
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >> Thank you~
> > > > > > >>> >>
> > > > > > >>> >> Xintong Song
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>> >> [1]
> > > > > > >>> >>
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > >>> >>
> > > > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <
> > > > > > [hidden email]>
> > > > > > >>> >> wrote:
> > > > > > >>> >>
> > > > > > >>> >> > Hi everyone,
> > > > > > >>> >> >
> > > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained
> > > > Resource
> > > > > > >>> >> > Management" splits into two separate FLIPs,
> > > > > > >>> >> >
> > > > > > >>> >> >    - FLIP-53: Fine Grained Operator Resource Management
> > [1]
> > > > > > >>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> > > > > > >>> >> >
> > > > > > >>> >> > We'll continue using this discussion thread for FLIP-53.
> > For
> > > > > > >>> FLIP-56, I
> > > > > > >>> >> > just started a new discussion thread [3].
> > > > > > >>> >> >
> > > > > > >>> >> > Thank you~
> > > > > > >>> >> >
> > > > > > >>> >> > Xintong Song
> > > > > > >>> >> >
> > > > > > >>> >> >
> > > > > > >>> >> > [1]
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > > > >>> >> >
> > > > > > >>> >> > [2]
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > > > > >>> >> >
> > > > > > >>> >> > [3]
> > > > > > >>> >> >
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > > > > >>> >> >
> > > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <
> > > > > > [hidden email]
> > > > > > >>> >
> > > > > > >>> >> > wrote:
> > > > > > >>> >> >
> > > > > > >>> >> >> Thinks for the comments, Yang.
> > > > > > >>> >> >>
> > > > > > >>> >> >> Regarding your questions:
> > > > > > >>> >> >>
> > > > > > >>> >> >>    1. How to calculate the resource specification of
> > > > > > TaskManagers?
> > > > > > >>> Do
> > > > > > >>> >> they
> > > > > > >>> >> >>>    have them same resource spec calculated based on
> the
> > > > > > >>> >> configuration? I
> > > > > > >>> >> >>> think
> > > > > > >>> >> >>>    we still have wasted resources in this situation.
> Or
> > we
> > > > > could
> > > > > > >>> start
> > > > > > >>> >> >>>    TaskManagers with different spec.
> > > > > > >>> >> >>>
> > > > > > >>> >> >> I agree with you that we can further improve the
> resource
> > > > > utility
> > > > > > >>> by
> > > > > > >>> >> >> customizing task executors with different resource
> > > > > > specifications.
> > > > > > >>> >> However,
> > > > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and
> leave
> > > it
> > > > > as a
> > > > > > >>> >> future
> > > > > > >>> >> >> optimization. The plan for that part is to move the
> logic
> > > of
> > > > > > >>> deciding
> > > > > > >>> >> task
> > > > > > >>> >> >> executor specifications into the slot manager and make
> > slot
> > > > > > manager
> > > > > > >>> >> >> pluggable, so inside the slot manager plugin we can
> have
> > > > > > different
> > > > > > >>> >> logics
> > > > > > >>> >> >> for deciding the task executor specifications.
> > > > > > >>> >> >>
> > > > > > >>> >> >>
> > > > > > >>> >> >>>    2. If a slot is released and returned to SlotPool,
> > does
> > > > it
> > > > > > >>> could be
> > > > > > >>> >> >>>    reused by other SlotRequest that the request
> resource
> > > is
> > > > > > >>> smaller
> > > > > > >>> >> than
> > > > > > >>> >> >>> it?
> > > > > > >>> >> >>>
> > > > > > >>> >> >> No, I think slot pool should always return slots if
> they
> > do
> > > > not
> > > > > > >>> exactly
> > > > > > >>> >> >> match the pending requests, so that resource manager
> can
> > > deal
> > > > > > with
> > > > > > >>> the
> > > > > > >>> >> >> extra resources.
> > > > > > >>> >> >>
> > > > > > >>> >> >>>       - If it is yes, what happens to the available
> > > resource
> > > > > in
> > > > > > >>> the
> > > > > > >>> >> >>
> > > > > > >>> >> >>       TaskManager.
> > > > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > > > SlotPool?
> > > > > > The
> > > > > > >>> >> >>>       AllocationId is null?
> > > > > > >>> >> >>>
> > > > > > >>> >> >> The allocation id does not change as long as the slot
> is
> > > not
> > > > > > >>> returned
> > > > > > >>> >> >> from the job master, no matter its occupied or
> available
> > in
> > > > the
> > > > > > >>> slot
> > > > > > >>> >> pool.
> > > > > > >>> >> >> I think we have the same behavior currently. No matter
> > how
> > > > many
> > > > > > >>> tasks
> > > > > > >>> >> the
> > > > > > >>> >> >> job master deploy into the slot, concurrently or
> > > > sequentially,
> > > > > it
> > > > > > >>> is
> > > > > > >>> >> one
> > > > > > >>> >> >> allocation from the cluster to the job until the slot
> is
> > > > freed
> > > > > > from
> > > > > > >>> >> the job
> > > > > > >>> >> >> master.
> > > > > > >>> >> >>
> > > > > > >>> >> >>>    3. In a session cluster, some jobs are configured
> > with
> > > > > > operator
> > > > > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN.
> > How
> > > to
> > > > > > deal
> > > > > > >>> with
> > > > > > >>> >> >>> this
> > > > > > >>> >> >>>    situation?
> > > > > > >>> >> >>
> > > > > > >>> >> >> As long as we do not mix unknown / specified resource
> > > > profiles
> > > > > > >>> within
> > > > > > >>> >> the
> > > > > > >>> >> >> same job / slot, there shouldn't be a problem. Resource
> > > > manager
> > > > > > >>> >> converts
> > > > > > >>> >> >> unknown resource profiles in slot requests to specified
> > > > default
> > > > > > >>> >> resource
> > > > > > >>> >> >> profiles, so they can be dynamically allocated from
> task
> > > > > > executors'
> > > > > > >>> >> >> available resources just as other slot requests with
> > > > specified
> > > > > > >>> resource
> > > > > > >>> >> >> profiles.
> > > > > > >>> >> >>
> > > > > > >>> >> >> Thank you~
> > > > > > >>> >> >>
> > > > > > >>> >> >> Xintong Song
> > > > > > >>> >> >>
> > > > > > >>> >> >>
> > > > > > >>> >> >>
> > > > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <
> > > > > > [hidden email]>
> > > > > > >>> >> wrote:
> > > > > > >>> >> >>
> > > > > > >>> >> >>> Hi Xintong,
> > > > > > >>> >> >>>
> > > > > > >>> >> >>>
> > > > > > >>> >> >>> Thanks for your detailed proposal. I think many users
> > are
> > > > > > >>> suffering
> > > > > > >>> >> from
> > > > > > >>> >> >>> waste of resources. The resource spec of all task
> > managers
> > > > are
> > > > > > >>> same
> > > > > > >>> >> and
> > > > > > >>> >> >>> we
> > > > > > >>> >> >>> have to increase all task managers to make the heavy
> one
> > > > more
> > > > > > >>> stable.
> > > > > > >>> >> So
> > > > > > >>> >> >>> we
> > > > > > >>> >> >>> will benefit from the fine grained resource
> management a
> > > > lot.
> > > > > We
> > > > > > >>> could
> > > > > > >>> >> >>> get
> > > > > > >>> >> >>> better resource utilization and stability.
> > > > > > >>> >> >>>
> > > > > > >>> >> >>>
> > > > > > >>> >> >>> Just to share some thoughts.
> > > > > > >>> >> >>>
> > > > > > >>> >> >>>
> > > > > > >>> >> >>>
> > > > > > >>> >> >>>    1. How to calculate the resource specification of
> > > > > > >>> TaskManagers? Do
> > > > > > >>> >> >>> they
> > > > > > >>> >> >>>    have them same resource spec calculated based on
> the
> > > > > > >>> >> configuration? I
> > > > > > >>> >> >>> think
> > > > > > >>> >> >>>    we still have wasted resources in this situation.
> Or
> > we
> > > > > could
> > > > > > >>> start
> > > > > > >>> >> >>>    TaskManagers with different spec.
> > > > > > >>> >> >>>    2. If a slot is released and returned to SlotPool,
> > does
> > > > it
> > > > > > >>> could be
> > > > > > >>> >> >>>    reused by other SlotRequest that the request
> resource
> > > is
> > > > > > >>> smaller
> > > > > > >>> >> than
> > > > > > >>> >> >>> it?
> > > > > > >>> >> >>>       - If it is yes, what happens to the available
> > > resource
> > > > > in
> > > > > > >>> the
> > > > > > >>> >> >>>       TaskManager.
> > > > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > > > SlotPool?
> > > > > > The
> > > > > > >>> >> >>>       AllocationId is null?
> > > > > > >>> >> >>>    3. In a session cluster, some jobs are configured
> > with
> > > > > > operator
> > > > > > >>> >> >>>    resources, meanwhile other jobs are using UNKNOWN.
> > How
> > > to
> > > > > > deal
> > > > > > >>> with
> > > > > > >>> >> >>> this
> > > > > > >>> >> >>>    situation?
> > > > > > >>> >> >>>
> > > > > > >>> >> >>>
> > > > > > >>> >> >>>
> > > > > > >>> >> >>> Best,
> > > > > > >>> >> >>> Yang
> > > > > > >>> >> >>>
> > > > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五
> > > > 下午8:57写道:
> > > > > > >>> >> >>>
> > > > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till.
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > Yangze,
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > I agree with you that we should make scheduling
> > strategy
> > > > > > >>> pluggable
> > > > > > >>> >> and
> > > > > > >>> >> >>> > optimize the strategy to reduce the memory
> > fragmentation
> > > > > > >>> problem,
> > > > > > >>> >> and
> > > > > > >>> >> >>> > thanks for the inputs on the potential algorithmic
> > > > > solutions.
> > > > > > >>> >> However,
> > > > > > >>> >> >>> I'm
> > > > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall
> > > > mechanism
> > > > > > >>> design
> > > > > > >>> >> >>> rather
> > > > > > >>> >> >>> > than strategies. Solving the fragmentation issue
> > should
> > > be
> > > > > > >>> >> considered
> > > > > > >>> >> >>> as an
> > > > > > >>> >> >>> > optimization, and I agree with Till that we probably
> > > > should
> > > > > > >>> tackle
> > > > > > >>> >> this
> > > > > > >>> >> >>> > afterwards.
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > Till,
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes
> > sense.
> > > > The
> > > > > > >>> operator
> > > > > > >>> >> >>> > resource management and dynamic slot allocation do
> not
> > > > have
> > > > > > much
> > > > > > >>> >> >>> dependency
> > > > > > >>> >> >>> > on each other.
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > - Regarding the default slot size, I think this is
> > > similar
> > > > > to
> > > > > > >>> >> FLIP-49
> > > > > > >>> >> >>> [1]
> > > > > > >>> >> >>> > where we want all the deriving happens at one
> place. I
> > > > think
> > > > > > it
> > > > > > >>> >> would
> > > > > > >>> >> >>> be
> > > > > > >>> >> >>> > nice to pass the default slot size into the task
> > > executor
> > > > in
> > > > > > the
> > > > > > >>> >> same
> > > > > > >>> >> >>> way
> > > > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49
> [1].
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > - Regarding the return value of
> > > > > > >>> >> TaskExecutorGateway#requestResource, I
> > > > > > >>> >> >>> > think you're right. We should avoid using null as
> the
> > > > return
> > > > > > >>> value.
> > > > > > >>> >> I
> > > > > > >>> >> >>> think
> > > > > > >>> >> >>> > we probably should thrown an exception here.
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > Thank you~
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > Xintong Song
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > [1]
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>>
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> > > > > > >>> [hidden email]
> > > > > > >>> >> >
> > > > > > >>> >> >>> > wrote:
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>> > > Hi Xintong,
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your
> proposal
> > > > helps
> > > > > > to
> > > > > > >>> >> >>> improve the
> > > > > > >>> >> >>> > > execution of batch jobs more efficiently.
> Moreover,
> > it
> > > > > > >>> enables the
> > > > > > >>> >> >>> proper
> > > > > > >>> >> >>> > > integration of the Blink planner which is very
> > > important
> > > > > as
> > > > > > >>> well.
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was
> wondering
> > > > > whether
> > > > > > it
> > > > > > >>> >> >>> wouldn't
> > > > > > >>> >> >>> > > make sense to actually split it up into two FLIPs:
> > > > > Operator
> > > > > > >>> >> resource
> > > > > > >>> >> >>> > > management and dynamic slot allocation. I think
> > these
> > > > two
> > > > > > >>> FLIPs
> > > > > > >>> >> >>> could be
> > > > > > >>> >> >>> > > seen as orthogonal and it would decrease the scope
> > of
> > > > each
> > > > > > >>> >> individual
> > > > > > >>> >> >>> > FLIP.
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > Some smaller comments:
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > - I'm not sure whether we should pass in the
> default
> > > > slot
> > > > > > size
> > > > > > >>> >> via an
> > > > > > >>> >> >>> > > environment variable. Without having unified the
> way
> > > how
> > > > > > Flink
> > > > > > >>> >> >>> components
> > > > > > >>> >> >>> > > are configured [1], I think it would be better to
> > pass
> > > > it
> > > > > in
> > > > > > >>> as
> > > > > > >>> >> part
> > > > > > >>> >> >>> of
> > > > > > >>> >> >>> > the
> > > > > > >>> >> >>> > > configuration.
> > > > > > >>> >> >>> > > - I would avoid returning a null value from
> > > > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot
> be
> > > > > > fulfilled.
> > > > > > >>> >> >>> Either we
> > > > > > >>> >> >>> > > should introduce an explicit return value saying
> > this
> > > or
> > > > > > >>> throw an
> > > > > > >>> >> >>> > > exception.
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are
> right
> > > that
> > > > > it
> > > > > > >>> would
> > > > > > >>> >> be
> > > > > > >>> >> >>> > > helpful to make the selection strategy pluggable.
> > Also
> > > > > > >>> batching
> > > > > > >>> >> slot
> > > > > > >>> >> >>> > > requests to the RM could be a good optimization.
> For
> > > the
> > > > > > sake
> > > > > > >>> of
> > > > > > >>> >> >>> keeping
> > > > > > >>> >> >>> > > the scope of this FLIP smaller I would try to
> tackle
> > > > these
> > > > > > >>> things
> > > > > > >>> >> >>> after
> > > > > > >>> >> >>> > the
> > > > > > >>> >> >>> > > initial version has been completed (without
> spoiling
> > > > these
> > > > > > >>> >> >>> optimization
> > > > > > >>> >> >>> > > opportunities). In particular batching the slot
> > > requests
> > > > > > >>> depends
> > > > > > >>> >> on
> > > > > > >>> >> >>> the
> > > > > > >>> >> >>> > > current scheduler refactoring and could also be
> > > realized
> > > > > on
> > > > > > >>> the RM
> > > > > > >>> >> >>> side
> > > > > > >>> >> >>> > > only.
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > [1]
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>>
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > Cheers,
> > > > > > >>> >> >>> > > Till
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
> > > > > > >>> [hidden email]>
> > > > > > >>> >> >>> wrote:
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> > > > Hi, Xintong
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general design
> > > looks
> > > > > good
> > > > > > >>> to
> > > > > > >>> >> me,
> > > > > > >>> >> >>> +1
> > > > > > >>> >> >>> > > > for this feature.
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > Since slots in the same task executor could have
> > > > > different
> > > > > > >>> >> resource
> > > > > > >>> >> >>> > > > profile, we will
> > > > > > >>> >> >>> > > > meet resource fragment problem. Think about this
> > > case:
> > > > > > >>> >> >>> > > >  - request A want 1G memory while request B & C
> > want
> > > > > 0.5G
> > > > > > >>> memory
> > > > > > >>> >> >>> > > >  - There are two task executors T1 & T2 with 1G
> > and
> > > > 0.5G
> > > > > > >>> free
> > > > > > >>> >> >>> memory
> > > > > > >>> >> >>> > > > respectively
> > > > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for
> B, A
> > > > must
> > > > > > >>> wait for
> > > > > > >>> >> >>> the
> > > > > > >>> >> >>> > > > free resource from
> > > > > > >>> >> >>> > > > other task. But A could have been scheduled
> > > > immediately
> > > > > if
> > > > > > >>> we
> > > > > > >>> >> cut a
> > > > > > >>> >> >>> > > > slot from T2 for B.
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become
> finding a
> > > > task
> > > > > > >>> executor
> > > > > > >>> >> >>> which
> > > > > > >>> >> >>> > > > has enough
> > > > > > >>> >> >>> > > > resource and then cut a slot from it. Current
> > method
> > > > > could
> > > > > > >>> be
> > > > > > >>> >> seen
> > > > > > >>> >> >>> as
> > > > > > >>> >> >>> > > > "First-fit strategy",
> > > > > > >>> >> >>> > > > which works well in general but sometimes could
> > not
> > > be
> > > > > the
> > > > > > >>> >> >>> optimization
> > > > > > >>> >> >>> > > > method.
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as
> "Bin
> > > > > Packing
> > > > > > >>> >> >>> Problem"[1].
> > > > > > >>> >> >>> > > > Here are
> > > > > > >>> >> >>> > > > some common approximate algorithms:
> > > > > > >>> >> >>> > > > - First fit
> > > > > > >>> >> >>> > > > - Next fit
> > > > > > >>> >> >>> > > > - Best fit
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing
> > problem
> > > if
> > > > > we
> > > > > > >>> take
> > > > > > >>> >> CPU
> > > > > > >>> >> >>> > > > into account. It hard
> > > > > > >>> >> >>> > > > to define which one is best fit now. Some
> research
> > > > > > addressed
> > > > > > >>> >> this
> > > > > > >>> >> >>> > > > problem, such like Tetris[2].
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > Here are some thinking about it:
> > > > > > >>> >> >>> > > > 1. We could make the strategy of finding
> matching
> > > task
> > > > > > >>> executor
> > > > > > >>> >> >>> > > > pluginable. Let user to config the
> > > > > > >>> >> >>> > > > best strategy in their scenario.
> > > > > > >>> >> >>> > > > 2. We could support batch request interface in
> RM,
> > > > > because
> > > > > > >>> we
> > > > > > >>> >> have
> > > > > > >>> >> >>> > > > opportunities to optimize
> > > > > > >>> >> >>> > > > if we have more information. If we know the A,
> B,
> > C
> > > at
> > > > > the
> > > > > > >>> same
> > > > > > >>> >> >>> time,
> > > > > > >>> >> >>> > > > we could always make the best decision.
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > [1] http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > > > > > >>> >> >>> > > > [2]
> > > > > > >>> >> >>> >
> > > > > > >>> >>
> > > > > >
> > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > Best,
> > > > > > >>> >> >>> > > > Yangze Guo
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song <
> > > > > > >>> >> >>> [hidden email]>
> > > > > > >>> >> >>> > > > wrote:
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > Hi everyone,
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > We would like to start a discussion thread on
> > > > > "FLIP-53:
> > > > > > >>> Fine
> > > > > > >>> >> >>> Grained
> > > > > > >>> >> >>> > > > > Resource Management"[1], where we propose how
> to
> > > > > improve
> > > > > > >>> Flink
> > > > > > >>> >> >>> > resource
> > > > > > >>> >> >>> > > > > management and scheduling.
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following
> issues.
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > >    - How to support tasks with fine grained
> > > resource
> > > > > > >>> >> >>> requirements.
> > > > > > >>> >> >>> > > > >    - How to unify resource management for jobs
> > > with
> > > > /
> > > > > > >>> without
> > > > > > >>> >> >>> fine
> > > > > > >>> >> >>> > > > grained
> > > > > > >>> >> >>> > > > >    resource requirements.
> > > > > > >>> >> >>> > > > >    - How to unify resource management for
> > > streaming
> > > > /
> > > > > > >>> batch
> > > > > > >>> >> jobs.
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as
> follows.
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > >    - Unify memory management for operators
> with
> > /
> > > > > > without
> > > > > > >>> fine
> > > > > > >>> >> >>> > grained
> > > > > > >>> >> >>> > > > >    resource requirements by applying a
> fraction
> > > > based
> > > > > > >>> quota
> > > > > > >>> >> >>> > mechanism.
> > > > > > >>> >> >>> > > > >    - Unify resource scheduling for streaming
> and
> > > > batch
> > > > > > >>> jobs by
> > > > > > >>> >> >>> > setting
> > > > > > >>> >> >>> > > > slot
> > > > > > >>> >> >>> > > > >    sharing groups for pipelined regions during
> > > > > compiling
> > > > > > >>> >> stage.
> > > > > > >>> >> >>> > > > >    - Dynamically allocate slots from task
> > > executors'
> > > > > > >>> available
> > > > > > >>> >> >>> > > resources.
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki
> > document
> > > > > [1].
> > > > > > >>> >> Looking
> > > > > > >>> >> >>> > forward
> > > > > > >>> >> >>> > > > to
> > > > > > >>> >> >>> > > > > your feedbacks.
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > Thank you~
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > Xintong Song
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > > > [1]
> > > > > > >>> >> >>> > > > >
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>>
> > > > > > >>> >>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > > > > > >>> >> >>> > > >
> > > > > > >>> >> >>> > >
> > > > > > >>> >> >>> >
> > > > > > >>> >> >>>
> > > > > > >>> >> >>
> > > > > > >>> >>
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

Xintong Song
Thanks all for joining the discussion.
It seems to me that there is a consensus on the current FLIP document. So
if there is no objection, I would like to start the voting process for this
FLIP.

Thank you~

Xintong Song



On Wed, Sep 4, 2019 at 8:23 PM Andrey Zagrebin <[hidden email]> wrote:

> Thanks for updating the FLIP Xintong. It looks good to me. I would be ok to
> start a vote for it.
>
> Best,
> Andrey
>
> On Wed, Sep 4, 2019 at 10:03 AM Xintong Song <[hidden email]>
> wrote:
>
> > @all
> >
> > The FLIP document [1] has been updated.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> >
> > On Tue, Sep 3, 2019 at 7:20 PM Zhu Zhu <[hidden email]> wrote:
> >
> > > Thanks Xintong for the explanation.
> > >
> > > For question #1, I think it's good as long as DataSet job behaviors
> > remains
> > > the same.
> > >
> > > For question #2, agreed that the resource difference is small enough(at
> > > most 1 edge diff) in current supported point-wise execution edge
> > connection
> > > patterns.
> > >
> > > Thanks,
> > > Zhu Zhu
> > >
> > > Xintong Song <[hidden email]> 于2019年9月3日周二 下午6:58写道:
> > >
> > > >  Thanks for the comments, Zhu & Kurt.
> > > >
> > > > Andrey and I also had some discussions offline, and I would like to
> > first
> > > > post a summary of our discussion:
> > > >
> > > >    1. The motivation of the fraction based approach is to unify
> > resource
> > > >    management for both operators with specified and unknown resource
> > > >    requirements.
> > > >    2. The fraction based approach proposed in this FLIP should only
> > > affect
> > > >    streaming jobs (both bounded and unbounded). For DataSet jobs,
> there
> > > are
> > > >    already some fraction based approach (in TaskConfig and
> > > ChainedDriver),
> > > > and
> > > >    we do not make any change to the existing approach.
> > > >    3. The scope of this FLIP does not include discussion of how to
> set
> > > >    ResourceSpec for operators.
> > > >       1. For blink jobs, the optimizer can set operator resources for
> > the
> > > >       users, according to their configurations (default: unknown)
> > > >       2. For DataStream jobs, there are no method / interface to set
> > > >       operator resources at the moment (1.10). We can have in the
> > future.
> > > >       3. For DataSet jobs, there are existing user interfaces to set
> > > >       operator resources.
> > > >    4. The FLIP should explain more about how ResourceSpecs works
> > > >       1. PhysicalTransformations (deployed with operators into the
> > > >       StreamTasks) get ResourceSpec: unknown by default or known
> (e.g.
> > > > from the
> > > >       Blink planner)
> > > >       2. While generating stream graph, calculate fractions and set
> to
> > > >       StreamConfig
> > > >       3. While scheduling, convert ResourceSpec to ResourceProfile
> > > >       (ResourceSpec + network memory), and deploy to slots / TMs
> > matching
> > > > the
> > > >       resources
> > > >       4. While starting Task in TM, each operator gets fraction
> > converted
> > > >       back to the original absolute value requested by user or fair
> > > > unknown share
> > > >       of the slot
> > > >       5. We should not set `allSourcesInSamePipelinedRegion` to
> `false`
> > > for
> > > >    DataSet jobs. Behaviors of DataSet jobs should not be changed.
> > > >    6. The FLIP document should differentiate works planed in this
> FLIP
> > > and
> > > >    the future follow-ups more clearly, by put the follow-ups in a
> > > separate
> > > >    section
> > > >    7. Another limitation of the rejected alternative setting
> fractions
> > at
> > > >    scheduling time is that, the scheduler implementation does not
> know
> > > > which
> > > >    tasks will be deployed into the same slot in advance.
> > > >
> > > > Andrey, Please bring it up if there is anything I missed.
> > > >
> > > > Zhu, regarding your comments:
> > > >
> > > >    1. If we do not set `allSourcesInSamePipelinedRegion` to `false`
> for
> > > >    DataSet jobs (point 5 in the discussion summary above), then there
> > > >    shouldn't be any regression right?
> > > >    2. I think it makes sense to set the max possible network memory
> for
> > > the
> > > >    JobVertex. When you say parallel instances of the same JobVertex
> may
> > > > have
> > > >    need different network memory, I guess you mean the rescale
> > scenarios
> > > > where
> > > >    parallelisms of upstream / downstream vertex cannot be exactly
> > divided
> > > > by
> > > >    parallelism of downstream / upstream vertex? I would say it's
> > > > acceptable to
> > > >    have slight difference between actually needed and allocated
> network
> > > > memory.
> > > >    3. Yes, by numOpsUseOnHeapManagedMemory I mean
> > > >    numOpsUseOnHeapManagedMemoryInTheSameSharedGroup. I'll update the
> > doc.
> > > >    4. Yes, it should be StreamingJobGraphGenerator. Thanks for the
> > > >    correction.
> > > >
> > > >
> > > > Kurt, regarding your comments:
> > > >
> > > >    1. I think we don't have network memory in ResourceSpec, which is
> > the
> > > >    user facing API. We only have network memory in ResourceProfile,
> > which
> > > > is
> > > >    used internally for scheduling. The reason we do not expose
> network
> > > > memory
> > > >    to the user is that, currently how many network buffers each task
> > > needs
> > > > is
> > > >    decided by the topology of execution graph (how many input /
> output
> > > >    channels it has).
> > > >    2. In the section "Operator Resource Requirements": "For the first
> > > >    version, we do not support mixing operators with specified /
> unknown
> > > >    resource requirements in the same job. Either all or none of the
> > > > operators
> > > >    of the same job should specify their resource requirements.
> > > >    StreamGraphGenerator should check this and throw an error when
> > mixing
> > > of
> > > >    specified / unknown resource requirements is detected, during the
> > > >    compilation stage."
> > > >    3. If the user set a resource requirement, then it is guaranteed
> > that
> > > >    the task should get at least the much resource, otherwise there
> > should
> > > > be
> > > >    an exception. That should be guaranteed by the "Dynamic Slot
> > > Allocation"
> > > >    approach (FLIP-56).
> > > >
> > > >
> > > > I'll update the FLIP document addressing the comments ASAP.
> > > >
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Tue, Sep 3, 2019 at 2:42 PM Kurt Young <[hidden email]> wrote:
> > > >
> > > > > Thanks Xingtong for driving this effort, I haven't finished the
> whole
> > > > > document yet,
> > > > > but have couple of questions:
> > > > >
> > > > > 1. Regarding to network memory, the document said it will be
> derived
> > by
> > > > > framework
> > > > > automatically. I'm wondering whether we should delete this
> dimension
> > > from
> > > > > user-
> > > > > facing API?
> > > > >
> > > > > 2. Regarding to fraction based quota, I don't quite get the meaning
> > of
> > > > > "slotSharingGroupOnHeapManagedMem" and
> > > > "slotSharingGroupOffHeapManagedMem".
> > > > > What if the sharing group is mixed with specified resource and
> > UNKNOWN
> > > > > resource
> > > > > requirements.
> > > > >
> > > > > 3 IIUC, even user had set resource requirements, lets say 500MB
> > > off-heap
> > > > > managed
> > > > > memory, during execution the operator may or may not have 500MB
> > > off-heap
> > > > > managed
> > > > > memory, right?
> > > > >
> > > > > Best,
> > > > > Kurt
> > > > >
> > > > >
> > > > > On Mon, Sep 2, 2019 at 8:36 PM Zhu Zhu <[hidden email]> wrote:
> > > > >
> > > > > > Thanks Xintong for proposing this improvement. Fine grained
> > resources
> > > > can
> > > > > > be very helpful when user has good planning on resources.
> > > > > >
> > > > > > I have a few questions:
> > > > > > 1. Currently in a batch job, vertices from different regions can
> > run
> > > at
> > > > > the
> > > > > > same time in slots from the same shared group, as long as they do
> > not
> > > > > have
> > > > > > data dependency on each other and available slot count is not
> > smaller
> > > > > than
> > > > > > the *max* of parallelism of all tasks.
> > > > > > With changes in this FLIP however, tasks from different regions
> > > cannot
> > > > > > share slots anymore.
> > > > > > Once available slot count is smaller than the *sum* of all
> > > parallelism
> > > > of
> > > > > > tasks from all regions, tasks may need to be executed
> sequentially,
> > > > which
> > > > > > might result in a performance regression.
> > > > > > Is this(performance regression to existing DataSet jobs)
> considered
> > > as
> > > > a
> > > > > > necessary and accepted trade off in this FLIP?
> > > > > >
> > > > > > 2. The network memory depends on the input/output ExecutionEdge
> > count
> > > > and
> > > > > > thus can be different even for parallel instances of the same
> > > > JobVertex.
> > > > > > Does this mean that when adding task resources to calculating the
> > > slot
> > > > > > resource for a shared group, the max possible network memory of
> the
> > > > > vertex
> > > > > > instance shall be used?
> > > > > > This might result in larger resource required than actually
> needed.
> > > > > >
> > > > > > And some minor comments:
> > > > > > 1. Regarding "fracManagedMemOnHeap = 1 /
> > > > numOpsUseOnHeapManagedMemory", I
> > > > > > guess you mean numOpsUseOnHeapManagedMemoryInTheSameSharedGroup ?
> > > > > > 2. I think the *StreamGraphGenerator* in the #Slot Sharing
> section
> > > and
> > > > > > implementation step 4 should be *StreamingJobGraphGenerator*, as
> > > > > > *StreamGraphGenerator* is not aware of JobGraph and pipelined
> > region.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Zhu Zhu
> > > > > >
> > > > > > Xintong Song <[hidden email]> 于2019年9月2日周一 上午11:59写道:
> > > > > >
> > > > > > > Updated the FLIP wiki page [1], with the following changes.
> > > > > > >
> > > > > > >    - Remove the step of converting pipelined edges between
> > > different
> > > > > slot
> > > > > > >    sharing groups into blocking edges.
> > > > > > >    - Set `allSourcesInSamePipelinedRegion` to true by default.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Sep 2, 2019 at 11:50 AM Xintong Song <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Regarding changing edge type, I think actually we don't need
> to
> > > do
> > > > > this
> > > > > > > > for batch jobs neither, because we don't have public
> interfaces
> > > for
> > > > > > users
> > > > > > > > to explicitly set slot sharing groups in DataSet API and
> > > SQL/Table
> > > > > API.
> > > > > > > We
> > > > > > > > have such interfaces in DataStream API only.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 27, 2019 at 10:16 PM Xintong Song <
> > > > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks for the correction, Till.
> > > > > > > >>
> > > > > > > >> Regarding your comments:
> > > > > > > >> - You are right, we should not change the edge type for
> > > streaming
> > > > > > jobs.
> > > > > > > >> Then I think we can change the option
> > > > > > 'allSourcesInSamePipelinedRegion'
> > > > > > > in
> > > > > > > >> step 2 to 'isStreamingJob', and implement the current step 2
> > > > before
> > > > > > the
> > > > > > > >> current step 1 so we can use this option to decide whether
> > > should
> > > > > > change
> > > > > > > >> the edge type. What do you think?
> > > > > > > >> - Agree. It should be easier to make the default value of
> > > > > > > >> 'allSourcesInSamePipelinedRegion' (or 'isStreamingJob')
> > 'true',
> > > > and
> > > > > > set
> > > > > > > it
> > > > > > > >> to 'false' when using DataSet API or blink planner.
> > > > > > > >>
> > > > > > > >> Thank you~
> > > > > > > >>
> > > > > > > >> Xintong Song
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Tue, Aug 27, 2019 at 8:59 PM Till Rohrmann <
> > > > [hidden email]
> > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> Thanks for creating the implementation plan Xintong.
> Overall,
> > > the
> > > > > > > >>> implementation plan looks good. I had a couple of comments:
> > > > > > > >>>
> > > > > > > >>> - What will happen if a user has defined a streaming job
> with
> > > two
> > > > > > slot
> > > > > > > >>> sharing groups? Would the code insert a blocking data
> > exchange
> > > > > > between
> > > > > > > >>> these two groups? If yes, then this breaks existing Flink
> > > > streaming
> > > > > > > jobs.
> > > > > > > >>> - How do we detect unbounded streaming jobs to set
> > > > > > > >>> the allSourcesInSamePipelinedRegion to `true`? Wouldn't it
> be
> > > > > easier
> > > > > > to
> > > > > > > >>> set
> > > > > > > >>> it false if we are using the DataSet API or the Blink
> planner
> > > > with
> > > > > a
> > > > > > > >>> bounded job?
> > > > > > > >>>
> > > > > > > >>> Cheers,
> > > > > > > >>> Till
> > > > > > > >>>
> > > > > > > >>> On Tue, Aug 27, 2019 at 2:16 PM Till Rohrmann <
> > > > > [hidden email]>
> > > > > > > >>> wrote:
> > > > > > > >>>
> > > > > > > >>> > I guess there is a typo since the link to the FLIP-53 is
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > > > > >>> >
> > > > > > > >>> > Cheers,
> > > > > > > >>> > Till
> > > > > > > >>> >
> > > > > > > >>> > On Tue, Aug 27, 2019 at 1:42 PM Xintong Song <
> > > > > > [hidden email]>
> > > > > > > >>> > wrote:
> > > > > > > >>> >
> > > > > > > >>> >> Added implementation steps for this FLIP on the wiki
> page
> > > [1].
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> Thank you~
> > > > > > > >>> >>
> > > > > > > >>> >> Xintong Song
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>> >> [1]
> > > > > > > >>> >>
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > >>> >>
> > > > > > > >>> >> On Mon, Aug 19, 2019 at 10:29 PM Xintong Song <
> > > > > > > [hidden email]>
> > > > > > > >>> >> wrote:
> > > > > > > >>> >>
> > > > > > > >>> >> > Hi everyone,
> > > > > > > >>> >> >
> > > > > > > >>> >> > As Till suggested, the original "FLIP-53: Fine Grained
> > > > > Resource
> > > > > > > >>> >> > Management" splits into two separate FLIPs,
> > > > > > > >>> >> >
> > > > > > > >>> >> >    - FLIP-53: Fine Grained Operator Resource
> Management
> > > [1]
> > > > > > > >>> >> >    - FLIP-56: Dynamic Slot Allocation [2]
> > > > > > > >>> >> >
> > > > > > > >>> >> > We'll continue using this discussion thread for
> FLIP-53.
> > > For
> > > > > > > >>> FLIP-56, I
> > > > > > > >>> >> > just started a new discussion thread [3].
> > > > > > > >>> >> >
> > > > > > > >>> >> > Thank you~
> > > > > > > >>> >> >
> > > > > > > >>> >> > Xintong Song
> > > > > > > >>> >> >
> > > > > > > >>> >> >
> > > > > > > >>> >> > [1]
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
> > > > > > > >>> >> >
> > > > > > > >>> >> > [2]
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
> > > > > > > >>> >> >
> > > > > > > >>> >> > [3]
> > > > > > > >>> >> >
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html
> > > > > > > >>> >> >
> > > > > > > >>> >> > On Mon, Aug 19, 2019 at 2:55 PM Xintong Song <
> > > > > > > [hidden email]
> > > > > > > >>> >
> > > > > > > >>> >> > wrote:
> > > > > > > >>> >> >
> > > > > > > >>> >> >> Thinks for the comments, Yang.
> > > > > > > >>> >> >>
> > > > > > > >>> >> >> Regarding your questions:
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>    1. How to calculate the resource specification of
> > > > > > > TaskManagers?
> > > > > > > >>> Do
> > > > > > > >>> >> they
> > > > > > > >>> >> >>>    have them same resource spec calculated based on
> > the
> > > > > > > >>> >> configuration? I
> > > > > > > >>> >> >>> think
> > > > > > > >>> >> >>>    we still have wasted resources in this situation.
> > Or
> > > we
> > > > > > could
> > > > > > > >>> start
> > > > > > > >>> >> >>>    TaskManagers with different spec.
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >> I agree with you that we can further improve the
> > resource
> > > > > > utility
> > > > > > > >>> by
> > > > > > > >>> >> >> customizing task executors with different resource
> > > > > > > specifications.
> > > > > > > >>> >> However,
> > > > > > > >>> >> >> I'm in favor of limiting the scope of this FLIP and
> > leave
> > > > it
> > > > > > as a
> > > > > > > >>> >> future
> > > > > > > >>> >> >> optimization. The plan for that part is to move the
> > logic
> > > > of
> > > > > > > >>> deciding
> > > > > > > >>> >> task
> > > > > > > >>> >> >> executor specifications into the slot manager and
> make
> > > slot
> > > > > > > manager
> > > > > > > >>> >> >> pluggable, so inside the slot manager plugin we can
> > have
> > > > > > > different
> > > > > > > >>> >> logics
> > > > > > > >>> >> >> for deciding the task executor specifications.
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>>    2. If a slot is released and returned to
> SlotPool,
> > > does
> > > > > it
> > > > > > > >>> could be
> > > > > > > >>> >> >>>    reused by other SlotRequest that the request
> > resource
> > > > is
> > > > > > > >>> smaller
> > > > > > > >>> >> than
> > > > > > > >>> >> >>> it?
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >> No, I think slot pool should always return slots if
> > they
> > > do
> > > > > not
> > > > > > > >>> exactly
> > > > > > > >>> >> >> match the pending requests, so that resource manager
> > can
> > > > deal
> > > > > > > with
> > > > > > > >>> the
> > > > > > > >>> >> >> extra resources.
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>>       - If it is yes, what happens to the available
> > > > resource
> > > > > > in
> > > > > > > >>> the
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>       TaskManager.
> > > > > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > > > > SlotPool?
> > > > > > > The
> > > > > > > >>> >> >>>       AllocationId is null?
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >> The allocation id does not change as long as the slot
> > is
> > > > not
> > > > > > > >>> returned
> > > > > > > >>> >> >> from the job master, no matter its occupied or
> > available
> > > in
> > > > > the
> > > > > > > >>> slot
> > > > > > > >>> >> pool.
> > > > > > > >>> >> >> I think we have the same behavior currently. No
> matter
> > > how
> > > > > many
> > > > > > > >>> tasks
> > > > > > > >>> >> the
> > > > > > > >>> >> >> job master deploy into the slot, concurrently or
> > > > > sequentially,
> > > > > > it
> > > > > > > >>> is
> > > > > > > >>> >> one
> > > > > > > >>> >> >> allocation from the cluster to the job until the slot
> > is
> > > > > freed
> > > > > > > from
> > > > > > > >>> >> the job
> > > > > > > >>> >> >> master.
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>>    3. In a session cluster, some jobs are configured
> > > with
> > > > > > > operator
> > > > > > > >>> >> >>>    resources, meanwhile other jobs are using
> UNKNOWN.
> > > How
> > > > to
> > > > > > > deal
> > > > > > > >>> with
> > > > > > > >>> >> >>> this
> > > > > > > >>> >> >>>    situation?
> > > > > > > >>> >> >>
> > > > > > > >>> >> >> As long as we do not mix unknown / specified resource
> > > > > profiles
> > > > > > > >>> within
> > > > > > > >>> >> the
> > > > > > > >>> >> >> same job / slot, there shouldn't be a problem.
> Resource
> > > > > manager
> > > > > > > >>> >> converts
> > > > > > > >>> >> >> unknown resource profiles in slot requests to
> specified
> > > > > default
> > > > > > > >>> >> resource
> > > > > > > >>> >> >> profiles, so they can be dynamically allocated from
> > task
> > > > > > > executors'
> > > > > > > >>> >> >> available resources just as other slot requests with
> > > > > specified
> > > > > > > >>> resource
> > > > > > > >>> >> >> profiles.
> > > > > > > >>> >> >>
> > > > > > > >>> >> >> Thank you~
> > > > > > > >>> >> >>
> > > > > > > >>> >> >> Xintong Song
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>
> > > > > > > >>> >> >> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang <
> > > > > > > [hidden email]>
> > > > > > > >>> >> wrote:
> > > > > > > >>> >> >>
> > > > > > > >>> >> >>> Hi Xintong,
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>> Thanks for your detailed proposal. I think many
> users
> > > are
> > > > > > > >>> suffering
> > > > > > > >>> >> from
> > > > > > > >>> >> >>> waste of resources. The resource spec of all task
> > > managers
> > > > > are
> > > > > > > >>> same
> > > > > > > >>> >> and
> > > > > > > >>> >> >>> we
> > > > > > > >>> >> >>> have to increase all task managers to make the heavy
> > one
> > > > > more
> > > > > > > >>> stable.
> > > > > > > >>> >> So
> > > > > > > >>> >> >>> we
> > > > > > > >>> >> >>> will benefit from the fine grained resource
> > management a
> > > > > lot.
> > > > > > We
> > > > > > > >>> could
> > > > > > > >>> >> >>> get
> > > > > > > >>> >> >>> better resource utilization and stability.
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>> Just to share some thoughts.
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>>    1. How to calculate the resource specification of
> > > > > > > >>> TaskManagers? Do
> > > > > > > >>> >> >>> they
> > > > > > > >>> >> >>>    have them same resource spec calculated based on
> > the
> > > > > > > >>> >> configuration? I
> > > > > > > >>> >> >>> think
> > > > > > > >>> >> >>>    we still have wasted resources in this situation.
> > Or
> > > we
> > > > > > could
> > > > > > > >>> start
> > > > > > > >>> >> >>>    TaskManagers with different spec.
> > > > > > > >>> >> >>>    2. If a slot is released and returned to
> SlotPool,
> > > does
> > > > > it
> > > > > > > >>> could be
> > > > > > > >>> >> >>>    reused by other SlotRequest that the request
> > resource
> > > > is
> > > > > > > >>> smaller
> > > > > > > >>> >> than
> > > > > > > >>> >> >>> it?
> > > > > > > >>> >> >>>       - If it is yes, what happens to the available
> > > > resource
> > > > > > in
> > > > > > > >>> the
> > > > > > > >>> >> >>>       TaskManager.
> > > > > > > >>> >> >>>       - What is the SlotStatus of the cached slot in
> > > > > SlotPool?
> > > > > > > The
> > > > > > > >>> >> >>>       AllocationId is null?
> > > > > > > >>> >> >>>    3. In a session cluster, some jobs are configured
> > > with
> > > > > > > operator
> > > > > > > >>> >> >>>    resources, meanwhile other jobs are using
> UNKNOWN.
> > > How
> > > > to
> > > > > > > deal
> > > > > > > >>> with
> > > > > > > >>> >> >>> this
> > > > > > > >>> >> >>>    situation?
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>> Best,
> > > > > > > >>> >> >>> Yang
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>> Xintong Song <[hidden email]> 于2019年8月16日周五
> > > > > 下午8:57写道:
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>> > Thanks for the feedbacks, Yangze and Till.
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > Yangze,
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > I agree with you that we should make scheduling
> > > strategy
> > > > > > > >>> pluggable
> > > > > > > >>> >> and
> > > > > > > >>> >> >>> > optimize the strategy to reduce the memory
> > > fragmentation
> > > > > > > >>> problem,
> > > > > > > >>> >> and
> > > > > > > >>> >> >>> > thanks for the inputs on the potential algorithmic
> > > > > > solutions.
> > > > > > > >>> >> However,
> > > > > > > >>> >> >>> I'm
> > > > > > > >>> >> >>> > in favor of keep this FLIP focusing on the overall
> > > > > mechanism
> > > > > > > >>> design
> > > > > > > >>> >> >>> rather
> > > > > > > >>> >> >>> > than strategies. Solving the fragmentation issue
> > > should
> > > > be
> > > > > > > >>> >> considered
> > > > > > > >>> >> >>> as an
> > > > > > > >>> >> >>> > optimization, and I agree with Till that we
> probably
> > > > > should
> > > > > > > >>> tackle
> > > > > > > >>> >> this
> > > > > > > >>> >> >>> > afterwards.
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > Till,
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > - Regarding splitting the FLIP, I think it makes
> > > sense.
> > > > > The
> > > > > > > >>> operator
> > > > > > > >>> >> >>> > resource management and dynamic slot allocation do
> > not
> > > > > have
> > > > > > > much
> > > > > > > >>> >> >>> dependency
> > > > > > > >>> >> >>> > on each other.
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > - Regarding the default slot size, I think this is
> > > > similar
> > > > > > to
> > > > > > > >>> >> FLIP-49
> > > > > > > >>> >> >>> [1]
> > > > > > > >>> >> >>> > where we want all the deriving happens at one
> > place. I
> > > > > think
> > > > > > > it
> > > > > > > >>> >> would
> > > > > > > >>> >> >>> be
> > > > > > > >>> >> >>> > nice to pass the default slot size into the task
> > > > executor
> > > > > in
> > > > > > > the
> > > > > > > >>> >> same
> > > > > > > >>> >> >>> way
> > > > > > > >>> >> >>> > that we pass in the memory pool sizes in FLIP-49
> > [1].
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > - Regarding the return value of
> > > > > > > >>> >> TaskExecutorGateway#requestResource, I
> > > > > > > >>> >> >>> > think you're right. We should avoid using null as
> > the
> > > > > return
> > > > > > > >>> value.
> > > > > > > >>> >> I
> > > > > > > >>> >> >>> think
> > > > > > > >>> >> >>> > we probably should thrown an exception here.
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > Thank you~
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > Xintong Song
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > [1]
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>>
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann <
> > > > > > > >>> [hidden email]
> > > > > > > >>> >> >
> > > > > > > >>> >> >>> > wrote:
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>> > > Hi Xintong,
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > thanks for drafting this FLIP. I think your
> > proposal
> > > > > helps
> > > > > > > to
> > > > > > > >>> >> >>> improve the
> > > > > > > >>> >> >>> > > execution of batch jobs more efficiently.
> > Moreover,
> > > it
> > > > > > > >>> enables the
> > > > > > > >>> >> >>> proper
> > > > > > > >>> >> >>> > > integration of the Blink planner which is very
> > > > important
> > > > > > as
> > > > > > > >>> well.
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > Overall, the FLIP looks good to me. I was
> > wondering
> > > > > > whether
> > > > > > > it
> > > > > > > >>> >> >>> wouldn't
> > > > > > > >>> >> >>> > > make sense to actually split it up into two
> FLIPs:
> > > > > > Operator
> > > > > > > >>> >> resource
> > > > > > > >>> >> >>> > > management and dynamic slot allocation. I think
> > > these
> > > > > two
> > > > > > > >>> FLIPs
> > > > > > > >>> >> >>> could be
> > > > > > > >>> >> >>> > > seen as orthogonal and it would decrease the
> scope
> > > of
> > > > > each
> > > > > > > >>> >> individual
> > > > > > > >>> >> >>> > FLIP.
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > Some smaller comments:
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > - I'm not sure whether we should pass in the
> > default
> > > > > slot
> > > > > > > size
> > > > > > > >>> >> via an
> > > > > > > >>> >> >>> > > environment variable. Without having unified the
> > way
> > > > how
> > > > > > > Flink
> > > > > > > >>> >> >>> components
> > > > > > > >>> >> >>> > > are configured [1], I think it would be better
> to
> > > pass
> > > > > it
> > > > > > in
> > > > > > > >>> as
> > > > > > > >>> >> part
> > > > > > > >>> >> >>> of
> > > > > > > >>> >> >>> > the
> > > > > > > >>> >> >>> > > configuration.
> > > > > > > >>> >> >>> > > - I would avoid returning a null value from
> > > > > > > >>> >> >>> > > TaskExecutorGateway#requestResource if it cannot
> > be
> > > > > > > fulfilled.
> > > > > > > >>> >> >>> Either we
> > > > > > > >>> >> >>> > > should introduce an explicit return value saying
> > > this
> > > > or
> > > > > > > >>> throw an
> > > > > > > >>> >> >>> > > exception.
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > Concerning Yangze's comments: I think you are
> > right
> > > > that
> > > > > > it
> > > > > > > >>> would
> > > > > > > >>> >> be
> > > > > > > >>> >> >>> > > helpful to make the selection strategy
> pluggable.
> > > Also
> > > > > > > >>> batching
> > > > > > > >>> >> slot
> > > > > > > >>> >> >>> > > requests to the RM could be a good optimization.
> > For
> > > > the
> > > > > > > sake
> > > > > > > >>> of
> > > > > > > >>> >> >>> keeping
> > > > > > > >>> >> >>> > > the scope of this FLIP smaller I would try to
> > tackle
> > > > > these
> > > > > > > >>> things
> > > > > > > >>> >> >>> after
> > > > > > > >>> >> >>> > the
> > > > > > > >>> >> >>> > > initial version has been completed (without
> > spoiling
> > > > > these
> > > > > > > >>> >> >>> optimization
> > > > > > > >>> >> >>> > > opportunities). In particular batching the slot
> > > > requests
> > > > > > > >>> depends
> > > > > > > >>> >> on
> > > > > > > >>> >> >>> the
> > > > > > > >>> >> >>> > > current scheduler refactoring and could also be
> > > > realized
> > > > > > on
> > > > > > > >>> the RM
> > > > > > > >>> >> >>> side
> > > > > > > >>> >> >>> > > only.
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > [1]
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>>
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > Cheers,
> > > > > > > >>> >> >>> > > Till
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo <
> > > > > > > >>> [hidden email]>
> > > > > > > >>> >> >>> wrote:
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> > > > Hi, Xintong
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > Thanks to propose this FLIP. The general
> design
> > > > looks
> > > > > > good
> > > > > > > >>> to
> > > > > > > >>> >> me,
> > > > > > > >>> >> >>> +1
> > > > > > > >>> >> >>> > > > for this feature.
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > Since slots in the same task executor could
> have
> > > > > > different
> > > > > > > >>> >> resource
> > > > > > > >>> >> >>> > > > profile, we will
> > > > > > > >>> >> >>> > > > meet resource fragment problem. Think about
> this
> > > > case:
> > > > > > > >>> >> >>> > > >  - request A want 1G memory while request B &
> C
> > > want
> > > > > > 0.5G
> > > > > > > >>> memory
> > > > > > > >>> >> >>> > > >  - There are two task executors T1 & T2 with
> 1G
> > > and
> > > > > 0.5G
> > > > > > > >>> free
> > > > > > > >>> >> >>> memory
> > > > > > > >>> >> >>> > > > respectively
> > > > > > > >>> >> >>> > > > If B come first and we cut a slot from T1 for
> > B, A
> > > > > must
> > > > > > > >>> wait for
> > > > > > > >>> >> >>> the
> > > > > > > >>> >> >>> > > > free resource from
> > > > > > > >>> >> >>> > > > other task. But A could have been scheduled
> > > > > immediately
> > > > > > if
> > > > > > > >>> we
> > > > > > > >>> >> cut a
> > > > > > > >>> >> >>> > > > slot from T2 for B.
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > The logic of findMatchingSlot now become
> > finding a
> > > > > task
> > > > > > > >>> executor
> > > > > > > >>> >> >>> which
> > > > > > > >>> >> >>> > > > has enough
> > > > > > > >>> >> >>> > > > resource and then cut a slot from it. Current
> > > method
> > > > > > could
> > > > > > > >>> be
> > > > > > > >>> >> seen
> > > > > > > >>> >> >>> as
> > > > > > > >>> >> >>> > > > "First-fit strategy",
> > > > > > > >>> >> >>> > > > which works well in general but sometimes
> could
> > > not
> > > > be
> > > > > > the
> > > > > > > >>> >> >>> optimization
> > > > > > > >>> >> >>> > > > method.
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > Actually, this problem could be abstracted as
> > "Bin
> > > > > > Packing
> > > > > > > >>> >> >>> Problem"[1].
> > > > > > > >>> >> >>> > > > Here are
> > > > > > > >>> >> >>> > > > some common approximate algorithms:
> > > > > > > >>> >> >>> > > > - First fit
> > > > > > > >>> >> >>> > > > - Next fit
> > > > > > > >>> >> >>> > > > - Best fit
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > But it become multi-dimensional bin packing
> > > problem
> > > > if
> > > > > > we
> > > > > > > >>> take
> > > > > > > >>> >> CPU
> > > > > > > >>> >> >>> > > > into account. It hard
> > > > > > > >>> >> >>> > > > to define which one is best fit now. Some
> > research
> > > > > > > addressed
> > > > > > > >>> >> this
> > > > > > > >>> >> >>> > > > problem, such like Tetris[2].
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > Here are some thinking about it:
> > > > > > > >>> >> >>> > > > 1. We could make the strategy of finding
> > matching
> > > > task
> > > > > > > >>> executor
> > > > > > > >>> >> >>> > > > pluginable. Let user to config the
> > > > > > > >>> >> >>> > > > best strategy in their scenario.
> > > > > > > >>> >> >>> > > > 2. We could support batch request interface in
> > RM,
> > > > > > because
> > > > > > > >>> we
> > > > > > > >>> >> have
> > > > > > > >>> >> >>> > > > opportunities to optimize
> > > > > > > >>> >> >>> > > > if we have more information. If we know the A,
> > B,
> > > C
> > > > at
> > > > > > the
> > > > > > > >>> same
> > > > > > > >>> >> >>> time,
> > > > > > > >>> >> >>> > > > we could always make the best decision.
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > [1]
> http://www.or.deis.unibo.it/kp/Chapter8.pdf
> > > > > > > >>> >> >>> > > > [2]
> > > > > > > >>> >> >>> >
> > > > > > > >>> >>
> > > > > > >
> > > https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > Best,
> > > > > > > >>> >> >>> > > > Yangze Guo
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > > > On Thu, Aug 15, 2019 at 10:40 PM Xintong Song
> <
> > > > > > > >>> >> >>> [hidden email]>
> > > > > > > >>> >> >>> > > > wrote:
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > Hi everyone,
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > We would like to start a discussion thread
> on
> > > > > > "FLIP-53:
> > > > > > > >>> Fine
> > > > > > > >>> >> >>> Grained
> > > > > > > >>> >> >>> > > > > Resource Management"[1], where we propose
> how
> > to
> > > > > > improve
> > > > > > > >>> Flink
> > > > > > > >>> >> >>> > resource
> > > > > > > >>> >> >>> > > > > management and scheduling.
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > This FLIP mainly discusses the following
> > issues.
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > >    - How to support tasks with fine grained
> > > > resource
> > > > > > > >>> >> >>> requirements.
> > > > > > > >>> >> >>> > > > >    - How to unify resource management for
> jobs
> > > > with
> > > > > /
> > > > > > > >>> without
> > > > > > > >>> >> >>> fine
> > > > > > > >>> >> >>> > > > grained
> > > > > > > >>> >> >>> > > > >    resource requirements.
> > > > > > > >>> >> >>> > > > >    - How to unify resource management for
> > > > streaming
> > > > > /
> > > > > > > >>> batch
> > > > > > > >>> >> jobs.
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > Key changes proposed in the FLIP are as
> > follows.
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > >    - Unify memory management for operators
> > with
> > > /
> > > > > > > without
> > > > > > > >>> fine
> > > > > > > >>> >> >>> > grained
> > > > > > > >>> >> >>> > > > >    resource requirements by applying a
> > fraction
> > > > > based
> > > > > > > >>> quota
> > > > > > > >>> >> >>> > mechanism.
> > > > > > > >>> >> >>> > > > >    - Unify resource scheduling for streaming
> > and
> > > > > batch
> > > > > > > >>> jobs by
> > > > > > > >>> >> >>> > setting
> > > > > > > >>> >> >>> > > > slot
> > > > > > > >>> >> >>> > > > >    sharing groups for pipelined regions
> during
> > > > > > compiling
> > > > > > > >>> >> stage.
> > > > > > > >>> >> >>> > > > >    - Dynamically allocate slots from task
> > > > executors'
> > > > > > > >>> available
> > > > > > > >>> >> >>> > > resources.
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > Please find more details in the FLIP wiki
> > > document
> > > > > > [1].
> > > > > > > >>> >> Looking
> > > > > > > >>> >> >>> > forward
> > > > > > > >>> >> >>> > > > to
> > > > > > > >>> >> >>> > > > > your feedbacks.
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > Thank you~
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > Xintong Song
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > > > [1]
> > > > > > > >>> >> >>> > > > >
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>>
> > > > > > > >>> >>
> > > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
> > > > > > > >>> >> >>> > > >
> > > > > > > >>> >> >>> > >
> > > > > > > >>> >> >>> >
> > > > > > > >>> >> >>>
> > > > > > > >>> >> >>
> > > > > > > >>> >>
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
12