[DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Andrey Zagrebin-4
Hi All,

As you may have noticed, 1.10 release included an extensive improvements to
memory management and configuration of Task Managers, FLIP-49: [1]. The
memory configuration of Job Managers has not been touched in 1.10.

Although, Job Manager's memory model does not look so sophisticated as
for Task Managers, It makes to align Job Manager memory model and settings
with Task Managers. Therefore, we propose to reconsider it as well in 1.11
and I prepared a FLIP 116 [2] for that.

Any feedback is appreciated.

So far, there is one discussion point about how to address native
non-direct memory usage of user code. The user code can be run e.g. in
certain job submission scenarios within the JM process. For simplicity,
FLIP suggests only an option for direct memory which is translated into the
setting of the JVM direct memory limit.
Although, we documented for TM that the similar parameters can also
address native non-direct memory usage [3], this can lead to wrong
functioning of the JVM direct memory limit. The direct memory option in JM
could be also named in more general way, e.g. off-heap memory but this
naming would somewhat hide its nature of JVM direct memory limit.
On the other hand, JVM Overhead does not suffer from this problem and
affects only the container/worker memory size which is the most important
matter to address for the native non-direct memory consumption. The caveat
here is that JVM Overhead was not supposed to be used by any Flink or user
components.

Thanks,
Andrey

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Xintong Song
Thanks Andrey for kicking this discussion off.

Regarding "direct" vs. "off-heap", I'm personally in favor of renaming the
"direct" memory in the current FLIP-116[1] to "off-heap" memory, and making
it also account for user native memory usage.

On one hand, I think it would be good that JM & TM provide consistent
concepts and terminologies to users. IIUC, this is exactly the purpose of
this FLIP. For TMs, we already have "off-heap" memory accounting for both
direct and native memory usages, and we did this so that users do not need
to understand the differences between the two kinds.

On the other hand, while for TMs it is hard to tell which kind of memory is
needed mostly due to variety of applications, I believe for JM the major
memory consumption is heap memory in most cases. That means we probably can
rely on the heap activities to trigger GC in most cases, and the max direct
memory limit can act as a safe net. Moreover, I think the cases should be
very rare that we need native memory for user codes. Therefore, we probably
should not break the JM/TM consistency for potential risks in such rare
cases.

WDYT?

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers

On Wed, Mar 11, 2020 at 8:56 PM Andrey Zagrebin <[hidden email]>
wrote:

> Hi All,
>
> As you may have noticed, 1.10 release included an extensive improvements to
> memory management and configuration of Task Managers, FLIP-49: [1]. The
> memory configuration of Job Managers has not been touched in 1.10.
>
> Although, Job Manager's memory model does not look so sophisticated as
> for Task Managers, It makes to align Job Manager memory model and settings
> with Task Managers. Therefore, we propose to reconsider it as well in 1.11
> and I prepared a FLIP 116 [2] for that.
>
> Any feedback is appreciated.
>
> So far, there is one discussion point about how to address native
> non-direct memory usage of user code. The user code can be run e.g. in
> certain job submission scenarios within the JM process. For simplicity,
> FLIP suggests only an option for direct memory which is translated into the
> setting of the JVM direct memory limit.
> Although, we documented for TM that the similar parameters can also
> address native non-direct memory usage [3], this can lead to wrong
> functioning of the JVM direct memory limit. The direct memory option in JM
> could be also named in more general way, e.g. off-heap memory but this
> naming would somewhat hide its nature of JVM direct memory limit.
> On the other hand, JVM Overhead does not suffer from this problem and
> affects only the container/worker memory size which is the most important
> matter to address for the native non-direct memory consumption. The caveat
> here is that JVM Overhead was not supposed to be used by any Flink or user
> components.
>
> Thanks,
> Andrey
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> [3]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Till Rohrmann
Thanks for creating this FLIP Andrey.

I agree with Xintong that we should rename jobmanager.memory.direct.size
into jobmanager.memory.off-heap.size which accounts for native and direct
memory usage. I think it should be good enough and is easier to understand
for the user.

Concerning the default value for the metaspace size. Did we take the
lessons learned from the TM metaspace size into account? IIRC we are about
to change the default value to 256 MB.

Feel free to start a vote once these last two questions have been resolved.

Cheers,
Till

On Thu, Mar 12, 2020 at 4:25 AM Xintong Song <[hidden email]> wrote:

> Thanks Andrey for kicking this discussion off.
>
> Regarding "direct" vs. "off-heap", I'm personally in favor of renaming the
> "direct" memory in the current FLIP-116[1] to "off-heap" memory, and making
> it also account for user native memory usage.
>
> On one hand, I think it would be good that JM & TM provide consistent
> concepts and terminologies to users. IIUC, this is exactly the purpose of
> this FLIP. For TMs, we already have "off-heap" memory accounting for both
> direct and native memory usages, and we did this so that users do not need
> to understand the differences between the two kinds.
>
> On the other hand, while for TMs it is hard to tell which kind of memory is
> needed mostly due to variety of applications, I believe for JM the major
> memory consumption is heap memory in most cases. That means we probably can
> rely on the heap activities to trigger GC in most cases, and the max direct
> memory limit can act as a safe net. Moreover, I think the cases should be
> very rare that we need native memory for user codes. Therefore, we probably
> should not break the JM/TM consistency for potential risks in such rare
> cases.
>
> WDYT?
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
>
> On Wed, Mar 11, 2020 at 8:56 PM Andrey Zagrebin <[hidden email]>
> wrote:
>
> > Hi All,
> >
> > As you may have noticed, 1.10 release included an extensive improvements
> to
> > memory management and configuration of Task Managers, FLIP-49: [1]. The
> > memory configuration of Job Managers has not been touched in 1.10.
> >
> > Although, Job Manager's memory model does not look so sophisticated as
> > for Task Managers, It makes to align Job Manager memory model and
> settings
> > with Task Managers. Therefore, we propose to reconsider it as well in
> 1.11
> > and I prepared a FLIP 116 [2] for that.
> >
> > Any feedback is appreciated.
> >
> > So far, there is one discussion point about how to address native
> > non-direct memory usage of user code. The user code can be run e.g. in
> > certain job submission scenarios within the JM process. For simplicity,
> > FLIP suggests only an option for direct memory which is translated into
> the
> > setting of the JVM direct memory limit.
> > Although, we documented for TM that the similar parameters can also
> > address native non-direct memory usage [3], this can lead to wrong
> > functioning of the JVM direct memory limit. The direct memory option in
> JM
> > could be also named in more general way, e.g. off-heap memory but this
> > naming would somewhat hide its nature of JVM direct memory limit.
> > On the other hand, JVM Overhead does not suffer from this problem and
> > affects only the container/worker memory size which is the most important
> > matter to address for the native non-direct memory consumption. The
> caveat
> > here is that JVM Overhead was not supposed to be used by any Flink or
> user
> > components.
> >
> > Thanks,
> > Andrey
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> > [3]
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Andrey Zagrebin-4
Hi all,

Thanks for the feedback, Xintong and Till.

> rename jobmanager.memory.direct.size into jobmanager.memory.off-heap.size

I am ok with that to align it with TM and avoid further complications for
users.
I will adjust the FLIP.

> change the default value of JM Metaspace size to 256 MB

Indeed, no reason to assume that the user code would need less Metaspace in
JM.
I will change it unless a better argument is reported for another value.

I think all concerns has been resolved so I am starting the voting in a
separate thread.

Best,
Andrey

On Tue, Mar 17, 2020 at 6:16 PM Till Rohrmann <[hidden email]> wrote:

> Thanks for creating this FLIP Andrey.
>
> I agree with Xintong that we should rename jobmanager.memory.direct.size
> into jobmanager.memory.off-heap.size which accounts for native and direct
> memory usage. I think it should be good enough and is easier to understand
> for the user.
>
> Concerning the default value for the metaspace size. Did we take the
> lessons learned from the TM metaspace size into account? IIRC we are about
> to change the default value to 256 MB.
>
> Feel free to start a vote once these last two questions have been resolved.
>
> Cheers,
> Till
>
> On Thu, Mar 12, 2020 at 4:25 AM Xintong Song <[hidden email]>
> wrote:
>
> > Thanks Andrey for kicking this discussion off.
> >
> > Regarding "direct" vs. "off-heap", I'm personally in favor of renaming
> the
> > "direct" memory in the current FLIP-116[1] to "off-heap" memory, and
> making
> > it also account for user native memory usage.
> >
> > On one hand, I think it would be good that JM & TM provide consistent
> > concepts and terminologies to users. IIUC, this is exactly the purpose of
> > this FLIP. For TMs, we already have "off-heap" memory accounting for both
> > direct and native memory usages, and we did this so that users do not
> need
> > to understand the differences between the two kinds.
> >
> > On the other hand, while for TMs it is hard to tell which kind of memory
> is
> > needed mostly due to variety of applications, I believe for JM the major
> > memory consumption is heap memory in most cases. That means we probably
> can
> > rely on the heap activities to trigger GC in most cases, and the max
> direct
> > memory limit can act as a safe net. Moreover, I think the cases should be
> > very rare that we need native memory for user codes. Therefore, we
> probably
> > should not break the JM/TM consistency for potential risks in such rare
> > cases.
> >
> > WDYT?
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> >
> > On Wed, Mar 11, 2020 at 8:56 PM Andrey Zagrebin <[hidden email]>
> > wrote:
> >
> > > Hi All,
> > >
> > > As you may have noticed, 1.10 release included an extensive
> improvements
> > to
> > > memory management and configuration of Task Managers, FLIP-49: [1]. The
> > > memory configuration of Job Managers has not been touched in 1.10.
> > >
> > > Although, Job Manager's memory model does not look so sophisticated as
> > > for Task Managers, It makes to align Job Manager memory model and
> > settings
> > > with Task Managers. Therefore, we propose to reconsider it as well in
> > 1.11
> > > and I prepared a FLIP 116 [2] for that.
> > >
> > > Any feedback is appreciated.
> > >
> > > So far, there is one discussion point about how to address native
> > > non-direct memory usage of user code. The user code can be run e.g. in
> > > certain job submission scenarios within the JM process. For simplicity,
> > > FLIP suggests only an option for direct memory which is translated into
> > the
> > > setting of the JVM direct memory limit.
> > > Although, we documented for TM that the similar parameters can also
> > > address native non-direct memory usage [3], this can lead to wrong
> > > functioning of the JVM direct memory limit. The direct memory option in
> > JM
> > > could be also named in more general way, e.g. off-heap memory but this
> > > naming would somewhat hide its nature of JVM direct memory limit.
> > > On the other hand, JVM Overhead does not suffer from this problem and
> > > affects only the container/worker memory size which is the most
> important
> > > matter to address for the native non-direct memory consumption. The
> > caveat
> > > here is that JVM Overhead was not supposed to be used by any Flink or
> > user
> > > components.
> > >
> > > Thanks,
> > > Andrey
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> > > [3]
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Andrey Zagrebin-4
Hi all,

One thing more thing to mention, the current calculations can lead to
arbitrary small JVM Heap, maybe even zero.
I suggest to introduce a check where we at least recommend to set the JVM
heap to e.g. 128Mb.

Additionally, we can demand some minimum value to function and fail if it
is not fulfilled.
We could experiment with what is the working minimum but It is hard to come
up with this limit because it again can depend on the job and environment.

Best,
Andrey

On Wed, Mar 18, 2020 at 5:03 PM Andrey Zagrebin <[hidden email]>
wrote:

> Hi all,
>
> Thanks for the feedback, Xintong and Till.
>
> > rename jobmanager.memory.direct.size into jobmanager.memory.off-heap.size
>
> I am ok with that to align it with TM and avoid further complications for
> users.
> I will adjust the FLIP.
>
> > change the default value of JM Metaspace size to 256 MB
>
> Indeed, no reason to assume that the user code would need less Metaspace
> in JM.
> I will change it unless a better argument is reported for another value.
>
> I think all concerns has been resolved so I am starting the voting in a
> separate thread.
>
> Best,
> Andrey
>
> On Tue, Mar 17, 2020 at 6:16 PM Till Rohrmann <[hidden email]>
> wrote:
>
>> Thanks for creating this FLIP Andrey.
>>
>> I agree with Xintong that we should rename jobmanager.memory.direct.size
>> into jobmanager.memory.off-heap.size which accounts for native and direct
>> memory usage. I think it should be good enough and is easier to understand
>> for the user.
>>
>> Concerning the default value for the metaspace size. Did we take the
>> lessons learned from the TM metaspace size into account? IIRC we are about
>> to change the default value to 256 MB.
>>
>> Feel free to start a vote once these last two questions have been
>> resolved.
>>
>> Cheers,
>> Till
>>
>> On Thu, Mar 12, 2020 at 4:25 AM Xintong Song <[hidden email]>
>> wrote:
>>
>> > Thanks Andrey for kicking this discussion off.
>> >
>> > Regarding "direct" vs. "off-heap", I'm personally in favor of renaming
>> the
>> > "direct" memory in the current FLIP-116[1] to "off-heap" memory, and
>> making
>> > it also account for user native memory usage.
>> >
>> > On one hand, I think it would be good that JM & TM provide consistent
>> > concepts and terminologies to users. IIUC, this is exactly the purpose
>> of
>> > this FLIP. For TMs, we already have "off-heap" memory accounting for
>> both
>> > direct and native memory usages, and we did this so that users do not
>> need
>> > to understand the differences between the two kinds.
>> >
>> > On the other hand, while for TMs it is hard to tell which kind of
>> memory is
>> > needed mostly due to variety of applications, I believe for JM the major
>> > memory consumption is heap memory in most cases. That means we probably
>> can
>> > rely on the heap activities to trigger GC in most cases, and the max
>> direct
>> > memory limit can act as a safe net. Moreover, I think the cases should
>> be
>> > very rare that we need native memory for user codes. Therefore, we
>> probably
>> > should not break the JM/TM consistency for potential risks in such rare
>> > cases.
>> >
>> > WDYT?
>> >
>> > Thank you~
>> >
>> > Xintong Song
>> >
>> >
>> > [1]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
>> >
>> > On Wed, Mar 11, 2020 at 8:56 PM Andrey Zagrebin <[hidden email]>
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > > As you may have noticed, 1.10 release included an extensive
>> improvements
>> > to
>> > > memory management and configuration of Task Managers, FLIP-49: [1].
>> The
>> > > memory configuration of Job Managers has not been touched in 1.10.
>> > >
>> > > Although, Job Manager's memory model does not look so sophisticated as
>> > > for Task Managers, It makes to align Job Manager memory model and
>> > settings
>> > > with Task Managers. Therefore, we propose to reconsider it as well in
>> > 1.11
>> > > and I prepared a FLIP 116 [2] for that.
>> > >
>> > > Any feedback is appreciated.
>> > >
>> > > So far, there is one discussion point about how to address native
>> > > non-direct memory usage of user code. The user code can be run e.g. in
>> > > certain job submission scenarios within the JM process. For
>> simplicity,
>> > > FLIP suggests only an option for direct memory which is translated
>> into
>> > the
>> > > setting of the JVM direct memory limit.
>> > > Although, we documented for TM that the similar parameters can also
>> > > address native non-direct memory usage [3], this can lead to wrong
>> > > functioning of the JVM direct memory limit. The direct memory option
>> in
>> > JM
>> > > could be also named in more general way, e.g. off-heap memory but this
>> > > naming would somewhat hide its nature of JVM direct memory limit.
>> > > On the other hand, JVM Overhead does not suffer from this problem and
>> > > affects only the container/worker memory size which is the most
>> important
>> > > matter to address for the native non-direct memory consumption. The
>> > caveat
>> > > here is that JVM Overhead was not supposed to be used by any Flink or
>> > user
>> > > components.
>> > >
>> > > Thanks,
>> > > Andrey
>> > >
>> > > [1]
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>> > > [2]
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
>> > > [3]
>> > >
>> > >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Xintong Song
I think recommend a minimum value in docs and throw a warning if the heap
size is too small should be good enough.
Not sure about failing job if the min heap is not fulfilled. As already
mentioned, it would be hard to determine the min heap size. And if we make
the min heap configurable, then in any case that users need to configure
the min heap, they can configure the heap size directly.

Thank you~

Xintong Song



On Wed, Mar 18, 2020 at 10:55 PM Andrey Zagrebin <[hidden email]>
wrote:

> Hi all,
>
> One thing more thing to mention, the current calculations can lead to
> arbitrary small JVM Heap, maybe even zero.
> I suggest to introduce a check where we at least recommend to set the JVM
> heap to e.g. 128Mb.
>
> Additionally, we can demand some minimum value to function and fail if it
> is not fulfilled.
> We could experiment with what is the working minimum but It is hard to come
> up with this limit because it again can depend on the job and environment.
>
> Best,
> Andrey
>
> On Wed, Mar 18, 2020 at 5:03 PM Andrey Zagrebin <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > Thanks for the feedback, Xintong and Till.
> >
> > > rename jobmanager.memory.direct.size into
> jobmanager.memory.off-heap.size
> >
> > I am ok with that to align it with TM and avoid further complications for
> > users.
> > I will adjust the FLIP.
> >
> > > change the default value of JM Metaspace size to 256 MB
> >
> > Indeed, no reason to assume that the user code would need less Metaspace
> > in JM.
> > I will change it unless a better argument is reported for another value.
> >
> > I think all concerns has been resolved so I am starting the voting in a
> > separate thread.
> >
> > Best,
> > Andrey
> >
> > On Tue, Mar 17, 2020 at 6:16 PM Till Rohrmann <[hidden email]>
> > wrote:
> >
> >> Thanks for creating this FLIP Andrey.
> >>
> >> I agree with Xintong that we should rename jobmanager.memory.direct.size
> >> into jobmanager.memory.off-heap.size which accounts for native and
> direct
> >> memory usage. I think it should be good enough and is easier to
> understand
> >> for the user.
> >>
> >> Concerning the default value for the metaspace size. Did we take the
> >> lessons learned from the TM metaspace size into account? IIRC we are
> about
> >> to change the default value to 256 MB.
> >>
> >> Feel free to start a vote once these last two questions have been
> >> resolved.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Mar 12, 2020 at 4:25 AM Xintong Song <[hidden email]>
> >> wrote:
> >>
> >> > Thanks Andrey for kicking this discussion off.
> >> >
> >> > Regarding "direct" vs. "off-heap", I'm personally in favor of renaming
> >> the
> >> > "direct" memory in the current FLIP-116[1] to "off-heap" memory, and
> >> making
> >> > it also account for user native memory usage.
> >> >
> >> > On one hand, I think it would be good that JM & TM provide consistent
> >> > concepts and terminologies to users. IIUC, this is exactly the purpose
> >> of
> >> > this FLIP. For TMs, we already have "off-heap" memory accounting for
> >> both
> >> > direct and native memory usages, and we did this so that users do not
> >> need
> >> > to understand the differences between the two kinds.
> >> >
> >> > On the other hand, while for TMs it is hard to tell which kind of
> >> memory is
> >> > needed mostly due to variety of applications, I believe for JM the
> major
> >> > memory consumption is heap memory in most cases. That means we
> probably
> >> can
> >> > rely on the heap activities to trigger GC in most cases, and the max
> >> direct
> >> > memory limit can act as a safe net. Moreover, I think the cases should
> >> be
> >> > very rare that we need native memory for user codes. Therefore, we
> >> probably
> >> > should not break the JM/TM consistency for potential risks in such
> rare
> >> > cases.
> >> >
> >> > WDYT?
> >> >
> >> > Thank you~
> >> >
> >> > Xintong Song
> >> >
> >> >
> >> > [1]
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> >> >
> >> > On Wed, Mar 11, 2020 at 8:56 PM Andrey Zagrebin <[hidden email]
> >
> >> > wrote:
> >> >
> >> > > Hi All,
> >> > >
> >> > > As you may have noticed, 1.10 release included an extensive
> >> improvements
> >> > to
> >> > > memory management and configuration of Task Managers, FLIP-49: [1].
> >> The
> >> > > memory configuration of Job Managers has not been touched in 1.10.
> >> > >
> >> > > Although, Job Manager's memory model does not look so sophisticated
> as
> >> > > for Task Managers, It makes to align Job Manager memory model and
> >> > settings
> >> > > with Task Managers. Therefore, we propose to reconsider it as well
> in
> >> > 1.11
> >> > > and I prepared a FLIP 116 [2] for that.
> >> > >
> >> > > Any feedback is appreciated.
> >> > >
> >> > > So far, there is one discussion point about how to address native
> >> > > non-direct memory usage of user code. The user code can be run e.g.
> in
> >> > > certain job submission scenarios within the JM process. For
> >> simplicity,
> >> > > FLIP suggests only an option for direct memory which is translated
> >> into
> >> > the
> >> > > setting of the JVM direct memory limit.
> >> > > Although, we documented for TM that the similar parameters can also
> >> > > address native non-direct memory usage [3], this can lead to wrong
> >> > > functioning of the JVM direct memory limit. The direct memory option
> >> in
> >> > JM
> >> > > could be also named in more general way, e.g. off-heap memory but
> this
> >> > > naming would somewhat hide its nature of JVM direct memory limit.
> >> > > On the other hand, JVM Overhead does not suffer from this problem
> and
> >> > > affects only the container/worker memory size which is the most
> >> important
> >> > > matter to address for the native non-direct memory consumption. The
> >> > caveat
> >> > > here is that JVM Overhead was not supposed to be used by any Flink
> or
> >> > user
> >> > > components.
> >> > >
> >> > > Thanks,
> >> > > Andrey
> >> > >
> >> > > [1]
> >> > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >> > > [2]
> >> > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> >> > > [3]
> >> > >
> >> > >
> >> >
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
> >> > >
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Till Rohrmann
I agree with Xintong's proposal. If we see that many users run into this
problem, then one could think about escalating the warning message into a
failure.

Cheers,
Till

On Thu, Mar 19, 2020 at 4:23 AM Xintong Song <[hidden email]> wrote:

> I think recommend a minimum value in docs and throw a warning if the heap
> size is too small should be good enough.
> Not sure about failing job if the min heap is not fulfilled. As already
> mentioned, it would be hard to determine the min heap size. And if we make
> the min heap configurable, then in any case that users need to configure
> the min heap, they can configure the heap size directly.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Wed, Mar 18, 2020 at 10:55 PM Andrey Zagrebin <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > One thing more thing to mention, the current calculations can lead to
> > arbitrary small JVM Heap, maybe even zero.
> > I suggest to introduce a check where we at least recommend to set the JVM
> > heap to e.g. 128Mb.
> >
> > Additionally, we can demand some minimum value to function and fail if it
> > is not fulfilled.
> > We could experiment with what is the working minimum but It is hard to
> come
> > up with this limit because it again can depend on the job and
> environment.
> >
> > Best,
> > Andrey
> >
> > On Wed, Mar 18, 2020 at 5:03 PM Andrey Zagrebin <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > Thanks for the feedback, Xintong and Till.
> > >
> > > > rename jobmanager.memory.direct.size into
> > jobmanager.memory.off-heap.size
> > >
> > > I am ok with that to align it with TM and avoid further complications
> for
> > > users.
> > > I will adjust the FLIP.
> > >
> > > > change the default value of JM Metaspace size to 256 MB
> > >
> > > Indeed, no reason to assume that the user code would need less
> Metaspace
> > > in JM.
> > > I will change it unless a better argument is reported for another
> value.
> > >
> > > I think all concerns has been resolved so I am starting the voting in a
> > > separate thread.
> > >
> > > Best,
> > > Andrey
> > >
> > > On Tue, Mar 17, 2020 at 6:16 PM Till Rohrmann <[hidden email]>
> > > wrote:
> > >
> > >> Thanks for creating this FLIP Andrey.
> > >>
> > >> I agree with Xintong that we should rename
> jobmanager.memory.direct.size
> > >> into jobmanager.memory.off-heap.size which accounts for native and
> > direct
> > >> memory usage. I think it should be good enough and is easier to
> > understand
> > >> for the user.
> > >>
> > >> Concerning the default value for the metaspace size. Did we take the
> > >> lessons learned from the TM metaspace size into account? IIRC we are
> > about
> > >> to change the default value to 256 MB.
> > >>
> > >> Feel free to start a vote once these last two questions have been
> > >> resolved.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Mar 12, 2020 at 4:25 AM Xintong Song <[hidden email]>
> > >> wrote:
> > >>
> > >> > Thanks Andrey for kicking this discussion off.
> > >> >
> > >> > Regarding "direct" vs. "off-heap", I'm personally in favor of
> renaming
> > >> the
> > >> > "direct" memory in the current FLIP-116[1] to "off-heap" memory, and
> > >> making
> > >> > it also account for user native memory usage.
> > >> >
> > >> > On one hand, I think it would be good that JM & TM provide
> consistent
> > >> > concepts and terminologies to users. IIUC, this is exactly the
> purpose
> > >> of
> > >> > this FLIP. For TMs, we already have "off-heap" memory accounting for
> > >> both
> > >> > direct and native memory usages, and we did this so that users do
> not
> > >> need
> > >> > to understand the differences between the two kinds.
> > >> >
> > >> > On the other hand, while for TMs it is hard to tell which kind of
> > >> memory is
> > >> > needed mostly due to variety of applications, I believe for JM the
> > major
> > >> > memory consumption is heap memory in most cases. That means we
> > probably
> > >> can
> > >> > rely on the heap activities to trigger GC in most cases, and the max
> > >> direct
> > >> > memory limit can act as a safe net. Moreover, I think the cases
> should
> > >> be
> > >> > very rare that we need native memory for user codes. Therefore, we
> > >> probably
> > >> > should not break the JM/TM consistency for potential risks in such
> > rare
> > >> > cases.
> > >> >
> > >> > WDYT?
> > >> >
> > >> > Thank you~
> > >> >
> > >> > Xintong Song
> > >> >
> > >> >
> > >> > [1]
> > >> >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> > >> >
> > >> > On Wed, Mar 11, 2020 at 8:56 PM Andrey Zagrebin <
> [hidden email]
> > >
> > >> > wrote:
> > >> >
> > >> > > Hi All,
> > >> > >
> > >> > > As you may have noticed, 1.10 release included an extensive
> > >> improvements
> > >> > to
> > >> > > memory management and configuration of Task Managers, FLIP-49:
> [1].
> > >> The
> > >> > > memory configuration of Job Managers has not been touched in 1.10.
> > >> > >
> > >> > > Although, Job Manager's memory model does not look so
> sophisticated
> > as
> > >> > > for Task Managers, It makes to align Job Manager memory model and
> > >> > settings
> > >> > > with Task Managers. Therefore, we propose to reconsider it as well
> > in
> > >> > 1.11
> > >> > > and I prepared a FLIP 116 [2] for that.
> > >> > >
> > >> > > Any feedback is appreciated.
> > >> > >
> > >> > > So far, there is one discussion point about how to address native
> > >> > > non-direct memory usage of user code. The user code can be run
> e.g.
> > in
> > >> > > certain job submission scenarios within the JM process. For
> > >> simplicity,
> > >> > > FLIP suggests only an option for direct memory which is translated
> > >> into
> > >> > the
> > >> > > setting of the JVM direct memory limit.
> > >> > > Although, we documented for TM that the similar parameters can
> also
> > >> > > address native non-direct memory usage [3], this can lead to wrong
> > >> > > functioning of the JVM direct memory limit. The direct memory
> option
> > >> in
> > >> > JM
> > >> > > could be also named in more general way, e.g. off-heap memory but
> > this
> > >> > > naming would somewhat hide its nature of JVM direct memory limit.
> > >> > > On the other hand, JVM Overhead does not suffer from this problem
> > and
> > >> > > affects only the container/worker memory size which is the most
> > >> important
> > >> > > matter to address for the native non-direct memory consumption.
> The
> > >> > caveat
> > >> > > here is that JVM Overhead was not supposed to be used by any Flink
> > or
> > >> > user
> > >> > > components.
> > >> > >
> > >> > > Thanks,
> > >> > > Andrey
> > >> > >
> > >> > > [1]
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > >> > > [2]
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
> > >> > > [3]
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
> > >> > >
> > >> >
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP 116: Unified Memory Configuration for Job Managers

Andrey Zagrebin-5
Alright, thanks for the feedback. I also agree with it. Then this is resolved.

> On 19 Mar 2020, at 14:14, Till Rohrmann <[hidden email]> wrote:
>
> I agree with Xintong's proposal. If we see that many users run into this
> problem, then one could think about escalating the warning message into a
> failure.
>
> Cheers,
> Till
>
> On Thu, Mar 19, 2020 at 4:23 AM Xintong Song <[hidden email]> wrote:
>
>> I think recommend a minimum value in docs and throw a warning if the heap
>> size is too small should be good enough.
>> Not sure about failing job if the min heap is not fulfilled. As already
>> mentioned, it would be hard to determine the min heap size. And if we make
>> the min heap configurable, then in any case that users need to configure
>> the min heap, they can configure the heap size directly.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Wed, Mar 18, 2020 at 10:55 PM Andrey Zagrebin <[hidden email]>
>> wrote:
>>
>>> Hi all,
>>>
>>> One thing more thing to mention, the current calculations can lead to
>>> arbitrary small JVM Heap, maybe even zero.
>>> I suggest to introduce a check where we at least recommend to set the JVM
>>> heap to e.g. 128Mb.
>>>
>>> Additionally, we can demand some minimum value to function and fail if it
>>> is not fulfilled.
>>> We could experiment with what is the working minimum but It is hard to
>> come
>>> up with this limit because it again can depend on the job and
>> environment.
>>>
>>> Best,
>>> Andrey
>>>
>>> On Wed, Mar 18, 2020 at 5:03 PM Andrey Zagrebin <[hidden email]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Thanks for the feedback, Xintong and Till.
>>>>
>>>>> rename jobmanager.memory.direct.size into
>>> jobmanager.memory.off-heap.size
>>>>
>>>> I am ok with that to align it with TM and avoid further complications
>> for
>>>> users.
>>>> I will adjust the FLIP.
>>>>
>>>>> change the default value of JM Metaspace size to 256 MB
>>>>
>>>> Indeed, no reason to assume that the user code would need less
>> Metaspace
>>>> in JM.
>>>> I will change it unless a better argument is reported for another
>> value.
>>>>
>>>> I think all concerns has been resolved so I am starting the voting in a
>>>> separate thread.
>>>>
>>>> Best,
>>>> Andrey
>>>>
>>>> On Tue, Mar 17, 2020 at 6:16 PM Till Rohrmann <[hidden email]>
>>>> wrote:
>>>>
>>>>> Thanks for creating this FLIP Andrey.
>>>>>
>>>>> I agree with Xintong that we should rename
>> jobmanager.memory.direct.size
>>>>> into jobmanager.memory.off-heap.size which accounts for native and
>>> direct
>>>>> memory usage. I think it should be good enough and is easier to
>>> understand
>>>>> for the user.
>>>>>
>>>>> Concerning the default value for the metaspace size. Did we take the
>>>>> lessons learned from the TM metaspace size into account? IIRC we are
>>> about
>>>>> to change the default value to 256 MB.
>>>>>
>>>>> Feel free to start a vote once these last two questions have been
>>>>> resolved.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Mar 12, 2020 at 4:25 AM Xintong Song <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Andrey for kicking this discussion off.
>>>>>>
>>>>>> Regarding "direct" vs. "off-heap", I'm personally in favor of
>> renaming
>>>>> the
>>>>>> "direct" memory in the current FLIP-116[1] to "off-heap" memory, and
>>>>> making
>>>>>> it also account for user native memory usage.
>>>>>>
>>>>>> On one hand, I think it would be good that JM & TM provide
>> consistent
>>>>>> concepts and terminologies to users. IIUC, this is exactly the
>> purpose
>>>>> of
>>>>>> this FLIP. For TMs, we already have "off-heap" memory accounting for
>>>>> both
>>>>>> direct and native memory usages, and we did this so that users do
>> not
>>>>> need
>>>>>> to understand the differences between the two kinds.
>>>>>>
>>>>>> On the other hand, while for TMs it is hard to tell which kind of
>>>>> memory is
>>>>>> needed mostly due to variety of applications, I believe for JM the
>>> major
>>>>>> memory consumption is heap memory in most cases. That means we
>>> probably
>>>>> can
>>>>>> rely on the heap activities to trigger GC in most cases, and the max
>>>>> direct
>>>>>> memory limit can act as a safe net. Moreover, I think the cases
>> should
>>>>> be
>>>>>> very rare that we need native memory for user codes. Therefore, we
>>>>> probably
>>>>>> should not break the JM/TM consistency for potential risks in such
>>> rare
>>>>>> cases.
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> Thank you~
>>>>>>
>>>>>> Xintong Song
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>>
>>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
>>>>>>
>>>>>> On Wed, Mar 11, 2020 at 8:56 PM Andrey Zagrebin <
>> [hidden email]
>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> As you may have noticed, 1.10 release included an extensive
>>>>> improvements
>>>>>> to
>>>>>>> memory management and configuration of Task Managers, FLIP-49:
>> [1].
>>>>> The
>>>>>>> memory configuration of Job Managers has not been touched in 1.10.
>>>>>>>
>>>>>>> Although, Job Manager's memory model does not look so
>> sophisticated
>>> as
>>>>>>> for Task Managers, It makes to align Job Manager memory model and
>>>>>> settings
>>>>>>> with Task Managers. Therefore, we propose to reconsider it as well
>>> in
>>>>>> 1.11
>>>>>>> and I prepared a FLIP 116 [2] for that.
>>>>>>>
>>>>>>> Any feedback is appreciated.
>>>>>>>
>>>>>>> So far, there is one discussion point about how to address native
>>>>>>> non-direct memory usage of user code. The user code can be run
>> e.g.
>>> in
>>>>>>> certain job submission scenarios within the JM process. For
>>>>> simplicity,
>>>>>>> FLIP suggests only an option for direct memory which is translated
>>>>> into
>>>>>> the
>>>>>>> setting of the JVM direct memory limit.
>>>>>>> Although, we documented for TM that the similar parameters can
>> also
>>>>>>> address native non-direct memory usage [3], this can lead to wrong
>>>>>>> functioning of the JVM direct memory limit. The direct memory
>> option
>>>>> in
>>>>>> JM
>>>>>>> could be also named in more general way, e.g. off-heap memory but
>>> this
>>>>>>> naming would somewhat hide its nature of JVM direct memory limit.
>>>>>>> On the other hand, JVM Overhead does not suffer from this problem
>>> and
>>>>>>> affects only the container/worker memory size which is the most
>>>>> important
>>>>>>> matter to address for the native non-direct memory consumption.
>> The
>>>>>> caveat
>>>>>>> here is that JVM Overhead was not supposed to be used by any Flink
>>> or
>>>>>> user
>>>>>>> components.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrey
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>>>>>>> [2]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP+116%3A+Unified+Memory+Configuration+for+Job+Managers
>>>>>>> [3]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_detail.html#overview
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>