Feifan Wang created FLINK-21986:
----------------------------------- Summary: taskmanager native memory not release timely after restart Key: FLINK-21986 URL: https://issues.apache.org/jira/browse/FLINK-21986 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.12.1 Environment: flink version:1.12.1 run :yarn session job type:mock source -> regular join checkpoint interval: 3m Taskmanager memory : 16G Reporter: Feifan Wang Attachments: image-2021-03-25-15-53-44-214.png, image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, image-2021-03-26-11-47-21-388.png I run a regular join job with flink_1.12.1 , and find taskmanager native memory not release timely after restart cause by exceeded checkpoint tolerable failure threshold. *problem job information:* # job first restart cause by exceeded checkpoint tolerable failure threshold. # then taskmanager be killed by yarn many times # in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G !image-2021-03-25-15-53-44-214.png|width=496,height=103! # nonheap size increase after restart,but still under 160M. !https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102! # taskmanager process memory increase 3-4G after restart(this figure show one of taskmanager) !image-2021-03-25-16-07-29-083.png|width=493,height=107! *my guess:* [RocksDB wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management] mentioned :Many of the Java Objects used in the RocksJava API will be backed by C++ objects for which the Java Objects have ownership. As C++ has no notion of automatic garbage collection for its heap in the way that Java does, we must explicitly free the memory used by the C++ objects when we are finished with them. So, is it possible that RocksDBStateBackend not call AbstractNativeReference#close() to release memory use by RocksDB C++ Object ? *I make a change:* Actively call System.gc() and System.runFinalization() every minute. *And run this test again:* # taskmanager process memory no obvious increase !image-2021-03-26-11-46-06-828.png|width=495,height=93! # job run for several days,and restart many times,but no taskmanager killed by yarn like before *Summary:* # first,there is some native memory can not release timely after restart in this situation # I guess it maybe RocksDB C++ object,but I hive not check it from source code of RocksDBStateBackend -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Hi community, I raised this issue about three weeks ago. After several weeks of investigation, I found the root cause of this issue and explained it in the issue comments. And I raised a PR to fix this problem ( I'm sorry that I didn't know before that I should raise the PR after the issue was assigned to me. I will pay attention next time. ). Now I request the committer of the relevant module to check this issue, assign this issue to me, and review this PR. Issue URL: https://issues.apache.org/jira/browse/FLINK-21986 PR URL: https://github.com/apache/flink/pull/15619 Best wishes, Feifan Wang —————————————— Name: Feifan Wang Email: [hidden email] On 03/26/2021 12:00,Feifan Wang (Jira)<[hidden email]> wrote: Feifan Wang created FLINK-21986: ----------------------------------- Summary: taskmanager native memory not release timely after restart Key: FLINK-21986 URL: https://issues.apache.org/jira/browse/FLINK-21986 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.12.1 Environment: flink version:1.12.1 run :yarn session job type:mock source -> regular join checkpoint interval: 3m Taskmanager memory : 16G Reporter: Feifan Wang Attachments: image-2021-03-25-15-53-44-214.png, image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, image-2021-03-26-11-47-21-388.png I run a regular join job with flink_1.12.1 , and find taskmanager native memory not release timely after restart cause by exceeded checkpoint tolerable failure threshold. *problem job information:* # job first restart cause by exceeded checkpoint tolerable failure threshold. # then taskmanager be killed by yarn many times # in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G !image-2021-03-25-15-53-44-214.png|width=496,height=103! # nonheap size increase after restart,but still under 160M. !https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102! # taskmanager process memory increase 3-4G after restart(this figure show one of taskmanager) !image-2021-03-25-16-07-29-083.png|width=493,height=107! *my guess:* [RocksDB wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management] mentioned :Many of the Java Objects used in the RocksJava API will be backed by C++ objects for which the Java Objects have ownership. As C++ has no notion of automatic garbage collection for its heap in the way that Java does, we must explicitly free the memory used by the C++ objects when we are finished with them. So, is it possible that RocksDBStateBackend not call AbstractNativeReference#close() to release memory use by RocksDB C++ Object ? *I make a change:* Actively call System.gc() and System.runFinalization() every minute. *And run this test again:* # taskmanager process memory no obvious increase !image-2021-03-26-11-46-06-828.png|width=495,height=93! # job run for several days,and restart many times,but no taskmanager killed by yarn like before *Summary:* # first,there is some native memory can not release timely after restart in this situation # I guess it maybe RocksDB C++ object,but I hive not check it from source code of RocksDBStateBackend -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Thanks for raising this issue Feifan. I think it is very important to fix
it. I will take a look at your PR. Cheers, Till On Tue, Apr 20, 2021 at 5:35 AM zoltar9264 <[hidden email]> wrote: > > > Hi community, > I raised this issue about three weeks ago. After several weeks of > investigation, I found the root cause of this issue and explained it in the > issue comments. > And I raised a PR to fix this problem ( I'm sorry that I didn't > know before that I should raise the PR after the issue was assigned to me. > I will pay attention next time. ). > Now I request the committer of the relevant module to check this > issue, assign this issue to me, and review this PR. > > > Issue URL: https://issues.apache.org/jira/browse/FLINK-21986 > PR URL: https://github.com/apache/flink/pull/15619 > > > Best wishes, > Feifan Wang > > > —————————————— > Name: Feifan Wang > Email: [hidden email] > > > On 03/26/2021 12:00,Feifan Wang (Jira)<[hidden email]> wrote: > Feifan Wang created FLINK-21986: > ----------------------------------- > > Summary: taskmanager native memory not release timely after restart > Key: FLINK-21986 > URL: https://issues.apache.org/jira/browse/FLINK-21986 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Affects Versions: 1.12.1 > Environment: flink version:1.12.1 > run :yarn session > job type:mock source -> regular join > > checkpoint interval: 3m > Taskmanager memory : 16G > > Reporter: Feifan Wang > Attachments: image-2021-03-25-15-53-44-214.png, > image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, > image-2021-03-26-11-47-21-388.png > > I run a regular join job with flink_1.12.1 , and find taskmanager native > memory not release timely after restart cause by exceeded checkpoint > tolerable failure threshold. > > *problem job information:* > # job first restart cause by exceeded checkpoint tolerable failure > threshold. > # then taskmanager be killed by yarn many times > # in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G > !image-2021-03-25-15-53-44-214.png|width=496,height=103! > # nonheap size increase after restart,but still under 160M. > ! > https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102 > ! > # taskmanager process memory increase 3-4G after restart(this figure show > one of taskmanager) > !image-2021-03-25-16-07-29-083.png|width=493,height=107! > > *my guess:* > > > > [RocksDB wiki| > https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management] > mentioned :Many of the Java Objects used in the RocksJava API will be > backed by C++ objects for which the Java Objects have ownership. As C++ has > no notion of automatic garbage collection for its heap in the way that Java > does, we must explicitly free the memory used by the C++ objects when we > are finished with them. > > > > So, is it possible that RocksDBStateBackend not call > AbstractNativeReference#close() to release memory use by RocksDB C++ Object > ? > > *I make a change:* > > Actively call System.gc() and System.runFinalization() every > minute. > > *And run this test again:* > # taskmanager process memory no obvious increase > !image-2021-03-26-11-46-06-828.png|width=495,height=93! > # job run for several days,and restart many times,but no taskmanager > killed by yarn like before > > > > *Summary:* > # first,there is some native memory can not release timely after restart > in this situation > # I guess it maybe RocksDB C++ object,but I hive not check it from source > code of RocksDBStateBackend > > > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005) > |
Free forum by Nabble | Edit this page |