Possible reference loops lead to high memory usage when idle

Bug #1570974 reported by Steven Hardy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter

Bug Description

Deploying a TripleO overcloud uses a lot of memory, and the heat-engine process is one of the top consumers.

However it seems that we hold on to the memory after the deployment (until heat-engine is restarted), so I think we may have more reference loops similar to bug #1454873

Before (just after a heat-engine restart)
12932 heat 20 0 368408 71504 6564 S 0.0 0.9 0:02.43 heat-engine
12942 heat 20 0 372536 70972 2076 S 0.0 0.9 0:00.27 heat-engine
12943 heat 20 0 372504 70932 2076 S 0.0 0.9 0:00.26 heat-engine
12944 heat 20 0 372612 70928 2076 S 0.0 0.9 0:00.24 heat-engine
12945 heat 20 0 372600 70936 2076 S 0.0 0.9 0:00.25 heat-engine
24436 heat 20 0 364336 66760 4100 S 0.0 0.8 0:49.87 heat-api-cfn
24510 heat 20 0 357592 59436 3584 S 0.0 0.7 0:00.56 heat-api
24542 heat 20 0 369376 69624 2024 S 0.0 0.9 3:32.44 heat-api
24543 heat 20 0 369812 70360 2012 S 0.0 0.9 3:32.41 heat-api

After:
[root@instack ~]# top -b -n1 | grep heat
12932 heat 20 0 368408 71504 6564 S 0.0 0.9 0:22.23 heat-engine
12942 heat 20 0 602812 296144 3164 S 0.0 3.7 10:43.63 heat-engine
12943 heat 20 0 579648 269740 3276 S 0.0 3.3 7:02.62 heat-engine
12944 heat 20 0 647048 323432 4280 S 0.0 4.0 12:23.91 heat-engine
12945 heat 20 0 697316 372644 4196 S 0.0 4.6 10:50.47 heat-engine
24436 heat 20 0 364340 66764 4100 S 0.0 0.8 0:52.44 heat-api-cfn
24510 heat 20 0 357592 59436 3584 S 0.0 0.7 0:00.56 heat-api
24542 heat 20 0 369376 69624 2024 S 0.0 0.9 4:02.81 heat-api
24543 heat 20 0 369812 70360 2012 S 0.0 0.9 4:02.74 heat-api

We can see heat-engine has gone from ~70M per worker to about 300M, which means there's nearly a gig not freed after the deployment completes.

Done some investigation with objgraph and heapy but as yet not isolated the cause(s)

Tags: tripleo
Steven Hardy (shardy)
tags: added: tripleo
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

There may be reference loops but they may not be the cause of this high memory use in heat.

Instead it is most likely to be a high-water mark / heap fragmentation. This thread is interesting:
http://www.gossamer-threads.com/lists/python/python/1162114

Our best options for memory-consuming stacks may be:
1. moving to process-based workers which exit at the end of their work
2. use less memory in the first place

We'll be discussing both at the optimisation session in Austin.

Revision history for this message
Steven Hardy (shardy) wrote :

Yeah, it may well be in part due to the python memory management, but I think this behaviour has got worse recently hence my suspicion that there may be reference loops like bug #1454873 to fix.

It's hard to prove without bisecting from a much earlier heat version, but I guess I could possibly do that if other profiling doesn't identify the issue.

I've attempted some profiling via heapy and objgraph, and it's tough to isolate potential causes due to the volume of data produced by those tools, combined with the fact that it's hard to know which candidate objects to profile (in the case of objgraph).

Anyone know of tools which are able to detect and flag reference loops?

I have also tried running with the gc module and DEBUG_LEAK, again the volume of data here is immense and there's nothing on the garbage list to help. I do the the number of tracked objects increasing significantly over time, so I'm pretty sure the data we need is in there, I'm just not sure of the best way to filter it.

Revision history for this message
Zane Bitter (zaneb) wrote :

Is it bad that my first reaction was that 300M per worker sounded pretty good?

Still, https://bugs.launchpad.net/heat/+bug/1454873/comments/9 suggests that we had it down to 130MB at one point, so there could be something to the reference loop thing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/308541

Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/308542

Changed in heat:
importance: Undecided → High
Revision history for this message
Steven Hardy (shardy) wrote :

@zaneb thanks for the patches - I did a quick test and memory usage seems about the same or slightly reduced (still over a gig peak for a minimal overcloud deployment with only 2 nodes, which ends up being 108 stacks.

I'm basing my assumption that is bad on the previous 130M quote, but also it just seems wrong that we'd use over a gig of ram for a single tree of "only" 100ish stacks, many of which are actually empty (zero resources) because I'm not using network isolation, and other than the endpoint map and overcloud.py, most only contain a handful of resources.

Maybe it's just my embedded background but being able to chew through such a lot of RAM seems wrong for such a small (relative to a public cloud with a non-trivial number of users) number of stacks, but I I know the heavy nesting and large files map are probably working against us here also.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/308541
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ec5e5c57a2f084ede7626df57426f9b69adf3541
Submitter: Jenkins
Branch: master

commit ec5e5c57a2f084ede7626df57426f9b69adf3541
Author: Zane Bitter <email address hidden>
Date: Wed Apr 20 14:41:40 2016 -0400

    Break reference cycle between Environment and ResourceRegistry

    The commit 08431c7c0601f64e6f0477dd502bf912eba8529b added a reference to
    the Environment object from the ResourceRegistry object, which is itself
    an attribute of the environment. This causes a reference cycle that
    means that the environment and everything referenced by it (including
    the potentially very large files dict) will only be deallocated when it
    is garbage collected since the reference counts will never hit zero.
    This has probably been contributing substantially to memory
    fragmentation.

    Change-Id: Ib251b5f5ffc07fe1a06f5e44124024831d4ba1b2
    Partial-Bug: #1570974

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/308542
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=7f801dfd24436f73235a1653690c98dcd4e90075
Submitter: Jenkins
Branch: master

commit 7f801dfd24436f73235a1653690c98dcd4e90075
Author: Zane Bitter <email address hidden>
Date: Wed Apr 20 15:08:12 2016 -0400

    Break reference cycle between ResourceRegistry and ResourceInfo

    ResourceInfo should hold only a weak reference to the ResourceRegistry
    it is held in, otherwise there is a cycle that means the entire registry
    has to wait for garbage collection in order to be deallocated.

    Change-Id: I708aa1e9bac696eb16e9ef21089ca6e14e84e067
    Partial-Bug: #1570974

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/315773

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/315773
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=c64b2cd1f337b263f1a34a1b532d86165c295d29
Submitter: Jenkins
Branch: master

commit c64b2cd1f337b263f1a34a1b532d86165c295d29
Author: Zane Bitter <email address hidden>
Date: Fri May 13 11:30:22 2016 -0400

    Break reference cycle in KeystoneClient plugins

    Change-Id: Ie8ddd132c3ce02a01b77242ce86f219ce4f86249
    Partial-Bug: #1570974
    Related-Bug: #1508134

Revision history for this message
Zane Bitter (zaneb) wrote :

Thomas reported that as of the Newton release, there are no further loops during stack create. There may be some in the update and delete paths, but as far as I can tell the only ones left are due to the use of closures, which would be painful to get rid of.

While memory usage could still improve, it seems to be back under control. Extensive investigation of bug 1626675 revealed thatthe growth was almost entirely due to complexity increases in TripleO. Most of the gains were from addressing having multiple yaql contexts, creating enormous numbers of ResourceInfo objects, loading many stacks simultaneously to calculate outputs (bug 1626675), and continuously calculating said outputs while polling for events (bug 1638908). It's likely that reference cycles no longer play a large part in whatever remaining memory problems we have (if they ever did).

Accordingly, I'm closing this bug.

Changed in heat:
milestone: none → ocata-1
status: In Progress → Fix Released
milestone: ocata-1 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.