Further memory usage issues with big stacks
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Heat |
Fix Released
|
High
|
Zane Bitter |
Bug Description
Earlier in Newton we fixed several issues which took TripleO undercloud heat memory usage down considerably, but now it's increased again (a lot):
http://
I don't yet have an accurate estimate of when the problems started, and we have started using some more functions (such as yaql, which I know may be expensive), but I suspect this is a bigger issue due to the steps in the plot.
The main thing we do regularly during the deployment is a bunch of SoftwareDeploym
Emilien Macchi (emilienm) wrote : | #1 |
Zane Bitter (zaneb) wrote : | #2 |
Bug 1622979 is a duplicate of bug 1626173, so it's now resolved. It was nothing to do with memory usage.
Changed in heat: | |
importance: | Undecided → Critical |
status: | New → Triaged |
milestone: | none → newton-rc2 |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master) | #3 |
Fix proposed to branch: master
Review: https:/
Changed in heat: | |
assignee: | nobody → Zane Bitter (zaneb) |
status: | Triaged → In Progress |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master) | #4 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #5 |
Related fix proposed to branch: master
Review: https:/
Changed in heat: | |
assignee: | Zane Bitter (zaneb) → nobody |
Changed in heat: | |
assignee: | nobody → Zane Bitter (zaneb) |
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master) | #6 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 82b8fd8c17d94e5
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 17:16:26 2016 -0400
Get rid of circular reference in Event class
This would have been causing the entire stack to remain in memory until
garbage-
Change-Id: If965b4415d7640
Partial-Bug: #1626675
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/newton) | #7 |
Fix proposed to branch: stable/newton
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master) | #8 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 4d109558fe3c257
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 17:53:57 2016 -0400
Use save_and_
Storing sys.exc_info() in a local variable in HeatException.
have caused a reference loop in cases where formatting the exception
message failed.
Change-Id: I29502344713e5d
Related-Bug: #1626675
OpenStack Infra (hudson-openstack) wrote : | #9 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 38483c56fcef08c
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 18:24:05 2016 -0400
Avoid circular refs more aggressively in DependencyTaskGroup
Be ultra-careful to make sure that we can't end up with a local reference
to a traceback that contains the current function, thus causing a reference
loop where everything on the call stack has to be garbage collected.
Also, ignore exceptions from cancelling threads in _cancel_
just as we do in cancel_all() since
2ffbd913a64
Change-Id: I635c41faab4b54
Related-Bug: #1626675
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/newton) | #10 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/newton
commit e59f9275625e675
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 17:16:26 2016 -0400
Get rid of circular reference in Event class
This would have been causing the entire stack to remain in memory until
garbage-
Change-Id: If965b4415d7640
Partial-Bug: #1626675
(cherry picked from commit 82b8fd8c17d94e5
tags: | added: in-stable-newton |
Changed in heat: | |
milestone: | newton-rc2 → ocata-1 |
Zane Bitter (zaneb) wrote : | #11 |
I ran a test creating approximately 150 nested stacks - 25 in parallel at the first level, and 6 in series at the next level, with each stack taking ~10s - with a large (~800KB) files dict. I had hoped that this could form the basis of an automated test that we could use to (a) bisect the repo now, and (b) perhaps incorporate in the gate in the future.
The results were as you might hope - with the patch https:/
I can try adding features to exercise in the test, but there'd be a lot of guesswork involved. I think our best chance at narrowing this down is probably to bisect the Heat repo testing against tripleo-
(Incidentally, for completeness also I tried the same test with convergence enabled, and the increase was slightly higher - around 23MB - but not vastly so. Also the delete phase appears to be piling on a higher memory increase than the create phase - it grows the memory by an additional 91MB for the legacy path and 63MB for the convergence path. This might be partly because it happens faster, the deepest nested resources don't take 10s to delete, only to create.)
Steve Baker (steve-stevebaker) wrote : | #12 |
I can make the memory use rise a lot if I run the following
while true; do os stack show overcloud ; done
whereas the following does not cause memory growth
while true; do os stack event list --nested-depth=2 overcloud ; done
The step rises in shardy's chart may be due to the tripleoclient status polling in periods when no events are created (such as when nodes are booting). If an event has not occured recently then tripleoclient does a stack show to get the status (just in case an event was lost).
So, doing a repeated stack show on Zane's test *might* reveal something.
Steve Baker (steve-stevebaker) wrote : | #13 |
BTW the testing in #12 was done on current RDO master tripleo, which doesn't yet have *any* of the 3 fixes which Zane has merged.
Steve Baker (steve-stevebaker) wrote : | #14 |
Hmm, with heat master, heat-engine is still getting OOMed before the deployment completes on an 8GB undercloud
Zane Bitter (zaneb) wrote : | #15 |
I can't reproduce the stack-show issue with my test templates, even after adding an output that grabs some data from all 150 child stacks. So that helps narrow it down, but it still sounds like it is related to some specific feature that is exercised.
Crag Wolfe (cwolfe) wrote : | #16 |
Likewise, I'm unable to reproduce the stack-show issue on master/devstack with depth-5 nested stacks, with or without a ResourceGroup.
Steven Hardy (shardy) wrote : | #17 |
>Hmm, with heat master, heat-engine is still getting OOMed before the deployment completes on an 8GB undercloud
Yeah - that's consistent with what I've been seeing - I've been testing locally for most of the cycle with an 8G undercloud (no swap), without OOM issues, but fairly recently noticed I sometimes hit OOM so added some swap and started investigating memory usage again.
I'll see if I can help narrow down this issue with some further local testing, but anyone with access to an upstream tripleo environment should be able to reproduce it.
Here's how my TripleO dev environment is setup for anyone who wants to replicate it (running on a desktop box with 32G ram & CentOS7):
Crag Wolfe (cwolfe) wrote : | #18 |
I've tried Steve Baker's "while true; do os stack show overcloud ; done" test (#12), using an environment from Steven Hardy's paste above (#17) [awesome dev env instructions, btw]. I'm not seeing memory usage grow at all after almost 3 hours / ~ 600 calls to "openstack stack show overcloud". heat-engine remains at 2.1gb according to ps_mem. Total used memory in the system is at 6.7gb. Version of the rpm installed on the undercloud vm is openstack-
Zane Bitter (zaneb) wrote : | #19 |
I did some analysis on historical memory usage data from the gate-tripleo-
http://
Basically we can only seen it increase once since August 9th, and that was when the undercloud VM size was increased from 6GB to 8GB.
Zane Bitter (zaneb) wrote : | #20 |
I ran the same analysis on the periodic job. The data is much more sparse, but it appears there is a jump between the 3rd and the 19th of August (before the RAM increase):
https:/
(Note that this may understate the increase, which looks like ~0.75GiB in the flat part of the graph, because there seems to be a hard ceiling below 6GiB that like affects the latter jobs - presumably swap is enabled, but we don't track that.)
The two closest builds showing different behaviour are http://
However, assuming that a problem lies between those two Heat commits, the most suspicious one appears to be https:/
The second-most suspicious commit is probably https:/
The other patches all seem fairly benign to me, but I would welcome more eyeballs.
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master) | #21 |
Related fix proposed to branch: master
Review: https:/
Crag Wolfe (cwolfe) wrote : | #22 |
Another observation (continuing #18): with 11 concurrent calls to "openstack stack
overcloud show" (in a loop), memory usage of heat-engine eventually
gets to 6.7gb where it has been stable for for ~8 hours.
Crag Wolfe (cwolfe) wrote : | #23 |
Using same test as (#22), if I disable eager loading of raw_template (#20) in stack_get(), I still get to around 6gb consumption in 45 mins vs. 35 mins without. Since raw_template's in general only reference the file_id of raw_template_files, there shouldn't be much of an issue with the raw_template_files caching. Looking at the object model of raw_template, I wonder if raw_template.
OpenStack Infra (hudson-openstack) wrote : | #24 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master) | #25 |
Fix proposed to branch: master
Review: https:/
Crag Wolfe (cwolfe) wrote : | #26 |
Ruling out more commits as the cause for the memory issue. These are three different tests with the commit indicated removed (the third test wholesale-removed a dozen commits).
bc3b84f A context cache for Resource objects
5.5gb after 15 min
3ab0ede Always eager load the raw_template for a stack / tl-minus-
5.7gb after 50 min
a2f5b5c Perform str_replace trying to match longest string first
4090dfe Refactor boolean condition functions
97483d5 Do str_replace in a single pass
b67605d Refactor resource definition parsing in HOT/cfn
8262265 Make cfn functions inherit from HOT
4a8ad39 Allow reference conditions by name
e417fc3 Revert "Allow reference conditions by name"
4a92678 Allows condition name using boolean or function
fbc0021 Make get_attr consistent across template versions
7b129f6 Copy correct definition to the backup stack
4a8ad39 Allow reference conditions by name
bca8b8e Allow referencing conditions by name
5.1gb after 38 mins
5.6gb after 1hr, 5mins
Crag Wolfe (cwolfe) wrote : | #27 |
Think #24 and #25 are good, but especially seeing improvement with #25.
https:/
https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master) | #28 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 260b79ed28b5dc4
Author: Zane Bitter <email address hidden>
Date: Tue Oct 4 08:25:01 2016 -0400
Don't always eagerly load the raw_template for a stack
Always loading the raw template in situations where we didn't need it -
e.g. in identify_stack, where we just want the name + id (given one of
them), or when getting the summary stack list - uses up DB bandwidth and
memory unnecessarily.
This partially reverts commit 3ab0ede98c6dc0c
* The eager_load option to get_stack() is reinstated, but with the default
flipped to True. In places where we explicitly do not want to load the
template, we pass False.
* stack_get_by_name() no longer eagerly loads the template. There were no
instances of this where we subsequently use the template.
* stack_get_all() acquires an eager_load option, with the default set to
False. Direct users of objects.
load by default, but users of engine.
template eagerly loaded. This practically always corresponds to what you
want.
Change-Id: I1f156c25ea2632
Related-Bug: #1626675
OpenStack Infra (hudson-openstack) wrote : | #29 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 0830318707b6b12
Author: Thomas Herve <email address hidden>
Date: Tue Oct 4 11:52:40 2016 +0200
Don't create yaql context
In the yaql function, we create and store a yaql context object that we
keep during the lifetime of the function. This is only needed for
evaluation, so let yaql create the context itself, and don't reference
it so that it's garbage collected.
Change-Id: If3015cf85dfe96
Related-Bug: #1626675
Crag Wolfe (cwolfe) wrote : | #30 |
One more observation for https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (stable/newton) | #31 |
Related fix proposed to branch: stable/newton
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master) | #32 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit adb8629a90eff29
Author: Zane Bitter <email address hidden>
Date: Tue Oct 4 16:46:58 2016 -0400
Use __slots__ in ResourceInfo classes
A TripleO environment typically contains hundreds of resource type
mappings. And a TripleO deployment typically contains hundreds of nested
stacks. The result is typically tens of thousands of ResourceInfo objects
all loaded in memory at the same time.
This change saves memory by using slots for these classes instead of
__dict__. I'd expect this to save on the order of tens of megabytes of RAM
in a TripleO deployment - comparatively modest, but an easy win given that
it is such a simple change.
Change-Id: Ia0f17be794618d
Partial-Bug: #1626675
Steve Baker (steve-stevebaker) wrote : | #33 |
Using heat master including the recent memory improvements, I've just bisected tripleo-
There is nothing suspicious looking merging around the 25th, either side of this change[1] could be investigated.
It could be that the bisect shows gradual memory growth over the changes, and the 25th is when the threshold is reached. I've captured memory use logs for each run so my next step is to graph these and see if there is a trend. tripleo-
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/newton) | #34 |
Fix proposed to branch: stable/newton
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master) | #35 |
Related fix proposed to branch: master
Review: https:/
Steve Baker (steve-stevebaker) wrote : | #36 |
I did another bisect with a more predictable environment (fake nova virt, script to fake deployment signals)
This time it bisected to this change[1] on August 24th. With this change the number of yaql uses goes from 2 to 3. By the time stable/newton is branched there are ~25 uses of the yaql function.
therve's latest yaql fix[2] made a huge difference - I'm not seeing OOMs on an 8GB undercloud deploying stable/newton on a fresh heat-engine. (doing multiple deploys without engine restarts still OOMs, so we can't call this fixed.)
[1] http://
[2] https:/
tags: | removed: in-stable-newton |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master) | #37 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 21990655b609a3a
Author: Zane Bitter <email address hidden>
Date: Thu Oct 6 09:40:45 2016 -0400
Use __slots__ in Parameter classes
A typical stack may easily have dozens of parameters, so Parameter objects
are very common in memory. They're also very simple and change rarely, all
of which makes them a good candidate for being made lighter-weight using
slots to avoid creating __dict__.
Change-Id: I23e07876054cba
Related-Bug: #1626675
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/newton) | #38 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/newton
commit 2830c83fd20a9ab
Author: Zane Bitter <email address hidden>
Date: Tue Oct 4 16:46:58 2016 -0400
Use __slots__ in ResourceInfo classes
A TripleO environment typically contains hundreds of resource type
mappings. And a TripleO deployment typically contains hundreds of nested
stacks. The result is typically tens of thousands of ResourceInfo objects
all loaded in memory at the same time.
This change saves memory by using slots for these classes instead of
__dict__. I'd expect this to save on the order of tens of megabytes of RAM
in a TripleO deployment - comparatively modest, but an easy win given that
it is such a simple change.
Change-Id: Ia0f17be794618d
Partial-Bug: #1626675
(cherry picked from commit adb8629a90eff29
tags: | added: in-stable-newton |
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/newton) | #39 |
Fix proposed to branch: stable/newton
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (stable/newton) | #40 |
Change abandoned by Zane Bitter (<email address hidden>) on branch: stable/newton
Review: https:/
Reason: Created https:/
Steve Baker (steve-stevebaker) wrote : | #41 |
We may be reaching diminishing returns in finding reference loops and leaks - it could be that heat's object creation pattern will always lead to a fragmented heap and memory which isn't returned to the OS (at least on python-2.7)
We have a worker process model and a polite EngineService.stop implementation. Why don't we keep a counter of some indicative metric (RPC calls, stacks loaded) and once a configured limit is reached, stop the current worker. Memory will be returned to the OS, and a new worker will be automatically spawned.
The config value specifying the limit could be set to -1 by default to disable this stopping behaviour - this would retain other behaviour and allow our CI to be useful for catching other memory problems.
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/newton) | #42 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/newton
commit 2ca3df862d6eaee
Author: Thomas Herve <email address hidden>
Date: Tue Oct 4 11:52:40 2016 +0200
Create a root Yaql context
In the yaql function, we create and store a yaql context object that we
keep during the lifetime of the function. This is only needed for
evaluation. Not keeping track of the context is an improvement on
memory, but require registration of the library every time. The most
efficient way to use yaql contexts seems to be to create a root context,
and then pass a child one for each evaluation.
Change-Id: I12ea701e51a4c3
Partial-Bug: #1626675
(cherry picked from commits 0830318707b6b12
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master) | #43 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master) | #44 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit df889488ede2d3c
Author: Zane Bitter <email address hidden>
Date: Mon Oct 10 15:11:42 2016 -0400
Avoid loading nested stacks in memory where possible
Prior to changing StackResource to do stack operations over RPC, we made
liberal use of the StackResource.
that was likely always loaded in memory. Now that that is no longer
required, it adds additional memory overhead that we need not have. We can
now obtain the stack identifier without loading the stack, and that is
sufficient for performing operations over RPC.
The exceptions are prepare_abandon(), which cannot be done over RPC at
present, and get_output(), which may be addressed in a separate patch. The
gratuitous loading of the nested stack in TemplateResourc
is eliminated, so although it still ends up loading the nested stack in
many cases, it will no longer do so once get_output() stops doing it.
Change-Id: I669d2a077381d7
Co-Authored-By: Thomas Herve <email address hidden>
Related-Bug: #1626675
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (stable/newton) | #45 |
Related fix proposed to branch: stable/newton
Review: https:/
Zane Bitter (zaneb) wrote : | #46 |
Here's the smoking gun for t-h-t, an analysis of the ps.txt file in the logs (which records the memory usage of each process at the end of the test:
https:/
It shows the memory use creeping up gradually from 1.3GiB in early August to 2.4GiB by late September.
It looks like the "Use __slots__ in ResourceInfo classes" and "Create a root Yaql context" patches have knocked things back considerably (to 1.6GiB), and I expect now that the "Use RPC to retrieve nested stack output" patch has merged we'll see it drop back even further in future, to around 1.0GiB.
Changed in heat: | |
importance: | Critical → High |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master) | #47 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 3a3e6a884091b28
Author: Thomas Herve <email address hidden>
Date: Mon Oct 10 17:35:36 2016 -0400
Use RPC to retrieve nested stack output
Instead of loading the stack in memory, use RPC to get the stack and its
output. It releases memory pressure from the main engine in legacy mode.
Change-Id: Id3da88e8c5d9b6
Related-Bug: #1626675
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (stable/newton) | #48 |
Related fix proposed to branch: stable/newton
Review: https:/
Steven Hardy (shardy) wrote : | #49 |
Thanks for all the fixes here guys - I've just re-tested and I can confirm things look much better now:
http://
We're still above where we were mid-newton after the previous round of memory related fixes, but that could easily be related to t-h-t complexity changes since then.
This was heat master at 06fe8d89ff78779
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (stable/newton) | #50 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/newton
commit 53137cb228768b2
Author: Zane Bitter <email address hidden>
Date: Mon Oct 10 15:11:42 2016 -0400
Avoid loading nested stacks in memory where possible
Prior to changing StackResource to do stack operations over RPC, we made
liberal use of the StackResource.
that was likely always loaded in memory. Now that that is no longer
required, it adds additional memory overhead that we need not have. We can
now obtain the stack identifier without loading the stack, and that is
sufficient for performing operations over RPC.
The exceptions are prepare_abandon(), which cannot be done over RPC at
present, and get_output(), which may be addressed in a separate patch. The
gratuitous loading of the nested stack in TemplateResourc
is eliminated, so although it still ends up loading the nested stack in
many cases, it will no longer do so once get_output() stops doing it.
Change-Id: I669d2a077381d7
Co-Authored-By: Thomas Herve <email address hidden>
Related-Bug: #1626675
(cherry picked from commit df889488ede2d3c
OpenStack Infra (hudson-openstack) wrote : | #51 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/newton
commit 92cadc352a67ab2
Author: Thomas Herve <email address hidden>
Date: Mon Oct 10 17:35:36 2016 -0400
Use RPC to retrieve nested stack output
Instead of loading the stack in memory, use RPC to get the stack and its
output. It releases memory pressure from the main engine in legacy mode.
Change-Id: Id3da88e8c5d9b6
Related-Bug: #1626675
(cherry picked from commit 3a3e6a884091b28
Thomas Herve (therve) wrote : | #52 |
I'd like to close this bug. We still have numerous improvements to make, but we made some stop-gap measures, so I think this particular iteration is done.
Changed in heat: | |
status: | In Progress → Fix Released |
Puppet OpenStack CI is also recently failing very often on Heat tempest tests. I reported this bug a few days ago: /bugs.launchpad .net/heat/ +bug/1622979
https:/
I did a bit of research in logstash and I found out that both TripleO and Puppet CI started to have performances issues with Heat since around September 12th.
I saw a few commits that might be related: /github. com/openstack/ heat/commit/ e417fc3b86e6371 def4cd4b24480c6 c44c2598fc /github. com/openstack/ heat/commit/ f18e57e004e65fa f0ed2d043384709 007f83b2b0
https:/
https:/
From my research, Puppet CI heat timeout started on September 12th: logstash. openstack. org/#/dashboard /file/logstash. json?query= build_name: %20*tripleo- ci*%20AND% 20build_ status: %20FAILURE% 20AND%20message :%20%5C% 22503%20Service %20
http://
and TripleO on September 13th: logstash. openstack. org/#/dashboard /file/logstash. json?query= build_name: %20*tripleo- ci*%20AND% 20build_ status: %20FAILURE% 20AND%20message :%20%5C% 22503%20Service %20Unavailable% 5C%22
http://
I really think something happened on September 11,12,13th that decreases Heat performances.
I hope this investigation will help.