Bug #1626675 “Further memory usage issues with big stacks” : Bugs : OpenStack Heat

Revision history for this message

Emilien Macchi (emilienm) wrote on 2016-09-22:

#1

Puppet OpenStack CI is also recently failing very often on Heat tempest tests. I reported this bug a few days ago:
https://bugs.launchpad.net/heat/+bug/1622979

I did a bit of research in logstash and I found out that both TripleO and Puppet CI started to have performances issues with Heat since around September 12th.

I saw a few commits that might be related:
https://github.com/openstack/heat/commit/e417fc3b86e6371def4cd4b24480c6c44c2598fc
https://github.com/openstack/heat/commit/f18e57e004e65faf0ed2d043384709007f83b2b0

From my research, Puppet CI heat timeout started on September 12th:
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:%20*tripleo-ci*%20AND%20build_status:%20FAILURE%20AND%20message:%20%5C%22503%20Service%20

and TripleO on September 13th:
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:%20*tripleo-ci*%20AND%20build_status:%20FAILURE%20AND%20message:%20%5C%22503%20Service%20Unavailable%5C%22

I really think something happened on September 11,12,13th that decreases Heat performances.

I hope this investigation will help.

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-09-22:

#2

Bug 1622979 is a duplicate of bug 1626173, so it's now resolved. It was nothing to do with memory usage.

Zane Bitter (zaneb) on 2016-09-22

Changed in heat:
importance:	Undecided → Critical
status:	New → Triaged
milestone:	none → newton-rc2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Fix proposed to heat (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/375129

Changed in heat:
assignee:	nobody → Zane Bitter (zaneb)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Related fix proposed to heat (master)

#4

Related fix proposed to branch: master
Review: https://review.openstack.org/375133

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22:

#5

Related fix proposed to branch: master
Review: https://review.openstack.org/375141

Zane Bitter (zaneb) on 2016-09-22

Changed in heat:
assignee:	Zane Bitter (zaneb) → nobody

Steve Baker (steve-stevebaker) on 2016-09-23

Changed in heat:
assignee:	nobody → Zane Bitter (zaneb)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix merged to heat (master)

#6

Reviewed: https://review.openstack.org/375129
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=82b8fd8c17d94e5c18ca3bdbca94e60978d1952b
Submitter: Jenkins
Branch: master

commit 82b8fd8c17d94e5c18ca3bdbca94e60978d1952b
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 17:16:26 2016 -0400

Get rid of circular reference in Event class

This would have been causing the entire stack to remain in memory until
garbage-collected. We only need the identifier, so store that instead.

Change-Id: If965b4415d7640b93edd153f2893a7e0c04bc8d6
Partial-Bug: #1626675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix proposed to heat (stable/newton)

#7

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/375486

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Related fix merged to heat (master)

#8

Reviewed: https://review.openstack.org/375133
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=4d109558fe3c25715d9e1844aa88dfd24f2f81dd
Submitter: Jenkins
Branch: master

commit 4d109558fe3c25715d9e1844aa88dfd24f2f81dd
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 17:53:57 2016 -0400

Use save_and_reraise_exception() in HeatException

    Storing sys.exc_info() in a local variable in HeatException.__init__ would
    have caused a reference loop in cases where formatting the exception
    message failed.

Change-Id: I29502344713e5d4da761d9277b445a6921dbd83b
Related-Bug: #1626675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-24:

#9

Reviewed: https://review.openstack.org/375141
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=38483c56fcef08c41e04c4a5b777b35b34338a4b
Submitter: Jenkins
Branch: master

commit 38483c56fcef08c41e04c4a5b777b35b34338a4b
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 18:24:05 2016 -0400

Avoid circular refs more aggressively in DependencyTaskGroup

    Be ultra-careful to make sure that we can't end up with a local reference
    to a traceback that contains the current function, thus causing a reference
    loop where everything on the call stack has to be garbage collected.

    Also, ignore exceptions from cancelling threads in _cancel_recursively(),
    just as we do in cancel_all() since
    2ffbd913a647cd1c8d5e1d85c29f104a5a4a326e.

Change-Id: I635c41faab4b54e95132a93f8046d0a63053485d
Related-Bug: #1626675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-26: Fix merged to heat (stable/newton)

#10

Reviewed: https://review.openstack.org/375486
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e59f9275625e6757bc619161caf6b3e9339e8887
Submitter: Jenkins
Branch: stable/newton

commit e59f9275625e6757bc619161caf6b3e9339e8887
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 17:16:26 2016 -0400

Get rid of circular reference in Event class

This would have been causing the entire stack to remain in memory until
garbage-collected. We only need the identifier, so store that instead.

    Change-Id: If965b4415d7640b93edd153f2893a7e0c04bc8d6
    Partial-Bug: #1626675
    (cherry picked from commit 82b8fd8c17d94e5c18ca3bdbca94e60978d1952b)

tags:

added: in-stable-newton

Thomas Herve (therve) on 2016-09-27

Changed in heat:
milestone:	newton-rc2 → ocata-1

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-09-27:

#11

I ran a test creating approximately 150 nested stacks - 25 in parallel at the first level, and 6 in series at the next level, with each stack taking ~10s - with a large (~800KB) files dict. I had hoped that this could form the basis of an automated test that we could use to (a) bisect the repo now, and (b) perhaps incorporate in the gate in the future.

The results were as you might hope - with the patch https://git.openstack.org/cgit/openstack/heat/commit/?id=fef94d0d7366576313883d9cfd59d775ea9e9907 the increase in memory is only around 16MB per worker during stack create, vs. more like 145MB immediately prior to that patch. The results on the newton-rc2 build are comparable, so unfortunately whatever regression has occurred is not triggered by my test.

I can try adding features to exercise in the test, but there'd be a lot of guesswork involved. I think our best chance at narrowing this down is probably to bisect the Heat repo testing against tripleo-heat-templates from the Mitaka release or early Newton. It's also possible that some change in tripleo-heat-templates could have triggered a large increase in memory use - if that proves to be the case (which should be obvious from the first test of the old templates against newton-rc Heat) then it'd be the reverse: bisect t-h-t running on newton-rc Heat until we find what made it jump.

(Incidentally, for completeness also I tried the same test with convergence enabled, and the increase was slightly higher - around 23MB - but not vastly so. Also the delete phase appears to be piling on a higher memory increase than the create phase - it grows the memory by an additional 91MB for the legacy path and 63MB for the convergence path. This might be partly because it happens faster, the deepest nested resources don't take 10s to delete, only to create.)

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2016-09-29:

#12

I can make the memory use rise a lot if I run the following

while true; do os stack show overcloud ; done

whereas the following does not cause memory growth

while true; do os stack event list --nested-depth=2 overcloud ; done

The step rises in shardy's chart may be due to the tripleoclient status polling in periods when no events are created (such as when nodes are booting). If an event has not occured recently then tripleoclient does a stack show to get the status (just in case an event was lost).

So, doing a repeated stack show on Zane's test *might* reveal something.

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2016-09-29:

#13

BTW the testing in #12 was done on current RDO master tripleo, which doesn't yet have *any* of the 3 fixes which Zane has merged.

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2016-09-29:

#14

Hmm, with heat master, heat-engine is still getting OOMed before the deployment completes on an 8GB undercloud

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-09-29:

#15

I can't reproduce the stack-show issue with my test templates, even after adding an output that grabs some data from all 150 child stacks. So that helps narrow it down, but it still sounds like it is related to some specific feature that is exercised.

Revision history for this message

Crag Wolfe (cwolfe) wrote on 2016-09-30:

#16

Likewise, I'm unable to reproduce the stack-show issue on master/devstack with depth-5 nested stacks, with or without a ResourceGroup.

Revision history for this message

Steven Hardy (shardy) wrote on 2016-09-30:

#17

>Hmm, with heat master, heat-engine is still getting OOMed before the deployment completes on an 8GB undercloud

Yeah - that's consistent with what I've been seeing - I've been testing locally for most of the cycle with an 8G undercloud (no swap), without OOM issues, but fairly recently noticed I sometimes hit OOM so added some swap and started investigating memory usage again.

I'll see if I can help narrow down this issue with some further local testing, but anyone with access to an upstream tripleo environment should be able to reproduce it.

Here's how my TripleO dev environment is setup for anyone who wants to replicate it (running on a desktop box with 32G ram & CentOS7):

http://paste.fedoraproject.org/438697/75219755/

Revision history for this message

Crag Wolfe (cwolfe) wrote on 2016-10-01:

#18

I've tried Steve Baker's "while true; do os stack show overcloud ; done" test (#12), using an environment from Steven Hardy's paste above (#17) [awesome dev env instructions, btw]. I'm not seeing memory usage grow at all after almost 3 hours / ~ 600 calls to "openstack stack show overcloud". heat-engine remains at 2.1gb according to ps_mem. Total used memory in the system is at 6.7gb. Version of the rpm installed on the undercloud vm is openstack-heat-engine-7.0.0-0.20160926200847.dd707bc.el7.centos.noarch. Baremetal host is centos 7.

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-10-03:

#19

I did some analysis on historical memory usage data from the gate-tripleo-ci-centos-7-ovb-nonha job. Results are here:

http://lists.openstack.org/pipermail/openstack-dev/2016-October/104883.html

Basically we can only seen it increase once since August 9th, and that was when the undercloud VM size was increased from 6GB to 8GB.

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-10-04:

#20

I ran the same analysis on the periodic job. The data is much more sparse, but it appears there is a jump between the 3rd and the 19th of August (before the RAM increase):

https://zaneb.fedorapeople.org/tripleo-memory/20160930/periodic_20160803-20160819_memused.png

(Note that this may understate the increase, which looks like ~0.75GiB in the flat part of the graph, because there seems to be a hard ceiling below 6GiB that like affects the latter jobs - presumably swap is enabled, but we don't track that.)

The two closest builds showing different behaviour are http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/05ecf21/ and http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/1b96976/ - they use the Heat builds 6c9f33d and 2f412de, respectively. Of course it's possible that changes in the test itself or changes in TripleO rather than Heat are responsible for the increase.

However, assuming that a problem lies between those two Heat commits, the most suspicious one appears to be https://review.openstack.org/#/c/323614/16 - always loading the raw_template from the database just to look up the name/id in identify_stack sounds like it could easily be hard on memory, and it's consistent with Steve's observation that memory use increases when you're hitting the API from the client (internally I don't think identify_stack is used that much), although none of us have been able to reproduce that. (It's possible even that eager loading in general might interact poorly with Crag's memory-saving fix that added the template_files table.)

The second-most suspicious commit is probably https://review.openstack.org/#/c/350287/ since it involves doing more work when creating a Software Config/Deployment, which is used extensively in TripleO but not in my test above. It doesn't appear to be doing anything obviously wrong, but then I would say that :)

The other patches all seem fairly benign to me, but I would welcome more eyeballs.

I ran the same analysis on the periodic job. The data is much more sparse, but it appears there is a jump between the 3rd and the 19th of August (before the RAM increase):

https://zaneb.fedorapeople.org/tripleo-memory/20160930/periodic_20160803-20160819_memused.png

(Note that this may understate the increase, which looks like ~0.75GiB in the flat part of the graph, because there seems to be a hard ceiling below 6GiB that like affects the latter jobs - presumably swap is enabled, but we don't track that.)

The two closest builds showing different behaviour are http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/05ecf21/ and http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha/1b96976/ - they use the Heat builds 6c9f33d and 2f412de, respectively. Of course it's possible that changes in the test itself or changes in TripleO rather than Heat are responsible for the increase.

However, assuming that a problem lies between those two Heat commits, the most suspicious one appears to be https://review.openstack.org/#/c/323614/16 - always loading the raw_template from the database just to look up the name/id in identify_stack sounds like it could easily be hard on memory, and it's consistent with Steve's observation that memory use increases when you're hitting the API from the client (internally I don't think identify_stack is used that much), although none of us have been able to reproduce that. (It's possible even that eager loading in general might interact poorly with Crag's memory-saving fix that added the template_files table.)

The second-most suspicious commit is probably https://review.openstack.org/#/c/350287/ since it involves doing more work when creating a Software Config/Deployment, which is used extensively in TripleO but not in my test above. It doesn't appear to be doing anything obviously wrong, but then I would say that :)

The other patches all seem fairly benign to me, but I would welcome more eyeballs.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-04: Related fix proposed to heat (master)

#21

Related fix proposed to branch: master
Review: https://review.openstack.org/381588

Revision history for this message

Crag Wolfe (cwolfe) wrote on 2016-10-04:

#22

Another observation (continuing #18): with 11 concurrent calls to "openstack stack
overcloud show" (in a loop), memory usage of heat-engine eventually
gets to 6.7gb where it has been stable for for ~8 hours.

Revision history for this message

Crag Wolfe (cwolfe) wrote on 2016-10-04:

#23

Using same test as (#22), if I disable eager loading of raw_template (#20) in stack_get(), I still get to around 6gb consumption in 45 mins vs. 35 mins without. Since raw_template's in general only reference the file_id of raw_template_files, there shouldn't be much of an issue with the raw_template_files caching. Looking at the object model of raw_template, I wonder if raw_template.environment is also a decent candidate for caching.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-04:

#24

Related fix proposed to branch: master
Review: https://review.openstack.org/382038

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-04: Fix proposed to heat (master)

#25

Fix proposed to branch: master
Review: https://review.openstack.org/382068

Revision history for this message

Crag Wolfe (cwolfe) wrote on 2016-10-05:

#26

Ruling out more commits as the cause for the memory issue. These are three different tests with the commit indicated removed (the third test wholesale-removed a dozen commits).

bc3b84f A context cache for Resource objects
5.5gb after 15 min

3ab0ede Always eager load the raw_template for a stack / tl-minus-always_eager_load_raw_tmpl
5.7gb after 50 min

a2f5b5c Perform str_replace trying to match longest string first
4090dfe Refactor boolean condition functions
97483d5 Do str_replace in a single pass
b67605d Refactor resource definition parsing in HOT/cfn
8262265 Make cfn functions inherit from HOT
4a8ad39 Allow reference conditions by name
e417fc3 Revert "Allow reference conditions by name"
4a92678 Allows condition name using boolean or function
fbc0021 Make get_attr consistent across template versions
7b129f6 Copy correct definition to the backup stack
4a8ad39 Allow reference conditions by name
bca8b8e Allow referencing conditions by name
5.1gb after 38 mins
5.6gb after 1hr, 5mins

Revision history for this message

Crag Wolfe (cwolfe) wrote on 2016-10-05:

#27

Think #24 and #25 are good, but especially seeing improvement with #25.

https://review.openstack.org/#/c/382038/ 5.4gb after 1 hr, 5mins
https://review.openstack.org/#/c/382068/ 5.0gb after 1 h4, 20mins

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-05: Related fix merged to heat (master)

#28

Reviewed: https://review.openstack.org/382038
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=260b79ed28b5dc48f70fe77dfdfc074991ad1e99
Submitter: Jenkins
Branch: master

commit 260b79ed28b5dc48f70fe77dfdfc074991ad1e99
Author: Zane Bitter <email address hidden>
Date: Tue Oct 4 08:25:01 2016 -0400

Don't always eagerly load the raw_template for a stack

    Always loading the raw template in situations where we didn't need it -
    e.g. in identify_stack, where we just want the name + id (given one of
    them), or when getting the summary stack list - uses up DB bandwidth and
    memory unnecessarily.

This partially reverts commit 3ab0ede98c6dc0c0327977be994c139113fc0489.

    * The eager_load option to get_stack() is reinstated, but with the default
      flipped to True. In places where we explicitly do not want to load the
      template, we pass False.
    * stack_get_by_name() no longer eagerly loads the template. There were no
      instances of this where we subsequently use the template.
    * stack_get_all() acquires an eager_load option, with the default set to
      False. Direct users of objects.stack.Stack.get_all() will not eagerly
      load by default, but users of engine.stack.Stack.load_all() will get the
      template eagerly loaded. This practically always corresponds to what you
      want.

Change-Id: I1f156c25ea26322be5b606a61dd77d80cadd65fc
Related-Bug: #1626675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-05:

#29

Reviewed: https://review.openstack.org/381588
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=0830318707b6b124dc171fdf143fd4117091420f
Submitter: Jenkins
Branch: master

commit 0830318707b6b124dc171fdf143fd4117091420f
Author: Thomas Herve <email address hidden>
Date: Tue Oct 4 11:52:40 2016 +0200

Don't create yaql context

    In the yaql function, we create and store a yaql context object that we
    keep during the lifetime of the function. This is only needed for
    evaluation, so let yaql create the context itself, and don't reference
    it so that it's garbage collected.

Change-Id: If3015cf85dfe966f4f113eb7f70ee3f3e36e5399
Related-Bug: #1626675

Revision history for this message

Crag Wolfe (cwolfe) wrote on 2016-10-05:

#30

One more observation for https://review.openstack.org/#/c/382068/ : 5.0gb after 1 hr, 13 mins, then stable at 5.6gb for 10 hours. It stills looks a bit better.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-05: Related fix proposed to heat (stable/newton)

#31

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/382648

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-06: Fix merged to heat (master)

#32

Reviewed: https://review.openstack.org/382068
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=adb8629a90eff299974c7d8138fe60194d5961ec
Submitter: Jenkins
Branch: master

commit adb8629a90eff299974c7d8138fe60194d5961ec
Author: Zane Bitter <email address hidden>
Date: Tue Oct 4 16:46:58 2016 -0400

Use __slots__ in ResourceInfo classes

    A TripleO environment typically contains hundreds of resource type
    mappings. And a TripleO deployment typically contains hundreds of nested
    stacks. The result is typically tens of thousands of ResourceInfo objects
    all loaded in memory at the same time.

    This change saves memory by using slots for these classes instead of
    __dict__. I'd expect this to save on the order of tens of megabytes of RAM
    in a TripleO deployment - comparatively modest, but an easy win given that
    it is such a simple change.

Change-Id: Ia0f17be794618d7b41c463e1992755947c56d4d1
Partial-Bug: #1626675

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2016-10-06:

#33

Using heat master including the recent memory improvements, I've just bisected tripleo-heat-templates down to changes which merged around August 25th - newer changes get OOMed, older changes get stalled with puppet at step 4 but this is usually when the OOM kicks in anyway.

There is nothing suspicious looking merging around the 25th, either side of this change[1] could be investigated.

It could be that the bisect shows gradual memory growth over the changes, and the 25th is when the threshold is reached. I've captured memory use logs for each run so my next step is to graph these and see if there is a trend. tripleo-heat-templates 5.0.0.0b2 looks like a low memory-using baseline.

[1] http://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b61f1a33d55a23df7cd451679ee356d12c7f4a24

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-06: Fix proposed to heat (stable/newton)

#34

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/382956

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-06: Related fix proposed to heat (master)

#35

Related fix proposed to branch: master
Review: https://review.openstack.org/382961

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2016-10-07:

#36

I did another bisect with a more predictable environment (fake nova virt, script to fake deployment signals)

This time it bisected to this change[1] on August 24th. With this change the number of yaql uses goes from 2 to 3. By the time stable/newton is branched there are ~25 uses of the yaql function.

therve's latest yaql fix[2] made a huge difference - I'm not seeing OOMs on an 8GB undercloud deploying stable/newton on a fresh heat-engine. (doing multiple deploys without engine restarts still OOMs, so we can't call this fixed.)

[1] http://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=4a61f7570b583e5038bd9705a7d707f2768ace3a
[2] https://review.openstack.org/#/c/382377

tags:

removed: in-stable-newton

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-07: Related fix merged to heat (master)

#37

Reviewed: https://review.openstack.org/382961
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=21990655b609a3aedc0b590821f5b23c848db874
Submitter: Jenkins
Branch: master

commit 21990655b609a3aedc0b590821f5b23c848db874
Author: Zane Bitter <email address hidden>
Date: Thu Oct 6 09:40:45 2016 -0400

Use __slots__ in Parameter classes

    A typical stack may easily have dozens of parameters, so Parameter objects
    are very common in memory. They're also very simple and change rarely, all
    of which makes them a good candidate for being made lighter-weight using
    slots to avoid creating __dict__.

Change-Id: I23e07876054cbaf220df1c9fc2d663b00130501b
Related-Bug: #1626675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-07: Fix merged to heat (stable/newton)

#38

Reviewed: https://review.openstack.org/382956
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=2830c83fd20a9ab1a76af956a3bb13de6c7bcfc3
Submitter: Jenkins
Branch: stable/newton

commit 2830c83fd20a9ab1a76af956a3bb13de6c7bcfc3
Author: Zane Bitter <email address hidden>
Date: Tue Oct 4 16:46:58 2016 -0400

Use __slots__ in ResourceInfo classes

    A TripleO environment typically contains hundreds of resource type
    mappings. And a TripleO deployment typically contains hundreds of nested
    stacks. The result is typically tens of thousands of ResourceInfo objects
    all loaded in memory at the same time.

    This change saves memory by using slots for these classes instead of
    __dict__. I'd expect this to save on the order of tens of megabytes of RAM
    in a TripleO deployment - comparatively modest, but an easy win given that
    it is such a simple change.

    Change-Id: Ia0f17be794618d7b41c463e1992755947c56d4d1
    Partial-Bug: #1626675
    (cherry picked from commit adb8629a90eff299974c7d8138fe60194d5961ec)

tags:

added: in-stable-newton

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-07: Fix proposed to heat (stable/newton)

#39

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/383777

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-07: Change abandoned on heat (stable/newton)

#40

Change abandoned by Zane Bitter (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/382648
Reason: Created https://review.openstack.org/#/c/383777/ instead (both commits squashed together).

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2016-10-10:

#41

We may be reaching diminishing returns in finding reference loops and leaks - it could be that heat's object creation pattern will always lead to a fragmented heap and memory which isn't returned to the OS (at least on python-2.7)

We have a worker process model and a polite EngineService.stop implementation. Why don't we keep a counter of some indicative metric (RPC calls, stacks loaded) and once a configured limit is reached, stop the current worker. Memory will be returned to the OS, and a new worker will be automatically spawned.

The config value specifying the limit could be set to -1 by default to disable this stopping behaviour - this would retain other behaviour and allow our CI to be useful for catching other memory problems.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-10: Fix merged to heat (stable/newton)

#42

Reviewed: https://review.openstack.org/383777
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=2ca3df862d6eaee816b0c310e2eccb10fad06cdf
Submitter: Jenkins
Branch: stable/newton

commit 2ca3df862d6eaee816b0c310e2eccb10fad06cdf
Author: Thomas Herve <email address hidden>
Date: Tue Oct 4 11:52:40 2016 +0200

Create a root Yaql context

    In the yaql function, we create and store a yaql context object that we
    keep during the lifetime of the function. This is only needed for
    evaluation. Not keeping track of the context is an improvement on
    memory, but require registration of the library every time. The most
    efficient way to use yaql contexts seems to be to create a root context,
    and then pass a child one for each evaluation.

    Change-Id: I12ea701e51a4c39b5a28d4bd4b61a67ab34dc16b
    Partial-Bug: #1626675
    (cherry picked from commits 0830318707b6b124dc171fdf143fd4117091420f
                            and 8b7e5bee3b190cf83be7b8292fff012c58d970ef)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-10: Related fix proposed to heat (master)

#43

Related fix proposed to branch: master
Review: https://review.openstack.org/384718

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-13: Related fix merged to heat (master)

#44

Reviewed: https://review.openstack.org/384718
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=df889488ede2d3cad30afdc8cd66fb1b92492ede
Submitter: Jenkins
Branch: master

commit df889488ede2d3cad30afdc8cd66fb1b92492ede
Author: Zane Bitter <email address hidden>
Date: Mon Oct 10 15:11:42 2016 -0400

Avoid loading nested stacks in memory where possible

    Prior to changing StackResource to do stack operations over RPC, we made
    liberal use of the StackResource.nested() method to access the nested stack
    that was likely always loaded in memory. Now that that is no longer
    required, it adds additional memory overhead that we need not have. We can
    now obtain the stack identifier without loading the stack, and that is
    sufficient for performing operations over RPC.

    The exceptions are prepare_abandon(), which cannot be done over RPC at
    present, and get_output(), which may be addressed in a separate patch. The
    gratuitous loading of the nested stack in TemplateResource.get_attribute()
    is eliminated, so although it still ends up loading the nested stack in
    many cases, it will no longer do so once get_output() stops doing it.

    Change-Id: I669d2a077381d7e4e913f6ad1a86fb3f094da6c5
    Co-Authored-By: Thomas Herve <email address hidden>
    Related-Bug: #1626675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-13: Related fix proposed to heat (stable/newton)

#45

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/386247

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-10-13:

#46

Here's the smoking gun for t-h-t, an analysis of the ps.txt file in the logs (which records the memory usage of each process at the end of the test:

https://zaneb.fedorapeople.org/tripleo-memory/20161013/heat_memused.png

It shows the memory use creeping up gradually from 1.3GiB in early August to 2.4GiB by late September.

It looks like the "Use __slots__ in ResourceInfo classes" and "Create a root Yaql context" patches have knocked things back considerably (to 1.6GiB), and I expect now that the "Use RPC to retrieve nested stack output" patch has merged we'll see it drop back even further in future, to around 1.0GiB.

Changed in heat:
importance:	Critical → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-14: Related fix merged to heat (master)

#47

Reviewed: https://review.openstack.org/383839
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3a3e6a884091b28653dfc9b337b2e800fbec5fe9
Submitter: Jenkins
Branch: master

commit 3a3e6a884091b28653dfc9b337b2e800fbec5fe9
Author: Thomas Herve <email address hidden>
Date: Mon Oct 10 17:35:36 2016 -0400

Use RPC to retrieve nested stack output

Instead of loading the stack in memory, use RPC to get the stack and its
output. It releases memory pressure from the main engine in legacy mode.

Change-Id: Id3da88e8c5d9b6d564b1b71960d9937867543d79
Related-Bug: #1626675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-14: Related fix proposed to heat (stable/newton)

#48

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/386444

Revision history for this message

Steven Hardy (shardy) wrote on 2016-10-14:

#49

Thanks for all the fixes here guys - I've just re-tested and I can confirm things look much better now:

http://people.redhat.com/~shardy/heat/plots/heat_20161014.png

We're still above where we were mid-newton after the previous round of memory related fixes, but that could easily be related to t-h-t complexity changes since then.

This was heat master at 06fe8d89ff787799ea2e30337c5e4a766f3ccec0 (with Dan's heat-all patch reverted because it broke RDO packaging).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-16: Related fix merged to heat (stable/newton)

#50

Reviewed: https://review.openstack.org/386247
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=53137cb228768b21027037b8f684d2c797e8410e
Submitter: Jenkins
Branch: stable/newton

commit 53137cb228768b21027037b8f684d2c797e8410e
Author: Zane Bitter <email address hidden>
Date: Mon Oct 10 15:11:42 2016 -0400

Avoid loading nested stacks in memory where possible

    Prior to changing StackResource to do stack operations over RPC, we made
    liberal use of the StackResource.nested() method to access the nested stack
    that was likely always loaded in memory. Now that that is no longer
    required, it adds additional memory overhead that we need not have. We can
    now obtain the stack identifier without loading the stack, and that is
    sufficient for performing operations over RPC.

    The exceptions are prepare_abandon(), which cannot be done over RPC at
    present, and get_output(), which may be addressed in a separate patch. The
    gratuitous loading of the nested stack in TemplateResource.get_attribute()
    is eliminated, so although it still ends up loading the nested stack in
    many cases, it will no longer do so once get_output() stops doing it.

    Change-Id: I669d2a077381d7e4e913f6ad1a86fb3f094da6c5
    Co-Authored-By: Thomas Herve <email address hidden>
    Related-Bug: #1626675
    (cherry picked from commit df889488ede2d3cad30afdc8cd66fb1b92492ede)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-16:

#51

Reviewed: https://review.openstack.org/386444
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=92cadc352a67ab2d66ab66e30f02a386f56012f8
Submitter: Jenkins
Branch: stable/newton

commit 92cadc352a67ab2d66ab66e30f02a386f56012f8
Author: Thomas Herve <email address hidden>
Date: Mon Oct 10 17:35:36 2016 -0400

Use RPC to retrieve nested stack output

Instead of loading the stack in memory, use RPC to get the stack and its
output. It releases memory pressure from the main engine in legacy mode.

    Change-Id: Id3da88e8c5d9b6d564b1b71960d9937867543d79
    Related-Bug: #1626675
    (cherry picked from commit 3a3e6a884091b28653dfc9b337b2e800fbec5fe9)

Revision history for this message

Thomas Herve (therve) wrote on 2016-10-17:

#52

I'd like to close this bug. We still have numerous improvements to make, but we made some stop-gap measures, so I think this particular iteration is done.

Changed in heat:
status:	In Progress → Fix Released

OpenStack Heat

Further memory usage issues with big stacks

Bug Description

Other bug subscribers

Remote bug watches