10 minute increase in overcloud deploy/update time

Bug #1684272 reported by Ben Nemec
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter
tripleo
Fix Released
High
Unassigned

Bug Description

Some time on April 18th something merged that increased our overcloud update time by about 10 minutes in the ovb-updates job. This is a significant performance regression that will affect our ci capacity and turnaround times. It's already causing spurious timeouts in updates jobs as well.

You can see in graphite that over the past 21 days our update times have been fairly consistent, then on April 18th they jump by 500+ seconds: https://66.187.229.172/S/I The ocata branch did not, so this appears to be something which merged to master and not an environmental issue.

Revision history for this message
Ben Nemec (bnemec) wrote :

Also worth noting that the post-increase graph is somewhat misleading in that it only shows the jobs which passed. The actual average update time would be somewhat higher if you include the jobs that timed out due to the increase.

Revision history for this message
Ben Nemec (bnemec) wrote :

Likely cause:

April 17th updates job http://logs.openstack.org/50/454650/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/e516b56/logs/postci.txt.gz:
ControllerServiceChain 37.0

Aprily 19th updates job http://logs.openstack.org/36/458036/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/0a15c0f/logs/postci.txt.gz:
ControllerServiceChain 547.0

That's a huge increase that approximately corresponds to the increase in stack update time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/458218

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/458219

Revision history for this message
Ben Nemec (bnemec) wrote : Re: 10 minute increase in overcloud update time

Looks like this might be a problem with the latest promotion: https://www.diffchecker.com/9SDDKxaR

Two jobs in different repos that ran around the same time. One has the problem, the other does not. The most obvious difference is that all of the OpenStack packages got updated between them.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Ben Nemec (<email address hidden>) on branch: master
Review: https://review.openstack.org/458218
Reason: I no longer think this is related to the performance issue. It looks like a promotion happened at the same time and is more likely the cause. Apologies for the noise.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Ben Nemec (<email address hidden>) on branch: master
Review: https://review.openstack.org/458219
Reason: I no longer think this is related to the performance issue. It looks like a promotion happened at the same time and is more likely the cause. Apologies for the noise.

Revision history for this message
Christian Schwede (cschwede) wrote : Re: 10 minute increase in overcloud update time

Proposed temporary fix for the upgrade jobs: https://review.openstack.org/#/c/458364/

Revision history for this message
Emilien Macchi (emilienm) wrote :

I'm working on tripleo-ci to use infra mirrors for upgrade jobs.

Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
status: Triaged → In Progress
Revision history for this message
Ben Nemec (bnemec) wrote :

This doesn't appear to be a repo speed issue. There was a regression in how long it takes to do a stack update. That presumably affects the upgrades jobs too, but even a basic updates job is timing out.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Ben, do you have metrics that you can show here about your theory?

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → nobody
Revision history for this message
Ben Nemec (bnemec) wrote :

Turns out this isn't just updates. All stack deployments are taking about 10 minutes longer. It's worse for updates because it's taking the 10 minute hit twice, but as you can see in this new graph all of the jobs had a bump in overcloud deployment time at the same time: https://66.187.229.172/S/J (Ocata jobs are also graphed to show that they didn't change, so this is something master-related).

summary: - 10 minute increase in overcloud update time
+ 10 minute increase in overcloud deploy/update time
Revision history for this message
Ben Nemec (bnemec) wrote :

Trying a temprevert of a heat change here: https://review.openstack.org/458570

Revision history for this message
Zane Bitter (zaneb) wrote :
Revision history for this message
Ben Nemec (bnemec) wrote :

I'm doing some pinning in ci to see if we can nail down which commits caused performance regressions, and to confirm that there wasn't also a tht change that contributed. Here are the results so far:

The commit we reverted, to confirm that the full regression was present after this merged:
f94d76cb322f14a001e7990c0918a67f1a09ea16 ControllerServiceChain 722.0

It was. Another commit about halfway between that one and the commit used for the promotion prior to the regression showing up:
b5359cedf10dab59a11effb0ebed383ad23c73a6 ControllerServiceChain 169.0

Better, but still not back to the 30 seconds or so that it used to be. I've pushed two new pins starting at https://review.openstack.org/#/c/460188 to test another halfway point between the last known good, and the last known good itself to verify it still yields the same results it did before the promotion.

Revision history for this message
Ben Nemec (bnemec) wrote :

Hmm, last known good is no longer good. Better, but not good.

2cbd4593261ba90ed61d2d293e69a930ef0b59b7 ControllerServiceChain 189.0

Looking at a pre-regression job, that's the commit we had in delorean at the time: openstack-heat-common-8.0.0-0.20170405074820.2cbd459.el7.centos.noarch

So something outside of heat has also impacted this regression. Based on Zane's findings in https://bugs.launchpad.net/tripleo/+bug/1666072/comments/31 maybe oslo.messaging? I don't see anything particularly suspicious from around that time though. :-/

I don't see a significant amount of swapping on the undercloud during the regressed jobs. Our resident memory usage seems to top out around 6 GB on 8 GB VMs. There is some swap being used, but it's minimal -- ~100 MB.

Revision history for this message
Ben Nemec (bnemec) wrote :

I don't think the latest results make any sense. :-/

666930d1958cc34a85da9712a7e09200ad0ec030 ControllerServiceChain 454.0

c10fa9be2b4727a9fbd5b19cf3cbf2570246fc9a ControllerServiceChain 699.0

These are back-to-back commits. c10 is only a unit test change (Unit tests: fix signal URL immutability tests) so I don't see how it could possibly have had any impact on runtime, yet it's showing a 250 second increase.

I have to be missing something here, but I can't see what it is. The rpm -qa output in the jobs suggests that they are indeed using the expected commit hashes. I'm going to recheck the c10 patch and see if we had a fluke slow job or something.

Revision history for this message
Ben Nemec (bnemec) wrote :

It actually got worse on the recheck: ControllerServiceChain 736.0

http://logs.openstack.org/89/460189/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/e481ff0/logs/undercloud/var/log/heat-deploy-times.txt.gz

I have yet to come up with a rational explanation for this.

Revision history for this message
Steven Hardy (shardy) wrote :

So I did a little digging locally, with a heatclient modified to show the time delta between IN_PROGRESS and COMPLETE:

http://paste.openstack.org/show/608682/

The results are interesting, in that the ServiceChain seems effectively serialized, despite setting concurrent: true - we put all the nested stacks IN_PROGRESS then get a serialized series of events while either RPC or the heat scheduler catches up and works through them not very concurrently, e.g compare:

0:00:08 [overcloud-ControllerServiceChain-zdhipc4gtgfs-ServiceChain-cybm26brd2kv.4]: CREATE_COMPLETE state changed

0:02:16 [overcloud-ControllerServiceChain-zdhipc4gtgfs-ServiceChain-cybm26brd2kv.80]: CREATE_COMPLETE state changed

The fastest nested stack completed in 8 seconds, then longest took well over 2 minutes, but the longest elapsed time is related to a nested stack that took zero seconds to actually create:

(undercloud) [stack@undercloud ~]$ openstack stack resource show 9e3270bc-603b-46d7-81ad-caa63c912395 80 | grep physical_resource_id
| physical_resource_id | 49f032a0-7ea2-4e8c-8633-34a19b16036d |

(undercloud) [stack@undercloud ~]$ openstack stack event list 49f032a0-7ea2-4e8c-8633-34a19b16036d
0:00:00 [overcloud-ControllerServiceChain-zdhipc4gtgfs-ServiceChain-cybm26brd2kv-80-vcwt5iqm4rvy]: CREATE_IN_PROGRESS Stack CREATE started
0:00:00 [overcloud-ControllerServiceChain-zdhipc4gtgfs-ServiceChain-cybm26brd2kv-80-vcwt5iqm4rvy]: CREATE_COMPLETE Stack CREATE completed successfully

Revision history for this message
Ben Nemec (bnemec) wrote :

I'm dropping the alert on this one. The right people are aware of it and investigating, and I don't want to leave alert bugs open for a long period of time. It would just make people tune out the alerts.

This is still a fairly serious issue so I'm going to leave it at critical though.

tags: removed: alert
Revision history for this message
Thomas Herve (therve) wrote :

I made another test here https://review.openstack.org/#/c/466311/ removing another suspect Heat commit (https://review.openstack.org/#/c/435171/). We can see the result here: http://logs.openstack.org/11/466311/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/a3a49fc/logs/postci.txt.gz

We can see that it has a noticeable effect on the service chain creation:

ControllerServiceChain 108.0

Compared to 384 on a similar job previously. But! The overall deployment time is not impacted. It's because we move calculation of the attributes to the resource creation, instead of later on. So while the commit that we already reverted had an effect, I don't think that one does, and the change in timing on the chain creation is a false alarm.

Changed in tripleo:
milestone: pike-2 → pike-3
Revision history for this message
Ben Nemec (bnemec) wrote :

We've still got a performance regression here, but it's not blocking things AFAIK so it shouldn't be critical. It may be that the eventual solution is a move to the convergence engine.

Changed in tripleo:
importance: Critical → High
Zane Bitter (zaneb)
Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/477031

Changed in heat:
status: New → In Progress
Zane Bitter (zaneb)
Changed in heat:
status: In Progress → Triaged
importance: Undecided → High
milestone: none → pike-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/477046

Changed in heat:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/477047

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/480755

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Zane Bitter (<email address hidden>) on branch: master
Review: https://review.openstack.org/480755
Reason: Accidentally regenerated the change ID. This patch is now at I234a0762a00815af86b5711e7494366c16e717c2.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/481083

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/481084

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/481083
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=6462e2e3c37d4778e27c26e1286e905009318aef
Submitter: Jenkins
Branch: master

commit 6462e2e3c37d4778e27c26e1286e905009318aef
Author: Steven Hardy <email address hidden>
Date: Thu Jul 6 12:29:50 2017 +0100

    Remove dependency on memcached_node_ips_v6

    This is set via all_nodes_config in t-h-t, but it's a special case for
    this service, so it'll be better if we handle the ipv6 transformation
    in puppet instead of relying on the service specific list mangling in
    t-h-t (one aspect of which has been identified as a potential performance
    problem).

    Related-Bug: #1684272
    Change-Id: Iccb9089db4b382db3adb9340f18f6d2364ca7f58

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/481084
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=b108289aa610b7d378d4461d942698eda35b4f31
Submitter: Jenkins
Branch: master

commit b108289aa610b7d378d4461d942698eda35b4f31
Author: Steven Hardy <email address hidden>
Date: Thu Jul 6 14:37:31 2017 +0100

    Remove special-case of memcache node ips for ipv6

    This should be handled in puppet-tripleo, as is done for some other
    services e.g ceph. This has also been identified as a possible
    performance problem due to the nested get_attr calls.

    Change-Id: I7e14f0219c28c023c4e8e1d4693f0bfa9674d801
    Related-Bug: #1684272
    Depends-On: Iccb9089db4b382db3adb9340f18f6d2364ca7f58

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/477031
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=c889f08ab5279caae3d6fe4e0614fa938fb2c9ff
Submitter: Jenkins
Branch: master

commit c889f08ab5279caae3d6fe4e0614fa938fb2c9ff
Author: Zane Bitter <email address hidden>
Date: Wed Jul 19 17:35:40 2017 -0400

    Optimise the function.dep_attrs() function

    Due to unfortunate copy-pasta, the original implementation of
    function.dep_attrs() iterates over items in a dictionary instead of values.
    Dictionary keys cannot have dependencies - they must be hashable and
    Function objects are not. So for each dictionary key we call dep_attrs()
    recursively 3 times as many times as we need to (for the key/value tuple,
    the key, and the value). For template snippets that are dominated by
    dictionaries, this means it took three times as long and used three times
    as much memory and twice as much stack space as it needed to.

    Change-Id: I13781540483daf88202d221a9f517746eebf0346
    Partial-Bug: #1684272

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/487605

Changed in tripleo:
milestone: pike-3 → pike-rc1
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-3 → pike-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/477046
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3c13cb82a6fc2b6205b65f1cc114578499922227
Submitter: Jenkins
Branch: master

commit 3c13cb82a6fc2b6205b65f1cc114578499922227
Author: Zane Bitter <email address hidden>
Date: Mon Jul 10 13:48:01 2017 -0400

    Cache dep_attrs for all resources in definitions

    When the dep_attrs function was written, it was used only in convergence
    after checking a single resource. However, now we also use it to generate
    data for the ResourceProxy objects, which is often done for all resources
    simultaneously. That means doing O(n^2) dep_attrs() calls, which can really
    slow things down when there is a large number of resources with complex
    properties (or metadata).

    This change adds an all_dep_attrs() call, which runs as fast as dep_attrs()
    on everything except get_attr functions and their arguments, but only needs
    to be called once instead of once per resource. (The get_attr function can
    in future override the default implementation of all_dep_attrs() to be as
    efficient as dep_attrs() as well.) The resulting data is cached in the
    ResourceDefinition or OutputDefinition so that subsequent calls to their
    get_attr() methods with different (or the same) resource names will use the
    existing data.

    Change-Id: If95f4c04b841519ce3d7492211f2696588c0ed48
    Partially-Implements: blueprint stack-definition
    Closes-Bug: #1684272

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/477047
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ee5dc7c120d793b66163815961b6a74b20cb8c5a
Submitter: Jenkins
Branch: master

commit ee5dc7c120d793b66163815961b6a74b20cb8c5a
Author: Zane Bitter <email address hidden>
Date: Mon Jul 10 13:48:01 2017 -0400

    Implement all_dep_attrs() more efficiently for get_attr

    Override the default implementation of all_dep_attrs() (which calls
    dep_attrs() in a loop). This makes all_dep_attrs() for GetAtt and its
    arguments as efficient as dep_attrs(), even if we only need the data for a
    single resource (as in convergence after we update a resource).

    Change-Id: Ia6a4d132e655ef24fb158959d05967d7f1c7fef8
    Related-Bug: #1684272

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 9.0.0.0rc1

This issue was fixed in the openstack/heat 9.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/ocata)

Reviewed: https://review.openstack.org/487605
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=0b934bf8609b631e24bdb0a09c090dabeef75e13
Submitter: Jenkins
Branch: stable/ocata

commit 0b934bf8609b631e24bdb0a09c090dabeef75e13
Author: Zane Bitter <email address hidden>
Date: Wed Jul 19 17:35:40 2017 -0400

    Optimise the function.dep_attrs() function

    Due to unfortunate copy-pasta, the original implementation of
    function.dep_attrs() iterates over items in a dictionary instead of values.
    Dictionary keys cannot have dependencies - they must be hashable and
    Function objects are not. So for each dictionary key we call dep_attrs()
    recursively 3 times as many times as we need to (for the key/value tuple,
    the key, and the value). For template snippets that are dominated by
    dictionaries, this means it took three times as long and used three times
    as much memory and twice as much stack space as it needed to.

    Change-Id: I13781540483daf88202d221a9f517746eebf0346
    Partial-Bug: #1684272
    (cherry picked from commit c889f08ab5279caae3d6fe4e0614fa938fb2c9ff)

tags: added: in-stable-ocata
Changed in tripleo:
milestone: pike-rc1 → pike-rc2
Changed in tripleo:
milestone: pike-rc2 → queens-1
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.