HA deployments taking close to the CI timeout

Bug #1664418 reported by Ben Nemec on 2017-02-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Emilien Macchi

Bug Description

I've noticed some ha jobs timing out with no apparent errors, and looking at the graphite data this is likely because our ha deployments are just taking close to the 1 hour 20 minute timeout, so any slowdown in a job is going to cause it to hit the timeout.

Performance seems to have regressed pretty significantly over the past month or so. You can see this in the graph at https://66.187.229.172/S/F (give it time, it's processing a month of CI metrics). The undercloud install time (which did increase some due to additional services) is also included to demonstrate that it's not just an environmental problem.

Unfortunately, until a few days ago we didn't split things out by release, so the older metrics probably look somewhat better than they should because stable branch jobs are also included, and those are consistently faster to deploy than master (see below for details on this).

I do have some suspicion that part of the regression may be related to moving around the nova and keystone initialization in https://github.com/openstack/puppet-tripleo/commit/3b00ffc728b47e132b3ed8bc460f8697ddb32047 and https://github.com/openstack/puppet-tripleo/commit/bb63f514d22ea82d17947a5972b4da16e66b5a36. It's hard to say for sure from the numbers available though because of the branch issue in the older data.

Note that we don't collect metrics on failed jobs, so any jobs that exceed 4800 seconds for the deploy are not going to show up here, so the average might still be skewed a little low.

Further supporting the possibility that the step changes are involved, here are the deployment times for the longest four steps as of a week ago:

2017-02-07 06:50:21.000 | ControllerDeployment_Step5 961.0
2017-02-07 06:50:21.000 | ControllerDeployment_Step4 683.0
2017-02-07 06:50:21.000 | ControllerDeployment_Step3 509.0
2017-02-07 06:50:21.000 | ControllerDeployment_Step2 469.0

And a recent job:

2017-02-13 21:10:27.000 | ControllerDeployment_Step3 1109.0
2017-02-13 21:10:27.000 | ControllerDeployment_Step4 1070.0
2017-02-13 21:10:27.000 | ControllerDeployment_Step5 950.0
2017-02-13 21:10:27.000 | ControllerDeployment_Step2 535.0

Steps 3 and 4 have nearly doubled in length. Step 2 is a little longer, which suggests the testenv for this job may have been somewhat slower too, but even if you subtract a corresponding amount of time from each of the other three steps you still end up much longer than before.

So, while the immediate concern is ci jobs timing out, this is also a pretty significant regression in performance in a release that was already running noticeably slower than newton (which, in turn, was noticeably slower than mitaka - sensing a trend here?). For reference, here's a graph including the newton and mitaka-specific ha deploy times: https://66.187.229.172/S/G Mitaka jobs on average deploy over 25 minutes faster, and even Newton, which is more similar feature-wise, is around 18 minutes faster right now.

I'm not sure if we have time to do anything in Ocata so late in the cycle, but it's something we definitely need to look into.

tags: added: alert

I think oooq jobs experiencing the same problem, half of them finishes with timeout. I track it here: https://bugs.launchpad.net/tripleo/+bug/1663310 because it has another problem - connecting to undercloud fails in postci function.

Ben Nemec (bnemec) wrote :

I pushed https://review.openstack.org/#/c/433755/ to see if those two puppet commits are actually slowing us down. It will fail the ping test, but we should be able to see the deployment times anyway.

Ben Nemec (bnemec) wrote :

From the temprevert patch:

2017-02-14 18:56:59.000 | ControllerDeployment_Step4 1315.0
2017-02-14 18:56:59.000 | ControllerDeployment_Step5 948.0
2017-02-14 18:56:59.000 | ControllerDeployment_Step3 587.0
2017-02-14 18:56:59.000 | ControllerDeployment_Step2 538.0

It moved the extra time to step 4, but that's still taking an extremely long time. It knocked about 4.5 minutes total off these four steps from the bad case, but it's still a lot slower than the run from the 7th.

Emilien Macchi (emilienm) wrote :

Before steps refacto:
2017-02-07 06:50:21.000 | ControllerDeployment_Step2 469.0
2017-02-07 06:50:21.000 | ControllerDeployment_Step3 509.0
2017-02-07 06:50:21.000 | ControllerDeployment_Step4 683.0
2017-02-07 06:50:21.000 | ControllerDeployment_Step5 961.0

After steps refacto:
2017-02-13 21:10:27.000 | ControllerDeployment_Step2 535.0
2017-02-13 21:10:27.000 | ControllerDeployment_Step3 1109.0
2017-02-13 21:10:27.000 | ControllerDeployment_Step4 1070.0
2017-02-13 21:10:27.000 | ControllerDeployment_Step5 950.0

With https://review.openstack.org/433954 (on HA job):
01:10:53.000 | ControllerDeployment_Step2 490.0
01:10:53.000 | ControllerDeployment_Step3 946.0
01:10:53.000 | ControllerDeployment_Step4 694.0
01:10:53.000 | ControllerDeployment_Step5 606.0

It's pretty clear it improved.

Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
Emilien Macchi (emilienm) wrote :

Also trying to optimize Apache configuration within one step: https://review.openstack.org/#/c/434016

Reviewed: https://review.openstack.org/433954
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d545621e613f867ef84ba073588da79943f3bb61
Submitter: Jenkins
Branch: master

commit d545621e613f867ef84ba073588da79943f3bb61
Author: Emilien Macchi <email address hidden>
Date: Tue Feb 14 17:17:39 2017 -0500

    tuning: manage keystone resources only at step3

    1. Manage Keystone resources only at step 3. Don't verify them
       at step 4 and 5, it's a huge loss of time.
    2. Don't require Keystone resources for Gnocchi services, they are
       already ready at Step 5.

    Related-Bug: #1664418
    Change-Id: I9879718a1a86b862e5eb97e6f938533c96c9f5c8

Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Ben Nemec (bnemec) wrote :

https://review.openstack.org/433954 has made a huge difference with this. The ha and nonha jobs have dropped about 1000 seconds on average since it merged, and the updates job (which was having to run these steps twice) is down about 1500 seconds. That's about 17 and 25 minutes off the jobs.

Emilien Macchi (emilienm) wrote :

I'm closing this one until we hit it again :-)

Changed in tripleo:
status: Triaged → Fix Released

Reviewed: https://review.openstack.org/434016
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=2272bcabba8752cd1876f85b1f9b83b0c7592c94
Submitter: Jenkins
Branch: master

commit 2272bcabba8752cd1876f85b1f9b83b0c7592c94
Author: Emilien Macchi <email address hidden>
Date: Wed Mar 29 17:42:32 2017 -0400

    Deploy WSGI apps at the same step (3)

    So we avoid useless apache restart and save time during the deployment.

    Related-Bug: #1664418
    Change-Id: Ie00b717a6741e215e59d219710154f0d2ce6b39e

Reviewed: https://review.openstack.org/453219
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=7ac3d9d8a4dbdb9dd31354c2bdab62be032df0c7
Submitter: Jenkins
Branch: stable/ocata

commit 7ac3d9d8a4dbdb9dd31354c2bdab62be032df0c7
Author: Emilien Macchi <email address hidden>
Date: Wed Mar 29 17:42:32 2017 -0400

    Deploy WSGI apps at the same step (3)

    So we avoid useless apache restart and save time during the deployment.

    Note: the backport is not 100% clean as Heat API was not deployed in
    WSGI during Ocata cycle, so now, it's only for Aodh.

    Related-Bug: #1664418
    Change-Id: Ie00b717a6741e215e59d219710154f0d2ce6b39e
    (cherry picked from commit 2272bcabba8752cd1876f85b1f9b83b0c7592c94)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/453221
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=4f776e7709a696812697ba02def34175ff6238af
Submitter: Jenkins
Branch: stable/newton

commit 4f776e7709a696812697ba02def34175ff6238af
Author: Emilien Macchi <email address hidden>
Date: Wed Mar 29 17:42:32 2017 -0400

    Deploy WSGI apps at the same step (3)

    So we avoid useless apache restart and save time during the deployment.

    Note: the backport is not 100% clean as Heat API was not deployed in
    WSGI during Ocata cycle, so now, it's only for Aodh.

    Related-Bug: #1664418
    Depends-On: Ibc184a50cf16b7048e0f7249f8894d8661bb76fe
    Change-Id: Ie00b717a6741e215e59d219710154f0d2ce6b39e
    (cherry picked from commit 2272bcabba8752cd1876f85b1f9b83b0c7592c94)

tags: added: in-stable-newton

Related fix proposed to branch: master
Review: https://review.openstack.org/456293

Reviewed: https://review.openstack.org/456293
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=9de4c92571fdbe342a20a68e4ee44feb55464007
Submitter: Jenkins
Branch: master

commit 9de4c92571fdbe342a20a68e4ee44feb55464007
Author: Alex Schultz <email address hidden>
Date: Wed Apr 12 10:34:07 2017 -0600

    Move gnocchi wsgi configuration to step 3

    We configure apache in step3 so we need to configure the gnocchi api in
    step 3 as well to prevent unnecessary service restarts during updates.

    Change-Id: I30010c9cf0b0c23fde5d00b67472979d519a15be
    Related-Bug: #1664418

Reviewed: https://review.openstack.org/456276
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=890178bd6f6f465ffcb8cf4ad9b8019a1d6dc653
Submitter: Jenkins
Branch: master

commit 890178bd6f6f465ffcb8cf4ad9b8019a1d6dc653
Author: Alex Schultz <email address hidden>
Date: Wed Apr 12 10:03:22 2017 -0600

    Move ceilometer wsgi to step 3

    Apache is configured in step 3 so if we configure ceilometer in step 4,
    the configuration is removed on updates. We need to configure it in step
    3 with the other apache services to ensure we don't have issues on
    updates.

    Change-Id: Icc9d03cd8904c93cb6e17f662f141c6e4c0bf423
    Related-Bug: #1664418

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/457259

Reviewed: https://review.openstack.org/457259
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=9e81a1bcec901596136ee1f7353f2bd66008c98a
Submitter: Jenkins
Branch: stable/ocata

commit 9e81a1bcec901596136ee1f7353f2bd66008c98a
Author: Alex Schultz <email address hidden>
Date: Wed Apr 12 10:03:22 2017 -0600

    Move ceilometer wsgi to step 3

    Apache is configured in step 3 so if we configure ceilometer in step 4,
    the configuration is removed on updates. We need to configure it in step
    3 with the other apache services to ensure we don't have issues on
    updates.

    Change-Id: Icc9d03cd8904c93cb6e17f662f141c6e4c0bf423
    Related-Bug: #1664418
    (cherry picked from commit 890178bd6f6f465ffcb8cf4ad9b8019a1d6dc653)

Reviewed: https://review.openstack.org/457258
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=243e3bdc20a5e5d21c828d0727c898ee7de9a05b
Submitter: Jenkins
Branch: stable/ocata

commit 243e3bdc20a5e5d21c828d0727c898ee7de9a05b
Author: Alex Schultz <email address hidden>
Date: Wed Apr 12 10:34:07 2017 -0600

    Move gnocchi wsgi configuration to step 3

    We configure apache in step3 so we need to configure the gnocchi api in
    step 3 as well to prevent unnecessary service restarts during updates.

    Change-Id: I30010c9cf0b0c23fde5d00b67472979d519a15be
    Related-Bug: #1664418
    (cherry picked from commit 9de4c92571fdbe342a20a68e4ee44feb55464007)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers