Netwon to Ocata upgrade failure because of ceilometer-upgrade

Bug #1724328 reported by David Manchado on 2017-10-17
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Pradeep Kilambi

Bug Description

Description
===========
While upgrading from Newton to Ocata, the upgrade fails reporting an error running ceilometer-upgrade

Error: /Stage[main]/Tripleo::Profile::Base::Ceilometer::Collector/Exec[ceilometer-db-upgrade]/returns: change from notrun to 0 failed: ceilometer-upgrade --skip-metering-database returned 1 instead of one of [0]

Before upgrading to Ocata, Newton was updated to the latest Newton bits available on rdo-trunk-ocata-tested (https://trunk.rdoproject.org/centos7-ocata/current-passed-ci/ )

Seems similar to https://bugs.launchpad.net/tripleo/+bug/1703444

Steps to reproduce
==================
* Deploy Newton openstack
* Update overcloud to latest Newton bits available
* Upgrade to Ocata

The upgrade script is the same than the deploy one but adding
-e templates/environments/major-upgrade-composable-steps.yaml \
-e overcloud-repos.yaml \
-e skip-validation-upgrade.yaml \

Expected result
===============
UPDATE_COMPLETE

Actual result
=============
UPDATE_FAILED

Re-running ceilometer-upgrade on one of the controller fails.
The logs show ERROR 503 is reported while trying to reach keystone-admin.

All the resources in the controller pcs cluster are reported as being unmanaged
openstack endpoint list fails with error
Failed to contact the endpoint at http://A.B.C.D:35357 for discovery. Fallback to using that endpoint as the base url.
Unable to establish connection to http://A.B.C.D:35357/endpoints: ('Connection aborted.', BadStatusLine("''",))

Setting the cluster out of maintenance mode does not bring keystone-admin back to test ceilometer-upgrade

Environment
===========
Three HA controllers (

Tripleo-related RPM:
puppet-tripleo-6.5.4-0.20171015123804.d9f056e.el7.centos.noarch
openstack-tripleo-common-6.1.2-1.el7.noarch
openstack-tripleo-puppet-elements-6.2.3-1.el7.noarch
python-tripleoclient-6.2.1-1.el7.noarch
openstack-tripleo-image-elements-6.1.0-1.el7.noarch
openstack-tripleo-0.0.8-0.3.4de13b3git.el7.noarch
openstack-tripleo-heat-templates-6.2.4-0.20171011085158.cf73cd2.el7.centos.noarch
openstack-tripleo-ui-3.2.2-1.el7.noarch
openstack-tripleo-validations-5.6.1-1.el7.noarch

Logs & Configs
==============
ceilometer-upgrade.log http://paste.openstack.org/show/623875/
openstack stack failure list --long http://paste.openstack.org/show/623876/

Changed in tripleo:
importance: Undecided → High
milestone: none → queens-2
status: New → Triaged
tags: added: upgrade
David Manchado (dmanchad) wrote :

In case it helps, find some responses I am getting when using openstackclient:

$ openstack endpoint list
Failed to contact the endpoint at http://A.B.C.D:35357 for discovery. Fallback to using that endpoint as the base url.
Unable to establish connection to http://A.B.C.D:35357/endpoints: ('Connection aborted.', BadStatusLine("''",))
----
$ openstack server list
The server is currently unavailable. Please try again at a later time.<br /><br />

 (HTTP 503) (Request-ID: req-6b0f1bc8-457c-4ed5-8978-b50e47df6164)
----
$ openstack token issue
+------------+----------------------------------+
| Field | Value |
+------------+----------------------------------+
| expires | 2017-10-18T14:02:19+0000 |
| id | 447a16ef8bf6412aa47b3b062bd9c2cc |
| project_id | 86dcd37905ec440b873422092b923eaf |
| user_id | 72f88c9866d14fc3aa2ce1eb03ffb4c2 |
+------------+----------------------------------+

David Manchado (dmanchad) wrote :

The output of the deploy has been uploaded too [1]

[1] http://paste.openstack.org/show/623940/

Pradeep Kilambi (pkilambi) wrote :

This is due to httpd being bounced during the step 4. I think this should be resolved with backport which was recently merged:

https://review.openstack.org/#/c/489437/

David Manchado (dmanchad) wrote :

I've just confirmed that the installed puppet-tripleo RPM [1] already had that patch included.
That is the version installed both in the controller and the undercloud.

[1] puppet-tripleo-6.5.4-0.20171015123804.d9f056e.el7.centos.noarch

Changed in tripleo:
importance: High → Critical

Increased the priority because it's one of blockers for moving OVB jobs to RDO cloud.

Pradeep Kilambi (pkilambi) wrote :

hmm if that patch is there, then i doubt if this is a ceilometer issue. based on comment#1 seems like keystone is down? If keystone is down ceilometer upgrade will not be able to authenticate to talk to gnocchi. So seems to me like root cause here is keystone.

wes hayutin (weshayutin) wrote :

Making this critical, alert, promotion blocker as it's blocking the upgrade of RDO-CLoud which blocks the migration of jobs off RH1 which blocks everyone patches because jobs in RH1 are timing out.

:))

tags: added: alert promotion-blocker
David Manchado (dmanchad) wrote :

All,

The keystone issue should not be related to the upgrade at the end.
Some time ago we had to change keystone admin into SSL as long as it is internet facing [1] and as long as we had to do it right at that time we changed the templates [1] and submitted some LP [2] and BZ [3].

We did a minor update (newton) after that change and we overwrote the template change so the mismatch related to keystone admin should have been actually identified/happened at the Ocata upgrade.

We are still testing on the staging environment the right setup and potential issues when we try the upgrade on Production.

[1] https://code.engineering.redhat.com/gerrit/#/c/107413/
[2] https://bugs.launchpad.net/tripleo/+bug/1639996
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1416225

wes hayutin (weshayutin) on 2017-11-09
Changed in tripleo:
assignee: nobody → mathieu bultel (mat-bultel)
David Manchado (dmanchad) wrote :

We have tested again once we have solved the keystone-admin issue and we are still having the same issue.

Note that keystone has been up an running before the upgrade and after the failure.

Logs are still reporting the same issues than a month ago.

RPMs (related to tripleo, gnocchi and ceilometer) used during today upgrade:
openstack-tripleo-ui-3.2.2-1.el7.noarch
openstack-gnocchi-metricd-3.1.11-1.el7.noarch
openstack-ceilometer-notification-8.1.2-0.20171102233300.600bd6a.el7.centos.noarch
openstack-ceilometer-api-8.1.2-0.20171102233300.600bd6a.el7.centos.noarch
openstack-tripleo-validations-5.6.1-1.el7.noarch
openstack-gnocchi-common-3.1.11-1.el7.noarch
openstack-tripleo-common-6.1.3-0.20171105015427.7b93bc1.el7.centos.noarch
puppet-gnocchi-10.3.2-0.20171031003416.48b6bca.el7.centos.noarch
openstack-gnocchi-indexer-sqlalchemy-3.1.11-1.el7.noarch
openstack-tripleo-puppet-elements-6.2.3-1.el7.noarch
python-ceilometer-8.1.2-0.20171102233300.600bd6a.el7.centos.noarch
openstack-ceilometer-central-8.1.2-0.20171102233300.600bd6a.el7.centos.noarch
openstack-gnocchi-api-3.1.11-1.el7.noarch
openstack-tripleo-heat-templates-6.2.5-0.20171105124759.fdcb5c6.el7.centos.noarch
openstack-tripleo-0.0.8-0.3.4de13b3git.el7.noarch
python2-ceilometerclient-2.8.1-1.el7.noarch
openstack-ceilometer-polling-8.1.2-0.20171102233300.600bd6a.el7.centos.noarch
puppet-tripleo-6.5.5-0.20171106204741.56b8111.el7.centos.noarch
puppet-ceilometer-10.3.2-0.20171102222201.4f6eb57.el7.centos.noarch
python-gnocchi-3.1.11-1.el7.noarch
python-tripleoclient-6.2.1-1.el7.noarch
python-ceilometermiddleware-1.0.3-1.el7.noarch
openstack-tripleo-image-elements-6.1.1-1.el7.noarch
openstack-gnocchi-statsd-3.1.11-1.el7.noarch
python2-gnocchiclient-3.1.0-1.el7.noarch
openstack-ceilometer-common-8.1.2-0.20171102233300.600bd6a.el7.centos.noarch
openstack-ceilometer-collector-8.1.2-0.20171102233300.600bd6a.el7.centos.noarch

David Manchado (dmanchad) wrote :

We have been able to deploy and upgrade when the initial deploy had telemetry disabled.
Just to confirm the issue is related to telemetry/gnocchi.

Pradeep Kilambi (pkilambi) wrote :

Yea sounds like we might have an ordering issue during upgrade. But hard to say where without logs. Can you get me /var/log/gnocchi/* /var/log/ceilometer/* /var/log/httpd/* and upgrae logs at the time of failure?

Changed in tripleo:
assignee: mathieu bultel (mat-bultel) → Pradeep Kilambi (pkilambi)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/521621
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=60925faefc58d76adf3914f96c636ca2a5b8c783
Submitter: Zuul
Branch: master

commit 60925faefc58d76adf3914f96c636ca2a5b8c783
Author: Pradeep Kilambi <email address hidden>
Date: Mon Nov 20 13:10:25 2017 -0500

    Add upgrade task to run gnocchi upgrade

    Closes-bug: #1724328

    Change-Id: Id7fed3746733c0ea0804532beda627c69e4ce078

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/521886
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=771189e91d50fd28d2be44d9b003ff061482b90d
Submitter: Zuul
Branch: stable/ocata

commit 771189e91d50fd28d2be44d9b003ff061482b90d
Author: Pradeep Kilambi <email address hidden>
Date: Mon Nov 20 13:10:25 2017 -0500

    Add upgrade task to run gnocchi upgrade

    Closes-bug: #1724328

    Change-Id: Id7fed3746733c0ea0804532beda627c69e4ce078
    (cherry picked from commit 60925faefc58d76adf3914f96c636ca2a5b8c783)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/521890
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=aab7bdd6fbab0158cf2b57a63b4422cd3156beed
Submitter: Zuul
Branch: stable/pike

commit aab7bdd6fbab0158cf2b57a63b4422cd3156beed
Author: Pradeep Kilambi <email address hidden>
Date: Mon Nov 20 13:10:25 2017 -0500

    Add upgrade task to run gnocchi upgrade

    Closes-bug: #1724328

    Change-Id: Id7fed3746733c0ea0804532beda627c69e4ce078
    (cherry picked from commit 60925faefc58d76adf3914f96c636ca2a5b8c783)

tags: added: in-stable-pike

This issue was fixed in the openstack/tripleo-heat-templates 8.0.0.0b2 development milestone.

This issue was fixed in the openstack/tripleo-heat-templates 6.2.7 release.

This issue was fixed in the openstack/tripleo-heat-templates 7.0.6 release.

Reviewed: https://review.openstack.org/527709
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=1c49fbe08d1f764975ce8ef952b055ad25effd65
Submitter: Zuul
Branch: master

commit 1c49fbe08d1f764975ce8ef952b055ad25effd65
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Dec 13 16:07:16 2017 +0100

    gnocchi: ensure upgrade run after swift setup

    The orignal fix have create an dependencies on an Class, so
    it does work and fail silencly.

    This changes it to the Anchor.

    Change-Id: I2ed6e328a9a4915844f699784dd87dc99078fb23
    Closes-bug: #1724328

Related fix proposed to branch: master
Review: https://review.openstack.org/527940

Change abandoned by Mehdi Abaakouk (sileht) (<email address hidden>) on branch: master
Review: https://review.openstack.org/527934
Reason: replaced by https://review.openstack.org/#/c/527940/

Reviewed: https://review.openstack.org/527759
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e09db0d8c306959a78f3ec51b5afa92b760a814d
Submitter: Zuul
Branch: stable/pike

commit e09db0d8c306959a78f3ec51b5afa92b760a814d
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Dec 13 16:07:16 2017 +0100

    gnocchi: ensure upgrade run after swift setup

    The orignal fix have create an dependencies on an Class, so
    it does work and fail silencly.

    This changes it to the Anchor.

    Change-Id: I2ed6e328a9a4915844f699784dd87dc99078fb23
    Closes-bug: #1724328
    (cherry picked from commit 1c49fbe08d1f764975ce8ef952b055ad25effd65)

Reviewed: https://review.openstack.org/527760
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5e1b5cafa6bb7a22f75375266688c86a73b157f2
Submitter: Zuul
Branch: stable/ocata

commit 5e1b5cafa6bb7a22f75375266688c86a73b157f2
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Dec 13 16:07:16 2017 +0100

    gnocchi: ensure upgrade run after swift setup

    The orignal fix have create an dependencies on an Class, so
    it does work and fail silencly.

    This changes it to the Anchor.

    Change-Id: I2ed6e328a9a4915844f699784dd87dc99078fb23
    Closes-bug: #1724328
    (cherry picked from commit 1c49fbe08d1f764975ce8ef952b055ad25effd65)

Reviewed: https://review.openstack.org/527940
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5b1a139fa0bfda2c5b38754b3d31ee28023cbef5
Submitter: Zuul
Branch: master

commit 5b1a139fa0bfda2c5b38754b3d31ee28023cbef5
Author: Mehdi Abaakouk <email address hidden>
Date: Thu Dec 14 12:36:26 2017 +0100

    gnocchi/ceilometer upgrade workflow fix

    The current workflow for gnocchi/ceilometer upgrade doesn't
    work well with swift backend.

    Notification agent push data into Gnocchi on step4, but
    Ceilometer-upgrade run only on step5, So Gnocchi have not been populated
    with latest resource schemas.

    Gnocchi-api is started in step3 but gnocchi::storage configuration have
    not been done and database upgrade have not been done.

    When configuration is done on step4, httpd will be restarted.

    This change will fix this issue by:

    * Doing only the Gnocchi database upgrade on step3 because swift is
      ready only on step4.
    * Configuring gnocchi::storage on step3 to avoid gnocchi-api restart on
      step4.
    * Add dependencies between ceilometer-upgrade and gnocchi-api in case of
      non multinode deployment.

    This ensures:

    * gnocchi-api will be correctly configured at the end of
      step3 (configuration+database-sync).
    * No new measures will be pushed to Gnocchi before ceilometer-upgrade have
      upgraded the Gnocchi resource schemas.
    * Gnocchi-api have database updated before ceilometer-upgrade need it.
    * We continue to upgrade storage/incoming data of Gnocchi on step4 after
      swift is up.

    Closes-bug: #1724328
    Change-Id: I3f9a784e507e03454b335ba8319601fba208ba0a

Reviewed: https://review.openstack.org/528077
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=062313628c334a5ea4a812b4a6176d3e1c02d8b2
Submitter: Zuul
Branch: stable/ocata

commit 062313628c334a5ea4a812b4a6176d3e1c02d8b2
Author: Mehdi Abaakouk <email address hidden>
Date: Thu Dec 14 12:36:26 2017 +0100

    gnocchi/ceilometer upgrade workflow fix

    The current workflow for gnocchi/ceilometer upgrade doesn't
    work well with swift backend.

    Notification agent push data into Gnocchi on step4, but
    Ceilometer-upgrade run only on step5, So Gnocchi have not been populated
    with latest resource schemas.

    Gnocchi-api is started in step3 but gnocchi::storage configuration have
    not been done and database upgrade have not been done.

    When configuration is done on step4, httpd will be restarted.

    This change will fix this issue by:

    * Doing only the Gnocchi database upgrade on step3 because swift is
      ready only on step4.
    * Configuring gnocchi::storage on step3 to avoid gnocchi-api restart on
      step4.
    * Move ceilometer-notification on step4 to ensure ceilometer-upgrade
      have been run.

    This ensures:

    * gnocchi-api will be correctly configured at the end of
      step3 (configuration+database-sync).
    * No new measures will be pushed to Gnocchi before ceilometer-upgrade have
      upgraded the Gnocchi resource schemas.
    * Gnocchi-api have database updated before ceilometer-upgrade need it.
    * We continue to upgrade storage/incoming data of Gnocchi on step4 after
      swift is up.

    Closes-bug: #1724328
    Change-Id: I3f9a784e507e03454b335ba8319601fba208ba0a
    (cherry picked from commit 4e6939c1a874a06f336321a9d44d9991872f74cf)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/528078
Reason: The gate is currently timeouting, we need https://review.openstack.org/#/c/531352/ to improve the situation. I'll restore the patch once the gate is stable again. Please do not recheck or restore this patch, I'll take care of it. Thanks for your patience.

This issue was fixed in the openstack/puppet-tripleo 6.5.7 release.

This issue was fixed in the openstack/puppet-tripleo 7.4.7 release.

Reviewed: https://review.openstack.org/528078
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=50429f2cfbb0696ac134b15254a4277398836f49
Submitter: Zuul
Branch: stable/pike

commit 50429f2cfbb0696ac134b15254a4277398836f49
Author: Mehdi Abaakouk <email address hidden>
Date: Thu Dec 14 12:36:26 2017 +0100

    gnocchi/ceilometer upgrade workflow fix

    The current workflow for gnocchi/ceilometer upgrade doesn't
    work well with swift backend.

    Notification agent push data into Gnocchi on step4, but
    Ceilometer-upgrade run only on step5, So Gnocchi have not been populated
    with latest resource schemas.

    Gnocchi-api is started in step3 but gnocchi::storage configuration have
    not been done and database upgrade have not been done.

    When configuration is done on step4, httpd will be restarted.

    This change will fix this issue by:

    * Doing only the Gnocchi database upgrade on step3 because swift is
      ready only on step4.
    * Configuring gnocchi::storage on step3 to avoid gnocchi-api restart on
      step4.
    * Move ceilometer-notification on step4 to ensure ceilometer-upgrade
      have been run.

    This ensures:

    * gnocchi-api will be correctly configured at the end of
      step3 (configuration+database-sync).
    * No new measures will be pushed to Gnocchi before ceilometer-upgrade have
      upgraded the Gnocchi resource schemas.
    * Gnocchi-api have database updated before ceilometer-upgrade need it.
    * We continue to upgrade storage/incoming data of Gnocchi on step4 after
      swift is up.

    Closes-bug: #1724328
    Change-Id: I3f9a784e507e03454b335ba8319601fba208ba0a
    (cherry picked from commit 4e6939c1a874a06f336321a9d44d9991872f74cf)

This issue was fixed in the openstack/puppet-tripleo 7.4.8 release.

This issue was fixed in the openstack/puppet-tripleo 8.2.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.