ocata deploy fails randomly with ceilo upgrade failing with keystone

Bug #1703444 reported by Pradeep Kilambi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Emilien Macchi

Bug Description

Description of problem:

Randomly deployment fails in overcloud.AllNodesDeploySteps.ControllerDeployment_Step5.0 with this error (the error is visible in "openstack stack failures list"):

Error: ceilometer-upgrade --skip-metering-database returned 1 instead of one of [0]

If you re-run the deployment command it will resume from where it failed and finish successfully. It appears that Keystone is not running when it's needed by ceilometer. ceilo-upgrade tries to authenticate with keystone to create resource types in gnocchi. But keystone throws a 503:

10.35.191.20 - - [02/Jul/2017:14:48:06 +0000] "GET
/v1/resource_type/instance_disk HTTP/1.1" 503 170 "-"
"ceilometer-upgrade keystoneauth1/2.18.0 python-requests/2.11.1
CPython/2.7.5"

2017-07-02 14:48:11.800 116449 ERROR keystonemiddleware.auth_token [-]
Bad response code while validating token: 503
2017-07-02 14:48:11.800 116449 WARNING keystonemiddleware.auth_token
[-] Identity response: <html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.

and ceilometer-upgrade fails with:

2017-07-02 14:48:11.803 123807 CRITICAL ceilometer [-]
ClientException: {"message": "The server is currently unavailable.
Please try again at a later time.<br /><br />\n\n\n", "code": "503
Service Unavailable", "title": "Service Unavailable"} (HTTP 503)

How reproducible:
randomly

Steps to Reproduce:
1. Deploy the default plan. I deployed a topology of 3 controllers + 1 compute + 3 ceph with this command: openstack overcloud deploy --templates --ntp-server clock.redhat.com -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e counters.yaml
2. If you hit this error you re-run the command to resume it from where it failed, and it passes the 2nd time.

Expected results:
Deployment should pass the first time. You shouldn't hit errors due to keystone not being up and running on time.

Changed in tripleo:
importance: Undecided → High
status: New → Triaged
Changed in tripleo:
milestone: none → pike-3
tags: added: ocata-backport-potential
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/482707

Changed in tripleo:
assignee: nobody → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/482712

Revision history for this message
Gonéri Le Bouder (goneri) wrote :

The problem is still here. What I did:

On the undercloud:
curl 'https://review.openstack.org/gitweb?p=openstack/puppet-tripleo.git;a=patch;h=75933b180e34be9eaeee4983866068a8516ea769'|patch /usr/share/openstack-puppet/modules/tripleo/manifests/profile/base/ceilometer/collector.pp
upload-puppet-modules -d /usr/share/openstack-puppet

Then I've injected collector.pp in the overcloud-full image and I've reupoaded the image in glance.

After, the deployment, I double checked, the fix is on the controller node, but obviously has no impact.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

So this seems to be what is happening:

controller0 runs the ceilometer-upgrade process which talks to gnocchi on any of the controllers (controller0,controller1,controller2). These gnocchi processes are having issues talking to keystone at some point during this. Most likely because httpd gets restarted on one of the other controllers (controller1 or controller2) and doesn't get removed from haproxy so gnocchi gets a 503. One option is to try enabling redispatch in haproxy to retry a different server in the case of a failure. Alternatively we could add retries to ceilometer-upgrade but I'm not sure the impact of a partial run.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/ocata)

Reviewed: https://review.openstack.org/482712
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=ebebbd7df9a037aaf39bf137e103061ff0b10cbd
Submitter: Jenkins
Branch: stable/ocata

commit ebebbd7df9a037aaf39bf137e103061ff0b10cbd
Author: Alex Schultz <email address hidden>
Date: Tue Jul 11 15:14:00 2017 -0600

    Retry ceilometer-upgrade

    When the ceilometer-upgrade command is run in step5, it talks to gnocchi
    and keystone on all the controllers. Since these other nodes might have
    httpd restarted mid-upgrade we should retry if we get a failure.

    Change-Id: I874cf9c34b41d055a258704dabe9150eab0f7968
    Closes-Bug: #1703444
    (cherry picked from commit 6aa27b457c366271295048522f214d99d6e3a3c2)

tags: added: in-stable-ocata
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/482707
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=661757a16f485567403470b38b403cccaf60bdb2
Submitter: Jenkins
Branch: master

commit 661757a16f485567403470b38b403cccaf60bdb2
Author: Alex Schultz <email address hidden>
Date: Tue Jul 11 15:14:00 2017 -0600

    Retry ceilometer-upgrade

    When the ceilometer-upgrade command is run in step5, it talks to gnocchi
    and keystone on all the controllers. Since these other nodes might have
    httpd restarted mid-upgrade we should retry if we get a failure.

    Change-Id: I874cf9c34b41d055a258704dabe9150eab0f7968
    Closes-Bug: #1703444

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 7.2.0

This issue was fixed in the openstack/puppet-tripleo 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 6.5.1

This issue was fixed in the openstack/puppet-tripleo 6.5.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.