periodic mitaka and liberty jobs fail with 503 service unavailable after deploy

Bug #1634823 reported by Gabriele Cerami on 2016-10-19
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
High
Ben Nemec

Bug Description

Both mitaka and liberty periodic jobs today fail with this error

2016-10-19 07:47:33.939580 | 2016-10-19 07:47:17 [overcloud]: CREATE_COMPLETE Stack CREATE completed successfully
2016-10-19 07:47:34.417456 | Stack overcloud CREATE_COMPLETE
2016-10-19 08:58:36.390192 | Service Unavailable (HTTP 503)

right after "successful" deployment.
liberty times out after waiting for 1 hour to contact the endpoint
mitaka fails right away.

Michele Baldessari (michele) wrote :

I noticed that too here while trying to backport a fix in mitaka:
https://review.openstack.org/#/c/388568/

Michele Baldessari (michele) wrote :
Download full text (8.0 KiB)

One thing I noticed is the following on controller-0:
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core [req-a910cef1-4202-41dd-a895-d5d04b3076ae - - - - -] Could not find user: ceilometer
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core Traceback (most recent call last):
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core File "/usr/lib/python2.7/site-packages/keystone/auth/plugins/core.py", line 173, in _validate_and_normalize_auth_data
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core user_name, domain_ref['id'])
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core File "/usr/lib/python2.7/site-packages/keystone/common/manager.py", line 124, in wrapped
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core __ret_val = __f(*args, **kwargs)
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core File "/usr/lib/python2.7/site-packages/keystone/identity/core.py", line 433, in wrapper
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core return f(self, *args, **kwargs)
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core File "/usr/lib/python2.7/site-packages/keystone/identity/core.py", line 443, in wrapper
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core return f(self, *args, **kwargs)
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core File "/usr/lib/python2.7/site-packages/dogpile/cache/region.py", line 1053, in decorate
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core should_cache_fn)
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ERROR keystone.auth.plugins.core File "/usr/lib/python2.7/site-packages/dogpile/cache/region.py", line 657, in get_or_create
keystone/keystone.log:2016-10-19 10:49:03.523 16145 ...

Read more...

Gabriele Cerami (gcerami) wrote :

The resulted deployment after the error is missing key components like service endpoints on keystone, and users on its database.

We identified the problem in post-config phase. Apparently the post-config step happens before some services have the possibility to settle, and 503 is returned. Repeating the post-config after the error works and create correctly the missing pieces.
Trying to add a delay after stack deploy and post-config to see what happens

Michele Baldessari (michele) wrote :

There are two separate issues we will need to address:
1) Today in Liberty and Mitaka when we create the endpoints after the heat stack completed the deployment, we actually have no guarantee whatsoever that keystone is really up and running and ready to serve. We do not have this issue in Newton because the endpoint creation is done via puppet. What we probably want to do here for L/M is to add a reasonable retry logic in the tripleoclient if we get a 503 error from keystone.

2) The reason we seem to get a 503 at least on Mitaka where keystone runs under httpd under two virtual hosts and the main one listening to port 5000 starts 4-5 *minutes*. For example:
http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha-mitaka/8b298bc/logs/overcloud-controller-0/var/log/keystone/keystone.txt.gz#_2016-10-24_07_24_46_294 has the following:

2016-10-24 07:24:46.294 19600 WARNING oslo_config.cfg [req-dd11202b-309d-448c-b8dd-25dbc140dfe2 - - - - -] Option "admin_bind_host" from group "DEFAULT" is deprecated. Use option "admin_bind_host" from group "eventlet_server".
2016-10-24 07:24:46.295 19600 WARNING oslo_config.cfg [req-dd11202b-309d-448c-b8dd-25dbc140dfe2 - - - - -] Option "admin_bind_host" from group "eventlet_server" is deprecated for removal. Its value may be silently ignored in the future.
....
2016-10-24 07:30:19.383 19601 WARNING oslo_config.cfg [req-60a32b5e-86da-40f2-bb35-15fe759876d9 - - - - -] Option "admin_bind_host" from group "DEFAULT" is deprecated. Use option "admin_bind_host" from group "eventlet_server".
2016-10-24 07:30:19.383 19601 WARNING oslo_config.cfg [req-60a32b5e-86da-40f2-bb35-15fe759876d9 - - - - -] Option "admin_bind_host" from group "eventlet_server" is deprecated for removal. Its value may be silently ignored in the future.

The first one is the admin instance on 35357 and the second one is on port 5000. This took almost *five* minutes.

I still need to investigate what has changed with liberty recently because we do not have keystone running under httpd there. We for sure do have the problem that when the heat stack is completed we have no guarantee whatsoever that it is ready to serve (see prob 1) above)

Fix proposed to branch: master
Review: https://review.openstack.org/390860

Changed in tripleo:
assignee: nobody → Giulio Fidente (gfidente)
status: New → In Progress
tags: added: newton-backport-potential
Gabriele Cerami (gcerami) wrote :
Download full text (4.0 KiB)

I tried this patch (before seeing the ones proposed here)

diff --git a/tripleoclient/v1/overcloud_deploy.py b/tripleoclient/v1/overcloud_deploy.py
index 11d236a..144ba87 100644
--- a/tripleoclient/v1/overcloud_deploy.py
+++ b/tripleoclient/v1/overcloud_deploy.py
@@ -984,8 +984,19 @@ class DeployOvercloud(command.Command):
             utils.create_overcloudrc(stack, parsed_args.no_proxy)
             utils.create_tempest_deployer_input()

- if (stack_create or parsed_args.force_postconfig):
- self._deploy_postconfig(stack, parsed_args)
+ max_retries = 20
+ retries = 0
+ delay = 30
+ while retries < max_retries:
+ try:
+ if (stack_create or parsed_args.force_postconfig):
+ self._deploy_postconfig(stack, parsed_args)
+ break
+ except Exception:
+ print(sys.exc_info()[0])
+ print("postconfig attempt {0} failed. Waiting {1} second ".format(retries, delay))
+ retries += 1
+ time.sleep(delay)

             overcloud_endpoint = utils.get_overcloud_endpoint(stack)
             print("Overcloud Endpoint: {0}".format(overcloud_endpoint))

It essentially does the same thing that the patch propose, but retries the entire post_config instead of the keystone part.

In the ci devenv this does NOT fix anything, this is what happens.

2016-10-26 20:26:33 [overcloud-BlockStorageNodesPostDeployment-szuwqeh4o3xp]: CREATE_COMPLETE Stack CREATE completed successfully
Stack overcloud CREATE_COMPLETE
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 0 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 1 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 2 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 3 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 4 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 5 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 6 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 7 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 8 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 9 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 10 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 11 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 12 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>
postconfig attempt 13 failed. Waiting 30 second
<class 'keystoneauth1.exceptions.http.ServiceUnavailable'>...

Read more...

Gabriele Cerami (gcerami) wrote :

Root cause found. THe keystone client is actually trying to use the internal CI proxy. As shown here:

http://logs.openstack.org/67/390767/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-mitaka/36b6d0e/console.html#_2016-10-26_09_50_35_642256

A squid is answering the keystone requests.

Gabriele Cerami (gcerami) wrote :

proposed workaround here: https://review.openstack.org/#/c/391190
and a more articulated fix here: https://review.openstack.org/#/c/391197

Changed in tripleo:
assignee: Giulio Fidente (gfidente) → Gabriele Cerami (gcerami)

Change abandoned by Giulio Fidente (<email address hidden>) on branch: master
Review: https://review.openstack.org/390860

Change abandoned by Giulio Fidente (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/390862

Changed in tripleo:
assignee: Gabriele Cerami (gcerami) → Ben Nemec (bnemec)
Steven Hardy (shardy) on 2016-11-16
Changed in tripleo:
milestone: ocata-1 → ocata-2

Reviewed: https://review.openstack.org/391197
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=0b2cf71a84362127abf92efa76b2fc1215f18097
Submitter: Jenkins
Branch: master

commit 0b2cf71a84362127abf92efa76b2fc1215f18097
Author: Gabriele Cerami <email address hidden>
Date: Thu Oct 27 15:51:08 2016 +0200

    Refactor creation of no_proxy list for postconfig

    Adds keystone admin ip to the list, create the list more closely
    to what utils already does.

    Change-Id: I6937bf0e46867b71972a322d33c0ac68e226b079
    Closes-Bug: 1634823

Changed in tripleo:
status: In Progress → Fix Released

This issue was fixed in the openstack/python-tripleoclient 5.6.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers