Creating the Keystone service and endpoint Failed: Internal Server Error (HTTP 500)

Bug #1846789 reported by Larry Lile on 2019-10-04
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Status tracked in Ussuri
Stein
Undecided
Unassigned
Train
Undecided
Radosław Piliszek
Ussuri
High
Michal Nasiadka

Bug Description

OS: Red Hat Enterprise Linux Server release 7.6 (Maipo)
Kernel: 3.10.0-957.27.2.el7.x86_64 #1
Docker Version: Client: Docker Engine - Community 19.03.2
Kolla version: 8.0.2.dev20

When deploying stein with the control nodes on a separate network from network, monitoring and compute keystone fails to create the service and endpoint with the following error.

TASK [keystone : Creating the Keystone service and endpoint] **************************************************************************************
failed: [odt-rsd-ost-ctr-a1.example.com] (item={u'interface': u'admin', u'url': u'http://10.239.195.200:35357'}) => {"changed": true, "item": {"interface": "admin", "url": "http://10.239.195.200:35357"}, "msg": "'Traceback (most recent call last):\\n File \"/tmp/ansible_Ybr9ij/ansible_module_kolla_keystone_service.py\", line 58, in main\\n for _service in cloud.keystone_client.services.list():\\n File \"/opt/ansible/lib/python2.7/site-packages/shade/_legacy_clients.py\", line 95, in keystone_client\\n self._identity_client\\n File \"/opt/ansible/lib/python2.7/site-packages/shade/openstackcloud.py\", line 616, in _identity_client\\n \\'identity\\', min_version=2, max_version=\\'3.latest\\')\\n File \"/opt/ansible/lib/python2.7/site-packages/shade/openstackcloud.py\", line 506, in _get_versioned_client\\n if adapter.get_endpoint():\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/adapter.py\", line 282, in get_endpoint\\n return self.session.get_endpoint(auth or self.auth, **kwargs)\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/session.py\", line 1200, in get_endpoint\\n return auth.get_endpoint(self, **kwargs)\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/identity/base.py\", line 380, in get_endpoint\\n allow_version_hack=allow_version_hack, **kwargs)\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/identity/base.py\", line 271, in get_endpoint_data\\n service_catalog = self.get_access(session).service_catalog\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/identity/base.py\", line 134, in get_access\\n self.auth_ref = self.get_auth_ref(session)\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/identity/generic/base.py\", line 208, in get_auth_ref\\n return self._plugin.get_auth_ref(session, **kwargs)\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/identity/v3/base.py\", line 184, in get_auth_ref\\n authenticated=False, log=False, **rkwargs)\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/session.py\", line 1106, in post\\n return self.request(url, \\'POST\\', **kwargs)\\n File \"/opt/ansible/lib/python2.7/site-packages/keystoneauth1/session.py\", line 943, in request\\n raise exceptions.from_response(resp, method, url)\\nInternalServerError: Internal Server Error (HTTP 500)\\n'"}

This is the inventory configuration (truncated) used. [control] is in a different network than everything else, the kolla_internal_vip_address resides in the same network as [compute] and [network].

[control]
odt-rsd-ost-ctr-a1.example.com ansible_user=root ansible_password=password ansible_become=true
odt-rsd-ost-ctr-a2.example.com ansible_user=root ansible_password=password ansible_become=true
odt-rsd-ost-ctr-a3.example.com ansible_user=root ansible_password=password ansible_become=true

[network]
hv-10-017.example.com ansible_user=root ansible_password=password ansible_become=true
hv-10-022.example.com ansible_user=root ansible_password=password ansible_become=true
hv-10-027.example.com ansible_user=root ansible_password=password ansible_become=true

[compute]
hv-10-017.example.com ansible_user=root ansible_password=password ansible_become=true
hv-10-022.example.com ansible_user=root ansible_password=password ansible_become=true
hv-10-027.example.com ansible_user=root ansible_password=password ansible_become=true

[monitoring]
hv-10-017.example.com ansible_user=root ansible_password=password ansible_become=true
hv-10-022.example.com ansible_user=root ansible_password=password ansible_become=true
hv-10-027.example.com ansible_user=root ansible_password=password ansible_become=true

[storage]
hv-10-017.example.com ansible_user=root ansible_password=password ansible_become=true

[deployment]
localhost ansible_connection=local

[haproxy:children]
network

Keystone is reporting the following in both the admin and public error log:

2019-10-03 17:15:29.755102 mod_wsgi (pid=25): Target WSGI script '/usr/bin/keystone-wsgi-admin' cannot be loaded as Python module.
2019-10-03 17:15:29.755148 mod_wsgi (pid=25): Exception occurred processing WSGI script '/usr/bin/keystone-wsgi-admin'.
2019-10-03 17:15:29.755199 Traceback (most recent call last):
2019-10-03 17:15:29.755247 File "/usr/bin/keystone-wsgi-admin", line 52, in <module>
2019-10-03 17:15:29.755328 application = initialize_admin_application()
2019-10-03 17:15:29.755363 File "/usr/lib/python2.7/site-packages/keystone/server/wsgi.py", line 24, in initialize_public_application
2019-10-03 17:15:29.755420 name='public', config_files=flask_core._get_config_files())
2019-10-03 17:15:29.755452 File "/usr/lib/python2.7/site-packages/keystone/server/flask/core.py", line 151, in initialize_application
2019-10-03 17:15:29.755508 keystone.server.configure(config_files=config_files)
2019-10-03 17:15:29.755540 File "/usr/lib/python2.7/site-packages/keystone/server/__init__.py", line 28, in configure
2019-10-03 17:15:29.755595 keystone.conf.configure()
2019-10-03 17:15:29.755628 File "/usr/lib/python2.7/site-packages/keystone/conf/__init__.py", line 139, in configure
2019-10-03 17:15:29.755689 deprecated_since=versionutils.deprecated.STEIN))
2019-10-03 17:15:29.755722 File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2045, in __inner
2019-10-03 17:15:29.755778 result = f(self, *args, **kwargs)
2019-10-03 17:15:29.755809 File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2323, in register_cli_opt
2019-10-03 17:15:29.755913 raise ArgsAlreadyParsedError("cannot register CLI option")
2019-10-03 17:15:29.755983 ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option

Radosław Piliszek (yoctozepto) wrote :

Please attach (from the controller nodes):
* full keystone logs (those under /var/log/kolla/keystone)
* docker logs for the keystone container
* /var/log/kolla/ansible.log
(you can archive that together for a cleaner upload)

And also:
* globals.yml

ArgsAlreadyParsedError is a weird error to get from that.

Changed in kolla:
status: New → Incomplete
Larry Lile (llile) wrote :

The requested logs have been provided, with slight redactions to domain names.

md5sum cabcf4456e6077ed20035e288abcb226 1846789.tar.gz

Larry Lile (llile) wrote :

Just to add some data points here for reference:

1. Train using stable/train exhibits this problem

2. Stein using stable/stein exhibits this problem

3. Rocky using stable/rocky works as expected
   [control] on network 1
   [monitoring] on network 1
   [compute,network] on network 2
   [haproxy:children]: control (on network 1)

The scenario described in 3 is our target, where control and haproxy are co-located on hosts, on network 1, monitoring is on network 1 while compute, network, storage are on network 2.

The reasoning behind this split is control, haproxy and monitoring are in a protected data center, compute, network, storage are in a utility data center without power protection.

Changed in kolla:
status: Incomplete → New
Radosław Piliszek (yoctozepto) wrote :

Were there no ansible.log in logs dir?

Train is on master. stable/train is not there yet.

Are you using kolla-ansible from branch stable/stein when deploying those stein images?

Also, your def of haproxy:children differs between report (network) and comment (control) - that changes behavior.

Changed in kolla:
status: New → Incomplete
Larry Lile (llile) wrote :

Hi,

Yes, to clarify, on train we are using master and on rocky we are using stable/rocky. Train and Stein exhibit the same behavior, Rocky works successfully.

With regard to the logs and current bug report:

Yes, we are using kolla-ansible stable/stein deploying the official stein images.

Yes, currently we are deploying haproxy:children to (network) not (control).

Once we get past the keystone issue we intend to change haproxy:children over to (control) but that seems to cause a different failure. However, I assumed it would be easier to figure out the keystone issue if I simplified the configuration.

Sorry for any confusion.

Larry Lile (llile) wrote :

The ansible.log files are attached as requested.

Changed in kolla:
status: Incomplete → New
affects: kolla → kolla-ansible
Changed in kolla-ansible:
importance: Undecided → High
Radosław Piliszek (yoctozepto) wrote :

I'm puzzled. Could you also include the generated configs on the deployed nodes? The /etc/kolla stuff.

Changed in kolla-ansible:
status: New → Incomplete
Larry Lile (llile) wrote :

The cluster has been redeployed since my last update, however the same issue persists.

I have gathered all deployment and configuration details along with logs for all subsystems.

Please let me know what else I can do to assist with debugging this issue.

Larry Lile (llile) wrote :

With regard to the data uploaded in 1846789_2.tar.gz, this is with haproxy running on the control nodes.

Larry Lile (llile) wrote :

The logs attached in 1846789_3.tar.gz show the same failure, but in this iteration I have haproxy running on network. So it appears that the location of haproxy does not affect the issue.

Larry Lile (llile) wrote :

I don't know what has changed but I just had a successful deployment with kolla-ansible 8.0.2.dev24 using the stein docker images. Mon Oct 14 17:03:22 EDT 2019

Mark Goddard (mgoddard) wrote :

Is it working reliably for you now?

Larry Lile (llile) wrote :

I tried moving haproxy to control and it failed, I am retrying with haproxy on network to see if I can reproduce the original result.

Larry Lile (llile) wrote :
Download full text (9.3 KiB)

I was able to re-deploy with haproxy on the network nodes. I did hit an issue with the heat versioning API failure during bootstrap, this seems to be an intermittent problem. Performing an additional deploy got past the issue.

TASK [heat : Running Heat bootstrap container] ***********************************************************************************************************************************
fatal: [odt-rsd-ost-ctr-a1.example.com -> odt-rsd-ost-ctr-a1.example.com]: FAILED! => {"changed": true, "msg": "Container exited with non-zero return code 1", "rc": 1, "stderr": "+ sudo -E kolla_set_configs\nINFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json\nINFO:__main__:Validating config file\nINFO:__main__:Kolla config strategy set to: COPY_ALWAYS\nINFO:__main__:Copying service configuration files\nINFO:__main__:Deleting /etc/heat/heat.conf\nINFO:__main__:Copying /var/lib/kolla/config_files/heat.conf to /etc/heat/heat.conf\nINFO:__main__:Setting permission for /etc/heat/heat.conf\nINFO:__main__:Writing out command to execute\n++ cat /run_command\n+ CMD=heat-api\n+ ARGS=\n+ [[ ! -n '' ]]\n+ . kolla_extend_start\n++ [[ ! -d /var/log/kolla/heat ]]\n++ mkdir -p /var/log/kolla/heat\n+++ stat -c %a /var/log/kolla/heat\n++ [[ 2755 != \\7\\5\\5 ]]\n++ chmod 755 /var/log/kolla/heat\n+++ whoami\n++ [[ heat == \\r\\o\\o\\t ]]\n++ . /usr/local/bin/kolla_heat_extend_start\n+++ [[ -n 0 ]]\n+++ heat-manage db_sync\n++++ openstack domain list\n++++ awk '{print $4}'\n++++ grep heat\nFailed to contact the endpoint at http://10.239.195.200:5000 for discovery. Fallback to using that endpoint as the base url.\nNot Found (HTTP 404) (Request-ID: req-1e9bea57-85a9-4a84-ac4c-b1d17dde6c2e)\n+++ CURRENT_HEAT_DOMAIN_NAME=\n+++ [[ heat_user_domain != '' ]]\n+++ openstack domain create heat_user_domain\n+++ openstack user create --domain heat_user_domain heat_domain_admin --password Op3nSt@ck\nget() takes exactly 1 argument (2 given)\n", "stderr_lines": ["+ sudo -E kolla_set_configs", "INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json", "INFO:__main__:Validating config file", "INFO:__main__:Kolla config strategy set to: COPY_ALWAYS", "INFO:__main__:Copying service configuration files", "INFO:__main__:Deleting /etc/heat/heat.conf", "INFO:__main__:Copying /var/lib/kolla/config_files/heat.conf to /etc/heat/heat.conf", "INFO:__main__:Setting permission for /etc/heat/heat.conf", "INFO:__main__:Writing out command to execute", "++ cat /run_command", "+ CMD=heat-api", "+ ARGS=", "+ [[ ! -n '' ]]", "+ . kolla_extend_start", "++ [[ ! -d /var/log/kolla/heat ]]", "++ mkdir -p /var/log/kolla/heat", "+++ stat -c %a /var/log/kolla/heat", "++ [[ 2755 != \\7\\5\\5 ]]", "++ chmod 755 /var/log/kolla/heat", "+++ whoami", "++ [[ heat == \\r\\o\\o\\t ]]", "++ . /usr/local/bin/kolla_heat_extend_start", "+++ [[ -n 0 ]]", "+++ heat-manage db_sync", "++++ openstack domain list", "++++ awk '{print $4}'", "++++ grep heat", "Failed to contact the endpoint at http://10.239.195.200:5000 for discovery. Fallback to using that endpoint as the base url.", "Not Found (HTTP 404) (Request-ID: req-1e9bea57-85a9-4a84-ac4c-b1d17dde6c2e)", "+++ CURRENT_H...

Read more...

Larry Lile (llile) wrote :

This is currently working on Stein.

However, I have tried this on master (Train) with kolla-ansible 8.1.0.dev589 and it is still broken but less broken. The kolla-ansible deploy completes successfully, but horizon or keystone seem to be broken. Attempting to login through the horizon UI as admin results in "Unable to retrieve authorized projects."

Does that give us any insight into what fixed Stein?

Errors on master from keystone seem the same

2019-10-14 11:13:24.616538 mod_wsgi (pid=24): Target WSGI script '/usr/bin/keystone-wsgi-public' cannot be loaded as Python module.
2019-10-14 11:13:24.616648 mod_wsgi (pid=24): Exception occurred processing WSGI script '/usr/bin/keystone-wsgi-public'.
2019-10-14 11:13:24.616710 Traceback (most recent call last):
2019-10-14 11:13:24.616766 File "/usr/bin/keystone-wsgi-public", line 52, in <module>
2019-10-14 11:13:24.617004 application = initialize_public_application()
2019-10-14 11:13:24.617040 File "/usr/lib/python2.7/site-packages/keystone/server/wsgi.py", line 24, in initialize_public_application
2019-10-14 11:13:24.617102 name='public', config_files=flask_core._get_config_files())
2019-10-14 11:13:24.617133 File "/usr/lib/python2.7/site-packages/keystone/server/flask/core.py", line 151, in initialize_application
2019-10-14 11:13:24.617239 keystone.server.configure(config_files=config_files)
2019-10-14 11:13:24.617274 File "/usr/lib/python2.7/site-packages/keystone/server/__init__.py", line 28, in configure
2019-10-14 11:13:24.617368 keystone.conf.configure()
2019-10-14 11:13:24.617406 File "/usr/lib/python2.7/site-packages/keystone/conf/__init__.py", line 139, in configure
2019-10-14 11:13:24.617470 deprecated_since=versionutils.deprecated.STEIN))
2019-10-14 11:13:24.617500 File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2045, in __inner
2019-10-14 11:13:24.617557 result = f(self, *args, **kwargs)
2019-10-14 11:13:24.617587 File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2323, in register_cli_opt
2019-10-14 11:13:24.617667 raise ArgsAlreadyParsedError("cannot register CLI option")
2019-10-14 11:13:24.617735 ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option

Larry Lile (llile) wrote :

My string of successful deployments seems to have ended abruptly. This is failing again.

Larry Lile (llile) wrote :

I'm still having consistent failures, with the occasional successful deployment. However, even when the deploy appears to be successful keystone is still producing the error. Any ideas?

2019-10-22 10:51:13.797434 mod_wsgi (pid=18): Target WSGI script '/usr/bin/keystone-wsgi-public' cannot be loaded as Python module.
2019-10-22 10:51:13.797531 mod_wsgi (pid=18): Exception occurred processing WSGI script '/usr/bin/keystone-wsgi-public'.
2019-10-22 10:51:13.797596 Traceback (most recent call last):
2019-10-22 10:51:13.797652 File "/usr/bin/keystone-wsgi-public", line 52, in <module>
2019-10-22 10:51:13.797741 application = initialize_public_application()
2019-10-22 10:51:13.797776 File "/usr/lib/python2.7/site-packages/keystone/server/wsgi.py", line 24, in initialize_public_application
2019-10-22 10:51:13.797840 name='public', config_files=flask_core._get_config_files())
2019-10-22 10:51:13.797873 File "/usr/lib/python2.7/site-packages/keystone/server/flask/core.py", line 151, in initialize_application
2019-10-22 10:51:13.797933 keystone.server.configure(config_files=config_files)
2019-10-22 10:51:13.797966 File "/usr/lib/python2.7/site-packages/keystone/server/__init__.py", line 28, in configure
2019-10-22 10:51:13.798024 keystone.conf.configure()
2019-10-22 10:51:13.798057 File "/usr/lib/python2.7/site-packages/keystone/conf/__init__.py", line 139, in configure
2019-10-22 10:51:13.798174 deprecated_since=versionutils.deprecated.STEIN))
2019-10-22 10:51:13.798273 File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2045, in __inner
2019-10-22 10:51:13.798344 result = f(self, *args, **kwargs)
2019-10-22 10:51:13.798377 File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2323, in register_cli_opt
2019-10-22 10:51:13.798434 raise ArgsAlreadyParsedError("cannot register CLI option")
2019-10-22 10:51:13.798490 ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option

Mark Goddard (mgoddard) wrote :

That suggests that this error might not be the problematic one.

Klemen Pogacnik (kemo-p) wrote :

I'm working on rocky release and have the same problem. It happens on each deploy on three-node cluster. I haven't notice this problem on single-node deployment.
After deploying keystone, its endpoint (port 5000) is quite unstable. It works five, six times and then fails with "500 Internal Server Error". Then it works again some time and again fails. When Internal server error is returned, the keystone error logs are the same as Larry had:

mod_wsgi (pid=20): Target WSGI script '/var/lib/kolla/venv/bin/keystone-wsgi-admin' cannot be loaded as Python module.
mod_wsgi (pid=20): Exception occurred processing WSGI script '/var/lib/kolla/venv/bin/keystone-wsgi-admin'.

I checked on keystone container on each node and there is no rule, sometimes only one has this problem, sometimes two.
Cloud deploymend then fails when some other module needs keystone endpoint.

I found out, that restarting keystone containers and redeploying cloud resolve this situation.

One more note, I did't have those problems with the rocky version from Nov 7th and I and my collegues deployed single and multicloud many times.

Michal Nasiadka (mnasiadka) wrote :

Probably found the cause - it seems we have some race-ish condition, and sometimes public keystone wsgi comes up without fernet keys in /etc/keystone/fernet-keys/ - visible in this logfile:
https://88763c12f13d1aeca43c-63681721353a54dab1064b012b97b3cb.ssl.cf1.rackcdn.com/698452/2/check/kolla-ansible-ubuntu-source-ceph/09da3d0/primary/logs/kolla/keystone/keystone-apache-public-error.txt.gz

And then persists on not coming up properly.

Working on patches in Kolla (fix fernet bootstrap so it properly sets changed flag) and Kolla-Ansible (restart keystone containers after fernet bootstrap or a similar solution).

Changed in kolla-ansible:
status: Incomplete → Triaged
assignee: nobody → Michal Nasiadka (mnasiadka)

Fix proposed to branch: master
Review: https://review.opendev.org/698710

Changed in kolla-ansible:
status: Triaged → In Progress
Larry Lile (llile) wrote :

Hi Michal,

I have been tracking the updates to change 698710 on OpenDev. We are successfully using one of the intermediate patches you supplied earlier with great success.

I wanted to thank you, and the reviewers, for all your hard work on this issue so far!

I guessing your final patch will not only address this presentation of the problem but many others that have yet to be reported or identified.

Thank you!

Reviewed: https://review.opendev.org/698710
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=0799782ce83d1057f262b44c979a15f9a1b05c72
Submitter: Zuul
Branch: master

commit 0799782ce83d1057f262b44c979a15f9a1b05c72
Author: Michal Nasiadka <email address hidden>
Date: Thu Dec 12 13:19:48 2019 +0100

    Fix keystone fernet bootstrap

    There are cases when a multinode deployment ends up in unusable
    keystone public wsgi on some nodes.

    The root cause is that keystone public wsgi doesn't find fernet
    keys on startup - and then persists on sending 500 errors to any
    requests - due to a race condition between
    fernet_setup/fernet-push.sh and keystone startup.

    Depends-On: https://review.opendev.org/703742/
    Change-Id: I63709c2e3f6a893db82a05640da78f492bf8440f
    Closes-Bug: #1846789

Changed in kolla-ansible:
status: In Progress → Fix Released
Doug Szumski (dszumski) wrote :

I believe I have just hit this on Train also.

Very likely, this is Train-affecting.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers