nova-compute service is not starting when TLS certs are supplied via ssl_* options

Bug #1915504 reported by Vladimir Grevtsev
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Fix Released
High
Unassigned

Bug Description

Although charm is showing "Unit is ready", the service itself never gets up:

ubuntu@node06:~$ service nova-compute status
● nova-compute.service - OpenStack Compute
   Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-02-12 09:09:21 UTC; 1s ago
 Main PID: 24411 (nova-compute)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/nova-compute.service
           └─24411 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log

Feb 12 09:09:21 node06 systemd[1]: Started OpenStack Compute.
Feb 12 09:09:23 node06 nova-compute[24411]: Deprecated: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
ubuntu@node06:~$

<again>

ubuntu@node06:~$ service nova-compute status
● nova-compute.service - OpenStack Compute
   Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-02-12 09:09:32 UTC; 760ms ago
 Main PID: 24553 (nova-compute)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/nova-compute.service
           └─24553 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log

Feb 12 09:09:32 node06 systemd[1]: Started OpenStack Compute.
ubuntu@node06:~$

Sometimes, there is a "Services not running that should be: nova-compute", but this error message is transient.

The service is never getting online, due to:

2021-02-12 09:09:12.823 24283 WARNING keystoneauth.identity.generic.base [req-664b95e2-8490-49c6-9790-16dbdd57e2d4 - - - - -] Failed to discover available identity versions when contacting https://keystone-int.orangebox84.ru:35357. Attempting to parse version from URL.: keystoneauth1.exceptions.connection.SSLError: SSL exception connecting to https://keystone-int.orangebox84.ru:35357: HTTPSConnectionPool(host='keystone-int.orangebox84.ru', port=35357): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))
2021-02-12 09:09:12.826 24283 CRITICAL nova [req-664b95e2-8490-49c6-9790-16dbdd57e2d4 - - - - -] Unhandled error: keystoneauth1.exceptions.discovery.DiscoveryFailure: Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. SSL exception connecting to https://keystone-int.orangebox84.ru:35357: HTTPSConnectionPool(host='keystone-int.orangebox84.ru', port=35357): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))
2021-02-12 09:09:12.826 24283 ERROR nova Traceback (most recent call last):
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
2021-02-12 09:09:12.826 24283 ERROR nova chunked=chunked)
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 346, in _make_request
2021-02-12 09:09:12.826 24283 ERROR nova self._validate_conn(conn)
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 852, in _validate_conn
2021-02-12 09:09:12.826 24283 ERROR nova conn.connect()
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 340, in connect
2021-02-12 09:09:12.826 24283 ERROR nova ssl_context=context)
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/urllib3/util/ssl_.py", line 332, in ssl_wrap_socket
2021-02-12 09:09:12.826 24283 ERROR nova return context.wrap_socket(sock, server_hostname=server_hostname)
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/eventlet/green/ssl.py", line 449, in wrap_socket
2021-02-12 09:09:12.826 24283 ERROR nova return GreenSSLSocket(sock, *a, _context=self, **kw)
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/eventlet/green/ssl.py", line 139, in __init__
2021-02-12 09:09:12.826 24283 ERROR nova self.do_handshake()
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/eventlet/green/ssl.py", line 312, in do_handshake
2021-02-12 09:09:12.826 24283 ERROR nova super(GreenSSLSocket, self).do_handshake)
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/eventlet/green/ssl.py", line 161, in _call_trampolining
2021-02-12 09:09:12.826 24283 ERROR nova return func(*a, **kw)
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3.6/ssl.py", line 1077, in do_handshake
2021-02-12 09:09:12.826 24283 ERROR nova self._sslobj.do_handshake()
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3.6/ssl.py", line 689, in do_handshake
2021-02-12 09:09:12.826 24283 ERROR nova self._sslobj.do_handshake()
2021-02-12 09:09:12.826 24283 ERROR nova ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)
2021-02-12 09:09:12.826 24283 ERROR nova
2021-02-12 09:09:12.826 24283 ERROR nova During handling of the above exception, another exception occurred:
2021-02-12 09:09:12.826 24283 ERROR nova
2021-02-12 09:09:12.826 24283 ERROR nova Traceback (most recent call last):
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/requests/adapters.py", line 440, in send
2021-02-12 09:09:12.826 24283 ERROR nova timeout=timeout
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 639, in urlopen
2021-02-12 09:09:12.826 24283 ERROR nova _stacktrace=sys.exc_info()[2])
2021-02-12 09:09:12.826 24283 ERROR nova File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment
2021-02-12 09:09:12.826 24283 ERROR nova raise MaxRetryError(_pool, url, error or ResponseError(cause))
2021-02-12 09:09:12.826 24283 ERROR nova urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='keystone-int.orangebox84.ru', port=35357): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))
2021-02-12 09:09:12.826 24283 ERROR nova

bundle: https://pastebin.canonical.com/p/5d4gNcHnmg/
juju status: https://paste.ubuntu.com/p/5wVnTHH7v2/
nova-compute unit log: https://paste.ubuntu.com/p/FM8W3m3836/

This has been observed at the two separate environments, so I suspect it has nothing to do with the particular certificate chain configuration on the specific environment. Before the new charm release it has worked fine - so we suspect there is some regression after the 21.01 charm release.

Tags: field-crit
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Subscribing field-critical as this issue has no reliable workaround and is blocking our ongoing customer delivery.

tags: added: field-critical
tags: added: field-crit
removed: field-critical
description: updated
Changed in charm-nova-compute:
assignee: nobody → Aurelien Lourot (aurelien-lourot)
Changed in charm-nova-compute:
status: New → In Progress
importance: Undecided → High
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

After SSHing to that system I confirm that the certificate is fine as `wget https://keystone-int.orangebox84.ru:35357` works well on nova-cloud-controller/0 but not on nova-compute/0.

/usr/local/share/ca-certificates/keystone_juju_ca_cert.crt exists on nova-cloud-controller/0 but not on nova-compute/0. There is no vault in this setup so nova-compute wants to receive this certificate as `ca_cert` value on its `cloud-credentials` relation to nova-cloud-controller. [0] This data hasn't been passed onto the relation. This can be checked with:

$ juju run -u nova-compute-kvm/0 -- relation-ids cloud-compute
cloud-compute:60
$ juju run -u nova-compute-kvm/0 -- relation-get -r 60 - nova-cloud-controller/0

nova-cloud-controller's code for setting `ca_cert` on the relation to nova-compute [1] is called only when the relation is joined, and never ever again. This may be a problem, although this isn't what is happening here because removing and re-adding the relation doesn't help. Still digging.

[0] https://github.com/openstack/charm-nova-compute/blob/master/hooks/nova_compute_utils.py#L747
[1] https://github.com/openstack/charm-nova-cloud-controller/blob/master/hooks/nova_cc_hooks.py#L680

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

s/cloud-credentials/cloud-compute/

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Here is the issue on nova-cloud-controller:
2021-02-12 11:07:46 INFO juju-log cloud-compute:69: CA Certificate not found at expected location ("keystone_juju_ca_cert"): "[Errno 2] No such file or directory: 'keystone_juju_ca_cert'"

This comes from get_ca_cert_b64() [0] where it tries to open the file 'keystone_juju_ca_cert' instead of `/usr/local/share/ca-certificates/keystone_juju_ca_cert.crt`. So the code "forgot" to concatenate both the path and the extensions before calling open().

Working on a fix.

[0] https://github.com/openstack/charm-nova-cloud-controller/blob/master/hooks/nova_cc_utils.py#L1072

Changed in charm-nova-cloud-controller:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Aurelien Lourot (aurelien-lourot)
no longer affects: charm-nova-compute
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

FWIW, I grepped all the OpenStack charms and this is the only spot I could find with such a mistake.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
David Ames (thedac) wrote :

This was a bit of a research project. I don't want to keep compounding fixes on fixes. So I would like to put the brakes on stable back-ports until we are all in agreement.

This is continued fallout from the keystone_juju_ca_cert.crt to vault_juju_ca_cert.crt rename which I believe began with [0] for CH and [1] for NCC.

Frode then filed LP Bug#1911902 [2] and a fix in [3]. Now Aurelien is working on Vladamir's LP Bug#1915504 [4] with a refactor out of NCC [5] into charm-helpers [6].

I got involved when requested to review the stable back port [7] and [8]. Which is where I am calling stop the line. Where [5] and [6] might be fine for master, [7] and [8] represents a lot of change for a back-port including a "refactor."

Questions:

NCC will not be the last charm that needs to determine its CA from config or the certificates relation, is [6] a generic enough and robust enough solution? For example, I don't quite understand the ca_cert_absolute_path taking in the basename with out extension.

Rather than back porting a refactor, can we use a simpler fix for stable? A potential 2 line change for stable NCC is [9].

[0] https://github.com/juju/charm-helpers/commit/a3a8ce071fa3d49d7c7d2cadad2a51ddc9ed4efe
[1] https://github.com/openstack/charm-nova-cloud-controller/commit/f4e3c8f21cc5f3241d61547c9dbb9f6ee4969d60
[2] https://bugs.launchpad.net/charm-nova-compute/+bug/1911902
[3] https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/770938
[4] https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1915504
[5] https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/775602
[6] https://github.com/juju/charm-helpers/pull/570
[7] https://review.opendev.org/c/openstack/charm-nova-cloud-controller/+/775900
[8] https://github.com/juju/charm-helpers/pull/571
[9] https://pastebin.ubuntu.com/p/k3fRNNkPmG/

Revision history for this message
David Ames (thedac) wrote :

One more question:

Have we tested [5] and [6] in an environment with ssl_* config options and not vault?

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

@vlgrevtsev do you confirm that the latest https://jaas.ai/u/openstack-charmers-next/nova-cloud-controller now fixes the issue in your setup with ssl_* config options and without vault?

@thedac yes, your simple fix sounds good for the stable branch.

Changed in charm-nova-cloud-controller:
status: In Progress → Fix Committed
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Hi Aurelien,

The charm upgrade didn't help.

I did this: juju upgrade-charm nova-cloud-controller --switch cs:~openstack-charmers-next/nova-cloud-controller-514

But:

a) nova-compute services didn't get restarted

ubuntu@node09:~$ service nova-compute status
● nova-compute.service - OpenStack Compute
   Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2021-02-16 18:14:04 UTC; 14h ago
  Process: 9074 ExecStart=/etc/init.d/nova-compute systemd-start (code=exited, status=1/FAILURE)
 Main PID: 9074 (code=exited, status=1/FAILURE)

Feb 16 18:14:04 node09 systemd[1]: nova-compute.service: Service hold-off time over, scheduling restart.
Feb 16 18:14:04 node09 systemd[1]: nova-compute.service: Scheduled restart job, restart counter is at 6464.
Feb 16 18:14:04 node09 systemd[1]: Stopped OpenStack Compute.
Feb 16 18:14:04 node09 systemd[1]: nova-compute.service: Start request repeated too quickly.
Feb 16 18:14:04 node09 systemd[1]: nova-compute.service: Failed with result 'exit-code'.
Feb 16 18:14:04 node09 systemd[1]: Failed to start OpenStack Compute.

b) after I've restarted it manually - it failed again with the same error.

I'm going to break the relation and add it back to see if that helps.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Confirmed together with vlgrevtsev: upgrading the charm to cs:~openstack-charmers-next/nova-cloud-controller-514, then deleting and re-creating the cloud-compute relation fixed the original issue.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

@thedac, actually your one-liner proposal [0] is good but is also missing the '.crt' extension. There are two problems to be solved: both the folder-path AND the extension are missing. Putting up a review as we speak.

[0] https://pastebin.ubuntu.com/p/k3fRNNkPmG/

Changed in charm-nova-cloud-controller:
assignee: Aurelien Lourot (aurelien-lourot) → nobody
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Changed in charm-nova-cloud-controller:
assignee: nobody → Aurelien Lourot (aurelien-lourot)
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Note: I have updated that same review by accident. That's OK but I actually intended to create a new review and abandon the former one.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Note: stable/21.01 backport has been temporarily published here for validation: https://jaas.ai/u/aurelien-lourot/nova-cloud-controller

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

stable/21.01 backport [0] validated by vlgrevtsev, we're good!

[0] https://jaas.ai/u/aurelien-lourot/nova-cloud-controller/0

Changed in charm-nova-cloud-controller:
status: Fix Committed → Fix Released
milestone: none → 21.01
assignee: Aurelien Lourot (aurelien-lourot) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.