octavia not receiving OVN updated certificates after vault re-issues them

Bug #1952279 reported by Andre Ruiz
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
In Progress
High
Felipe Reyes
charm-layer-ovn
In Progress
High
Felipe Reyes
charm-ovn-chassis
Triaged
High
Unassigned

Bug Description

I did an Openstack Xena deployment where vault is using internal self-signed CA and issues certificates to all charms through certificates relation interface. It worked well, ovn and octavia get certificates from vault (for the APIs).

After this, I swapped the CA in vault for an intermediate externally signed. Vault re-issued certificates for all charms (I confirmed APIs now use the new one).

But octavia is now broken, and the problem seems to be that it is still trying to use the old certs to talk to OVN, which were not updated over the relation after the new certs were issued.

Just to avoid confusion -- If I understand correctly octavia uses 3 sets of certificates:

(a) - API certificates (this comes from vault "certificate" endpoint relation)
(b) - amphora CA + certs (this comes from a manual set of certs generated and passed to charm (lb-mgmt-* options).
(c) - Certs used to talk to OVN which are downloaded via relation

When I swapped the CA on vault, it re-issued (a) for both OVN and Octavia APIs. (b) is not relevant to this bug. (c) is the API certs from OVN which should have been updated in octavia via relation. I still see the old certs (ovn_ca_cert.pem ovn_certificate.pem ovn_private_key.pem).

If this is the case, is there a workaround (manually copying the files -- and which files)?

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Subscribing field-critical.

Revision history for this message
Felipe Reyes (freyes) wrote :

I'm testing a workaround that requires copying the certificates manually from ovn-central into the octavia units. I will follow up with the instructions.

Changed in charm-octavia:
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

More details:

When accessing the api for octavia, I'm getting 500 Internal Error. Tracing that, it seems that haproxy (port 9876) is relaying to apache (port 9866), which is offloading SSL and proxying back to apache (port 9856) which tries to run a WSGI and that one times out.

I'm not sure how to debug that. Running that WSGI on the console, I get just a few messages:

2021-11-25 16:55:23.724 3430891 INFO octavia.common.config [-] Logging enabled!
2021-11-25 16:55:23.724 3430891 INFO octavia.common.config [-] /usr/bin/octavia-wsgi version 9.0.0

(after a LONG wait)

2021-11-25 16:57:41.240 3430891 ERROR octavia.api.drivers.driver_factory [-] Unable to load provider driver ovn due to: Unable to open the driver agent socket: /var/run/octavia/status.sock: octavia_lib.api.drivers.exceptions.DriverAgentNotFound: Unable to open the driver agent socket: /var/run/octavia/status.sock
2021-11-25 16:57:41.241 3430891 CRITICAL octavia [-] Unhandled error: octavia.common.exceptions.ProviderNotFound: Provider 'ovn' was not found.

(python trace)

Where in a good system I get more (this is from a different deployment, just for comparison):

ubuntu@juju-8d30fd-5-lxd-2:~$ sudo octavia-wsgi
2021-11-25 15:21:42.338 243186 INFO octavia.common.config [-] Logging enabled!
2021-11-25 15:21:42.338 243186 INFO octavia.common.config [-] /usr/bin/octavia-wsgi version 6.2.1
********************************************************************************
STARTING test server octavia.api.app.setup_app
Available at http://juju-8d30fd-5-lxd-2.maas:8000/
DANGER! For testing only, do not use in production
********************************************************************************

(I omitted a few deprecation warnings in between).

I suspect octavia is blocking on something, the most obvious thing being talking to ovn maybe?

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I suspect this situation was aggravated after I added a relation between octavia and ovn-central. This relation was always supposed to exist but it has been missing from our official bundles (via fce bundle-builder -- not github example bundles which have them), there's a bug for that which was recently fixed.

That relation was not strictly necessary on Ussuri (or even Victoria and Wallaby) but is definitely needed in Xena, or else all loadbalancers will always be offline. The added benefit of adding it is that it will enable the OVN octavia provider/driver.

But, as it was missing (and not needed) before, octavia was not enabling the ovn driver and would not get into this situation. This would explain why nobody have seen this in large scale to this date with ussuri & fce up to 2.13).

Revision history for this message
Felipe Reyes (freyes) wrote :

Andre,

Please verify the following files have the same content within each octavia unit

/etc/ovn/cert_host /etc/octavia/ovn_certificate.pem
/etc/ovn/key_host /etc/octavia/ovn_private_key.pem
/etc/ovn/ovn-chassis.crt /etc/octavia/ovn_ca_cert.pem

running md5sum against them won't do the trick since the files under /etc/ovn/ don't have a '\n' at the end, so please inspect them manually.

If they are not the same, please make a backup of the ones in the /etc/octavia directory (e.g. cp /etc/octavia/ovn_ca_cert.pem /etc/octavia/ovn_ca_cert.pem.bak) and copy the ones in /etc/ovn over the ones in /etc/octavia:

cp /etc/octavia/ovn_certificate.pem /etc/octavia/ovn_certificate.pem.bak
cp /etc/octavia/ovn_private_key.pem /etc/octavia/ovn_private_key.pem.bak
cp /etc/octavia/ovn_ca_cert.pem /etc/octavia/ovn_ca_cert.pem.bak

cp /etc/ovn/cert_host /etc/octavia/ovn_certificate.pem
cp /etc/ovn/key_host /etc/octavia/ovn_private_key.pem
cp /etc/ovn/ovn-chassis.crt /etc/octavia/ovn_ca_cert.pem

And then restart octavia daemons.

Revision history for this message
Felipe Reyes (freyes) wrote :

I was able to reproduce the problem in a lab environment. After running reissue-certificates I can see in octavia/0 that ovn updated the certificates, but the copies living under /etc/octavia didn't get updated.

root@juju-d93572-ovn-17:/etc/ovn# ls -la key_host cert_host ovn-chassis.crt
-rw-r----- 1 root root 1508 Nov 25 19:12 cert_host
-rw-r----- 1 root root 1678 Nov 25 19:12 key_host
-rw-r--r-- 1 root root 1244 Nov 25 19:12 ovn-chassis.crt

root@juju-d93572-ovn-17:/etc/ovn# ls -la /etc/octavia/ovn_*
-rw-r----- 1 root octavia 1245 Nov 25 16:45 /etc/octavia/ovn_ca_cert.pem
-rw-r----- 1 root octavia 1509 Nov 25 16:45 /etc/octavia/ovn_certificate.pem
-rw-r----- 1 root octavia 1679 Nov 25 16:45 /etc/octavia/ovn_private_key.pem

Revision history for this message
Felipe Reyes (freyes) wrote :

the chassis-certificates available to octavia via the ovsdb-subordinate relation didn't get updated, so adding a task for charm-layer-ovn

Changed in charm-layer-ovn:
status: New → Triaged
Changed in charm-ovn-chassis:
status: New → Triaged
Changed in charm-layer-ovn:
importance: Undecided → High
Changed in charm-ovn-chassis:
importance: Undecided → High
Changed in charm-octavia:
importance: Undecided → High
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

The proposed workaround did fix the issue. It took a small fight to convince octavia to use the new files (I restarted a lot of services and it would still not work) but after restarting all three lxc containers it came back working.

Note to who else is doing this: pay attention to owner/perm when copying files.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

Removing field-critical.

Felipe Reyes (freyes)
Changed in charm-octavia:
status: New → Triaged
Changed in charm-octavia:
status: Triaged → In Progress
Revision history for this message
Felipe Reyes (freyes) wrote :
Changed in charm-layer-ovn:
assignee: nobody → Felipe Reyes (freyes)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-octavia (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-octavia/+/819438
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Andrei Fer (andreifer20) wrote :

Thanks for the workaround. Works for me in Openstack/Victoria installed with Juju on LXD containers. To explain what I did:

0) connect to lxd containers
1) cp /etc/octavia/ovn_certificate.pem /etc/octavia/ovn_certificate.pem.bkp
2) cp /etc/octavia/ovn_private_key.pem /etc/octavia/ovn_private_key.pem.bkp
3) cp /etc/ovn/cert_host /etc/octavia/ovn_certificate.pem
4) cp /etc/ovn/key_host /etc/octavia/ovn_private_key.pem
5) systemctl restart apache2
6) check the logs -> tail -n 400 -f /var/log/apache2/octavia_access.log -> you will 200 response code instead of 503
Repeat all the steps on each lxd containers.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.