Unable to refresh certificates with reissue-certificates

Bug #1940549 reported by Steven Parker
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron API OVN Plugin Charm
Fix Committed
High
Unassigned
charm-ovn-central
Fix Committed
High
Unassigned
charm-ovn-chassis
Fix Committed
High
Unassigned
vault-charm
Fix Committed
High
Martin Kalcok

Bug Description

While attempting to refresh certificates for a k8s installation no units other then the client leaders updated.

Steps to replicate:
  Deploy k8s stack and vault with replication count 3 (HA).
  Delete vault unit which is leader and add another
    execute refresh certificates action
    confirm k8s client.crt is actually updated or fail to update
       juju ssh kubernetes-worker/0 sudo openssl x509 -in /root/cdk/client.crt -text
  Repeat a few times

  At issue is that there are multiple instances of relation data from those units being shared with other applications vs one source of truth (the leader).
We have one vault leader which provides the correct data when we re-issue certificates.
However, older vault units that may have been leader at some time still retain stale certificate data shared with all the clients.
That stale data is conflicting with the newly provided certificates and the clients think nothing has changed (the stale data has the original certs)
and thus the clients do not drop the client certificates to disk.

== work-around below ==

We cleared data from the non leaders to solve the issue:
For example here is vault/0 which is a non-leader (vault/1 is the current leader)
  juju run -u vault/0 "relation-set -r certificates:61
       kubernetes-master_0.server.key='' kubernetes-master_0.server.cert='' kubernetes-master_0.processed_client_requests=''
       kubernetes-master_1.server.key='' kubernetes-master_1.server.cert='' kubernetes-master_1.processed_client_requests=''
       kubernetes-master_2.server.key='' kubernetes-master_2.server.cert='' kubernetes-master_2.processed_client_requests=''
       kubernetes-master_5.server.key='' kubernetes-master_5.server.cert='' kubernetes-master_5.processed_client_requests=''
    "

Once the stale data was cleared the clients saw the new certificates and updated correctly.

Revision history for this message
Cory Johns (johnsca) wrote :

As mentioned, this arises because the interface protocol in question was created before application-level relation data was available, so the leader has no choice but to write the response data in its unit data bucket, potentially leading to conflicting data being presented on the relation. The requesting side has no way to know which unit is the leader and thus which data is authoritative, but it could perhaps parse the cert data and pick the best one based on the effective and expiration dates. However, there are many more clients than providers for this relation and this issue impacts all of them, not just Kubernetes.

Possible solutions:

1) Migrate the interface protocol to app-level rel data. This would be the best solution for a new interface, but migrating to it now would require updating every charm which uses this interface on either side of the relation. It might be possible to do incrementally by writing the data in both buckets and applying one of the other fixes and then gradually updating the client charms to prefer the app-level data.

2) Make provider units clear their relation data whenever they see that they are not the leader. Requires no updates to the clients, and possibly no communication between the leaders and non-leaders of the provider, except that there is a chance for the non-leaders to wipe out relation data before the leader has written the new data, so that may want to be managed using a leader data field.

3) Make provider units all write the latest data as soon as it's available. I think this should be possible for Vault if the non-leader units can read the secrets out, but they'll need some trigger to know when the leader has generated the initial or updated data. For EasyRSA, the cert data will need to be copied to leader data if it isn't already. This is a bit more complex than 2 but ensures that the correct data is always available on the relation.

Cory Johns (johnsca)
no longer affects: charm-kubernetes-master
no longer affects: charm-kubernetes-worker
Changed in vault-charm:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Chris Johnston (cjohnston) wrote :

See also: https://bugs.launchpad.net/charm-kubernetes-master/+bug/1899706
I believe these are the same underlying issue

Revision history for this message
James Troup (elmo) wrote :

I've subscribed ~field-high. This also affects the OVN charms; workaround in the bug description works great.

Changed in vault-charm:
importance: Medium → High
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Bumping to high impact. There's a work-around as mentioned in the bug description, which typically leads to triaging as medium. However, after spending some time today working with bootstack to work around this, it's clear this may impact more clouds so we should aim to get it fixed soon.

description: updated
description: updated
Changed in vault-charm:
importance: High → Medium
importance: Medium → High
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I think this can affect any client that gets certificates from vault, but I've added charm-ovn-chassis and charm-ovn-central for now, however the fix will likely be on the vault side.

Revision history for this message
Andrea Ieri (aieri) wrote :

this bug affected neutron-api-plugin-ovn as well for us today. I'll add the project for completeness but I agree the fix will probably not involve it

Changed in charm-ovn-chassis:
status: New → Triaged
importance: Undecided → High
Changed in charm-ovn-central:
status: New → Triaged
importance: Undecided → High
Changed in charm-neutron-api-plugin-ovn:
status: New → Triaged
importance: Undecided → High
Changed in vault-charm:
assignee: nobody → Martin Kalcok (martin-kalcok)
status: Triaged → In Progress
Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

Correct me if I'm wrong, but EasyRSA does not really support HA deployments. There's no "peers" relation endpoint defined in metadata.yaml and no code that would handle relations with other EasyRSA units.
So I don't think that this bug is applicable to EasyRSA charm.

Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

In addition to your suggested solutions @johnsca I think there's also an option to implement "leader storage" as a cache. Leader unit can use "leader-set" to store cert/keys when they are issued and when leadership changes, and new leader unit processes the requests from "clients", it first looks into cache if certificate was previously issued for that particular request, and if so, it will reuse it.

This approach is bit of a spin on your suggestion #3 except it does require synchronization between all the vault (provider) units via peer relations when new certificate is issued.

The current approach of re-issuing of certificates on leader change is unnecessary, as the root CA is shared between vault (provider) units and does not change with leadership change. Imo we should aim to remove the re-issuing as part of this fix.

Potential downside of using the leader storage is that all the secrets are stored in the controller database, but I checked with @aluria and he said that even the data from regular relations (provide-require) are stored on the controller, so it should not make much difference security-wise. (As the secrets are currently stored in the provide-require relationship data anyway)

Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

I opened a PR with the implementation of leader cache and synchronization of certificates across the non-leader units. It's marked as a work-in-progress 'cause I didn't look at the functional tests (or if they are needed at all), but overall it's ready for review.

PR: https://review.opendev.org/c/openstack/charm-vault/+/828885

tags: added: review-needed
Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

At the engineering sprint, it came to my attention that leader-set should be deprecated and "application databag" should be used instead, so this'll need a rework.

Also worth to mention that both Application and Leader databags persist through scaling to zero and then up again. This is an edge case, but we should make sure that the cache gets erased when last unit leaves because new vault instance will probably initiate new CA.

tags: removed: review-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Per review comments this is the bundle I'm testing with, in case it helps others iterate on testing faster.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I believe we can mark this as Fix Committed for Vault.

commit f55055b8783ca6f3f569209b4f82285377f5ac64
Author: Martin Kalcok <email address hidden>
Date: Fri Feb 11 15:13:41 2022 +0100

    Implement cert cache for vault units (v2)

    This cache is used to store certificates and keys
    issued by the leader unit. Non-leader units read
    these certificates and keep data in their
    "tls-certificates" relations up to date.
    This ensures that charm units that receive certs
    from vault can read from relation data of any
    vault unit and receive correct data.

    This patch is the same as
    1159e547dd755af97d5eab578cdfe90abad93843
    but improved to avoid LP#1970888

    Change-Id: Ic4dd009cc18c52e1667391b00ebba9928acc5937
    Closes-Bug: #1940549
    Closes-Bug: #1970888

Changed in vault-charm:
status: In Progress → Fix Committed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I'll mark as Fix Committed for the other affected charms, though they didn't receive and won't be receiving fixes. The fix is in vault.

Changed in charm-ovn-chassis:
status: Triaged → Fix Committed
Changed in charm-ovn-central:
status: Triaged → Fix Committed
Changed in charm-neutron-api-plugin-ovn:
status: Triaged → Fix Committed
Changed in charm-easyrsa:
status: New → Fix Committed
George Kraft (cynerva)
no longer affects: charm-easyrsa
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments