Possible collisions if two apps with same name related to vault from different models

Bug #1949913 reported by Paul Goins
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
Critical
Adam Dyess
Vault KV Charm Layer
Fix Released
Critical
Adam Dyess
charm-interface-vault-kv
Fix Released
Undecided
Adam Dyess
vault-charm
Status tracked in Trunk
1.5
New
Undecided
Unassigned
1.6
New
Undecided
Unassigned
1.7
New
Undecided
Unassigned
1.8
New
Undecided
Unassigned
Trunk
New
Undecided
Unassigned

Bug Description

Hello,

I have an issue where I have 2 K8s clusters, both with apps named "kubernetes-master", on two separate models. They're connected to vault via a cross-model relationship.

After recent maintenance, we're unable to get the kubernetes-master units to fully come up on one of the clusters.

I've noted that the layer code (https://github.com/juju-solutions/layer-vault-kv/blob/e22c18b133070ce354cebbda864a5aa8a4b60398/lib/charms/layer/vault_kv.py#L101) appears to store keys based upon the app name and the unit numeric suffix, but there is no accomodating for multiple models with the same application name. I'm suspecting this is causing collisions between the applications.

Revision history for this message
Paul Goins (vultaire) wrote :

After hitting this bug again, and reading into the sources a bit, this seems like a critical bug. Apps relying on VaultAppKV (e.g. kubernetes-master, which I think is now kubernetes-control-plane) can end up clobbering each others' data.

Likely a similar issue occurs on VaultUnitKV, albeit mitigated somewhat if the unit IDs from the different apps don't collide. That doesn't matter in my current case re: Charmed Kubernetes, but it's worth calling out as well.

The core of the collision appears to be here: https://github.com/juju-solutions/layer-vault-kv/blob/39b0c7b38d59e0133d7fffe3849b4783c431e16d/lib/charms/layer/vault_kv.py#L246
(This is on the tip of the master branch at the time of writing.)

The only way I know of to avoid this issue at present is to avoid having multiple apps of the same name (via differnet models) using the same vault.

Revision history for this message
George Kraft (cynerva) wrote :

Can you clarify the impact of this? What happens when the KVs collide? Does it lead to permanent loss of cluster data? Does it lead to clusters that cannot be recovered?

George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: New → Confirmed
Changed in charm-layer-vault-kv:
status: New → Confirmed
Revision history for this message
Paul Goins (vultaire) wrote (last edit ):

Hi George,

It can lead to permanent data loss. And indeed, while the immediate bug is in layer-vault-kv in my opinion, we have perhaps a second issue related to this in kubernetes-master. I'll try to lay this out simply below.

1. 2 kubernetes-master apps are deployed in different models, connected to the same vault via a CMR. Each app sees itself as "kubernetes-master", and thus the _get_secret_backend() function mentioned in my last comment returns "charm-kubernetes-master".

2. Each app stores its encryption_secret at charm-kubernetes-master/kv/app. (Source from current main branch tip: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/54e02bb/reactive/kubernetes_control_plane.py#L3307) The last one to call the generate_encryption_key handler will clobber whatever may have been in vault before at that value.

3. Running the create_secure_storage() function - directly, or because the kubernetes-control-plane.secure-storage.created flag was cleared (perhaps during attempted debug of issues, or as a follow-up to the layer.vaultlocker.ready flag being cleared), will cause the /var/snap/kube-apiserver/common/encryption/encryption_config.yaml file to be rewritten with whatever the present value in vault has. This appears to be the moment that the previous key will be irrevocably lost, as it may have persisted in this config file despite it having already being overwritten in vault.

4. After the config is rewritten based upon current vault values, you find yourself in a situation where you cannot read any existing secrets. "kubectl get secrets" will fail with cryptic errors like "Internal error occurred: unable to transform key "/registry/secrets/kube-system/attachdetach-controller-token-4bcd3": invalid padding on input" since it cannot decode these secrets. However, creating and reading new secrets via kubectl works fine, and pulling those secrets via etcdctl confirms that they are indeed being encrypted. In other words: most likely, the config file's key has changed and is no longer able to decrypt any secrets created prior to the accidental change.

5. As a slight tangent, but related to the above: the kubernetes-control-plane charm doesn't appear to handle key rotation. Upon key change, secrets are not re-encoded with the new key; the old key is simply unceremoniously dropped and existing keys become inaccessible. This is a use case which should also be considered, at least to limit the damage of a collision or other unexpected changes to the data in vault.

Hope this helps - please reach out if you wish for more details or clarification.

Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the details. Given the delayed but inevitable loss of Kubernetes secret data, I would agree with your assessment that this is a critical issue. Even worse, it seems, that the cluster can appear healthy at first, until much later, when the encryption config file gets rewritten.

I believe we should be able to fix the collision by adding the model UUID to the secret backend. On upgrade, we'll have to be careful to ensure that existing data is migrated properly.

When migrating the encryption key, instead of grabbing it from Vault, perhaps we should consider grabbing it from the local encryption config file if it exists. This might heal existing clusters that have been damaged but are not showing symptoms yet.

Regarding point 5, I agree that key rotation would be good to have and would have lessened the impact here. The upstream Kubernetes documentation describes how to do this[1]. It will be tricky to implement in the charm. To keep the scope of this issue small, I think we will have to consider key rotation separately.

For my own slight tangent, from a security standpoint, it seems unfortunate that units using the vault-kv relation have full read and write access to a database shared by other models. I don't think a kubernetes-control-plane app in one model should have read access to the secret data for a kubernetes-control-plane app in a different model. Does vault have access controls that we're not taking advantage of in the charms? Or is this an indication that each cluster should have its own instance of vault? I'm not sure, but it seems worth looking into.

[1]: https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/#rotating-a-decryption-key

Changed in charm-kubernetes-master:
importance: Undecided → Critical
Changed in charm-layer-vault-kv:
importance: Undecided → Critical
Changed in charm-kubernetes-master:
status: Confirmed → Triaged
Changed in charm-layer-vault-kv:
status: Confirmed → Triaged
Revision history for this message
Adam Dyess (addyess) wrote :

The control-plane charms uses vault-kv in the following situations to store state encrypted in vault:

Diagnosis
----------

1) The Kubernetes-Control-Plane charm stores the "encryption_key" in a vault secret named charm-{app-name} at the path charm-{app-name}/kv/app. (Source from 1.27_release branch [1])

2) There are no uses of VaultUnitKV (individualized unit KV store)

Reactive Flag Events (per 1.27/stable release)
--------------

1) Each unit stores a hash at the path charm-{app-name}/kv/app-hashes/{unit-num} which is an md5 hexdigest of each key data in */kv/app/. Its purpose is to update reactive flags for other units in the event one unit changes the main /kv/app/* item, the other units can react to changes by watching for flags like
* layer.vault-kv.app-kv.changed
* layer.vault-kv.app-kv.changed.{key}
* layer.vault-kv.app-kv.set.{key}

2) Only leader units generate an encryption_key when vault relation is ready, and "layer.vault-kv.app-kv.set.encryption_key" is cleared.

3) Each unit writes the encryption_key to disk when vault relation is ready, "layer.vault-kv.app-kv.set.encryption_key" is set and "kubernetes-control-plane.secure-storage.created" is unset

4) Each unit clears "kubernetes-control-plane.secure-storage.created" if the vault relation is no longer ready and "kubernetes-control-plane.secure-storage.created" is currently set.

Assurances
----
* A charm-upgrade must confirm each unit maintains the correct encryption key located in /var/snap/kube-apiserver/common/encryption/encryption_config.yaml so that secrets can be unencrypted

* A charm-upgrade in one cluster must not disturb another cluster (upgraded or not)

* A charm-upgrade should store the encryption_key to a new secrets store named charm-{model-uuid}-{app-name} at /kv/app

* A charm-upgrade should NOT generate a new encryption_key when the key is unset in charm-{model-uuid}-{app-name}/kv/app, but is available in the encryption_config.yaml. Rather it should try to read from the yaml, and push this as the restored encryption_key value for this secret store.

Links
------------------
[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/release_1.27/reactive/kubernetes_control_plane.py#L3307

Revision history for this message
Adam Dyess (addyess) wrote :

Beginning changes for layer-vault-kv to support a new secrets-backend

https://github.com/juju-solutions/layer-vault-kv/pull/20

Revision history for this message
Adam Dyess (addyess) wrote :

I've run into a new issue related to this. The vault charm authorizes a vault role to a certain IP CIDR based on its "unit-name" [1]. When multiple `kubernetes-control-plane/0` units from various models try to create a vault new role, the role cannot be created with a different CIDR range than the first one. I've included a stack-trace for this new issue -- but i believe each unit which is assigned a token will need its role and token scoped via model and unit [2].

[1] https://github.com/openstack/charm-vault/blob/stable/1.8/src/reactive/vault_handlers.py#L652-L666
[2] https://paste.ubuntu.com/p/xthqdJWmKw/

Revision history for this message
Adam Dyess (addyess) wrote :

This was discovered on the following charm revisions:

K8s model #1: cs:~containers/kubernetes-master-1034
K8s model #2: cs:~containers/kubernetes-master-1034 (yes, same)
Model holding vault: cs:vault-49

Though there are still limits blocking multiple models relating to vault with the same app names and the same unit names

Adam Dyess (addyess)
Changed in charm-kubernetes-master:
milestone: none → 1.27+ck1
Changed in charm-layer-vault-kv:
milestone: none → 1.27+ck1
Changed in charm-kubernetes-master:
assignee: nobody → Adam Dyess (addyess)
Changed in charm-layer-vault-kv:
assignee: nobody → Adam Dyess (addyess)
Revision history for this message
Adam Dyess (addyess) wrote (last edit ):
Adam Dyess (addyess)
Changed in charm-interface-vault-kv:
status: New → In Progress
assignee: nobody → Adam Dyess (addyess)
Changed in charm-layer-vault-kv:
status: Triaged → In Progress
Changed in charm-kubernetes-master:
status: Triaged → In Progress
Adam Dyess (addyess)
Changed in charm-layer-vault-kv:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Adam Dyess (addyess)
no longer affects: vault-charm
Revision history for this message
Adam Dyess (addyess) wrote :

backport to layer-vault-kv release_1.27 complete

tags: added: backport-needed
Revision history for this message
Adam Dyess (addyess) wrote :

backport to charm-kubernetes-control-plane release_1.27 completed

new 1.27/candidate charm available
https://charmhub.io/kubernetes-control-plane?channel=1.27/candidate

tags: removed: backport-needed
Revision history for this message
Adam Dyess (addyess) wrote :

I've re-added the backport-needed tag because the PR [1] on interface-vault-kv is mandatory to resolve one more issue here. Do not mark this bug as "released" unless this PR is also merged.

[1] https://github.com/openstack-charmers/charm-interface-vault-kv/pull/14

Changed in charm-kubernetes-master:
status: Fix Committed → In Progress
tags: added: backport-needed
Adam Dyess (addyess)
Changed in charm-interface-vault-kv:
status: In Progress → Fix Committed
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
tags: removed: backport-needed
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Add the vault charm as it's based on the Vault KV layer and therefore will need to be rebuilt.

Revision history for this message
Adam Dyess (addyess) wrote (last edit ):
Changed in charm-interface-vault-kv:
status: Fix Committed → In Progress
Adam Dyess (addyess)
Changed in charm-interface-vault-kv:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
Changed in charm-layer-vault-kv:
status: Fix Committed → Fix Released
Changed in charm-interface-vault-kv:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.