kubernetes-control-plane errors with hook failed: "vault-kv-relation-changed"

Bug #1989362 reported by Bas de Bruijne
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
High
Adam Dyess
Vault KV Charm Layer
Fix Released
High
Adam Dyess

Bug Description

In testrun https://solutions.qa.canonical.com/testruns/testRun/a84cd3a5-9279-48ac-8455-3d32cfcad411, which is a release test for 1.25 on aws jammy, k8s-control-plane fails with:

```
kubernetes-control-plane/0 waiting idle 12 54.81.238.200 6443/tcp Waiting for auth-webhook tokens
  calico/4 waiting idle 54.81.238.200 Waiting to retry Calico node configuration
  containerd/4 active idle 54.81.238.200 Container runtime available
  filebeat/17 active idle 54.81.238.200 Filebeat ready.
  ntp/17 active idle 54.81.238.200 123/udp chrony: Ready
  telegraf/17 active idle 54.81.238.200 9103/tcp Monitoring kubernetes-control-plane/0 (source version/commit 76901fd)
kubernetes-control-plane/1* error idle 13 35.168.18.90 hook failed: "vault-kv-relation-changed"
  calico/1 waiting idle 35.168.18.90 Waiting to retry Calico node configuration
  containerd/1 active idle 35.168.18.90 Container runtime available
  filebeat/2 active idle 35.168.18.90 Filebeat ready.
  ntp/2 active idle 35.168.18.90 123/udp chrony: Ready
  telegraf/2 active idle 35.168.18.90 9103/tcp Monitoring kubernetes-control-plane/1 (source version/commit 76901fd)
```

In the logs we see an internal error:
```
/vaultlocker.py:176: DeprecationWarning: Call to deprecated function '_post'. This method will be removed in version '0.8.0' Please use the 'post' method on the 'hvac.adapters' class moving forward.
unit-kubernetes-control-plane-1: 22:35:38 WARNING unit.kubernetes-control-plane/1.vault-kv-relation-changed response = client._post('/v1/sys/wrapping/unwrap')
unit-kubernetes-control-plane-1: 22:35:41 ERROR unit.kubernetes-control-plane/1.juju-log vault-kv:58: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/charms/reactive/__init__.py", line 73, in main
    hookenv._run_atstart()
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/charmhelpers/core/hookenv.py", line 1348, in _run_atstart
    callback(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/reactive/vault_kv.py", line 46, in manage_app_kv_flags
    app_kv = vault_kv.VaultAppKV()
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 33, in __call__
    cls._singleton_instance = super().__call__(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 127, in __init__
    self._path = "{}/kv/app".format(self._config["secret_backend"])
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 72, in _config
    _VaultBaseKV._config = get_vault_config()
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 238, in get_vault_config
    "secret_id": _get_secret_id(vault),
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 257, in _get_secret_id
    secret_id = retrieve_secret_id(vault_url, token)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/charmhelpers/contrib/openstack/vaultlocker.py", line 176, in retrieve_secret_id
    response = client._post('/v1/sys/wrapping/unwrap')
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/utils.py", line 201, in new_func
    return method(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/v1/__init__.py", line 3034, in _post
    return self._adapter.post(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/adapters.py", line 126, in post
    return self.request("post", url, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/adapters.py", line 330, in request
    utils.raise_for_error(
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/utils.py", line 49, in raise_for_error
    raise exceptions.InternalServerError(
hvac.exceptions.InternalServerError: 1 error occurred:
 * internal error

, on post http://172.31.44.158:8200/v1/sys/wrapping/unwrap
```

Crashdumps for this run can be found here:
https://oil-jenkins.canonical.com/artifacts/a84cd3a5-9279-48ac-8455-3d32cfcad411/index.html

Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report. This is closely related to https://bugs.launchpad.net/bugs/1988448. See the comments there for an explanation about why the InternalServerError was raised from Vault.

However, this time it was raised in a different code path in layer-vault-kv, in _get_secret_id[1]. We need to catch the exception there.

[1]: https://github.com/juju-solutions/layer-vault-kv/blob/b0265ff2efed76da594abc07a279d2454f224bfd/lib/charms/layer/vault_kv.py#L259-L264

Changed in charm-kubernetes-master:
milestone: none → 1.25+ck2
importance: Undecided → High
status: New → Triaged
Revision history for this message
George Kraft (cynerva) wrote :

I think a potential workaround would be to wait until mysql-innodb-cluster has finished its rolling restarts, before relating kubernetes-control-plane:vault-kv to vault.

summary: - [1.25/beta] kubernetes-control-plane errors with hook failed: "vault-kv-
- relation-changed"
+ kubernetes-control-plane errors with hook failed: "vault-kv-relation-
+ changed"
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
milestone: 1.25+ck2 → 1.26
George Kraft (cynerva)
Changed in charm-kubernetes-master:
assignee: nobody → George Kraft (cynerva)
status: Triaged → Won't Fix
status: Won't Fix → In Progress
Revision history for this message
George Kraft (cynerva) wrote :
George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Revision history for this message
Moises Emilio Benzan Mora (moisesbenzan) wrote :
Revision history for this message
George Kraft (cynerva) wrote :

Traceback from the occurrence in comment #4: https://paste.ubuntu.com/p/nX3Jm85qvB/

OK, so once again: same root cause, same presenting symptom, different code path. In this case, the InternalServerError was caught properly in the _client function and then re-raised as VaultNotReady. The problem now is that kubernetes-control-plane didn't catch the VaultNotReady exception.

It looks like both the generate_encryption_key handler[1] and the _write_encryption_config function[2] (or its caller) may need to be updated to catch VaultNotReady.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7de2ed17e09c1709c49d72572eab299c5a64fcf0/reactive/kubernetes_control_plane.py#L3229
[2]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7de2ed17e09c1709c49d72572eab299c5a64fcf0/reactive/kubernetes_control_plane.py#L3243

Changed in charm-kubernetes-master:
status: Fix Committed → Triaged
milestone: 1.26 → 1.26+ck1
Revision history for this message
George Kraft (cynerva) wrote :

Removed from the 1.26 milestone and reverted to triaged state. We'll try again to fix this in 1.26+ck1.

Adam Dyess (addyess)
Changed in charm-kubernetes-master:
milestone: 1.26+ck1 → 1.26+ck2
Changed in charm-kubernetes-master:
milestone: 1.26+ck2 → 1.26+ck3
Adam Dyess (addyess)
Changed in charm-layer-vault-kv:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.26+ck3
Revision history for this message
Adam Dyess (addyess) wrote :

Catching and logging vault connection errors to KCP Logs -- will be retried
https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/pull/276

Changed in charm-kubernetes-master:
status: Triaged → In Progress
Changed in charm-layer-vault-kv:
status: Triaged → In Progress
assignee: nobody → Adam Dyess (addyess)
Changed in charm-kubernetes-master:
assignee: George Kraft (cynerva) → Adam Dyess (addyess)
Adam Dyess (addyess)
Changed in charm-layer-vault-kv:
status: In Progress → Fix Committed
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
tags: added: backport-needed
Revision history for this message
Adam Dyess (addyess) wrote :
tags: removed: backport-needed
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
Changed in charm-layer-vault-kv:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.