[1.25/beta] kubernetes-control-plane errors with hook failed: "certificates-relation-changed"

Bug #1988448 reported by Bas de Bruijne
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
High
George Kraft

Bug Description

In testrun https://solutions.qa.canonical.com/testruns/testRun/5f543256-9548-4b86-9f9a-ab03d1018e16, k8s cp fails with:

```
kubernetes-control-plane/1* error idle 13 18.232.60.34 hook failed: "certificates-relation-changed"
  calico/4 waiting idle 18.232.60.34 Waiting to retry Calico node configuration
  containerd/4 active idle 18.232.60.34 Container runtime available
  filebeat/15 active idle 18.232.60.34 Filebeat ready.
  ntp/15 active idle 18.232.60.34 123/udp chrony: Ready
  telegraf/15 active idle 18.232.60.34 9103/tcp Monitoring kubernetes-control-plane/1 (source version/commit 0380c15)
```

In the log we see

```
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/charms/reactive/__init__.py", line 73, in main
    hookenv._run_atstart()
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/charmhelpers/core/hookenv.py", line 1348, in _run_atstart
    callback(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/reactive/vault_kv.py", line 46, in manage_app_kv_flags
    app_kv = vault_kv.VaultAppKV()
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 33, in __call__
    cls._singleton_instance = super().__call__(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 131, in __init__
    super().__init__()
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 41, in __init__
    response = self._client.read(self._path)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/charm/lib/charms/layer/vault_kv.py", line 60, in _client
    client.auth_approle(self._config["role_id"], self._config["secret_id"])
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/utils.py", line 201, in new_func
    return method(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/v1/__init__.py", line 1805, in auth_approle
    return self.login(
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/v1/__init__.py", line 1495, in login
    return self._adapter.login(url=url, use_token=use_token, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/adapters.py", line 197, in login
    response = self.post(url, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/adapters.py", line 126, in post
    return self.request("post", url, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/adapters.py", line 364, in request
    response = super(JSONAdapter, self).request(*args, **kwargs)
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/adapters.py", line 330, in request
    utils.raise_for_error(
  File "/var/lib/juju/agents/unit-kubernetes-control-plane-1/.venv/lib/python3.10/site-packages/hvac/utils.py", line 49, in raise_for_error
    raise exceptions.InternalServerError(
hvac.exceptions.InternalServerError: internal error, on post http://172.31.46.20:8200/v1/auth/approle/login
```

We have seen other testruns of the same SKU pass this point.

Crashdumps can be found here:
https://oil-jenkins.canonical.com/artifacts/5f543256-9548-4b86-9f9a-ab03d1018e16/index.html

tags: added: cdo-qa foundations-engine
Revision history for this message
Alexander Balderson (asbalderson) wrote :

We get to this state by
1) deploying cdk 1.25/beta with vault. Relating vault as both the certificate manager and for managing the vault-kv store
2) once vault settles, we unseal it following the steps in the charm documentation
3) run the action "reissue certificates" from vault - https://charmhub.io/vault/actions?channel=1.7/stable#reissue-certificates

Once vault has been unsealed and certificates are re-issued, sometimes k8s cp will reach this state, I would guess it happens about 30% of the time based on our small sample size

Revision history for this message
George Kraft (cynerva) wrote :

It looks like vault had an internal server error due to a dropped mysql connection:

Sep 1 07:46:51 ip-172-31-46-20 vault[50291]: [mysql] 2022/09/01 07:46:51 packets.go:36: unexpected EOF
Sep 1 07:46:51 ip-172-31-46-20 vault[50291]: 2022-09-01T07:46:51.279Z [ERROR] core: failed to create token: error="failed to persist accessor index entry: invalid connection"
Sep 1 07:46:52 ip-172-31-46-20 vault[50291]: [mysql] 2022/09/01 07:46:52 packets.go:122: closing bad idle connection: EOF

and I think the mysql connection dropped because the primary was in the process of shutting down at that time:

2022-09-01T07:46:07.179532Z 144 [System] [MY-011510] [Repl] Plugin group_replication reported: 'This server is working as primary member.'
2022-09-01T07:46:50.168380Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.30-0ubuntu0.22.04.1).
2022-09-01T07:46:53.271039Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2022-09-01T07:46:55.272634Z 0 [Warning] [MY-010909] [Server] /usr/sbin/mysqld: Forcing close of thread 219 user: 'vault'.

and it looks like the primary was shutting down because the mysql-innodb-cluster charm was in the middle of a rolling restart across its units, to add certificates to their configuration.

This looks like a transient issue that kubernetes-control-plane got hung up on. We probably just need to make kubernetes-control-plane a little more resilient against this InternalServerError exception. We can catch it and retry.

George Kraft (cynerva)
Changed in charm-kubernetes-master:
importance: Undecided → High
status: New → Triaged
assignee: nobody → George Kraft (cynerva)
status: Triaged → In Progress
milestone: none → 1.25+ck1
Revision history for this message
George Kraft (cynerva) wrote :
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

This should be included in the latest edge version of the k-c-p charm (rev 187):

juju deploy kubernetes-control-plane --channel edge

SQA, i'm giving it a spin on vsphere now; could you try this in your env as well?

Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
tags: added: backport-needed
Adam Dyess (addyess)
tags: removed: backport-needed
Adam Dyess (addyess)
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.