[1.7/stable focal] Vault fails with hook failed: "certificates-relation-changed"

Bug #2003744 reported by Bas de Bruijne
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
vault-charm
Incomplete
Undecided
Unassigned

Bug Description

In testrun https://solutions.qa.canonical.com/v2/testruns/3989aa33-ea60-44f6-9742-481a8f290484, which is aws kubernetes on focal, vault fails on the certificates-relation-changed hook. In the vault logs we see:

```
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-vault-0/charm/reactive/vault_handlers.py", line 1100, in tune_pki_backend_config_changed
    vault_pki.update_roles(max_ttl=max_ttl)
  File "/var/lib/juju/agents/unit-vault-0/charm/lib/charm/vault_pki.py", line 339, in update_roles
    write_roles(client, **local)
  File "/var/lib/juju/agents/unit-vault-0/charm/lib/charm/vault_pki.py", line 314, in write_roles
    client.write(
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/hvac/v1/__init__.py", line 189, in write
    response = self._adapter.post('/v1/{0}'.format(path), json=kwargs, wrap_ttl=wrap_ttl)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 103, in post
    return self.request('post', url, **kwargs)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 233, in request
    utils.raise_for_error(response.status_code, text, errors=errors)
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.8/site-packages/hvac/utils.py", line 39, in raise_for_error
    raise exceptions.InternalServerError(message, errors=errors)
hvac.exceptions.InternalServerError: 1 error occurred:
 * invalid connection
```

I can't track down why this is happening. The crashdump doesn't show any additional information that I can find at first glance.

Crashdumps and configs can be found here:
https://oil-jenkins.canonical.com/artifacts/3989aa33-ea60-44f6-9742-481a8f290484/index.html

Revision history for this message
Jeffrey Chang (modern911) wrote :

Seeing another fail in https://solutions.qa.canonical.com/v2/testruns/fb1016df-2473-40da-b897-e40a382ec87d.

And vault logs shows:

unit-vault-1: 06:39:35 INFO unit.vault/1.juju-log certificates:64: Get installed key for snap vault
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed Traceback (most recent call last):
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/charm/hooks/certificates-relation-changed", line 22, in <module>
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed main()
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 84, in main
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed hookenv._run_atexit()
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/charmhelpers/core/hookenv.py", line 1357, in _run_atexit
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed callback(*args, **kwargs)
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/charm/reactive/vault_handlers.py", line 867, in _assess_status
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed if not client.ha_status['ha_enabled']:
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/v1/__init__.py", line 491, in ha_status
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed return self._adapter.get('/v1/sys/leader').json()
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 90, in get
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed return self.request('get', url, **kwargs)
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/adapters.py", line 233, in request
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed utils.raise_for_error(response.status_code, text, errors=errors)
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed File "/var/lib/juju/agents/unit-vault-1/.venv/lib/python3.8/site-packages/hvac/utils.py", line 39, in raise_for_error
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed raise exceptions.InternalServerError(message, errors=errors)
unit-vault-1: 06:39:40 WARNING unit.vault/1.certificates-relation-changed hvac.exceptions.InternalServerError: context deadline exceeded
unit-vault-1: 06:39:41 ERROR juju.worker.uniter.operation hook "certificates-relation-changed" (via explicit, bespoke hook script) failed: exit status 1

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

So the error message is coming from hvac, and this post gives an interesting perspective on it: https://support.hashicorp.com/hc/en-us/articles/4404634420755-Why-am-I-seeing-context-deadline-exceeded-errors

An interesting section:

The error 'context deadline exceeded' means that we ran into a situation where a given action was not completed in an expected timeframe. For Vault this is typically going to be related to a network connection made to an external system such as a database or even a storage backend such as Consul.

--

I'm going to hazard a guess, and suggest that we look at the health/status of the mysql database at the time the error occured; e.g. the /var/log/mysql/error.log files will help to indicate what the status of the cluster was during the above error trace.

Changed in vault-charm:
status: New → Incomplete
Revision history for this message
Amjad Chami (amjad-chami) wrote (last edit ):

For this run: https://solutions.qa.canonical.com/testruns/f6381b6b-4b4c-4631-9516-d5cad6b55470

In the logs, mysql-innodb-cluster/leader goes down for maintenance during:
juju-unit executing running certificates-relation-changed hook for mysql-innodb-cluster/1

Crashdump and configs: https://oil-jenkins.canonical.com/artifacts/f6381b6b-4b4c-4631-9516-d5cad6b55470/index.html
in this file: juju-crashdump-kubernetes-aws-2023-10-26-17.17.31/5a4a5540-08a8-450e-828f-fafe6f2de7a8/mysql-innodb-cluster_1/juju-show-status-log/mysql-innodb-cluster_1

validating Alex's hypothesis

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.