hook failed: "certificates-relation-changed" (error in send_cluster_details function)

Bug #2003396 reported by Samuel Allan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Etcd Charm
Incomplete
Undecided
Unassigned

Bug Description

Sometimes a new etcd unit fails on the certificates-relation-changed hook. On stable channel. Seems to be only when there is > 1 unit.

juju debug log:

```
unit-etcd-4: 02:02:36 INFO unit.etcd/4.juju-log certificates:75: Invoking reactive handler: reactive/etcd.py:341:send_cluster_details
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: ['/snap/bin/etcd.etcdctl', 'member', 'list']
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: {'ETCDCTL_API': '2', 'ETCDCTL_CA_FILE': '/var/snap/etcd/common/ca.crt', 'ETCDCTL_CERT_FILE': '/var/snap/etcd/common/server.crt', 'ETCDCTL_KEY_FILE': '/var/snap/etcd/common/server.key'}
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: b'Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused\n; error #1: dial tcp 127.0.0.1:4001: connect: connection refused\n\nerror #0: dial tcp 127.0.0.1:2379: connect: connection refused\nerror #1: dial tcp 127.0.0.1:4001: connect: connection refused\n\n'
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: None
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 182, in run
    return check_output(command, env=env, stderr=STDOUT).decode("utf-8")
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/snap/bin/etcd.etcdctl', 'member', 'list']' returned non-zero exit status 2.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-etcd-4/charm/reactive/etcd.py", line 354, in send_cluster_details
    peers = etcdctl.member_list()
  File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 75, in member_list
    out = self.run(command, endpoints=leader_address, api=2)
  File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 188, in run
    raise EtcdCtl.CommandFailed() from e
etcdctl.EtcdCtl.CommandFailed

unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed Traceback (most recent call last):
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 182, in run
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed return check_output(command, env=env, stderr=STDOUT).decode("utf-8")
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/usr/lib/python3.8/subprocess.py", line 516, in run
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed raise CalledProcessError(retcode, process.args,
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed subprocess.CalledProcessError: Command '['/snap/bin/etcd.etcdctl', 'member', 'list']' returned non-zero exit status 2.
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed The above exception was the direct cause of the following exception:
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed Traceback (most recent call last):
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/hooks/certificates-relation-changed", line 22, in <module>
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed main()
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed bus.dispatch(restricted=restricted_mode)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed _invoke(other_handlers)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed handler.invoke()
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed self._action(*args)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/reactive/etcd.py", line 354, in send_cluster_details
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed peers = etcdctl.member_list()
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 75, in member_list
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed out = self.run(command, endpoints=leader_address, api=2)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 188, in run
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed raise EtcdCtl.CommandFailed() from e
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed etcdctl.CommandFailed
```

It appears that the etcd service has failed for some reason (invalid config from the charm?):

```
ubuntu@juju-d2a365-3-lxd-5:~$ snap list
Name Version Rev Tracking Publisher Notes
core 16-2.58 14447 latest/stable canonical✓ core
core18 20221212 2667 latest/stable canonical✓ base
core20 20221212 1778 latest/stable canonical✓ base
etcd 3.4.22 233 3.4/stable canonical✓ -
lxd 4.0.9-a29c6f1 24061 4.0/stable/… canonical✓ -
snapd 2.57.6 17883 latest/stable/… canonical✓ snapd
ubuntu@juju-d2a365-3-lxd-5:~$ sudo snap logs etcd
2023-01-20T02:04:27Z etcd[23259]: data dir = /var/snap/etcd/current
2023-01-20T02:04:27Z etcd[23259]: member dir = /var/snap/etcd/current/member
2023-01-20T02:04:27Z etcd[23259]: heartbeat = 100ms
2023-01-20T02:04:27Z etcd[23259]: election = 1000ms
2023-01-20T02:04:27Z etcd[23259]: snapshot count = 10000
2023-01-20T02:04:27Z etcd[23259]: advertise client URLs = https://192.168.151.113:2379
2023-01-20T02:04:27Z etcd[23259]: check file permission: directory "/var/snap/etcd/current" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2023-01-20T02:04:27Z etcd[23259]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs
2023-01-20T02:04:27Z systemd[1]: snap.etcd.etcd.service: Main process exited, code=exited, status=1/FAILURE
2023-01-20T02:04:27Z systemd[1]: snap.etcd.etcd.service: Failed with result 'exit-code'.
ubuntu@juju-d2a365-3-lxd-5:~$ sudo etcd.etcdctl member list
{"level":"warn","ts":"2023-01-20T02:04:50.079Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-262f2f03-fb39-4d99-aefc-297a18f1c635/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
```

Restarting the service with `sudo snap restart etcd` results in the same failure.

summary: - hook failed: "certificates-relation-changed"
+ hook failed: "certificates-relation-changed" (error in
+ send_cluster_details function)
Revision history for this message
George Kraft (cynerva) wrote :

We're not seeing this issue in CI runs, so we don't currently have a way to reproduce this. We will need more info before we can investigate further.

If you see this issue again, can you run juju-crashdump and attach the resulting .tar.gz file? You can install and run juju-crashdump with the following commands:

sudo snap install juju-crashdump --classic
juju-crashdump -a debug-layer -a config

Changed in charm-etcd:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.