Sometimes a new etcd unit fails on the certificates-relation-changed hook. On stable channel. Seems to be only when there is > 1 unit.
juju debug log:
```
unit-etcd-4: 02:02:36 INFO unit.etcd/4.juju-log certificates:75: Invoking reactive handler: reactive/etcd.py:341:send_cluster_details
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: ['/snap/bin/etcd.etcdctl', 'member', 'list']
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: {'ETCDCTL_API': '2', 'ETCDCTL_CA_FILE': '/var/snap/etcd/common/ca.crt', 'ETCDCTL_CERT_FILE': '/var/snap/etcd/common/server.crt', 'ETCDCTL_KEY_FILE': '/var/snap/etcd/common/server.key'}
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: b'Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: connect: connection refused\n; error #1: dial tcp 127.0.0.1:4001: connect: connection refused\n\nerror #0: dial tcp 127.0.0.1:2379: connect: connection refused\nerror #1: dial tcp 127.0.0.1:4001: connect: connection refused\n\n'
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: None
unit-etcd-4: 02:02:36 ERROR unit.etcd/4.juju-log certificates:75: Hook error:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 182, in run
return check_output(command, env=env, stderr=STDOUT).decode("utf-8")
File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/snap/bin/etcd.etcdctl', 'member', 'list']' returned non-zero exit status 2.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
bus.dispatch(restricted=restricted_mode)
File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
_invoke(other_handlers)
File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
handler.invoke()
File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
self._action(*args)
File "/var/lib/juju/agents/unit-etcd-4/charm/reactive/etcd.py", line 354, in send_cluster_details
peers = etcdctl.member_list()
File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 75, in member_list
out = self.run(command, endpoints=leader_address, api=2)
File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 188, in run
raise EtcdCtl.CommandFailed() from e
etcdctl.EtcdCtl.CommandFailed
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed Traceback (most recent call last):
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 182, in run
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed return check_output(command, env=env, stderr=STDOUT).decode("utf-8")
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/usr/lib/python3.8/subprocess.py", line 516, in run
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed raise CalledProcessError(retcode, process.args,
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed subprocess.CalledProcessError: Command '['/snap/bin/etcd.etcdctl', 'member', 'list']' returned non-zero exit status 2.
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed The above exception was the direct cause of the following exception:
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed Traceback (most recent call last):
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/hooks/certificates-relation-changed", line 22, in <module>
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed main()
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed bus.dispatch(restricted=restricted_mode)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed _invoke(other_handlers)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed handler.invoke()
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed self._action(*args)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/reactive/etcd.py", line 354, in send_cluster_details
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed peers = etcdctl.member_list()
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 75, in member_list
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed out = self.run(command, endpoints=leader_address, api=2)
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed File "/var/lib/juju/agents/unit-etcd-4/charm/lib/etcdctl.py", line 188, in run
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed raise EtcdCtl.CommandFailed() from e
unit-etcd-4: 02:02:36 WARNING unit.etcd/4.certificates-relation-changed etcdctl.CommandFailed
```
It appears that the etcd service has failed for some reason (invalid config from the charm?):
```
ubuntu@juju-d2a365-3-lxd-5:~$ snap list
Name Version Rev Tracking Publisher Notes
core 16-2.58 14447 latest/stable canonical✓ core
core18 20221212 2667 latest/stable canonical✓ base
core20 20221212 1778 latest/stable canonical✓ base
etcd 3.4.22 233 3.4/stable canonical✓ -
lxd 4.0.9-a29c6f1 24061 4.0/stable/… canonical✓ -
snapd 2.57.6 17883 latest/stable/… canonical✓ snapd
ubuntu@juju-d2a365-3-lxd-5:~$ sudo snap logs etcd
2023-01-20T02:04:27Z etcd[23259]: data dir = /var/snap/etcd/current
2023-01-20T02:04:27Z etcd[23259]: member dir = /var/snap/etcd/current/member
2023-01-20T02:04:27Z etcd[23259]: heartbeat = 100ms
2023-01-20T02:04:27Z etcd[23259]: election = 1000ms
2023-01-20T02:04:27Z etcd[23259]: snapshot count = 10000
2023-01-20T02:04:27Z etcd[23259]: advertise client URLs = https://192.168.151.113:2379
2023-01-20T02:04:27Z etcd[23259]: check file permission: directory "/var/snap/etcd/current" exist, but the permission is "drwxr-xr-x". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2023-01-20T02:04:27Z etcd[23259]: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs
2023-01-20T02:04:27Z systemd[1]: snap.etcd.etcd.service: Main process exited, code=exited, status=1/FAILURE
2023-01-20T02:04:27Z systemd[1]: snap.etcd.etcd.service: Failed with result 'exit-code'.
ubuntu@juju-d2a365-3-lxd-5:~$ sudo etcd.etcdctl member list
{"level":"warn","ts":"2023-01-20T02:04:50.079Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-262f2f03-fb39-4d99-aefc-297a18f1c635/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Error: context deadline exceeded
```
Restarting the service with `sudo snap restart etcd` results in the same failure.
We're not seeing this issue in CI runs, so we don't currently have a way to reproduce this. We will need more info before we can investigate further.
If you see this issue again, can you run juju-crashdump and attach the resulting .tar.gz file? You can install and run juju-crashdump with the following commands:
sudo snap install juju-crashdump --classic
juju-crashdump -a debug-layer -a config