Etcd errors without juju detecting error state but displaying "Errored with 0 known peers"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Etcd Charm |
New
|
Undecided
|
Unassigned |
Bug Description
Etcd appears to be a tire fire in this snap/lxd mess.
I have a bug out on it losing the freezer cgroup randomly at runtime, and now i've lost an entire openstack because etcd went insane (again, relating to Canonical's proprietary confinement strategy).
OpenStack was fine for weeks, then all of a sudden etcd starting showing:
```
$ juju status etcd
Model Controller Cloud/Region Version SLA Timestamp
openstack maas-controller cumulostratus/
App Version Status Scale Charm Store Channel Rev OS Message
etcd 3.4.5 active 3 etcd charmstore stable 594 ubuntu Errored with 0 known peers
Unit Workload Agent Machine Public address Ports Message
etcd/0 active idle 0/lxd/1 10.217.242.17 2379/tcp Errored with 0 known peers
etcd/1* active idle 1/lxd/0 10.217.242.8 2379/tcp Errored with 0 known peers
etcd/2 active idle 2/lxd/0 10.217.242.2 2379/tcp Errored with 0 known peers
```
which the logs say is caused by
```
unit-etcd-0: 14:01:47 DEBUG juju.worker.
unit-etcd-0: 14:01:49 DEBUG unit.etcd/
unit-etcd-0: 14:01:50 WARNING unit.etcd/
unit-etcd-0: 14:01:50 WARNING unit.etcd/
unit-etcd-0: 14:01:50 WARNING unit.etcd/
unit-etcd-0: 14:01:50 WARNING unit.etcd/
unit-etcd-0: 14:01:50 WARNING unit.etcd/
unit-etcd-0: 14:01:50 WARNING unit.etcd/
unit-etcd-0: 14:01:51 WARNING unit.etcd/
unit-etcd-0: 14:01:51 WARNING unit.etcd/
unit-etcd-0: 14:01:51 WARNING unit.etcd/
unit-etcd-0: 14:01:51 WARNING unit.etcd/
unit-etcd-0: 14:01:51 WARNING unit.etcd/
unit-etcd-0: 14:01:51 WARNING unit.etcd/
unit-etcd-0: 14:01:53 INFO juju.worker.
```
on all units, and that update-status hook runs `/snap/
This worked fine for weeks, we were about to go prod with this stack, but it failed like this costing us massive time loss to rebuild everything and start from square 1.
I'm about to wipe out this stack and switch over to Kolla/Kayobe or something actually FOSS, so won't be able to get any debug data after i wipe the hosts. etcd/common/ server. crt, key = /var/snap/ etcd/common/ server. key, trusted-ca = /var/snap/ etcd/common/ ca.crt, client-cert-auth = true, crl-file = etcd.service: Main process exited, code=exited, status=1/FAILURE etcd.service: Failed with result 'exit-code'. etcd.service: Scheduled restart job, restart counter is at 5432. etcd/common/ etcd.conf. yml etcd/common/ etcd.conf. yml". Other configuration command line flags and environment variables will be ignored if provided.
However, the permissions errors are also showing up in the journal of the relevant snap service in the lxd container unit:
```
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: Go Version: go1.13.10
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd.etcd[326634]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: Go OS/Arch: linux/amd64
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: failed to detect default host (operation not permitted)
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: the server is already initialized as member before, starting as etcd member...
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: peerTLS: cert = /var/snap/
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: listen tcp 0.0.0.0:2380: socket: permission denied
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 systemd[1]: snap.etcd.
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 systemd[1]: snap.etcd.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 systemd[1]: snap.etcd.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 systemd[1]: Stopped Service for snap application etcd.etcd.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 systemd[1]: Started Service for snap application etcd.etcd.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: Running as system with data in /var/snap/etcd/230
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: Configuration from /var/snap/
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Loading server configuration from "/var/snap/
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: etcd Version: 3.4.5
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Git SHA: Not provided (use ./build instead of go build)
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Go Version: go1.13.10
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Go OS/Arch: linux/amd64
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: failed to detect default host (operation not permitted)
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: the server is already initialized as member before, starting as etcd member...
Jun 23 21:4...