Etcd errors without juju detecting error state but displaying "Errored with 0 known peers"

Bug #1933355 reported by Boris Lukashev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Etcd Charm
New
Undecided
Unassigned

Bug Description

Etcd appears to be a tire fire in this snap/lxd mess.
I have a bug out on it losing the freezer cgroup randomly at runtime, and now i've lost an entire openstack because etcd went insane (again, relating to Canonical's proprietary confinement strategy).

OpenStack was fine for weeks, then all of a sudden etcd starting showing:
```
$ juju status etcd
Model Controller Cloud/Region Version SLA Timestamp
openstack maas-controller cumulostratus/default 2.9.0 unsupported 14:01:29Z

App Version Status Scale Charm Store Channel Rev OS Message
etcd 3.4.5 active 3 etcd charmstore stable 594 ubuntu Errored with 0 known peers

Unit Workload Agent Machine Public address Ports Message
etcd/0 active idle 0/lxd/1 10.217.242.17 2379/tcp Errored with 0 known peers
etcd/1* active idle 1/lxd/0 10.217.242.8 2379/tcp Errored with 0 known peers
etcd/2 active idle 2/lxd/0 10.217.242.2 2379/tcp Errored with 0 known peers
```
which the logs say is caused by
```
unit-etcd-0: 14:01:47 DEBUG juju.worker.uniter.runner starting jujuc server  {unix @/var/lib/juju/agents/unit-etcd-0/agent.socket <nil>}
unit-etcd-0: 14:01:49 DEBUG unit.etcd/0.update-status lxc
unit-etcd-0: 14:01:50 WARNING unit.etcd/0.update-status Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: socket: permission denied
unit-etcd-0: 14:01:50 WARNING unit.etcd/0.update-status ; error #1: dial tcp 127.0.0.1:2379: socket: permission denied
unit-etcd-0: 14:01:50 WARNING unit.etcd/0.update-status
unit-etcd-0: 14:01:50 WARNING unit.etcd/0.update-status error #0: dial tcp 127.0.0.1:4001: socket: permission denied
unit-etcd-0: 14:01:50 WARNING unit.etcd/0.update-status error #1: dial tcp 127.0.0.1:2379: socket: permission denied
unit-etcd-0: 14:01:50 WARNING unit.etcd/0.update-status
unit-etcd-0: 14:01:51 WARNING unit.etcd/0.update-status Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: socket: permission denied
unit-etcd-0: 14:01:51 WARNING unit.etcd/0.update-status ; error #1: dial tcp 127.0.0.1:4001: socket: permission denied
unit-etcd-0: 14:01:51 WARNING unit.etcd/0.update-status
unit-etcd-0: 14:01:51 WARNING unit.etcd/0.update-status error #0: dial tcp 127.0.0.1:2379: socket: permission denied
unit-etcd-0: 14:01:51 WARNING unit.etcd/0.update-status error #1: dial tcp 127.0.0.1:4001: socket: permission denied
unit-etcd-0: 14:01:51 WARNING unit.etcd/0.update-status
unit-etcd-0: 14:01:53 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script
```
on all units, and that update-status hook runs `/snap/etcd/230/bin/etcd --config-file /var/snap/etcd/common/etcd.conf.yml` - a command which when run by the operator succeeds, etcd creates a cluster, and things work in the foreground of the shell. Let it natively try to run in its snap thing and it breaks, then vault breaks, then mysql breaks because it needs vault, then openstack is gone.

This worked fine for weeks, we were about to go prod with this stack, but it failed like this costing us massive time loss to rebuild everything and start from square 1.

Revision history for this message
Boris Lukashev (rageltman) wrote :
Download full text (4.6 KiB)

I'm about to wipe out this stack and switch over to Kolla/Kayobe or something actually FOSS, so won't be able to get any debug data after i wipe the hosts.
However, the permissions errors are also showing up in the journal of the relevant snap service in the lxd container unit:
```
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: Go Version: go1.13.10
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd.etcd[326634]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: Go OS/Arch: linux/amd64
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: failed to detect default host (operation not permitted)
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: the server is already initialized as member before, starting as etcd member...
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: peerTLS: cert = /var/snap/etcd/common/server.crt, key = /var/snap/etcd/common/server.key, trusted-ca = /var/snap/etcd/common/ca.crt, client-cert-auth = true, crl-file =
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 etcd[326634]: listen tcp 0.0.0.0:2380: socket: permission denied
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 systemd[1]: snap.etcd.etcd.service: Main process exited, code=exited, status=1/FAILURE
Jun 23 21:42:33 juju-4e82f9-0-lxd-1 systemd[1]: snap.etcd.etcd.service: Failed with result 'exit-code'.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 systemd[1]: snap.etcd.etcd.service: Scheduled restart job, restart counter is at 5432.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 systemd[1]: Stopped Service for snap application etcd.etcd.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 systemd[1]: Started Service for snap application etcd.etcd.
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: Running as system with data in /var/snap/etcd/230
Jun 23 21:42:44 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: Configuration from /var/snap/etcd/common/etcd.conf.yml
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Loading server configuration from "/var/snap/etcd/common/etcd.conf.yml". Other configuration command line flags and environment variables will be ignored if provided.
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: etcd Version: 3.4.5
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd.etcd[326745]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Git SHA: Not provided (use ./build instead of go build)
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Go Version: go1.13.10
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: Go OS/Arch: linux/amd64
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: failed to detect default host (operation not permitted)
Jun 23 21:42:45 juju-4e82f9-0-lxd-1 etcd[326745]: the server is already initialized as member before, starting as etcd member...
Jun 23 21:4...

Read more...

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

Due to this bug, we have had to abandon the Canonical OpenStack approach - lack of triage to a mission critical bug which breaks the undercloud completely and thus the overcloud as well, from the proprietary Canonical snap engine no less, does not make for a stable ecosystem on which OpenStack can be run.
We've now converted our Ceph bundle to eschew all of this as well along with vault and mysql because those were destroyed in our openstack build, though there are still snaps in-play so we might go to Ansible Ceph.
Snaps are killing all of the good work Canonical does, about time to kill that project like Unity and Ubuntu for mobiles before it and get on the bandwagon of FOSS tooling. Its clear as day that snaps are a poor attempt at vendor lock-in, and they're causing customers to actually leave instead. Canonical has a very small window to claim market share now that RL is live to replace CentOS, and snaps are the #1 reason people are not adopting Ubuntu (based on discussions with client engineers - we work across a few verticals).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.