etcd shows update-status hook errors after host reboot

Bug #1934108 reported by Przemyslaw Hausman
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Etcd Charm
Confirmed
Undecided
Unassigned
Etcd Snaps
New
Undecided
Unassigned

Bug Description

I have rebooted the host machine and now etcd unit is stuck in error state:

juju debug-log:

unit-etcd-1: 09:03:00 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-etcd-1: 09:03:01 WARNING unit.etcd/1.start cannot perform operation: mount --rbind /dev /tmp/snap.rootfs_hHlR10//dev: No such file or directory
unit-etcd-1: 09:03:01 WARNING unit.etcd/1.start cannot perform operation: mount --rbind /dev /tmp/snap.rootfs_sFhD0d//dev: No such file or directory
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start cannot perform operation: mount --rbind /dev /tmp/snap.rootfs_1oHvMf//dev: No such file or directory
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start cannot perform operation: mount --rbind /dev /tmp/snap.rootfs_Rzv1kQ//dev: No such file or directory
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start Traceback (most recent call last):
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/var/lib/juju/agents/unit-etcd-1/charm/hooks/start", line 22, in <module>
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start main()
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start bus.dispatch(restricted=restricted_mode)
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start _invoke(other_handlers)
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start handler.invoke()
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/var/lib/juju/agents/unit-etcd-1/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start self._action(*args)
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/var/lib/juju/agents/unit-etcd-1/charm/reactive/etcd.py", line 279, in send_cluster_connection_details
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start db.set_connection_string(connection_string, version=etcdctl.version())
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "lib/etcdctl.py", line 193, in version
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start out = check_output(
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start File "/usr/lib/python3.8/subprocess.py", line 512, in run
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start raise CalledProcessError(retcode, process.args,
unit-etcd-1: 09:03:02 WARNING unit.etcd/1.start subprocess.CalledProcessError: Command '['/snap/bin/etcd.etcdctl', 'version']' returned non-zero exit status 1.
unit-etcd-1: 09:03:02 ERROR juju.worker.uniter.operation hook "start" (via explicit, bespoke hook script) failed: exit status 1
unit-etcd-1: 09:03:02 INFO juju.worker.uniter awaiting error resolution for "start" hook
unit-etcd-1: 09:03:29 INFO juju.worker.uniter awaiting error resolution for "start" hook

I have 3 etcd units deployed in total. Only one unit is in error state. Etcd units are deployed in lxd containers.

etcd charm revision: 594

Revision history for this message
Drew Freiberger (afreiberger) wrote :

For further info, I'm seeing this on a couple units of etcd as well.

running the command manually, you see this error

root@juju-f98bb9-2-lxd-1:/sys/fs/cgroup/freezer# etcd.etcdctl version
cannot open cgroup hierarchy /sys/fs/cgroup/freezer: No such file or directory

But oddly, the cgroup exists and should be readable, but may not be available due to snap confinement. I'd guess that cgroups got a new plug in upstream snapd, hence the effect taking place after restart. It seems that the issue is the charm's attempt to run etcdctl version command, but that etcd itself is running and functioning.

root@juju-f98bb9-2-lxd-1:/sys/fs/cgroup/freezer# find -ls
       32 0 drwxrwxr-x 4 nobody root 0 Jul 20 20:25 .
       33 0 -rw-rw-r-- 1 nobody root 0 Oct 1 21:03 ./cgroup.procs
       38 0 -r--r--r-- 1 nobody nogroup 0 Jan 7 23:31 ./freezer.self_freezing
      120 0 drwxr-xr-x 2 root root 0 Jul 20 20:25 ./snap.etcd
      121 0 -rw-r--r-- 1 root root 0 Jul 20 20:25 ./snap.etcd/cgroup.procs
      126 0 -r--r--r-- 1 root root 0 Jul 20 20:25 ./snap.etcd/freezer.self_freezing
      123 0 -rw-r--r-- 1 root root 0 Jul 20 20:25 ./snap.etcd/tasks
      127 0 -r--r--r-- 1 root root 0 Jul 20 20:25 ./snap.etcd/freezer.parent_freezing
      125 0 -rw-r--r-- 1 root root 0 Dec 20 00:00 ./snap.etcd/freezer.state
      124 0 -rw-r--r-- 1 root root 0 Jul 20 20:25 ./snap.etcd/notify_on_release
      122 0 -rw-r--r-- 1 root root 0 Jul 20 20:25 ./snap.etcd/cgroup.clone_children
       35 0 -rw-rw-r-- 1 nobody root 0 Jul 20 20:22 ./tasks
       39 0 -r--r--r-- 1 nobody nogroup 0 Jan 7 23:31 ./freezer.parent_freezing
       37 0 -rw-r--r-- 1 nobody nogroup 0 Jan 7 23:31 ./freezer.state
       88 0 drwxr-xr-x 2 root root 0 Jul 20 20:23 ./snap.lxd
       89 0 -rw-r--r-- 1 root root 0 Jul 20 20:23 ./snap.lxd/cgroup.procs
       94 0 -r--r--r-- 1 root root 0 Jul 20 20:23 ./snap.lxd/freezer.self_freezing
       91 0 -rw-r--r-- 1 root root 0 Jul 20 20:23 ./snap.lxd/tasks
       95 0 -r--r--r-- 1 root root 0 Jul 20 20:23 ./snap.lxd/freezer.parent_freezing
       93 0 -rw-r--r-- 1 root root 0 Dec 20 00:00 ./snap.lxd/freezer.state
       92 0 -rw-r--r-- 1 root root 0 Jul 20 20:23 ./snap.lxd/notify_on_release
       90 0 -rw-r--r-- 1 root root 0 Jul 20 20:23 ./snap.lxd/cgroup.clone_children
       36 0 -rw-r--r-- 1 nobody nogroup 0 Jan 7 23:31 ./notify_on_release
       34 0 -rw-r--r-- 1 nobody nogroup 0 Jan 7 23:31 ./cgroup.clone_children

Changed in charm-etcd:
status: New → Confirmed
summary: - etcd stuck in error state after host reboot
+ etcd shows update-status hook errors after host reboot
Revision history for this message
George Kraft (cynerva) wrote :

The cgroup/freezer error is a different bug, being tracked here: https://bugs.launchpad.net/bugs/1933128

Please see the last few comments of that bug for potential workarounds.

I think these are two different bugs. The debug-log output from this bug's description clearly shows the command failing with:

mount --rbind /dev /tmp/snap.rootfs_hHlR10//dev: No such file or directory

Not the freezer cgroup thing.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.