config-changed can hang indefinitly if it misses a previous upgrade key

Bug #1821113 reported by Wouter van Bommel
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
New
Undecided
Unassigned

Bug Description

When performing some maintenance on a cloud, bumped into a host that was holding a machine lock.

I could trace back with the logging that it was waiting for an upgrade to finish. But it looks like the 'timeout' that is supposed to happen after 10 minutes, never kicked in.

Code that I suspect is located around: https://github.com/openstack/charm-ceph-osd/blob/master/lib/ceph/utils.py#L2205

Logging:
2019-03-19 15:08:57 DEBUG juju.cmd supercommand.go:57 args: []string{"/var/lib/juju/tools/unit-ceph-osd-6/jujud", "unit", "--data-dir", "/var/lib/juju", "--unit-name", "ceph-osd/6", "--debug"}
2019-03-19 15:08:57 DEBUG juju.agent agent.go:545 read agent config, format "2.0"
2019-03-19 15:08:57 INFO juju.cmd.jujud agent.go:133 setting logging config to "<root>=WARNING;unit=DEBUG"
2019-03-19 15:09:22 DEBUG juju-log Hardening function 'config_changed'
2019-03-19 15:09:22 DEBUG juju-log No hardening applied to 'config_changed'
2019-03-19 15:09:22 INFO juju-log old_version: luminous
2019-03-19 15:09:23 INFO juju-log new_version: luminous
2019-03-19 15:09:24 DEBUG juju-log Directory "/var/lib/ceph/osd/ceph-58" needs its ownership updated
2019-03-19 15:09:24 INFO juju-log Attempting to resume possibly failed upgrade.
2019-03-19 15:09:27 INFO juju-log Making dir /var/lib/charm/ceph-osd ceph:ceph 555
2019-03-19 15:09:27 INFO juju-log Monitor hosts are ['10.63.4.43:6789', '10.63.4.49:6789', '10.63.4.57:6789']
2019-03-19 15:09:35 DEBUG juju-log Writing file /var/lib/charm/ceph-osd/ceph.conf ceph:ceph 644
2019-03-19 15:09:35 INFO juju-log roll_osd_cluster called with luminous
2019-03-19 15:09:36 INFO juju-log osd_sorted_list: [<ceph.utils.CrushLocation object at 0x7f7166cc9080>, <ceph.utils.CrushLocation object at 0x7f7166cc90b8>, <ceph.utils.CrushLocation object at 0x7f7166cc90f0>, <ceph.utils.CrushLocation object at 0x7f7166cc9128>, <ceph.utils.CrushLocation object at 0x7f7166cc9160>, <ceph.utils.CrushLocation object at 0x7f7166cc9198>, <ceph.utils.CrushLocation object at 0x7f7166cc91d0>, <ceph.utils.CrushLocation object at 0x7f7166cc9208>, <ceph.utils.CrushLocation object at 0x7f7166cc9240>]
2019-03-19 15:09:36 INFO juju-log upgrade position: 5
2019-03-19 15:09:36 INFO juju-log Previous node is: os-s005
2019-03-19 15:09:37 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-19 15:09:37 INFO juju-log os-s005 is not finished. Waiting
2019-03-19 15:09:37 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-19 15:09:37 INFO juju-log Monitor config-key get failed with message: b''
2019-03-19 15:09:37 INFO juju-log waiting for 11 seconds
2019-03-19 15:09:48 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-19 15:09:48 INFO juju-log os-s005 is not finished. Waiting
2019-03-19 15:09:49 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-19 15:09:49 INFO juju-log Monitor config-key get failed with message: b''
2019-03-19 15:09:49 INFO juju-log waiting for 19 seconds
2019-03-19 15:10:08 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-19 15:10:08 INFO juju-log os-s005 is not finished. Waiting
2019-03-19 15:10:08 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-19 15:10:08 INFO juju-log Monitor config-key get failed with message: b''
2019-03-19 15:10:08 INFO juju-log waiting for 11 seconds
2019-03-19 15:10:20 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-19 15:10:20 INFO juju-log os-s005 is not finished. Waiting
2019-03-19 15:10:20 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-19 15:10:20 INFO juju-log Monitor config-key get failed with message: b''
2019-03-19 15:10:20 INFO juju-log waiting for 7 seconds
2019-03-19 15:10:27 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
.
.
.
2019-03-19 19:37:59 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-19 19:37:59 INFO juju-log os-s005 is not finished. Waiting
2019-03-19 19:38:00 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-19 19:38:00 INFO juju-log Monitor config-key get failed with message: b''
2019-03-19 19:38:00 INFO juju-log waiting for 5 seconds
2019-03-19 19:38:05 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-19 19:38:05 INFO juju-log os-s005 is not finished. Waiting
2019-03-19 19:38:05 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-19 19:38:05 INFO juju-log Monitor config-key get failed with message: b''
2019-03-19 19:38:05 INFO juju-log waiting for 17 seconds
2019-03-19 19:38:23 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-19 19:38:23 INFO juju-log os-s005 is not finished. Waiting
.
.
.
2019-03-21 01:35:05 INFO juju-log os-s005 is not finished. Waiting
2019-03-21 01:35:05 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-21 01:35:05 INFO juju-log Monitor config-key get failed with message: b''
2019-03-21 01:35:05 INFO juju-log waiting for 12 seconds
2019-03-21 01:35:17 DEBUG config-changed Error ENOENT: key 'osd_os-s005_luminous_done' doesn't exist
2019-03-21 01:35:17 INFO juju-log os-s005 is not finished. Waiting
2019-03-21 01:35:18 DEBUG config-changed Error ENOENT: error obtaining 'osd_os-s005_luminous_start': (2) No such file or directory
2019-03-21 01:35:18 INFO juju-log Monitor config-key get failed with message: b''
2019-03-21 01:35:18 INFO juju-log waiting for 28 seconds

tags: added: canonical-bootstack
Revision history for this message
Wouter van Bommel (woutervb) wrote :

After looking a bit closer to the code. The issue is caused by the fact the for some reason (edge case) the start signal (osd_os-s005_luminous_start in this case) is never received. So the 10 minuted loop never even starts.

Revision history for this message
Trent Lloyd (lathiat) wrote :

This is a duplicate of my bug here:
https://bugs.launchpad.net/charm-ceph-osd/+bug/1779828

Revision history for this message
Trent Lloyd (lathiat) wrote :

The failure to timeout is also here:
https://bugs.launchpad.net/charms.ceph/+bug/1664435

But it shouldn't have needed to timeout in the first place, so the main bug is #1779828

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.