re-adding ceph-osd charm back to host with down/out, but not deleted OSDs fails on install hook

Bug #1728161 reported by Drew Freiberger
This bug report is a duplicate of:  Bug #1629679: remove-unit doesn't take OSDs down. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
New
Undecided
Unassigned

Bug Description

Scenario:

ceph-osd on metal with dummy charm with 4 osd-devices defined as paths to directories (/srv/ceph/ceph0, 1, 2, 3), rather than device paths.
then the 4 OSDs hang up in kernel

Remove ceph-osd unit to stop services, but ceph-osd can't actually stop due to hanging in io wait state.
force ceph-osd unit to fail and remove itself and stop properly through debug-hooks.

Now you have a host with ubuntu and no ceph-osd unit, but 4 ceph osd prepared disks that were previously activated and still defined in ceph osd map and crush map.

xfs_repair the ceph disks

re-add ceph-osd charm to the unit and hope to pick up the disks and rejoin the cluster.

Problem:

During the install hook, ceph-osd processes installing software and running ceph-prepare and ceph-activate on disks that have a predefined whoami and fsid file, however, the activate command fails because ceph.conf does not have the fsid or mons defined.

2017-10-27 21:43:23 DEBUG config-changed kernel.pid_max = 2097152
2017-10-27 21:43:23 DEBUG juju-log got journal devs: set([])
2017-10-27 21:43:23 DEBUG juju-log read zapped: set([])
2017-10-27 21:43:23 DEBUG juju-log write zapped: set([])
2017-10-27 21:43:23 INFO juju-log ceph bootstrapped, rescanning disks
2017-10-27 21:43:25 INFO juju-log Making dir /var/lib/charm/ceph-osd-bcache ceph:ceph 555
2017-10-27 21:43:25 INFO juju-log Monitor hosts are []
2017-10-27 21:43:31 INFO juju-log Making dir /srv/ceph/ceph0 ceph:ceph 755
2017-10-27 21:43:37 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph0']
2017-10-27 21:43:44 INFO juju-log Making dir /srv/ceph/ceph1 ceph:ceph 755
2017-10-27 21:43:50 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph1']
2017-10-27 21:43:53 INFO juju-log Making dir /srv/ceph/ceph2 ceph:ceph 755
2017-10-27 21:44:03 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph2']
2017-10-27 21:44:07 INFO juju-log Making dir /srv/ceph/ceph3 ceph:ceph 755
2017-10-27 21:44:13 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph3']
2017-10-27 21:44:15 DEBUG config-changed ceph-disk: Error: No cluster conf found in /etc/ceph with fsid ca9451f1-5c4f-4e85-bb14-a08dfc0568f7
2017-10-27 21:44:15 DEBUG config-changed Traceback (most recent call last):
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/juju/agents/unit-ceph-osd-bcache-35/charm/hooks/config-changed", line 524, in <module>
2017-10-27 21:44:15 DEBUG config-changed hooks.execute(sys.argv)
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/juju/agents/unit-ceph-osd-bcache-35/charm/hooks/charmhelpers/core/hookenv.py", line 731, in execute
2017-10-27 21:44:15 DEBUG config-changed self._hooks[hook_name]()
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/juju/agents/unit-ceph-osd-bcache-35/charm/hooks/charmhelpers/contrib/hardening/harden.py", line 79, in _harden_inner2
2017-10-27 21:44:15 DEBUG config-changed return f(*args, **kwargs)
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/juju/agents/unit-ceph-osd-bcache-35/charm/hooks/config-changed", line 335, in config_changed
2017-10-27 21:44:15 DEBUG config-changed prepare_disks_and_activate()
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/juju/agents/unit-ceph-osd-bcache-35/charm/hooks/config-changed", line 362, in prepare_disks_and_activate
2017-10-27 21:44:15 DEBUG config-changed ceph.start_osds(get_devices())
2017-10-27 21:44:15 DEBUG config-changed File "lib/ceph/__init__.py", line 810, in start_osds
2017-10-27 21:44:15 DEBUG config-changed subprocess.check_call(['ceph-disk', 'activate', dev_or_path])
2017-10-27 21:44:15 DEBUG config-changed File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2017-10-27 21:44:15 DEBUG config-changed raise CalledProcessError(retcode, cmd)
2017-10-27 21:44:15 DEBUG config-changed subprocess.CalledProcessError: Command '['ceph-disk', 'activate', u'/srv/ceph/ceph0']' returned non-zero exit status 1

I believe this is 17.02 ceph-osd charm.

To resolve, I copied mon host = X.Y.Z.A:PORT and fsid entries from another working ceph-osd unit in /etc/ceph/ceph.conf.

The charm then proceeded to active the disks on the install-hook retry, then it went on to add-relations, etc.

Would be handy if there are pre-defined ceph disks, to check mon relations if "Monitor hosts are []" before performing ceph-disk activate commands.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.