re-adding ceph-osd charm back to host with down/out, but not deleted OSDs fails on install hook
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
New
|
Undecided
|
Unassigned |
Bug Description
Scenario:
ceph-osd on metal with dummy charm with 4 osd-devices defined as paths to directories (/srv/ceph/ceph0, 1, 2, 3), rather than device paths.
then the 4 OSDs hang up in kernel
Remove ceph-osd unit to stop services, but ceph-osd can't actually stop due to hanging in io wait state.
force ceph-osd unit to fail and remove itself and stop properly through debug-hooks.
Now you have a host with ubuntu and no ceph-osd unit, but 4 ceph osd prepared disks that were previously activated and still defined in ceph osd map and crush map.
xfs_repair the ceph disks
re-add ceph-osd charm to the unit and hope to pick up the disks and rejoin the cluster.
Problem:
During the install hook, ceph-osd processes installing software and running ceph-prepare and ceph-activate on disks that have a predefined whoami and fsid file, however, the activate command fails because ceph.conf does not have the fsid or mons defined.
2017-10-27 21:43:23 DEBUG config-changed kernel.pid_max = 2097152
2017-10-27 21:43:23 DEBUG juju-log got journal devs: set([])
2017-10-27 21:43:23 DEBUG juju-log read zapped: set([])
2017-10-27 21:43:23 DEBUG juju-log write zapped: set([])
2017-10-27 21:43:23 INFO juju-log ceph bootstrapped, rescanning disks
2017-10-27 21:43:25 INFO juju-log Making dir /var/lib/
2017-10-27 21:43:25 INFO juju-log Monitor hosts are []
2017-10-27 21:43:31 INFO juju-log Making dir /srv/ceph/ceph0 ceph:ceph 755
2017-10-27 21:43:37 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph0']
2017-10-27 21:43:44 INFO juju-log Making dir /srv/ceph/ceph1 ceph:ceph 755
2017-10-27 21:43:50 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph1']
2017-10-27 21:43:53 INFO juju-log Making dir /srv/ceph/ceph2 ceph:ceph 755
2017-10-27 21:44:03 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph2']
2017-10-27 21:44:07 INFO juju-log Making dir /srv/ceph/ceph3 ceph:ceph 755
2017-10-27 21:44:13 INFO juju-log osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/ceph/ceph3']
2017-10-27 21:44:15 DEBUG config-changed ceph-disk: Error: No cluster conf found in /etc/ceph with fsid ca9451f1-
2017-10-27 21:44:15 DEBUG config-changed Traceback (most recent call last):
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/
2017-10-27 21:44:15 DEBUG config-changed hooks.execute(
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/
2017-10-27 21:44:15 DEBUG config-changed self._hooks[
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/
2017-10-27 21:44:15 DEBUG config-changed return f(*args, **kwargs)
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/
2017-10-27 21:44:15 DEBUG config-changed prepare_
2017-10-27 21:44:15 DEBUG config-changed File "/var/lib/
2017-10-27 21:44:15 DEBUG config-changed ceph.start_
2017-10-27 21:44:15 DEBUG config-changed File "lib/ceph/
2017-10-27 21:44:15 DEBUG config-changed subprocess.
2017-10-27 21:44:15 DEBUG config-changed File "/usr/lib/
2017-10-27 21:44:15 DEBUG config-changed raise CalledProcessEr
2017-10-27 21:44:15 DEBUG config-changed subprocess.
I believe this is 17.02 ceph-osd charm.
To resolve, I copied mon host = X.Y.Z.A:PORT and fsid entries from another working ceph-osd unit in /etc/ceph/
The charm then proceeded to active the disks on the install-hook retry, then it went on to add-relations, etc.
Would be handy if there are pre-defined ceph disks, to check mon relations if "Monitor hosts are []" before performing ceph-disk activate commands.