charm osdize's /dev/sdb, then fails with "looks like /dev/sdb is in use, skipping"

Bug #1751127 reported by Jason Hobbs
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
High
Chris MacNaughton

Bug Description

A ceph-osd unit is ending up in "No block devices detected using current configuration".

We set our block device to /dev/sdb.

In the unit log we see:

2018-02-22 12:58:50 INFO juju-log mon:37: osdize cmd: ['ceph-disk', 'prepare', '--zap-disk', '--bluestore', '/dev/sdb']

Then later we see:
2018-02-22 13:01:08 INFO juju-log mon:37: Looks like /dev/sdb is in use, skipping.

Here is the bundle and overlay:
http://paste.ubuntu.com/p/JqmwxGhGCt/

crashdump is attached.

Here is a second bundle that reproduced this (this one with bluestore off):
http://paste.ubuntu.com/p/G6WH6MHM83/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again today on a run that was not bluestore, so it doesn't seem to be bluestore specific.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Jason, JFYI: I was able to deploy Ceph clusters with bluestore just fine without similar errors. So if that happens, it is certainly intermittent and probably not bluestore-specific.

From looking at the ceph-disk code I can say that the steps it does are very similar for FileStore and BlueStore.

Revision history for this message
Chris Gregan (cgregan) wrote :

Escalated to field high to increase visibility

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :
Download full text (8.5 KiB)

Digging through the ceph logs on monitor 0, and osd.2 (machine 8), it looks like, as soon as osd.2 came online, osd.5 tried to ping it (normal healthcheck), immediately failed, and started reporting it to the monitors as down, causing it to eventually kill itself. Is it possible that there could be networking issues between the machines in the storage network, potentially MTU issues?

Alternately, if the machine were under heavy enough load and couldn't respond to healthchecks fast enough, that could also cause it to get marked as down.

Attaching a couple of log snippets below:

mon.log

2018-02-22 13:05:10.825046 7f5e88c16700 0 mon.juju-6ba979-0-lxd-0@0(leader) e2 handle_command mon_command({"prefix": "osd ls", "format": "json"} v 0) v1
2018-02-22 13:05:10.825084 7f5e88c16700 0 log_channel(audit) log [DBG] : from='client.? 192.168.36.11:0/3533814954' entity='client.admin' cmd=[{"prefix": "osd ls", "format": "json"}]: dispatch
2018-02-22 13:05:11.397144 7f5e8b41b700 0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)
2018-02-22 13:05:11.397207 7f5e8b41b700 0 log_channel(cluster) log [INF] : Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2018-02-22 13:05:11.401213 7f5e8440d700 1 mon.juju-6ba979-0-lxd-0@0(leader).osd e130 e130: 9 total, 9 up, 9 in
2018-02-22 13:05:11.405136 7f5e8440d700 0 log_channel(cluster) log [INF] : osd.2 192.168.36.6:6800/182399 boot
2018-02-22 13:05:11.405220 7f5e8440d700 0 log_channel(cluster) log [DBG] : osdmap e130: 9 total, 9 up, 9 in
2018-02-22 13:05:11.411304 7f5e88c16700 1 mon.juju-6ba979-0-lxd-0@0(leader).osd e130 prepare_failure osd.2 192.168.36.6:6800/182399 from osd.5 192.168.36.4:6800/501269 is reporting failure:1
2018-02-22 13:05:11.411316 7f5e88c16700 0 log_channel(cluster) log [DBG] : osd.2 192.168.36.6:6800/182399 reported immediately failed by osd.5 192.168.36.4:6800/501269
2018-02-22 13:05:11.411338 7f5e88c16700 1 mon.juju-6ba979-0-lxd-0@0(leader).osd e130 we're forcing failure of osd.2
2018-02-22 13:05:11.411360 7f5e88c16700 0 log_channel(cluster) log [INF] : osd.2 failed (root=default,rack=zone2,host=ariados) (connection refused reported by osd.5)
2018-02-22 13:05:11.614894 7f5e88c16700 1 mon.juju-6ba979-0-lxd-0@0(leader).osd e130 prepare_failure osd.2 192.168.36.6:6800/182399 from osd.5 192.168.36.4:6800/501269 is reporting failure:1
2018-02-22 13:05:11.614912 7f5e88c16700 0 log_channel(cluster) log [DBG] : osd.2 192.168.36.6:6800/182399 reported immediately failed by osd.5 192.168.36.4:6800/501269
2018-02-22 13:05:11.682861 7f5e88c16700 0 mon.juju-6ba979-0-lxd-0@0(leader) e2 handle_command mon_command({"prefix": "config-key put", "key": "cephx.groups.objects", "val": "{\"pools\": [\"default.rgw.buckets\", \"default.rgw\", \"default.rgw.root\", \"default.rgw.control\", \"default.rgw.gc\", \"default.rgw.buckets.index\", \"default.rgw.buckets.extra\", \"default.log\", \"default.intent-log\", \"default.usage\", \"default.users\", \"default.users.email\", \"default.users.swift\", \"default.users.uid\", \".rgw.root\"], \"services\": []}"} v 0) v1
2018-02-22 13:05:11.682907 7f5e88c16700 0 log_channel(audit) log [INF] : from='client.15830 -...

Read more...

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Bluestore support in the ceph charms is classified preview/experimental, but is intended for feature development as soon as possible.

Changed in charm-ceph-osd:
importance: Undecided → Wishlist
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

attached crashdump from the reproduction with bluestore off.

description: updated
Ryan Beisner (1chb1n)
Changed in charm-ceph-osd:
importance: Wishlist → High
assignee: nobody → Chris MacNaughton (chris.macnaughton)
milestone: none → 18.05
Ryan Beisner (1chb1n)
Changed in charm-ceph-osd:
status: New → In Progress
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The message in the title of this bug is not really useful, as it means that the charm correctly determined that the block device was mounted and moved on gracefully. What is interesting, to me, about this bug is the disks on a machine crashing (machine 8 in the original crashdump).

The next time that we see this error, I'd like to get a new crashdump with the load information, as load or networking issues[1] are the only cases that I can think of that would cause the cluster to mark an OSD down almost immediately.

[1]: I'd suspect a routing issue, if network is to be blamed, as the OSD ping message is tiny, well under a size that would cause problems with an MTU mismatch.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

If this recurs, please include a new crashdump so that we can re-analyze with the new /var/log/load information. Thank you.

Changed in charm-ceph-osd:
status: In Progress → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

posting bundle/crashdump from reproduction that includes load logging.

bundle:
http://paste.ubuntu.com/p/3rG3XyCrs7/

Changed in charm-ceph-osd:
status: Incomplete → New
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The error in the crashdump attached in #10 is entirely different, as there was a hook error in the mon-relation-changed hook.

Changed in charm-ceph-osd:
status: New → Incomplete
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Given the stack trace:

2018-03-02 22:26:13 INFO juju-log mon:38: osdize cmd: ['ceph-disk', 'prepare', '--fs-type', 'xfs', '--zap-disk', '--filestore', '/dev/sdb']
2018-03-02 22:26:14 DEBUG mon-relation-changed ceph-disk: Error: Device is mounted: /dev/sdb1
2018-03-02 22:26:14 DEBUG worker.uniter.jujuc server.go:178 running hook tool "juju-log"
2018-03-02 22:26:14 ERROR juju-log mon:38: Unable to initialize device: /dev/sdb
2018-03-02 22:26:14 DEBUG mon-relation-changed Traceback (most recent call last):
2018-03-02 22:26:14 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-osd-6/charm/hooks/mon-relation-changed", line 559, in <module>
2018-03-02 22:26:14 DEBUG mon-relation-changed hooks.execute(sys.argv)
2018-03-02 22:26:14 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-osd-6/charm/hooks/charmhelpers/core/hookenv.py", line 800, in execute
2018-03-02 22:26:14 DEBUG mon-relation-changed self._hooks[hook_name]()
2018-03-02 22:26:14 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-osd-6/charm/hooks/mon-relation-changed", line 486, in mon_relation
2018-03-02 22:26:14 DEBUG mon-relation-changed prepare_disks_and_activate()
2018-03-02 22:26:14 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-osd-6/charm/hooks/mon-relation-changed", line 389, in prepare_disks_and_activate
2018-03-02 22:26:14 DEBUG mon-relation-changed config('bluestore'))
2018-03-02 22:26:14 DEBUG mon-relation-changed File "lib/ceph/utils.py", line 1439, in osdize
2018-03-02 22:26:14 DEBUG mon-relation-changed bluestore)
2018-03-02 22:26:14 DEBUG mon-relation-changed File "lib/ceph/utils.py", line 1507, in osdize_dev
2018-03-02 22:26:14 DEBUG mon-relation-changed subprocess.check_call(cmd)
2018-03-02 22:26:14 DEBUG mon-relation-changed File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
2018-03-02 22:26:14 DEBUG mon-relation-changed raise CalledProcessError(retcode, cmd)
2018-03-02 22:26:14 DEBUG mon-relation-changed subprocess.CalledProcessError: Command '['ceph-disk', 'prepare', '--fs-type', 'xfs', '--zap-disk', '--filestore', '/dev/sdb']' returned non-zero exit status 1

It looks like this is actually a reproduction of #1746118

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

@jason-hobbs, @cgregan is this bug still occurring? If so, can we get a fresh crashdump where this bug has been reproduced along with the new load information in the crashdump?

In addition to waiting on more information for this bug, there are reviews in progress to improve logging under error conditions during disk changes in addition to improving disk management in general.

Revision history for this message
Eric Vasquez (envas) wrote :

@icey
We had this occur again. ceph-osd was:
    charm: cs:~chris.macnaughton/ceph-osd-7

Attaching juju crash dump from the failing run.

Revision history for this message
Eric Vasquez (envas) wrote :
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

That crashdump has a very partial juju status, and almost none of the machine logs?

Revision history for this message
Eric Vasquez (envas) wrote :

Apologies, attached the wrong crash dump.
Here is the right one.

Chris Gregan (cgregan)
Changed in charm-ceph-osd:
status: Incomplete → Confirmed
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Looking at the logs (with additional logging, yay!), I can see that /dev/sdb1 is mounted (briefly) after the charm executes:

2018-04-06 02:32:02 INFO juju-log mon:42: osdize cmd: ['ceph-disk', 'prepare', '--fs-type', 'xfs', '--zap-disk', '--filestore', '/dev/sdb']

Looking at the timestamps, we can see that preparing and zapping the disk took around 30 seconds. Several hook executions later, then, we see the output of lsblk again with /dev/sdb1 _not_ mounted.

2018-04-06 02:32:35 DEBUG juju-log mon:42: NAME="sdb" MAJ:MIN="8:16" RM="0" SIZE="931.5G" RO="0" TYPE="disk" MOUNTPOINT=""
NAME="sdb1" MAJ:MIN="8:17" RM="0" SIZE="930.5G" RO="0" TYPE="part" MOUNTPOINT="/var/lib/ceph/tmp/mnt.fw3oZa"
NAME="sdb2" MAJ:MIN="8:18" RM="0" SIZE="1G" RO="0" TYPE="part" MOUNTPOINT=""
...snip...
2018-04-06 02:32:52 DEBUG juju-log mon:42: NAME="sdb" MAJ:MIN="8:16" RM="0" SIZE="931.5G" RO="0" TYPE="disk" MOUNTPOINT=""
NAME="sdb1" MAJ:MIN="8:17" RM="0" SIZE="930.5G" RO="0" TYPE="part" MOUNTPOINT=""
NAME="sdb2" MAJ:MIN="8:18" RM="0" SIZE="1G" RO="0" TYPE="part" MOUNTPOINT=""

 I wonder if, per the suggestion at the end of ceph-disk's prepare state, we need to add a call to partprobe or kpartx:

2018-04-06 02:32:33 DEBUG mon-relation-changed Warning: The kernel is still using the old partition table.
2018-04-06 02:32:33 DEBUG mon-relation-changed The new table will be used at the next reboot or after you
2018-04-06 02:32:33 DEBUG mon-relation-changed run partprobe(8) or kpartx(8)

Revision history for this message
James Page (james-page) wrote :

$ grep "Monitor hosts are" unit-ceph-osd-4.log
2018-04-06 02:31:46 INFO juju-log mon:42: Monitor hosts are ['10.244.40.228:6789', '10.244.41.5:6789', '192.168.36.14:6789']
2018-04-06 02:31:55 INFO juju-log mon:42: Monitor hosts are ['10.244.40.228:6789', '10.244.41.5:6789', '192.168.36.14:6789']
2018-04-06 02:32:42 INFO juju-log mon:42: Monitor hosts are ['10.244.40.228:6789', '192.168.36.14:6789', '192.168.36.23:6789']
2018-04-06 02:32:49 INFO juju-log mon:42: Monitor hosts are ['10.244.40.228:6789', '192.168.36.14:6789', '192.168.36.23:6789']

syslog on the unit indicates that the udev trigger action to activate the OSD fails until 02:53 - just after a write to the ceph.conf file with 2/3 mons actually on the right network address.

The get_mon_hosts function has a fallback which I think is causing this issue; as that was to deal with upgrades, we should just drop the fallback code and rely on a populated set of ceph-public-address values to address the mon cluster.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

There is a new version if my namespace including the proposal James is making above as well as some additional logging on failure at cs:~chris.macnaughton/ceph-osd-8 . Can you please update your testing to use that branch for a bit to see if it resolves this issue?

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Ok, while I'd love to see if the above change fixes this bug, the fix in https://review.openstack.org/#/c/559964 should prevent it from being a problem to begin with.

Changed in charm-ceph-osd:
status: Confirmed → Fix Committed
Revision history for this message
Ryan Beisner (1chb1n) wrote :

This is a tricky backport, as it changes the type of a config option if we do a straight backport. We will have to look further into how/if we can selectively backport a change which is also backward-compatible with current stable config types.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

An adendum to Ryan's point, this change, excepting the config change, has been backported in https://review.openstack.org/#/c/562488/

Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.