is_active_bluestore_device (add-disk, zap-disk, list-disks) crashes if device is initialized as a PV with no VG created

Bug #1832444 reported by Trent Lloyd on 2019-06-12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack ceph-osd charm

Bug Description

== Problem description ==

If a device is initialized as an LVM PV, but does not yet have a VG created on it, is_active_bluestore_device crashes out with "list index out of range". This causes various tasks to fail such as setting up an OSD from any hook (config_changed, etc) as well as the actions add-disk, zap-disk and list-disks

This occurs specifically after destroying an OSD using "ceph-volume lvm destroy --osd-id N --destroy" which will cleanup both the LV and VGs as appropriate, but does not run "pvremove".

In this case, "pvdisplay" shows the header "--- NEW Physical volume ---" and a "VG Name" which is blank.

This can affect the DB device as well as the OSD device, as the VG is also removed from the DB device if it was the last one.

We also cannot use the zap-disk action to fix this situation, since that action is broken by this issue.

== Workaround ==

Manually remove the PV header, using pvremove

pvremove /dev/xxx

== Reproducer ==

(1) Deploy a luminous ceph-osd cluster (pike or newer)
(2) [ceph-osd/0] systemctl stop ceph-osd@*
(3) [ceph-osd/0] ceph-volume lvm zap --osd-id N --destroy # Replace N with one of the OSD IDs on this machine, you can get the ID list from "sudo mount"
(4) juju run-action ceph-osd/0 zap-disk devices="/dev/xxx" i-really-mean-it=true --wait # Replace /dev/xxx with the matching raw block device path, e.g. /dev/vdb, /dev/sdb, etc.
(5) Check the juju log for the traceback

Tags: seg Edit Tag help
Trent Lloyd (lathiat) wrote :

== Cause ==

The cause is that is_active_bluestore_device first checks if the device is a PV, then assume that a VG name will be returned but doesn't actually check that it wasn't blank.

Additionally, when that is fixed, it also assumes that a valid lv_name is returned by lvm.list_logical_volumes which may also not be true (the VG may be empty).

That blank lv_name is then passed to target.endswith(lv_name="") which will always return True and result in all devices being marked as active when they are not.

There are also various other functions that call to lvm.list_lvm_volume_group, lvm.list_logical_volumes without checking for a blank return value. Most likely all such calls should be fixed.

description: updated
description: updated
tags: added: seg
Trent Lloyd (lathiat) on 2019-06-12
description: updated
Trent Lloyd (lathiat) wrote :

I think this was partly fixed by some other recent commits

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers