disks becomes unusable if add-disk action fails

Bug #1945843 reported by Tolga Kaprol
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
New
Undecided
Unassigned

Bug Description

My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists.

Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again.

lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 55.5M 1 loop /snap/core18/2074
loop1 7:1 0 55.3M 1 loop
loop2 7:2 0 70.6M 1 loop
loop3 7:3 0 55.5M 1 loop
loop4 7:4 0 32.3M 1 loop
loop5 7:5 0 70.3M 1 loop /snap/lxd/21029
loop6 7:6 0 32.3M 1 loop /snap/snapd/12883
loop7 7:7 0 67.6M 1 loop
loop8 7:8 0 55.4M 1 loop /snap/core18/2128
loop10 7:10 0 32.3M 1 loop /snap/snapd/13170
loop11 7:11 0 61.8M 1 loop /snap/core20/1081
loop12 7:12 0 67.3M 1 loop /snap/lxd/21545
sda 8:0 1 232.9G 0 disk
└─sda1 8:1 1 232.9G 0 part /
sdb 8:16 1 894.3G 0 disk
└─ceph--746cc89e--b2aa--4fab--b2fb--066b1532489f-osd--block--746cc89e--b2aa--4fab--b2fb--066b1532489f
                                                                 253:0 0 894.3G 0 lvm

juju run-action --wait ceph-osd/0 add-disk osd-devices="/dev/sdb"
unit-ceph-osd-0:
  UnitId: ceph-osd/0
  id: "1533"
  message: exit status 1
  results:
    ReturnCode: 1
    Stderr: |
      partx: /dev/sdb: failed to read partition table
        Failed to find physical volume "/dev/sdb".
        Failed to find physical volume "/dev/sdb".
        Can't open /dev/sdb exclusively. Mounted filesystem?
        Can't open /dev/sdb exclusively. Mounted filesystem?
      Traceback (most recent call last):
        File "/var/lib/juju/agents/unit-ceph-osd-0/charm/actions/add-disk", line 79, in <module>
          request = add_device(request=request,
        File "/var/lib/juju/agents/unit-ceph-osd-0/charm/actions/add-disk", line 34, in add_device
          charms_ceph.utils.osdize(device_path, hookenv.config('osd-format'),
        File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1498, in osdize
          osdize_dev(dev, osd_format, osd_journal,
        File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1571, in osdize_dev
          cmd = _ceph_volume(dev,
        File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1706, in _ceph_volume
          cmd.append(_allocate_logical_volume(dev=dev,
        File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1960, in _allocate_logical_volume
          lvm.create_lvm_physical_volume(pv_dev)
        File "/var/lib/juju/agents/unit-ceph-osd-0/charm/hooks/charmhelpers/contrib/storage/linux/lvm.py", line 92, in create_lvm_physical_volume
          check_call(['pvcreate', block_device])
        File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['pvcreate', '/dev/sdb']' returned non-zero exit status 5.
  status: failed
  timing:
    completed: 2021-10-02 00:44:47 +0000 UTC
    enqueued: 2021-10-02 00:44:45 +0000 UTC
    started: 2021-10-02 00:44:45 +0000 UTC

It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works.

I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well.

vgcfgrestore ceph-359ab4d2-15df-4583-add7-b05e9cb36055
  Volume group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 has active volume: osd-block-359ab4d2-15df-4583-add7-b05e9cb36055.
  WARNING: Found 1 active volume(s) in volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055".
  Restoring VG with active LVs, may cause mismatch with its metadata.
Do you really want to proceed with restore of volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055", while 1 volume(s) are active? [y/n]: y
  WARNING: Couldn't find device with uuid 1E7aEI-mNj3-fZXN-762y-Ul2o-vCmc-GkNkDp.
  Cannot restore Volume Group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 with 1 PVs marked as missing.
  Restore failed.

The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again.

dmsetup info
dmsetup remove <failed vg name>

These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly.

description: updated
description: updated
Revision history for this message
Chris Valean (cvalean) wrote :

we have also been observing this error and got to this bug.
A related bug to this would be https://bugs.launchpad.net/charm-ceph-osd/+bug/1858519

juju version 2.9.44.1
ceph-osd charm: 15.2.17
channel: octopus/stable

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.