Ceph OSD Charm

Bug #1945843
Activity log

Activity log for bug #1945843

Date	Who	What changed	Old value	New value	Message
2021-10-02 00:39:46	Tolga Kaprol	bug			added bug
2021-10-02 00:41:32	Tolga Kaprol	description	My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists. Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again. It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works. I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well. The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again. dmsetup info dmsetup remove <failed vg name> These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly.	My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists. Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again. It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works. I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well. vgcfgrestore ceph-359ab4d2-15df-4583-add7-b05e9cb36055 Volume group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 has active volume: osd-block-359ab4d2-15df-4583-add7-b05e9cb36055. WARNING: Found 1 active volume(s) in volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055". Restoring VG with active LVs, may cause mismatch with its metadata. Do you really want to proceed with restore of volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055", while 1 volume(s) are active? [y/n]: y WARNING: Couldn't find device with uuid 1E7aEI-mNj3-fZXN-762y-Ul2o-vCmc-GkNkDp. Cannot restore Volume Group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 with 1 PVs marked as missing. Restore failed. The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again. dmsetup info dmsetup remove <failed vg name> These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly.
2021-10-02 00:48:02	Tolga Kaprol	description	My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists. Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again. It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works. I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well. vgcfgrestore ceph-359ab4d2-15df-4583-add7-b05e9cb36055 Volume group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 has active volume: osd-block-359ab4d2-15df-4583-add7-b05e9cb36055. WARNING: Found 1 active volume(s) in volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055". Restoring VG with active LVs, may cause mismatch with its metadata. Do you really want to proceed with restore of volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055", while 1 volume(s) are active? [y/n]: y WARNING: Couldn't find device with uuid 1E7aEI-mNj3-fZXN-762y-Ul2o-vCmc-GkNkDp. Cannot restore Volume Group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 with 1 PVs marked as missing. Restore failed. The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again. dmsetup info dmsetup remove <failed vg name> These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly.	My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists. Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again. lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 55.5M 1 loop /snap/core18/2074 loop1 7:1 0 55.3M 1 loop loop2 7:2 0 70.6M 1 loop loop3 7:3 0 55.5M 1 loop loop4 7:4 0 32.3M 1 loop loop5 7:5 0 70.3M 1 loop /snap/lxd/21029 loop6 7:6 0 32.3M 1 loop /snap/snapd/12883 loop7 7:7 0 67.6M 1 loop loop8 7:8 0 55.4M 1 loop /snap/core18/2128 loop10 7:10 0 32.3M 1 loop /snap/snapd/13170 loop11 7:11 0 61.8M 1 loop /snap/core20/1081 loop12 7:12 0 67.3M 1 loop /snap/lxd/21545 sda 8:0 1 232.9G 0 disk └─sda1 8:1 1 232.9G 0 part / sdb 8:16 1 894.3G 0 disk └─ceph--746cc89e--b2aa--4fab--b2fb--066b1532489f-osd--block--746cc89e--b2aa--4fab--b2fb--066b1532489f 253:0 0 894.3G 0 lvm juju run-action --wait ceph-osd/0 add-disk osd-devices="/dev/sdb" unit-ceph-osd-0: UnitId: ceph-osd/0 id: "1533" message: exit status 1 results: ReturnCode: 1 Stderr: \| partx: /dev/sdb: failed to read partition table Failed to find physical volume "/dev/sdb". Failed to find physical volume "/dev/sdb". Can't open /dev/sdb exclusively. Mounted filesystem? Can't open /dev/sdb exclusively. Mounted filesystem? Traceback (most recent call last): File "/var/lib/juju/agents/unit-ceph-osd-0/charm/actions/add-disk", line 79, in <module> request = add_device(request=request, File "/var/lib/juju/agents/unit-ceph-osd-0/charm/actions/add-disk", line 34, in add_device charms_ceph.utils.osdize(device_path, hookenv.config('osd-format'), File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1498, in osdize osdize_dev(dev, osd_format, osd_journal, File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1571, in osdize_dev cmd = _ceph_volume(dev, File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1706, in _ceph_volume cmd.append(_allocate_logical_volume(dev=dev, File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1960, in _allocate_logical_volume lvm.create_lvm_physical_volume(pv_dev) File "/var/lib/juju/agents/unit-ceph-osd-0/charm/hooks/charmhelpers/contrib/storage/linux/lvm.py", line 92, in create_lvm_physical_volume check_call(['pvcreate', block_device]) File "/usr/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['pvcreate', '/dev/sdb']' returned non-zero exit status 5. status: failed timing: completed: 2021-10-02 00:44:47 +0000 UTC enqueued: 2021-10-02 00:44:45 +0000 UTC started: 2021-10-02 00:44:45 +0000 UTC It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works. I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well. vgcfgrestore ceph-359ab4d2-15df-4583-add7-b05e9cb36055 Volume group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 has active volume: osd-block-359ab4d2-15df-4583-add7-b05e9cb36055. WARNING: Found 1 active volume(s) in volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055". Restoring VG with active LVs, may cause mismatch with its metadata. Do you really want to proceed with restore of volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055", while 1 volume(s) are active? [y/n]: y WARNING: Couldn't find device with uuid 1E7aEI-mNj3-fZXN-762y-Ul2o-vCmc-GkNkDp. Cannot restore Volume Group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 with 1 PVs marked as missing. Restore failed. The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again. dmsetup info dmsetup remove <failed vg name> These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly.
2023-12-06 13:14:00	Chris Valean	bug			added subscriber Chris Valean