2021-10-02 00:41:32 |
Tolga Kaprol |
description |
My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists.
Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again.
It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works.
I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well.
The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again.
dmsetup info
dmsetup remove <failed vg name>
These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly. |
My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists.
Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again.
It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works.
I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well.
vgcfgrestore ceph-359ab4d2-15df-4583-add7-b05e9cb36055
Volume group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 has active volume: osd-block-359ab4d2-15df-4583-add7-b05e9cb36055.
WARNING: Found 1 active volume(s) in volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055".
Restoring VG with active LVs, may cause mismatch with its metadata.
Do you really want to proceed with restore of volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055", while 1 volume(s) are active? [y/n]: y
WARNING: Couldn't find device with uuid 1E7aEI-mNj3-fZXN-762y-Ul2o-vCmc-GkNkDp.
Cannot restore Volume Group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 with 1 PVs marked as missing.
Restore failed.
The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again.
dmsetup info
dmsetup remove <failed vg name>
These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly. |
|
2021-10-02 00:48:02 |
Tolga Kaprol |
description |
My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists.
Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again.
It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works.
I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well.
vgcfgrestore ceph-359ab4d2-15df-4583-add7-b05e9cb36055
Volume group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 has active volume: osd-block-359ab4d2-15df-4583-add7-b05e9cb36055.
WARNING: Found 1 active volume(s) in volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055".
Restoring VG with active LVs, may cause mismatch with its metadata.
Do you really want to proceed with restore of volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055", while 1 volume(s) are active? [y/n]: y
WARNING: Couldn't find device with uuid 1E7aEI-mNj3-fZXN-762y-Ul2o-vCmc-GkNkDp.
Cannot restore Volume Group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 with 1 PVs marked as missing.
Restore failed.
The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again.
dmsetup info
dmsetup remove <failed vg name>
These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly. |
My Ceph deployment had a problem with a removed disk. Despite OSD was not listed Ceph, it's auth key was still exists.
Therefore whenever I try to add a new OSD to deployment was failing due to ceph attempts to use same osd number again.
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 55.5M 1 loop /snap/core18/2074
loop1 7:1 0 55.3M 1 loop
loop2 7:2 0 70.6M 1 loop
loop3 7:3 0 55.5M 1 loop
loop4 7:4 0 32.3M 1 loop
loop5 7:5 0 70.3M 1 loop /snap/lxd/21029
loop6 7:6 0 32.3M 1 loop /snap/snapd/12883
loop7 7:7 0 67.6M 1 loop
loop8 7:8 0 55.4M 1 loop /snap/core18/2128
loop10 7:10 0 32.3M 1 loop /snap/snapd/13170
loop11 7:11 0 61.8M 1 loop /snap/core20/1081
loop12 7:12 0 67.3M 1 loop /snap/lxd/21545
sda 8:0 1 232.9G 0 disk
└─sda1 8:1 1 232.9G 0 part /
sdb 8:16 1 894.3G 0 disk
└─ceph--746cc89e--b2aa--4fab--b2fb--066b1532489f-osd--block--746cc89e--b2aa--4fab--b2fb--066b1532489f
253:0 0 894.3G 0 lvm
juju run-action --wait ceph-osd/0 add-disk osd-devices="/dev/sdb"
unit-ceph-osd-0:
UnitId: ceph-osd/0
id: "1533"
message: exit status 1
results:
ReturnCode: 1
Stderr: |
partx: /dev/sdb: failed to read partition table
Failed to find physical volume "/dev/sdb".
Failed to find physical volume "/dev/sdb".
Can't open /dev/sdb exclusively. Mounted filesystem?
Can't open /dev/sdb exclusively. Mounted filesystem?
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-ceph-osd-0/charm/actions/add-disk", line 79, in <module>
request = add_device(request=request,
File "/var/lib/juju/agents/unit-ceph-osd-0/charm/actions/add-disk", line 34, in add_device
charms_ceph.utils.osdize(device_path, hookenv.config('osd-format'),
File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1498, in osdize
osdize_dev(dev, osd_format, osd_journal,
File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1571, in osdize_dev
cmd = _ceph_volume(dev,
File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1706, in _ceph_volume
cmd.append(_allocate_logical_volume(dev=dev,
File "/var/lib/juju/agents/unit-ceph-osd-0/charm/lib/charms_ceph/utils.py", line 1960, in _allocate_logical_volume
lvm.create_lvm_physical_volume(pv_dev)
File "/var/lib/juju/agents/unit-ceph-osd-0/charm/hooks/charmhelpers/contrib/storage/linux/lvm.py", line 92, in create_lvm_physical_volume
check_call(['pvcreate', block_device])
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['pvcreate', '/dev/sdb']' returned non-zero exit status 5.
status: failed
timing:
completed: 2021-10-02 00:44:47 +0000 UTC
enqueued: 2021-10-02 00:44:45 +0000 UTC
started: 2021-10-02 00:44:45 +0000 UTC
It's a deadlock situation for ceph-osd, none of zap-disk or add-disk command works.
I discovered that the volume listed on lsblk meanwhile none of vgs, pgs or lgs returns nothing. There was a backup for the vg however vgcfgrestore was denying to restore as well.
vgcfgrestore ceph-359ab4d2-15df-4583-add7-b05e9cb36055
Volume group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 has active volume: osd-block-359ab4d2-15df-4583-add7-b05e9cb36055.
WARNING: Found 1 active volume(s) in volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055".
Restoring VG with active LVs, may cause mismatch with its metadata.
Do you really want to proceed with restore of volume group "ceph-359ab4d2-15df-4583-add7-b05e9cb36055", while 1 volume(s) are active? [y/n]: y
WARNING: Couldn't find device with uuid 1E7aEI-mNj3-fZXN-762y-Ul2o-vCmc-GkNkDp.
Cannot restore Volume Group ceph-359ab4d2-15df-4583-add7-b05e9cb36055 with 1 PVs marked as missing.
Restore failed.
The solution is find and remove vg manually. Then zap-disk and add-disk commands starts to work again.
dmsetup info
dmsetup remove <failed vg name>
These commands should be implemented into ceph-osd charms at least as an additional action to clear volumes properly. |
|