zap-disk action fails when OSD is locked up due to i/o errors
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
New
|
Undecided
|
Unassigned |
Bug Description
When performing zap-disk action, the pvdisplay and blockdev --getsz commands fail when there are I/O errors.
It would be useful to be able to purge all information on a disk/running node related to an i/o failure via charm action. Currently, in order to keep a failed disk from rejoining the ceph cluster upon machine reboot, one has to reboot the server with the D-state processes and then re-purge/zap-disk while the i/o errors are not present in the kernel. Otherwise, there is risk of the failing disk re-joining the cluster and causing i/o interruptions/lag for the workloads using ceph.
juju run-action --wait $OSD_UNIT zap-disk devices=
unit-ceph-osd-18:
UnitId: ceph-osd/18
id: "11668"
message: exit status 1
results:
ReturnCode: 1
Stderr: |2
/dev/sdf: read failed after 0 of 4096 at 0: Input/output error
/dev/sdf: read failed after 0 of 4096 at 6001175035904: Input/output error
Device /dev/disk/
Problem opening /dev/disk/
Problem opening '' for writing! Program will now terminate.
Warning! MBR not overwritten! Error is 2!
Problem opening /dev/disk/
Caution! Secondary header was placed beyond the disk's limits! Moving the
header, but other problems may occur!
Unable to open device '' for writing! Errno is 2! Aborting write!
blockdev: cannot open /dev/disk/
Traceback (most recent call last):
File "/var/lib/
zap()
File "/var/lib/
File "hooks/
File "/usr/lib/
File "/usr/lib/
subproces
Stdout: |
Information: Creating fresh partition table; will override earlier problems!
status: failed