zap-disk action fails when OSD is locked up due to i/o errors

Bug #1928705 reported by Drew Freiberger
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
New
Undecided
Unassigned

Bug Description

When performing zap-disk action, the pvdisplay and blockdev --getsz commands fail when there are I/O errors.

It would be useful to be able to purge all information on a disk/running node related to an i/o failure via charm action. Currently, in order to keep a failed disk from rejoining the ceph cluster upon machine reboot, one has to reboot the server with the D-state processes and then re-purge/zap-disk while the i/o errors are not present in the kernel. Otherwise, there is risk of the failing disk re-joining the cluster and causing i/o interruptions/lag for the workloads using ceph.

juju run-action --wait $OSD_UNIT zap-disk devices=/dev/disk/by-dname/bcache1-osd-5 i-really-mean-it=true
unit-ceph-osd-18:
  UnitId: ceph-osd/18
  id: "11668"
  message: exit status 1
  results:
    ReturnCode: 1
    Stderr: |2
        /dev/mapper/crypt-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 0: Input/output error
        /dev/mapper/crypt-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 6001172938752: Input/output error
        /dev/mapper/crypt-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 6001173012480: Input/output error
        /dev/mapper/crypt-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 4096: Input/output error
        /dev/ceph-90930824-9648-4de9-8c8c-9c8db46f0e12/osd-block-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 0: Input/output error
        /dev/ceph-90930824-9648-4de9-8c8c-9c8db46f0e12/osd-block-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 6001168154624: Input/output error
        /dev/ceph-90930824-9648-4de9-8c8c-9c8db46f0e12/osd-block-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 6001168211968: Input/output error
        /dev/ceph-90930824-9648-4de9-8c8c-9c8db46f0e12/osd-block-90930824-9648-4de9-8c8c-9c8db46f0e12: read failed after 0 of 4096 at 4096: Input/output error
        /dev/sdf: read failed after 0 of 4096 at 0: Input/output error
        /dev/sdf: read failed after 0 of 4096 at 6001175035904: Input/output error
      Device /dev/disk/by-dname/bcache1-osd-5 doesn't exist or access denied.
      Problem opening /dev/disk/by-dname/bcache1-osd-5 for reading! Error is 6.
      Problem opening '' for writing! Program will now terminate.
      Warning! MBR not overwritten! Error is 2!
      Problem opening /dev/disk/by-dname/bcache1-osd-5 for reading! Error is 6.
      Caution! Secondary header was placed beyond the disk's limits! Moving the
      header, but other problems may occur!
      Unable to open device '' for writing! Errno is 2! Aborting write!
      blockdev: cannot open /dev/disk/by-dname/bcache1-osd-5: No such device or address
      Traceback (most recent call last):
        File "/var/lib/juju/agents/unit-ceph-osd-18/charm/actions/zap-disk", line 94, in <module>
          zap()
        File "/var/lib/juju/agents/unit-ceph-osd-18/charm/actions/zap-disk", line 79, in zap
          zap_disk(device)
        File "hooks/charmhelpers/contrib/storage/linux/utils.py", line 90, in zap_disk
          block_device]).decode('UTF-8')
        File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
          **kwargs).stdout
        File "/usr/lib/python3.6/subprocess.py", line 438, in run
          output=stdout, stderr=stderr)
      subprocess.CalledProcessError: Command '['blockdev', '--getsz', '/dev/disk/by-dname/bcache1-osd-5']' returned non-zero exit status 1.
    Stdout: |
      Information: Creating fresh partition table; will override earlier problems!
  status: failed

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.