Multipath disconnect fails if path just went down

Bug #1794829 reported by Gorka Eguileor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
os-brick
Fix Released
Undecided
Gorka Eguileor

Bug Description

If the iSCSI connection to a device goes down right after we flush it, or if one of the paths of a multipath device goes down right before we start disconnecting, the detach will fail even though it should succeed.

An extract of the error we'll see in the logs is:

  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server return r.call(f, *args, **kwargs)
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/retrying.py", line 229, in call
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server raise attempt.get()
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/retrying.py", line 261, in get
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server six.reraise(self.value[0], self.value[1], self.value[2])
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/retrying.py", line 217, in call
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/os_brick/initiator/linuxscsi.py", line 89, in wait_for_volumes_removal
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server raise exception.VolumePathNotRemoved(volume_path=exist)
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server VolumePathNotRemoved: Volume path [u'sdd'] was not removed in time.
  2018-09-12 10:30:52.013 1 ERROR oslo_messaging.rpc.server

This happens because, under those circumstances, it may take up to 30 seconds for the SCSI device to be removed from /dev, but expect it to disappear in 6 seconds (first check happens, immediately, then another in 2 seconds, and another in 4 seconds).

If we wait a little bit more, the device will be properly removed.

Gorka Eguileor (gorka)
Changed in os-brick:
assignee: nobody → Gorka Eguileor (gorka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master)

Fix proposed to branch: master
Review: https://review.openstack.org/605802

Changed in os-brick:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.openstack.org/605802
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=b9c7bc2b597d944cbc404d6bf5fedc35d095a897
Submitter: Zuul
Branch: master

commit b9c7bc2b597d944cbc404d6bf5fedc35d095a897
Author: Gorka Eguileor <email address hidden>
Date: Thu Sep 27 17:55:00 2018 +0200

    Succeed on iSCSI detach when path just went down

    If the iSCSI connection to a device goes down right after we flush it,
    or if one of the paths of a multipath device goes down right before we
    start disconnecting, the detach will fail even though it should succeed.

    We'll see a VolumePathNotRemoved exception listing volumes that had not
    disappeared.

    This happens because, under those circumstances, it may take up to 30
    seconds for the SCSI device to be removed from /dev, but expect it to
    disappear in 6 seconds (first check happens, immediately, then another
    in 2 seconds, and another in 4 seconds).

    Since the device will be removed if we wait a bit more, this patch makes
    it so that we wait for up to 30 seconds for the removal.

    To ensure we wait as little time as possible, we change the way we wait
    for the devices to be removed. Instead of checking, sleeping for 2 and
    then for 4 seconds, and then checking again, we just sleep 500ms between
    checks, and we do the DEBUG log every 5 seconds.

    Change-Id: If801dfc2462c0d3f986eebd4108087139934610d
    Closes-Bug: #1794829

Changed in os-brick:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/607041

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/rocky)

Reviewed: https://review.openstack.org/607041
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=b75411de2b2aadd1eafd2f8f8b1579df357bf09f
Submitter: Zuul
Branch: stable/rocky

commit b75411de2b2aadd1eafd2f8f8b1579df357bf09f
Author: Gorka Eguileor <email address hidden>
Date: Thu Sep 27 17:55:00 2018 +0200

    Succeed on iSCSI detach when path just went down

    If the iSCSI connection to a device goes down right after we flush it,
    or if one of the paths of a multipath device goes down right before we
    start disconnecting, the detach will fail even though it should succeed.

    We'll see a VolumePathNotRemoved exception listing volumes that had not
    disappeared.

    This happens because, under those circumstances, it may take up to 30
    seconds for the SCSI device to be removed from /dev, but expect it to
    disappear in 6 seconds (first check happens, immediately, then another
    in 2 seconds, and another in 4 seconds).

    Since the device will be removed if we wait a bit more, this patch makes
    it so that we wait for up to 30 seconds for the removal.

    To ensure we wait as little time as possible, we change the way we wait
    for the devices to be removed. Instead of checking, sleeping for 2 and
    then for 4 seconds, and then checking again, we just sleep 500ms between
    checks, and we do the DEBUG log every 5 seconds.

    Change-Id: If801dfc2462c0d3f986eebd4108087139934610d
    Closes-Bug: #1794829
    (cherry-picked from b9c7bc2b597d944cbc404d6bf5fedc35d095a897)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/607632

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.6.1

This issue was fixed in the openstack/os-brick 2.6.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.5.4

This issue was fixed in the openstack/os-brick 2.5.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/queens)

Reviewed: https://review.openstack.org/607632
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=9722aa7db81b1d67b9e4c7804034b680aeb10b17
Submitter: Zuul
Branch: stable/queens

commit 9722aa7db81b1d67b9e4c7804034b680aeb10b17
Author: Gorka Eguileor <email address hidden>
Date: Thu Sep 27 17:55:00 2018 +0200

    Succeed on iSCSI detach when path just went down

    If the iSCSI connection to a device goes down right after we flush it,
    or if one of the paths of a multipath device goes down right before we
    start disconnecting, the detach will fail even though it should succeed.

    We'll see a VolumePathNotRemoved exception listing volumes that had not
    disappeared.

    This happens because, under those circumstances, it may take up to 30
    seconds for the SCSI device to be removed from /dev, but expect it to
    disappear in 6 seconds (first check happens, immediately, then another
    in 2 seconds, and another in 4 seconds).

    Since the device will be removed if we wait a bit more, this patch makes
    it so that we wait for up to 30 seconds for the removal.

    To ensure we wait as little time as possible, we change the way we wait
    for the devices to be removed. Instead of checking, sleeping for 2 and
    then for 4 seconds, and then checking again, we just sleep 500ms between
    checks, and we do the DEBUG log every 5 seconds.

    Change-Id: If801dfc2462c0d3f986eebd4108087139934610d
    Closes-Bug: #1794829
    (cherry-picked from commit b9c7bc2b597d944cbc404d6bf5fedc35d095a897)
    (cherry picked from commit b75411de2b2aadd1eafd2f8f8b1579df357bf09f)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.3.5

This issue was fixed in the openstack/os-brick 2.3.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/647777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/pike)

Reviewed: https://review.openstack.org/647777
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=97c1da1230261ac79c6045b6aa8690db597fcc8c
Submitter: Zuul
Branch: stable/pike

commit 97c1da1230261ac79c6045b6aa8690db597fcc8c
Author: Gorka Eguileor <email address hidden>
Date: Thu Sep 27 17:55:00 2018 +0200

    Succeed on iSCSI detach when path just went down

    If the iSCSI connection to a device goes down right after we flush it,
    or if one of the paths of a multipath device goes down right before we
    start disconnecting, the detach will fail even though it should succeed.

    We'll see a VolumePathNotRemoved exception listing volumes that had not
    disappeared.

    This happens because, under those circumstances, it may take up to 30
    seconds for the SCSI device to be removed from /dev, but expect it to
    disappear in 6 seconds (first check happens, immediately, then another
    in 2 seconds, and another in 4 seconds).

    Since the device will be removed if we wait a bit more, this patch makes
    it so that we wait for up to 30 seconds for the removal.

    To ensure we wait as little time as possible, we change the way we wait
    for the devices to be removed. Instead of checking, sleeping for 2 and
    then for 4 seconds, and then checking again, we just sleep 500ms between
    checks, and we do the DEBUG log every 5 seconds.

    Change-Id: If801dfc2462c0d3f986eebd4108087139934610d
    Closes-Bug: #1794829
    (cherry-picked from commit b9c7bc2b597d944cbc404d6bf5fedc35d095a897)
    (cherry picked from commit b75411de2b2aadd1eafd2f8f8b1579df357bf09f)
    (cherry picked from commit 9722aa7db81b1d67b9e4c7804034b680aeb10b17)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 1.15.9

This issue was fixed in the openstack/os-brick 1.15.9 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.