Multipath disconnect may fail when a path is down

Bug #1785669 reported by Gorka Eguileor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
os-brick
Fix Released
Undecided
Gorka Eguileor

Bug Description

Under certain conditions detaching a multipath device may result on failure when flushing one of the individual paths, but the disconnect should have succeeded, because there were other paths available to flush all the data.

Here's an example of a failure during a live migration, but the same error can happen on a normal multipathed volume detach.

2018-07-06 12:57:29,570.570 32255 DEBUG oslo_messaging._drivers.amqpdriver [req-896a0a06-2810-42d3-a0c1-dff100fd6762 1eaf607c59da4f3c93930252dd1d4fe6 6da93dd1cfc8407f9f8f6693dbb0c606 - - -] CAST unique_id: 6d93265823de41af99b2f83f4d27c9b0 NOTIFY exchange 'nova' topic 'ver
sioned_notifications.error' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:479
2018-07-06 12:57:29,572.572 32255 WARNING nova.virt.libvirt.driver [req-896a0a06-2810-42d3-a0c1-dff100fd6762 1eaf607c59da4f3c93930252dd1d4fe6 6da93dd1cfc8407f9f8f6693dbb0c606 - - -] [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Error monitoring migration: Unexpected er
ror while running command.
Command: blockdev --flushbufs /dev/sdan
Exit code: 1
Stdout: u''
Stderr: u'blockdev: cannot open /dev/sdan: No such device or address\n'
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Traceback (most recent call last):
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6406, in _live_migration
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] finish_event, disk_paths)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6316, in _live_migration_monitor
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] migrate_data)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 75, in wrapped
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] function_name, call_dict, binary)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] self.force_reraise()
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] six.reraise(self.type_, self.value, self.tb)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 66, in wrapped
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] return f(self, context, *args, **kw)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 216, in decorated_function
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] kwargs['instance'], e, sys.exc_info())
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] self.force_reraise()
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] six.reraise(self.type_, self.value, self.tb)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 204, in decorated_function
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] return function(self, context, *args, **kwargs)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5437, in _post_live_migration
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] migrate_data)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6774, in post_live_migration
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] self._disconnect_volume(connection_info, disk_dev)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1104, in _disconnect_volume
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] vol_driver.disconnect_volume(connection_info, disk_dev)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/volume/iscsi.py", line 74, in disconnect_volume
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] self.connector.disconnect_volume(connection_info['data'], None)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/utils.py", line 145, in trace_logging_wrapper
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] result = f(*args, **kwargs)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] return f(*args, **kwargs)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 830, in disconnect_volume
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] ignore_errors=ignore_errors)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/initiator/connectors/iscsi.py", line 867, in _cleanup_connection
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] force, exc)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/initiator/linuxscsi.py", line 226, in remove_connection
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] self.remove_scsi_device('/dev/' + device_name, force, exc)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/initiator/linuxscsi.py", line 73, in remove_scsi_device
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] self.flush_device_io(device)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/initiator/linuxscsi.py", line 256, in flush_device_io
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] interval=10, root_helper=self._root_helper)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/executor.py", line 52, in _execute
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] result = self.__execute(*args, **kwargs)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/os_brick/privileged/rootwrap.py", line 169, in execute
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] return execute_root(*cmd, **kwargs)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/oslo_privsep/priv_context.py", line 204, in _wrap
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] return self.channel.remote_call(name, args, kwargs)
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] File "/usr/lib/python2.7/site-packages/oslo_privsep/daemon.py", line 187, in remote_call
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] raise exc_type(*result[2])
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] ProcessExecutionError: Unexpected error while running command.
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Command: blockdev --flushbufs /dev/sdan
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Exit code: 1
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Stdout: u''
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Stderr: u'blockdev: cannot open /dev/sdan: No such device or address\n'
2018-07-06 12:57:29,572.572 32255 ERROR nova.virt.libvirt.driver [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff]
2018-07-06 12:57:29,573.573 32255 DEBUG nova.virt.libvirt.driver [req-896a0a06-2810-42d3-a0c1-dff100fd6762 1eaf607c59da4f3c93930252dd1d4fe6 6da93dd1cfc8407f9f8f6693dbb0c606 - - -] [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Live migration monitoring is all done _live_migration /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:6413
2018-07-06 12:57:29,573.573 32255 ERROR nova.compute.manager [req-896a0a06-2810-42d3-a0c1-dff100fd6762 1eaf607c59da4f3c93930252dd1d4fe6 6da93dd1cfc8407f9f8f6693dbb0c606 - - -] [instance: fe9aedfe-eee5-4ebc-926b-05a49dc950ff] Live migration failed.

Gorka Eguileor (gorka)
affects: cinder → os-brick
Changed in os-brick:
assignee: nobody → Gorka Eguileor (gorka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master)

Fix proposed to branch: master
Review: https://review.openstack.org/589235

Changed in os-brick:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/592579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.openstack.org/589235
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=d866ee75c23e05efee5688a948498831980d98e3
Submitter: Zuul
Branch: master

commit d866ee75c23e05efee5688a948498831980d98e3
Author: Gorka Eguileor <email address hidden>
Date: Mon Aug 6 18:47:27 2018 +0200

    Fix multipath disconnect with path failure

    Under certain conditions detaching a multipath device may result on
    failure when flushing one of the individual paths, but the disconnect
    should have succeeded, because there were other paths available to flush
    all the data.

    OS-Brick is currently following standard recommended disconnect
    mechanism for multipath devices:

    - Release all device holders
    - Flush multipath
    - Flush single paths
    - Delete single devices

    The problem is that this procedure does an innecessary step, flushing
    individual single paths, that may result in an error.

    Originally it was thought that the individual flushes were necessary to
    prevent data loss, but upon further study of the multipath-tools and the
    device-mapper code it was discovered that this is not really the case.

    After the multipath flushing has been completed we can be sure that the
    data has been successfully sent and acknowledge by the device.

    Closes-Bug: #1785669
    Change-Id: I10f7fea2d69d5d9011f0d5486863a8d9d8a9696e

Changed in os-brick:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/rocky)

Reviewed: https://review.openstack.org/592579
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=b253e3169625a9617a528f9f989808fb041780db
Submitter: Zuul
Branch: stable/rocky

commit b253e3169625a9617a528f9f989808fb041780db
Author: Gorka Eguileor <email address hidden>
Date: Mon Aug 6 18:47:27 2018 +0200

    Fix multipath disconnect with path failure

    Under certain conditions detaching a multipath device may result on
    failure when flushing one of the individual paths, but the disconnect
    should have succeeded, because there were other paths available to flush
    all the data.

    OS-Brick is currently following standard recommended disconnect
    mechanism for multipath devices:

    - Release all device holders
    - Flush multipath
    - Flush single paths
    - Delete single devices

    The problem is that this procedure does an innecessary step, flushing
    individual single paths, that may result in an error.

    Originally it was thought that the individual flushes were necessary to
    prevent data loss, but upon further study of the multipath-tools and the
    device-mapper code it was discovered that this is not really the case.

    After the multipath flushing has been completed we can be sure that the
    data has been successfully sent and acknowledge by the device.

    Closes-Bug: #1785669
    Change-Id: I10f7fea2d69d5d9011f0d5486863a8d9d8a9696e
    (cherry picked from commit d866ee75c23e05efee5688a948498831980d98e3)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/594436

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/queens)

Reviewed: https://review.openstack.org/594436
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=cd9da93ee12b1ba6145ec5b196cc02f01b84db3f
Submitter: Zuul
Branch: stable/queens

commit cd9da93ee12b1ba6145ec5b196cc02f01b84db3f
Author: Gorka Eguileor <email address hidden>
Date: Mon Aug 6 18:47:27 2018 +0200

    Fix multipath disconnect with path failure

    Under certain conditions detaching a multipath device may result on
    failure when flushing one of the individual paths, but the disconnect
    should have succeeded, because there were other paths available to flush
    all the data.

    OS-Brick is currently following standard recommended disconnect
    mechanism for multipath devices:

    - Release all device holders
    - Flush multipath
    - Flush single paths
    - Delete single devices

    The problem is that this procedure does an innecessary step, flushing
    individual single paths, that may result in an error.

    Originally it was thought that the individual flushes were necessary to
    prevent data loss, but upon further study of the multipath-tools and the
    device-mapper code it was discovered that this is not really the case.

    After the multipath flushing has been completed we can be sure that the
    data has been successfully sent and acknowledge by the device.

    Closes-Bug: #1785669
    Change-Id: I10f7fea2d69d5d9011f0d5486863a8d9d8a9696e
    (cherry picked from commit d866ee75c23e05efee5688a948498831980d98e3)
    (cherry picked from commit b253e3169625a9617a528f9f989808fb041780db)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/594777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/pike)

Reviewed: https://review.openstack.org/594777
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=d8a807d7eabb783ee3e5cd0bb522d3d2c7c35eab
Submitter: Zuul
Branch: stable/pike

commit d8a807d7eabb783ee3e5cd0bb522d3d2c7c35eab
Author: Gorka Eguileor <email address hidden>
Date: Mon Aug 6 18:47:27 2018 +0200

    Fix multipath disconnect with path failure

    Under certain conditions detaching a multipath device may result on
    failure when flushing one of the individual paths, but the disconnect
    should have succeeded, because there were other paths available to flush
    all the data.

    OS-Brick is currently following standard recommended disconnect
    mechanism for multipath devices:

    - Release all device holders
    - Flush multipath
    - Flush single paths
    - Delete single devices

    The problem is that this procedure does an innecessary step, flushing
    individual single paths, that may result in an error.

    Originally it was thought that the individual flushes were necessary to
    prevent data loss, but upon further study of the multipath-tools and the
    device-mapper code it was discovered that this is not really the case.

    After the multipath flushing has been completed we can be sure that the
    data has been successfully sent and acknowledge by the device.

    Closes-Bug: #1785669
    Change-Id: I10f7fea2d69d5d9011f0d5486863a8d9d8a9696e
    (cherry picked from commit d866ee75c23e05efee5688a948498831980d98e3)
    (cherry picked from commit b253e3169625a9617a528f9f989808fb041780db)
    (cherry picked from commit cd9da93ee12b1ba6145ec5b196cc02f01b84db3f)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.6.0

This issue was fixed in the openstack/os-brick 2.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 1.15.6

This issue was fixed in the openstack/os-brick 1.15.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.3.4

This issue was fixed in the openstack/os-brick 2.3.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-brick 2.5.4

This issue was fixed in the openstack/os-brick 2.5.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.