During live-migration Nova expects identical IQN from attached volume(s)

Bug #1423772 reported by Sean Severson
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Anthony Lee
Juno
Fix Released
Undecided
Unassigned
Kilo
Fix Released
Undecided
Unassigned

Bug Description

When attempting to do a live-migration on an instance with one or more attached volumes, Nova expects that the IQN will be exactly the same as it's attaching the volume(s) to the new host. This conflicts with the Cinder settings such as "hp3par_iscsi_ips" which allows for multiple IPs for the purpose of load balancing.

Example:
An instance on Host A has a volume attached at "/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-2"
An attempt is made to migrate the instance to Host B.
Cinder sends the request to attach the volume to the new host.
Cinder gives the new host "/dev/disk/by-path/ip-10.10.120.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-2"
Nova looks for the volume on the new host at the old location "/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-2"

The following error appears in n-cpu in this case:

2015-02-19 17:09:05.574 ERROR nova.virt.libvirt.driver [-] [instance: b6fa616f-4e78-42b1-a747-9d081a4701df] Live Migration failure: Failed to open file '/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-2': No such file or directory
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 115, in wait
    listener.cb(fileno)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 212, in main
    result = function(*args, **kwargs)
  File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 5426, in _live_migration
    recover_method(context, instance, dest, block_migration)
  File "/opt/stack/nova/nova/openstack/common/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 5393, in _live_migration
    CONF.libvirt.live_migration_bandwidth)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1582, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
libvirtError: Failed to open file '/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-2': No such file or directory
Removing descriptor: 3

When looking at the nova DB, this is the state of block_device_mapping prior to the migration attempt:

mysql> select * from block_device_mapping where instance_uuid='b6fa616f-4e78-42b1-a747-9d081a4701df' and deleted=0;

| created_at | updated_at | deleted_at | id | device_name | delete_on_termination | snapshot_id | volume_id | volume_size | no_device | connection_info | instance_uuid | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id |

| 2015-02-20 00:57:01 | 2015-02-20 01:03:06 | NULL | 3 | /dev/vda | 0 | NULL | e031b804-f824-45f1-a9fa-b9330ba061e0 | NULL | NULL | {"driver_volume_type": "iscsi", "serial": "e031b804-f824-45f1-a9fa-b9330ba061e0", "data": {"host_device": "/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-1", "target_discovered": true, "qos_specs": null, "target_iqn": "iqn.2000-05.com.3pardata:22210002ac002a13", "target_portal": "10.10.220.244:3260", "target_lun": 1, "access_mode": "rw"}} | b6fa616f-4e78-42b1-a747-9d081a4701df | 0 | volume | volume | NULL | disk | virtio | 0 | NULL |
| 2015-02-20 01:07:11 | 2015-02-20 01:07:19 | NULL | 5 | /dev/vdb | 0 | NULL | c3009a3d-549d-4ee5-b7a6-b0eac6382beb | NULL | NULL | {"driver_volume_type": "iscsi", "serial": "c3009a3d-549d-4ee5-b7a6-b0eac6382beb", "data": {"device_path": "/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-2", "host_device": "/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-2", "target_discovered": true, "qos_specs": null, "target_iqn": "iqn.2000-05.com.3pardata:22210002ac002a13", "target_portal": "10.10.220.244:3260", "target_lun": 2, "access_mode": "rw"}} | b6fa616f-4e78-42b1-a747-9d081a4701df | 0 | volume | volume | NULL | NULL | NULL | NULL | NULL |

2 rows in set (0.00 sec)

Then after the error message comes up in n-cpu:


| created_at | updated_at | deleted_at | id | device_name | delete_on_termination | snapshot_id | volume_id | volume_size | no_device | connection_info | instance_uuid | deleted | source_type | destination_type | guest_format | device_type | disk_bus | boot_index | image_id |

| 2015-02-20 00:57:01 | 2015-02-20 01:08:55 | NULL | 3 | /dev/vda | 0 | NULL | e031b804-f824-45f1-a9fa-b9330ba061e0 | NULL | NULL | {"driver_volume_type": "iscsi", "serial": "e031b804-f824-45f1-a9fa-b9330ba061e0", "data": {"target_discovered": true, "qos_specs": null, "target_iqn": "iqn.2000-05.com.3pardata:22210002ac002a13", "target_portal": "10.10.220.244:3260", "target_lun": 0, "access_mode": "rw"}} | b6fa616f-4e78-42b1-a747-9d081a4701df | 0 | volume | volume | NULL | disk | virtio | 0 | NULL |
| 2015-02-20 01:07:11 | 2015-02-20 01:08:57 | NULL | 5 | /dev/vdb | 0 | NULL | c3009a3d-549d-4ee5-b7a6-b0eac6382beb | NULL | NULL | {"driver_volume_type": "iscsi", "serial": "c3009a3d-549d-4ee5-b7a6-b0eac6382beb", "data": {"target_discovered": true, "qos_specs": null, "target_iqn": "iqn.2000-05.com.3pardata:22210002ac002a13", "target_portal": "10.10.220.244:3260", "target_lun": 1, "access_mode": "rw"}} | b6fa616f-4e78-42b1-a747-9d081a4701df | 0 | volume | volume | NULL | NULL | NULL | NULL | NULL |
+---------------------+---------------------+------------+----+-------------+-----------------------+-------------+--------------------------------------+-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+-------------+------------------+--------------+-------------+----------+------------+----------+
2 rows in set (0.00 sec)

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Sean Severson (sseverson) wrote :
Download full text (4.9 KiB)

I was able to reproduce this in Kilo (the original discovery was in Juno), with the same error:

2015-02-20 16:12:52.802 ERROR nova.virt.libvirt.driver [-] [instance: 93ba0373-15cd-4e83-845d-4cfaf7c11416] Live Migration failure: Failed to open file '/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-1': No such file or directory
2015-02-20 16:12:54.381 ERROR root [-] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/opt/stack/n ova/nova/virt/libvirt/driver.py", line 5249, in _live_migration\n CONF.libvirt.live_migration_bandwidth)\n', ' File "/usr/local/lib /python2.7/dist-packages/eventlet/tpool.py", line 183, in doit\n result = proxy_call(self._autowrap, f, *args, **kwargs)\n', ' File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call\n rv = execute(f, *args, **kwargs)\n', ' File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute\n six.reraise(c, e, tb)\n', ' File "/usr/local/lib /python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker\n rv = meth(*args, **kwargs)\n', ' File "/usr/lib/python2.7/dist-p ackages/libvirt.py", line 1582, in migrateToURI2\n if ret == -1: raise libvirtError (\'virDomainMigrateToURI2() failed\', dom=self)\ n', "libvirtError: Failed to open file '/dev/disk/by-path/ip-10.10.220.244:3260-iscsi-iqn.2000-05.com.3pardata:22210002ac002a13-lun-1': No such file or directory\n"]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 115, in wait
    listener.cb(fileno)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 5282, in _live_migration
    recover_method(context, instance, dest, block_migration)
  File "/opt/stack/nova/nova/exception.py", line 88, in wrapped
    payload)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/exception.py", line 71, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/nova/nova/compute/manager.py", line 324, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 312, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 5297, in _rollback_live_migration
    context, instance, bdm.volume_id, dest)
  File "/opt/stack/nova/nova/compute/rpcapi.py", line 677, in remove_volume_connection
    instance=instance, volume_id=volume_id)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 156, in call
    retry=self.retry)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
    timeout=...

Read more...

Revision history for this message
Joe T (joe-topjian-v) wrote :

We're running into this same issue on Icehouse. We have a central storage appliance that is attaching volumes over iSCSI. I'm happy to provide further information or logs.

Revision history for this message
Joe T (joe-topjian-v) wrote :

This patch fixed the issue for us:

https://review.openstack.org/#/c/137466/

We're running Icehouse, but it was fairly trivial to make the changes.

ugvddm (271025598-9)
Changed in nova:
assignee: nobody → ugvddm (271025598-9)
ugvddm (271025598-9)
Changed in nova:
assignee: ugvddm (271025598-9) → nobody
tags: added: live-migrate
Changed in nova:
assignee: nobody → Anthony Lee (anthony-mic-lee)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/211051

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/202770
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8b649aa86fb26e998d66e75e5cebfd19c396942d
Submitter: Jenkins
Branch: master

commit 8b649aa86fb26e998d66e75e5cebfd19c396942d
Author: Anthony Lee <email address hidden>
Date: Thu Jul 16 13:02:00 2015 -0700

    Fix live-migrations usage of the wrong connector information

    During the post_live_migration step for the Nova libvirt driver
    an incorrect assumption is being made about the connector
    information being sent to _disconnect_volume. It is assumed that
    the connection information on the source and destination is the
    same but that is not always the case. The BDM, where the
    connector information is being retrieved from only contains the
    connection information for the destination. This will not work
    when trying to disconnect volumes from the source during live
    migration as the properties such as the target_lun and
    initiator_target_map could be different. This ends up leaving
    behind dangling LUNs and possibly removing the incorrect
    volume's LUNs.

    The solution proposed here utilizes the connection_info that
    can be retrieved for a host from Cinder's initialize_connection
    API. This connection information contains the correct data for
    the source host and allows volume LUNs to be removed properly.

    Change-Id: I3dfb75eb58dfbc66b218bcee473af4c2ac282eb6
    Closes-Bug: #1475411
    Closes-Bug: #1288039
    Closes-Bug: #1423772

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/211051
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=587092c909e15e983f7aef31d7bc0862271a32c7
Submitter: Jenkins
Branch: stable/kilo

commit 587092c909e15e983f7aef31d7bc0862271a32c7
Author: Anthony Lee <email address hidden>
Date: Thu Jul 16 13:02:00 2015 -0700

    Fix live-migrations usage of the wrong connector information

    During the post_live_migration step for the Nova libvirt driver
    an incorrect assumption is being made about the connector
    information being sent to _disconnect_volume. It is assumed that
    the connection information on the source and destination is the
    same but that is not always the case. The BDM, where the
    connector information is being retrieved from only contains the
    connection information for the destination. This will not work
    when trying to disconnect volumes from the source during live
    migration as the properties such as the target_lun and
    initiator_target_map could be different. This ends up leaving
    behind dangling LUNs and possibly removing the incorrect
    volume's LUNs.

    The solution proposed here utilizes the connection_info that
    can be retrieved for a host from Cinder's initialize_connection
    API. This connection information contains the correct data for
    the source host and allows volume LUNs to be removed properly.

    --

    NOTE(sahid): The TODO comment in the original change on master is
    omitted here since os-brick wasn't used by nova in kilo so leaving
    it in the backport would be confusing.

    Change-Id: I3dfb75eb58dfbc66b218bcee473af4c2ac282eb6
    Closes-Bug: #1475411
    Closes-Bug: #1288039
    Closes-Bug: #1423772

tags: added: in-stable-kilo
Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-3
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/228517

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/228517
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9d2abbd9ab60ca873650759feaba98b4d8d35566
Submitter: Jenkins
Branch: stable/juno

commit 9d2abbd9ab60ca873650759feaba98b4d8d35566
Author: Anthony Lee <email address hidden>
Date: Thu Jul 16 13:02:00 2015 -0700

    Fix live-migrations usage of the wrong connector information

    During the post_live_migration step for the Nova libvirt driver
    an incorrect assumption is being made about the connector
    information being sent to _disconnect_volume. It is assumed that
    the connection information on the source and destination is the
    same but that is not always the case. The BDM, where the
    connector information is being retrieved from only contains the
    connection information for the destination. This will not work
    when trying to disconnect volumes from the source during live
    migration as the properties such as the target_lun and
    initiator_target_map could be different. This ends up leaving
    behind dangling LUNs and possibly removing the incorrect
    volume's LUNs.

    The solution proposed here utilizes the connection_info that
    can be retrieved for a host from Cinder's initialize_connection
    API. This connection information contains the correct data for
    the source host and allows volume LUNs to be removed properly.

    Conflicts:
            nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(mriedem): The conflicts are due to the tests being moved
    in Kilo and 41f80226e0a1f73af76c7968617ebfda0aeb40b1 not being
    in stable/juno (renamed conn var to drvr in libvirt tests).

    Change-Id: I3dfb75eb58dfbc66b218bcee473af4c2ac282eb6
    Closes-Bug: #1475411
    Closes-Bug: #1288039
    Closes-Bug: #1423772
    (cherry picked from commit 587092c909e15e983f7aef31d7bc0862271a32c7)

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-3 → 12.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.