Live migration of volume backed instances broken after upgrade to Juno

Bug #1392773 reported by Darren Worrall
56
This bug affects 11 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Dan Genin
Juno
Fix Released
Critical
Matthew Booth

Bug Description

I'm running nova in a virtualenv with a checkout of stable/juno:

root@compute1:/opt/openstack/src/nova# git branch
  stable/icehouse
* stable/juno
root@compute1:/opt/openstack/src/nova# git rev-list stable/juno | head -n 1
54330ce33ee31bbd84162f0af3a6c74003d57329

Since upgrading from icehouse, our iscsi backed instances are no longer able to live migrate, throwing exceptions like:

Traceback (most recent call last):
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
    incoming.message))
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
    result = getattr(endpoint, method)(ctxt, **new_args)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/exception.py", line 88, in wrapped
    payload)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/exception.py", line 71, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 326, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 314, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/compute/manager.py", line 4882, in check_can_live_migrate_source
    dest_check_data)
  File "/opt/openstack/venv/nova/local/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5040, in check_can_live_migrate_source
    raise exception.InvalidSharedStorage(reason=reason, path=source)
InvalidSharedStorage: compute2 is not on shared storage: Live migration can not be used without shared storage.

Looking back through the code, given dest_check_data like this:

{u'disk_over_commit': False, u'disk_available_mb': None, u'image_type': u'default', u'filename': u'tmpyrUUg1', u'block_migration': False, 'is_volume_backed': True}

In Icehouse the code to validate the request skipped this[0]:
     elif not shared and (not is_volume_backed or has_local_disks):

In Juno, it matches this[1]:

 if (dest_check_data.get('is_volume_backed') and
  not bool(jsonutils.loads(
    self.get_instance_disk_info(instance['name'])))):

In Juno at least, get_instance_disk_info returns something like this:

[{u'disk_size': 10737418240, u'type': u'raw', u'virt_disk_size': 10737418240, u'path': u'/dev/disk/by-path/ip-10.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-10f2302c-26b6-44e0-a3ea-7033d1091470-lun-1', u'backing_file': u'', u'over_committed_disk_size': 0}]

I wonder if that was previously an empty return value in Icehouse, I'm unable to test right now, but if it returned the same then I'm not sure how it ever worked before.

This is a lab environment, the volume storage is an LVM+ISCSI cinder service. nova.conf and cinder.conf here[2]

[0]: https://github.com/openstack/nova/blob/stable/icehouse/nova/virt/libvirt/driver.py#L4299
[1]: https://github.com/openstack/nova/blob/stable/juno/nova/virt/libvirt/driver.py#L5073
[2]: https://gist.github.com/DazWorrall/b1b1e906a6dc2338f6c1

Revision history for this message
Darren Worrall (dazworrall) wrote :

Ok, I rolled the whole thing back to icehouse, and confirmed that get_instance_disk_info returns an empty list for a volume backed image, whereas in Juno it does not.

Revision history for this message
Darren Worrall (dazworrall) wrote :

Created a short test case in this gist:

https://gist.github.com/DazWorrall/790bb41dbbbee2c24eef

Revision history for this message
Darren Worrall (dazworrall) wrote :

To also confirm, if I drop a hack in to emulate the old behaviour I am able to migrate instances successfully again.

Joe Gordon (jogo)
Changed in nova:
status: New → Confirmed
importance: Undecided → High
tags: added: juno-backport-potential
Changed in nova:
importance: High → Critical
Revision history for this message
Darren Worrall (dazworrall) wrote :

This is the commit that changed the behaviour:

https://github.com/openstack/nova/commit/5fa74bc0b2ab6fe5149a9b684b4beadb67877622#diff-f4019782d93a196a0d026479e6aa61b1R5313

Specifically:

- path = path_node.get('file')
+ path = path_node.get('file') or path_node.get('dev')

Revision history for this message
Darren Worrall (dazworrall) wrote :

I subscribed the commit author, I hope that's ok.

Revision history for this message
Dan Genin (daniel-genin) wrote :

I think the problem is that the get_instance_disk_info() call is not passing the block_device_info that is used to filter out volume backed disks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/135074

Changed in nova:
assignee: nobody → Dan Genin (daniel-genin)
status: Confirmed → In Progress
Revision history for this message
Dan Genin (daniel-genin) wrote :

Darren, can you please try this patch https://review.openstack.org/#/c/135074/. There is no Tempest testing for migration and I could never get live migration to work in my DevStack.

Revision history for this message
Darren Worrall (dazworrall) wrote :

The patch applies but is incomplete on stable/juno [0], the missing bits are on master but I cant even start instances on master at the moment (unrelated to this patch I'm sure, nova-compute is incorrectly looking up endpoints as being on 127.0.0.1 for some reason). Does the patch need to change for stable/juno or will that branch just merge with master at some point?

[0]: http://pastebin.com/0x4ERypG

Revision history for this message
Dan Genin (daniel-genin) wrote :

The patch is slightly different for stable/juno. Try this http://pastebin.com/95tk0NcL.

Revision history for this message
Darren Worrall (dazworrall) wrote :

Applies but incomplete again I think: http://pastebin.com/ByUCgFzJ

Revision history for this message
Dan Genin (daniel-genin) wrote :
Revision history for this message
Darren Worrall (dazworrall) wrote :

Yeah I just had time to look more closely and spotted the typo :)

It's gotten me past the validation step ok, I'm now getting a cinder problem but I dont think it's related (the second compute note is having issues authenticating with the target).

Revision history for this message
Dan Genin (daniel-genin) wrote :

OK, we'll count this as progress. Thank you for the quick turn around on testing.

Revision history for this message
Darren Worrall (dazworrall) wrote :

Ok, I rolled the whole thing back to icehouse again to collect some debug data while it was working, rolled forward to Juno and applied the patch and everything is working fine - the other problems must have been self inflicted while I was hacking around trying to figure out the initial problem.

So to clarify, on a fresh checkout and install of stable/juno, this patch fixes the issue.

Revision history for this message
Dan Genin (daniel-genin) wrote :

Sweeeet! Will work up a patch for master that will hopefully get backported.

Revision history for this message
John Griffith (john-griffith) wrote :

The pastebin seems to be inop; I worked up a quick *almost* complete version if Dan wants to submit it, otherwise I'm happy to finish it up and submit with him listed as Author.

The patch works, but needs updates to Xen and HyperV as well as some unit testing.

Revision history for this message
Dan Genin (daniel-genin) wrote :

Thank you, John! Your help is much appreciated. Please, feel free to submit this patch with me as an author or otherwise. I have little experience with Xen and HyperV, so I am not sure how much help I can be in adding unit tests for these drivers.

Revision history for this message
Luo Gangyi (luogangyi) wrote :

Besides this bug, I may find another problem in code.

The methon _get_instance_disk_info (nova/nova/virt/libvirt/driver.py) in Juno has piece of code as

"
        for cnt, path_node in enumerate(path_nodes):
            disk_type = disk_nodes[cnt].get('type')
            path = path_node.get('file') or path_node.get('dev')
            target = target_nodes[cnt].attrib['dev']

            if not path:
                LOG.debug('skipping disk for %s as it does not have a path',
                          instance_name)
                continue

            if disk_type not in ['file', 'block']:
                LOG.debug('skipping disk because it looks like a volume', path)
                continue

            if target in volume_devices:
                LOG.debug('skipping disk %(path)s (%(target)s) as it is a '
                          'volume', {'path': path, 'target': target})
                continue
"

However, in Icehouse, the code looks like as

"
        for cnt, path_node in enumerate(path_nodes):
            disk_type = disk_nodes[cnt].get('type')
            path = path_node.get('file')
            target = target_nodes[cnt].attrib['dev']

            if not path:
                LOG.debug(_('skipping disk for %s as it does not have a path'),
                          instance_name)
                continue

            if disk_type != 'file':
                LOG.debug(_('skipping %s since it looks like volume'), path)
                continue

            if target in volume_devices:
                LOG.debug(_('skipping disk %(path)s (%(target)s) as it is a '
                            'volume'), {'path': path, 'target': target})
                continue
"

Since both in Juno and Icehouse, the variable 'volume_devices' are None, so it should step into the code

"
            if disk_type != 'file':
                LOG.debug(_('skipping %s since it looks like volume'), path)
                continue
"
to prevent copy volume device.

But, in Juno, this code was changed to
''
            if disk_type not in ['file', 'block']:
                LOG.debug('skipping disk because it looks like a volume', path)
                continue
"
I wander volume device will be copied in this code which is definitely wrong.

Revision history for this message
Dan Genin (daniel-genin) wrote :

Luo, I modified that line in Juno to make get_instance_disk_info() return correct information for LVM backed disks, which have type 'block'. However, as long as the caller provides correct block_device_info parameter, the volume devices will be filtered out. The volume_devices variable is set based on block_device_info.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/135074
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6ad7e17e19b9809306b0b96ae7aa9cdfda91fcbb
Submitter: Jenkins
Branch: master

commit 6ad7e17e19b9809306b0b96ae7aa9cdfda91fcbb
Author: Daniel Genin <email address hidden>
Date: Mon Nov 17 15:16:14 2014 -0500

    libvirt: Fixes live migration for volume backed instances

    Live migration fails for volume backed instances in LibvirtDriver
    because _is_shared_block_storage() incorrectly identifies volume
    backed disks as local. This is fixed by passing the block_device_info
    parameter to get_instance_disk_info, which allows for correct
    filtering of volume backed disks.

    Change-Id: I1437b2a7d5a62615b0099114ed1b5b1110f56de2
    Closes-bug: 1392773

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/141124

Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/juno)

Change abandoned by John Griffith (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/141124

Revision history for this message
Darren Worrall (dazworrall) wrote :

This is marked as Fix Released but is still broken in stable/juno. I got an email saying the change had been abandoned?

Changed in nova:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by John Griffith (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/141124
Reason: Looks like this is all done actually.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Looks like this is fixed for master in https://review.openstack.org/#/c/135074/

Changed in nova:
status: Confirmed → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/141124
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a0711dbffe3d68ee9be39c85307b19ea5efee7a
Submitter: Jenkins
Branch: stable/juno

commit 5a0711dbffe3d68ee9be39c85307b19ea5efee7a
Author: Daniel Genin <email address hidden>
Date: Mon Nov 17 15:16:14 2014 -0500

    libvirt: Fixes live migration for volume backed instances

    Live migration fails for volume backed instances in LibvirtDriver
    because _is_shared_block_storage() incorrectly identifies volume
    backed disks as local. This is fixed by passing the block_device_info
    parameter to get_instance_disk_info, which allows for correct
    filtering of volume backed disks.

    (cherry picked from commit 6ad7e17e19b9809306b0b96ae7aa9cdfda91fcbb)

    Conflicts:
     nova/tests/unit/virt/libvirt/test_driver.py

    Tests moved to tests/unit and 1 other minor conflict.

    objects.Instance(**self.test_instance) replaced with
    self.create_instance_obj(self.context) to handle PciDeviceList.

    Use explicit field name in format() for 2.6 compatibility.

    Change-Id: I1437b2a7d5a62615b0099114ed1b5b1110f56de2
    Closes-bug: 1392773

Revision history for this message
Darren Worrall (dazworrall) wrote :

Just updated our stable/juno install and this issue has been fixed, thanks all.

Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
tags: added: live-migrate
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.