Xenapi: Migration failure of Volume Backed VHDs

Bug #1745072 reported by Brooks Kaminski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Brooks Kaminski
Queens
Confirmed
Low
Unassigned
Rocky
Fix Committed
Low
Matt Riedemann

Bug Description

Description
===========
   Current Xenapi Resize code in _process_ephemeral_chain_recursive is attempting to migrate Volume backed VHD chains rather than detach these from the source hypervisor and re-attach to the destination. This code is triggered due to the possibility of ephemeral drives existing and needing to have their VHD chains migrated over after the initial set of base VHD Chains. This attempt to snapshot the VDI of a Volume Backed drive results in SR_OPERATION_UNSUPPORTED being thrown by XenAPI and the migration failing, as these VDIs do not have the allowed_operation to handle this, and additionally, the VHD that is associated with this VDI is simply a stub.

Steps to reproduce
==================
1. Create a server of any size or flavor
2. Attach enough volumes to create VBD Userdevice=4 or greater.
3. Attempt to migrate the server (Not live-migration)
4. Migration will fail during ephemeral snapshot process.

Expected result
===============
During a migration, the volumes should be detached from the source server and then attached to the destination without any real attempt to "migrate" them beyond switching their connection points.

Actual result
=============
Current resize migration code detects all volumes with VBD Userdevice 4+ as an ephemeral drive and attempts to snapshot and migrate it's VHD, causing errors when this is actually a volume backed drive.

Environment
===========
1. Exact version of OpenStack you are running:

   Liberty, Issue exists in current however

2. Which hypervisor did you use?

   Xenserver Hypervisors. All versions 6.0+

2. Which storage type did you use?

   Local SSD RAID10 + iSCSI NAS

3. Which networking type did you use?

   Neutron with OpenVSwitch

Logs & Configs
==============

Please note the logs here from Compute and Xenapi are from two actually different samples, but will present in the same way.

-----------------------------
- Logs from Nova-Compute:
-----------------------------

2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/nova/virt/xenapi/vmops.py", line 212, in inner
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] rv = f(*args, **kwargs)
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/nova/virt/xenapi/vmops.py", line 1205, in transfer_ephemeral_disks_then_all_leaf_vdis
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] _process_ephemeral_chain_recursive(ephemeral_chains, [])
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/nova/virt/xenapi/vmops.py", line 1170, in _process_ephemeral_chain_recursive
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] vm_ref, label, str(userdevice)) as chain_vdi_uuids:
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] return self.gen.next()
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/nova/virt/xenapi/vm_utils.py", line 748, in _snapshot_attached_here_impl
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] snapshot_ref = _vdi_snapshot(session, vm_vdi_ref)
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/nova/virt/xenapi/vm_utils.py", line 646, in _vdi_snapshot
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] return session.call_xenapi("VDI.snapshot", vdi_ref, {})
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/nova/virt/xenapi/client/session.py", line 212, in call_xenapi
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] return session.xenapi_request(method, args)
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/XenAPI.py", line 133, in xenapi_request
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] result = _parse_result(getattr(self, methodname)(*full_params))
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] File "/opt/rackstack/rackstack.399.15/nova/lib/python2.7/site-packages/XenAPI.py", line 203, in _parse_result
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] raise Failure(result['ErrorDescription'])
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415] Failure: ['SR_OPERATION_NOT_SUPPORTED', 'OpaqueRef:33dd3d47-abd4-0be7-e83b-a1ffd290292a']
2015-11-23 20:44:59.396 10060 ERROR nova.virt.xenapi.vmops [instance: df05e626-246b-4676-990f-d67bcc9c0415]
2015-11-23 20:44:59.502 10060 INFO nova.compute.manager [req-c17b3702-077c-40bb-9487-5598b6ac0a9a 3742 391232 - - -] [instance: df05e626-246b-4676-990f-d67bcc9c0415] Setting instance back to ACTIVE after: Instance rollback performed due to: ['SR_OPERATION_NOT_SUPPORTED', 'OpaqueRef:33dd3d47-abd4-0be7-e83b-a1ffd290292a']

------------------------------
- Logs from Xensource.log
------------------------------

/var/log/xensource.log:Oct 12 02:46:20 localhost xapi: [debug|24-46-53-471027|36242831 INET 0.0.0.0:80|VDI.snapshot R:a42bf8a64bd2|xapi] Caught exception while SR_OPERATION_NOT_SUPPORTED: [ OpaqueRef:8c63e240-d80c-4354-4a58-d0395abf7425 ] in message forwarder: marking SR for VDI.snapshot
/var/log/xensource.log:Oct 12 02:46:20 localhost xapi: [debug|24-46-53-471027|36242831 INET 0.0.0.0:80|VDI.snapshot R:a42bf8a64bd2|dispatcher] Server_helpers.exec exception_handler: Got exception SR_OPERATION_NOT_SUPPORTED: [ OpaqueRef:8c63e240-d80c-4354-4a58-d0395abf7425 ]

Tags: xenserver
Revision history for this message
Brooks Kaminski (brooks-kaminski) wrote :

Sorry for the wrong order of operations here, I committed the fix to this here:

https://review.openstack.org/#/c/533168/

Changed in nova:
status: New → Fix Committed
Changed in nova:
assignee: nobody → Brooks Kaminski (brooks-kaminski)
Changed in nova:
status: Fix Committed → In Progress
Changed in nova:
assignee: Brooks Kaminski (brooks-kaminski) → Brooks Kaminski (bhkaminski)
Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/533168
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eefb20e4658e17f91fa76b74fef6ff899babe51b
Submitter: Zuul
Branch: master

commit eefb20e4658e17f91fa76b74fef6ff899babe51b
Author: Brooks Kaminski <email address hidden>
Date: Fri Jan 12 06:05:36 2018 -0600

    XenAPI/Stops the migration of volume backed VHDS

    This commit aims to correct problems with the resize_up codebase that allows
    the snapshot and migration of volume backed VDI/VHDs. Since these are empty
    stub disks, and the XenAPI does not allow these VDIs to be snapped, this results
    in an SR_OPERATION_NOT_ALLOWED or similar error on attempt.

    This change adds a check into the _process_ephemeral_chain_recursive method to
    run the current userdevice through volume_utils.is_booted_from_volume. To
    achieve this, the method has been opened in scope to accept custom user_device
    objects. In a future commit we will need to rename this method for clarity
    and correct its dependancies that call it. I have added a TODO for this to be
    done by myself. The check will ensure that the userdevice is not volume backed
    and then continue to snapshot and migrate the disk as needed, else increment
    and move on.

    Closes-Bug: #1745072
    Change-Id: I7cd2977c8268c1f73062b5d0b2b68ea686db99fe

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/604203

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/604203
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dba00dbe11361a12d2602dd9601ee365a0755051
Submitter: Zuul
Branch: stable/rocky

commit dba00dbe11361a12d2602dd9601ee365a0755051
Author: Brooks Kaminski <email address hidden>
Date: Fri Jan 12 06:05:36 2018 -0600

    XenAPI/Stops the migration of volume backed VHDS

    This commit aims to correct problems with the resize_up codebase that allows
    the snapshot and migration of volume backed VDI/VHDs. Since these are empty
    stub disks, and the XenAPI does not allow these VDIs to be snapped, this results
    in an SR_OPERATION_NOT_ALLOWED or similar error on attempt.

    This change adds a check into the _process_ephemeral_chain_recursive method to
    run the current userdevice through volume_utils.is_booted_from_volume. To
    achieve this, the method has been opened in scope to accept custom user_device
    objects. In a future commit we will need to rename this method for clarity
    and correct its dependancies that call it. I have added a TODO for this to be
    done by myself. The check will ensure that the userdevice is not volume backed
    and then continue to snapshot and migrate the disk as needed, else increment
    and move on.

    Closes-Bug: #1745072
    Change-Id: I7cd2977c8268c1f73062b5d0b2b68ea686db99fe
    (cherry picked from commit eefb20e4658e17f91fa76b74fef6ff899babe51b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.2

This issue was fixed in the openstack/nova 18.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.