Instance can't run normally after volume migration (with swap volume) fails.

Bug #1550639 reported by YaoZheng_ZTE
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
New
Undecided
Unassigned
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

Reproducing method as following:
1.create a volume from image
[root@2C5_10_DELL05 ~(keystone_admin)]# cinder create --image-id fd8330b3-a307-4140-8fe0-01341b583e26 --name test_image_volume --volume-type KSIP 1
+---------------------------------------+--------------------------------------+
| Property | Value |
+---------------------------------------+--------------------------------------+
| attachments | [] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2016-02-27T04:20:37.000000 |
| description | None |
| encrypted | False |
| id | a0dae16a-2669-49c7-a118-250c31adc655 |
| metadata | {} |
| multiattach | False |
| name | test_image_volume |
| os-vol-host-attr:host | None |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | 181a578bc97642f2b9e153bec622f130 |
| os-volume-replication:driver_data | None |
| os-volume-replication:extended_status | None |
| replication_status | disabled |
| size | 1 |
| snapshot_id | None |
| source_volid | None |
| status | creating |
| user_id | 8b34e1ab75024fcba0ea69a6fd0937c3 |
| volume_type | KSIP |
+---------------------------------------+--------------------------------------+
2、boot a instance from the step 1 volume.
[root@2C5_10_DELL05 ~(keystone_admin)]# nova boot --flavor 1 --block-device id=a0dae16a-2669-49c7-a118-250c31adc655,source=volume,dest=volume,bootindex=0 --nic net-id=5c8f7e7a-5a75-48eb-9c68-096278585c18 test_vm
+--------------------------------------+--------------------------------------------------+
| Property | Value |
+--------------------------------------+--------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | instance-00000647 |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | JEeW4BR4WL3a |
| autostart | TRUE |
| boot_index_type | |
| config_drive | |
| created | 2016-02-27T04:22:42Z |
| flavor | m1.tiny (1) |
| hostId | |
| id | a740b3da-42e7-4cba-9408-8df3b4846dcc |
| image | Attempt to boot from volume - no image supplied |
| key_name | - |
| metadata | {} |
| move | TRUE |
| name | test_vm |
| novnc | TRUE |
| os-extended-volumes:volumes_attached | [{"id": "a0dae16a-2669-49c7-a118-250c31adc655"}] |
| priority | 50 |
| progress | 0 |
| qos | |
| security_groups | default |
| status | BUILD |
| tenant_id | 181a578bc97642f2b9e153bec622f130 |
| updated | 2016-02-27T04:22:43Z |
| user_id | 8b34e1ab75024fcba0ea69a6fd0937c3 |
+--------------------------------------+--------------------------------------------------+
3. migrate the in-use status volume
[root@2C5_10_DELL05 ~(keystone_admin)]# cinder migrate a0dae16a-2669-49c7-a118-250c31adc655 2C5_10_DELL05@KS3200ISCSIDriver-2#KS3200_IPSAN
4. migrate volume fail, nova-compute.log as following:
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher executor_callback))
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher executor_callback)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 8699, in swap_volume
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher new_volume_id)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 88, in wrapped
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher payload)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/exception.py", line 71, in wrapped
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 379, in decorated_function
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher LOG.warning(msg, e, instance_uuid=instance_uuid)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 350, in decorated_function
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 407, in decorated_function
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 395, in decorated_function
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5965, in swap_volume
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher new_volume_id)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5932, in _swap_volume
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher self.volume_api.unreserve_volume(context, new_volume_id)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5913, in _swap_volume
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher resize_to)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1241, in swap_volume
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher self._disconnect_volume(old_connection_info, disk_dev)
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1092, in _disconnect_volume
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher raise
2016-02-27 11:32:47.986 29370 TRACE oslo_messaging.rpc.dispatcher TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType
5. Then, the instance is still running ,active. but login to the virtual machine system, find the guest OS changed read-only file system.

Changed in nova:
assignee: nobody → YaoZheng_ZTE (zheng-yao1)
Revision history for this message
YaoZheng_ZTE (zheng-yao1) wrote :

The fourth step does not need to care about is abnormal, environmental problems of my own. But i want to express is that even if migrate volume is abnormal, we should be able to roll back to normal, the maximum guarantee the normal use of the virtual machine storage. so, to solve this problem, we should capture the exception handling in disconnect_volume.

Revision history for this message
Matt Riedemann (mriedem) wrote :

What version of nova and cinder are you using?

I'm not familiar with the volume migration API in Cinder, but it looks like it calls out to Nova to perform a swap volume operation.

Per your comment 1, you're expecting that Cinder will handle the failure from the Nova swap volume operation and rollback the failure? That's reasonable but would need the Cinder team to weigh in here.

Do you have the Cinder logs from the failure to see why those don't rollback?

Changed in nova:
status: New → Incomplete
tags: added: migration volumes
Revision history for this message
Matt Riedemann (mriedem) wrote :

It seems like at the very least the instance in nova should go to ERROR state if the swap volume operation fails.

Matt Riedemann (mriedem)
summary: - After migrate volume being attached instance, the instance cann't run
- normally
+ Instance can't run normally after volume migration (with swap volume)
+ fails.
Revision history for this message
Matt Riedemann (mriedem) wrote :

Another issue that worries me is nova doesn't change the task_state on the instance during a swap_volume operation, so the owner of the server instance could delete it while volumes are being swapped, which could leave the volumes in a garbage state on the cinder side. This is especially bad because swap_volume is by default an admin-only API, and it's initiated from cinder when doing a volume migration. So the user might not even know their volume backends are being migrated, and they decide they no longer need the instance and go to delete it, and that fails, or it passes but the volumes are stuck or orphaned in cinder.

Revision history for this message
Matt Riedemann (mriedem) wrote :

It also doesn't look like we set any instance action events or emit notifications when performing a swap_volume operation either.

Matt Riedemann (mriedem)
Changed in nova:
status: Incomplete → Confirmed
importance: Undecided → Medium
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: YaoZheng_ZTE (zheng-yao1) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.