BUG : when live-migration failed, lun-id couldn't be rollback

Bug #1416314 reported by Hyun Ha on 2015-01-30
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned

Bug Description

Hi, guys

I'm testing live-migration with openstack Juno.

when live-migrate failed with error, lun-id of connection_info in bdm table couldn't be rollback

my test version is following :

Openstack Version : Juno ( 2014.2.1)
Compute Node OS : 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Compute Node multipath : multipath-tools 0.4.9-3ubuntu7.2
backend storage : EMC VNX 5400

test step is :

1) create 2 Compute node (host#1 and host#2)
2) create 1 VM on host#1 (vm01)
3) create 2 cinder volumes (vol01, vol02)
4) attach 2 volumes to vm01 (vdb, vdc)
5) host#2's iscsi interface down
    - this situation can be occurred frequently in production
6) live-migrate vm01 from host#1 to host#2
7) live-migrate fails
     - please check connection_info(lun-id) of bdm at this time then you can find the lun-id of cinder-volume is not be rollback
     - please check lun's storage_group by using unisphere then you can find lun has two storage groups.

This Bug is very critical because the VM can have different lun mappings when this case is occurred, so that filesystem of volume can be break.

Actually this case was occurred and my vm's filesystem was broken.
and I think every backend storage of cinder-volume can have same problem because this is the bug of live-migration's rollback process.

please fix this bug ASAP.

Thank you.

Hyun Ha (raymon-ha) on 2015-02-03
tags: added: vnx
Hyun Ha (raymon-ha) on 2015-02-03
Changed in cinder:
assignee: nobody → Hahyun (hfamily15)
assignee: Hahyun (hfamily15) → nobody
Hyun Ha (raymon-ha) on 2015-02-03
Changed in cinder:
assignee: nobody → Hahyun (hfamily15)
status: New → In Progress
Robert Esker (esker) on 2015-02-06
Changed in cinder:
assignee: Hahyun (hfamily15) → Robert Esker (esker)
assignee: Robert Esker (esker) → NetApp (netapp)
Robert Esker (esker) on 2015-02-24
tags: added: security
Xing Yang (xing-yang) wrote :

Rob,

This was assigned to NetApp. Is anyone from NetApp looking into this? Thanks.

Yogesh (ykshirsa) wrote :

Based on my investigation, I did not find any issue on cinder side. In a situation of failed migration, the volume is properly unmapped from the host and the terminate_connection has been called as expected.
I rebooted the vm after failed migration to make sure LUNs are properly mapped to the host. I didn't see any discrepancy there too.

However, I did see an issue with the BDM table in Nova where "controller-info" column does not rollback the information about "target-id" properly.
Therefore, I am moving this bug to Nova.

Yogesh (ykshirsa) wrote :

Issue with nova BDM table.

affects: cinder → nova
Changed in nova:
assignee: NetApp (netapp) → nobody
Changed in nova:
status: In Progress → Confirmed
importance: Undecided → Low
Hyun Ha (raymon-ha) wrote :

Hi, Yogesh
Thank you for your comment.

I agree with that this is a nova bug.

I have one question.
Did you rebooted the vm in VM ? or by using nova CLI like 'nova reboot --hard [uuid]'?

You can find that the vm can have another volume when you try to hard reboot the rollbacked vm.
When the vm is hard rebooted, nova request a info for 'block device mapping' to make xml again, but BDM table in Nova has
wrong lun-id in connection_info column, so vm has another volume mapping in this situation.

Thank you.

Yogesh (ykshirsa) wrote :

Hi Hyun,

Yes, I did a hard reboot the vm purposely to see the behavior based on issue that you reported.
However, I did not see it happening the way you described.
For me the size of the vm and the mapping to the host remain as expected(i.e. the original size and host mapping).
So, on that front, I could not reproduce your issue. May be someone from the nova team could try reproducing it at their end.

Regards,
Yogesh

Matt Riedemann (mriedem) on 2015-05-26
tags: added: libvirt live-migration
tags: added: volum
tags: added: volumes
removed: volum
Matt Riedemann (mriedem) wrote :

Is this still an issue on master (liberty) level code? I'm assuming the virt driver here is libvirt - can someone confirm?

I'm also confused about the 'controller-info' comment in the block_device_mapping table in comment 2 - do you mean the connection_info column which is a serialized dict?

Looking at the nova.virt.libvirt.driver.pre_live_migration() method, I see it's connecting to a volume and the connection_info dictionary is updated in the nova.virt.libvirt.volume code, but I don't see where that connection_info dict comes back to the virt driver's pre_live_migration method and persists the change to the database.

What I do see is that pre_live_migration returns a pre_live_migration_result dict to the compute manager which gets passed to live_migration in the virt driver and that uses it to update the domain xml here:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py?id=2015.1.0#n5431

which eventually gets here:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py?id=2015.1.0#n5306

It seems like that could cause issues, but I still don't see where anything is persisted to the database that requires rollback.

Matt Riedemann (mriedem) wrote :

This appears to be a duplicate of bug 1419577 which has already gone through the security team.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers