Live migration of UEFI booted instance failing unexpectedly

Bug #1792999 reported by Wendy Mitchell on 2018-09-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
StarlingX
Medium
YU CHENGDE

Bug Description

Brief Description
-----------------
Live migration of UEFI booted instance failing unexpectedly (rollsback)

Severity
--------
Major

Steps to Reproduce
------------------
1: Get/Create ge_edge image
2: Create a flavor with 1 vcpu
3: Add following extra specs: {'hw:cpu_policy': 'dedicated'}
4: Create a volume from ge_edge image
5: Boot a ge_edge VM with above flavor from volume
6. Check topology of vm b69eb997-0477-43f5-a2dc-6840e13cadcd on controller, hypervisor and vm
7: Live migrate ge_edge VM
[2018-09-15 21:38:21,722]
8: Ping vm from NatBox after live migration
9: Check topology of vm b69eb997-0477-43f5-a2dc-6840e13cadcd on controller, hypervisor and vm
10: Cold migrate vm and check vm is moved to different host
11: Ping vm from NatBox after cold migration
[2018-09-15 21:40:10,847]
12: Check topology of vm b69eb997-0477-43f5-a2dc-6840e13cadcd on controller, hypervisor and vm
[2018-09-15 21:40:43,316]
13: Swact active controller
[2018-09-15 21:41:05,746]
14: Ensure ge_edge vm can still be live-migrated after swact
[2018-09-15 21:42:36,155] 262 DEBUG MainThread ssh.send :: Send 'nova --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne live-migration b69eb997-0477-43f5-a2dc-6840e13cadcd'

Expected Behavior
------------------
Expected live migration success in step 14

Actual Behavior
----------------
[2018-09-15 21:43:26,974] 53 DEBUG MainThread conftest.update_results::
***Failure at test call: /home/svc-cgcsauto/wassp-repos.new/testcases/cgcs/CGCSAuto/keywords/vm_helper.py:1049: utils.exceptions.VMPostCheckFailed: Check failed post VM operation.

instance: b69eb997-0477-43f5-a2dc-6840e13cadcd
id=instance-00000063, name=tenant2-ge_edge-dedicated-migrate-120

instance is on compute-0 and is attempting to live migrate to compute-2 but the migration unexpectedly aborts
see nova-conductor.log
2018-09-15 21:42:41.042 84948 INFO nova.conductor.tasks.live_migrate [req-5f9433b4-ff15-4886-8dcb-b20c88798125 1476c2312b6144a9ae513a66b4666a14 a7f0ca6d827847c883e47b00a2e5cb4c - default default] Live migrating instance b69eb997-0477-43f5-a2dc-6840e13cadcd: source:compute-0 dest:compute-2

2018-09-15 21:42:47.317 54712 ERROR nova.virt.libvirt.driver [req-5f9433b4-ff15-4886-8dcb-b20c88798125 1476c2312b6144a9ae513a66b4666a14 a7f0ca6d827847c883e47b00a2e5cb4c - default default] [instance: b69eb997-0477-43f5-a2dc-6840e13cadcd] Live Migration failure: Failed to open file '/etc/nova/instances/b69eb997-0477-43f5-a2dc-6840e13cadcd/instance-00000063_VARS.fd': No such file or directory

2018-09-15 21:42:47.573 54712 WARNING nova.compute.manager [req-92548739-631c-4ddc-bc13-d3b535690142 20fad0edb889469cab36e8977901c9cd 29b6c88a54a14ca8a2ff424dc0e7c1f3 - default default] [instance: b69eb997-0477-43f5-a2dc-6840e13cadcd] Received unexpected event network-vif-unplugged-b3fff5d5-6f44-48b0-8789-e5cc6592bf0d for instance
2018-09-15 21:42:47.737 54712 ERROR nova.virt.libvirt.driver [req-5f9433b4-ff15-4886-8dcb-b20c88798125 1476c2312b6144a9ae513a66b4666a14 a7f0ca6d827847c883e47b00a2e5cb4c - default default] [instance: b69eb997-0477-43f5-a2dc-6840e13cadcd] Migration operation has aborted

nova-compute.log (on destination compute-2) - reports rollback
2018-09-15 21:42:49.506 60988 WARNING nova.compute.manager [req-952fcd77-943e-4077-98b6-0efdacc3bc88 20fad0edb889469cab36e8977901c9cd 29b6c88a54a14ca8a2ff424dc0e7c1f3 - default default] [instance: b69eb997-0477-43f5-a2dc-6840e13cadcd] Received unexpected event network-vif-plugged-b3fff5d5-6f44-48b0-8789-e5cc6592bf0d for instance
2018-09-15 21:42:50.504 60988 INFO nova.compute.manager [req-5f9433b4-ff15-4886-8dcb-b20c88798125 1476c2312b6144a9ae513a66b4666a14 a7f0ca6d827847c883e47b00a2e5cb4c - default default] [instance: b69eb997-0477-43f5-a2dc-6840e13cadcd] Rollback live-migration at destination.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
2 + X node

Branch/Pull Time/Commit
-----------------------
Master as of date: 2018-09-16_21-38-00

Timestamp/Logs
--------------
see above logs

Ghada Khalil (gkhalil) on 2018-09-18
tags: added: stx.distro.openstack
Ghada Khalil (gkhalil) wrote :

Targeting stx.2019.03 as this is not a common guest use-case.
Note: this is a nova upstream bug. A nova bug needs to be opened and the fix proposed there.
Once that is done, the fix can be pushed to stx-staging until we rebase to a new version of openstack which includes the nova fix.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Daniel Chavolla (dchavoll)
tags: added: stx.2019.03
Ghada Khalil (gkhalil) on 2018-09-27
Changed in starlingx:
assignee: Daniel Chavolla (dchavoll) → Jack Ding (jackding)
Ken Young (kenyis) on 2019-01-18
tags: added: stx.2019.05
removed: stx.2019.03
Frank Miller (sensfan22) wrote :

This is an OpenStack nova issue tracked by https://bugs.launchpad.net/nova/+bug/1785123

Ken Young (kenyis) on 2019-04-05
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil) on 2019-04-05
Changed in starlingx:
assignee: Jack Ding (jackding) → nobody
tags: added: stx.helpwanted
tags: removed: stx.helpwanted
Changed in starlingx:
assignee: nobody → Frank Miller (sensfan22)
Ghada Khalil (gkhalil) on 2019-04-08
tags: added: stx.retestneeded
Bruce Jones (brucej) on 2019-05-28
Changed in starlingx:
assignee: Frank Miller (sensfan22) → Bruce Jones (brucej)
yong hu (yhu6) wrote :

The default plan is to rebase stein before RC1 or cherry pick to Stx-Nova.
Chris is the owner of that patch.

yong hu (yhu6) on 2019-07-19
Changed in starlingx:
assignee: Bruce Jones (brucej) → Yao Zhou (yzuzhouyao-x)
yong hu (yhu6) on 2019-07-19
Changed in starlingx:
assignee: Yao Zhou (yzuzhouyao-x) → Shuquan Huang (shuquan)
yao (yaozhou) on 2019-07-22
Changed in starlingx:
assignee: Shuquan Huang (shuquan) → yao (yaozhou)
yong hu (yhu6) wrote :

seeing @Yao is working on a nova patch https://review.opendev.org/#/c/621646/ for https://bugs.launchpad.net/nova/+bug/1785123, which is a duplication of this current LP.

Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
yong hu (yhu6) wrote :

@chengde, please help check whether this issue still exists in Nova with "Train" version

Changed in starlingx:
assignee: yao (yaozhou) → YU CHENGDE (chant)
Peng Peng (ppeng) wrote :

Test Result for: testcases/functional/nova/test_migrate_vms.py::test_migrate_vm_various_guest[ge_edge-1-1024-shared-image] - Test Passed

Issue was not reproduced on train
2019-11-21_20-00-00
wcp_3-6

Ghada Khalil (gkhalil) wrote :

@Yong Hu, Is it possible this was addressed in nova train? The associated nova bug seems to still be open.

yong hu (yhu6) wrote :

@chengde, you mentioned this issue was not observed in recent StarlingX with OS Train.
Please post your update here.

YU CHENGDE (chant) wrote :

After doing live migration with uefi mode instance in NOVA on train,
Issue didn't happened.

Ghada Khalil (gkhalil) wrote :

Closing. As per notes above, the issue is no longer reproduced with openstack train.

Changed in starlingx:
status: Triaged → Invalid
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers