VM hard reboot fails on Live Migration Abort with node having Two numa sockets
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
New
|
Undecided
|
keerthivasan |
Bug Description
Description
===========
Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover
Steps to reproduce [100%]
==================
As part of this, compute node should have two cells or sockets each.
VM flavor having below extra spec
hw:mem_
we need to 100 huge pages for specific flavor
Before performing test, make sure source & destination will have below huge page available resources
we need to move VM from source node ( from numa Node1 to Numa Node0 on compute2 )
Source: [compute1]
~# cat /sys/devices/
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 50
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50 <Source node, our test vm will be part of numa Node 1 >
Node 1 HugePages_Surp: 0
Destination: [compute-2]
~# cat /sys/devices/
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 130 <destination node having 130 huge pages>
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50
Node 1 HugePages_Surp: 0
Before Live migration please find the numa topology details
MariaDB [nova]> select numa_topology from instance_extra where instance_
| numa_topology |
| {"nova_
select migration_context from instance_extra where instance_
<empty>
-----END of DB-----
#Trigger live migration
#Apply stress inside vm to achieve migration longer
#
Able to see migration context is created for specific VM:
MariaDB [nova]> select migration_context from instance_extra where instance_
+------
| migration_context |
| {"nova_
old numa cell is 1, new numa cell is 0
#trigger abort
Feb 13 20:59:00 cdc-appblx095-36 nova-compute[
Post abort numa topology got updated to numa cell 0 which is part of destination
| {"nova_
Migration context is not deleted
Expected result
===============
numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node. Hard reboot of vm should be working as expected
Actual result
=============
VM is having newer numa topology based on calculated destination numa details post abort
Hard reboot of vm failed
Environment
===========
Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Changed in nova: | |
assignee: | nobody → keerthivasan (keerthivassan86) |
description: | updated |
This line https:/ /github. com/openstack/ nova/blob/ master/ nova/compute/ manager. py#L9787 is not executing, able to see migration_context is not dropped
https:/ /github. com/openstack/ nova/blob/ master/ nova/compute/ manager. py#L9788 this line executed before the function https:/ /github. com/openstack/ nova/blob/ master/ nova/compute/ manager. py#L9780 is completed & it overrides the instance numa_topology via at destination cleanup