VM hard reboot fails on Live Migration Abort with node having Two numa sockets

Bug #2053163 reported by keerthivasan
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
keerthivasan

Bug Description

Description
===========
Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover

Steps to reproduce [100%]
==================

As part of this, compute node should have two cells or sockets each.

VM flavor having below extra spec

hw:mem_page_size='1048576', hw:numa_nodes='1'

we need to 100 huge pages for specific flavor

Before performing test, make sure source & destination will have below huge page available resources

we need to move VM from source node ( from numa Node1 to Numa Node0 on compute2 )

Source: [compute1]

~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 50
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50 <Source node, our test vm will be part of numa Node 1 >
Node 1 HugePages_Surp: 0

Destination: [compute-2]

~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 130 <destination node having 130 huge pages>
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50
Node 1 HugePages_Surp: 0

Before Live migration please find the numa topology details

MariaDB [nova]> select numa_topology from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';

| numa_topology |

| {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["pagesize", "id"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} |

select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
    <empty>

-----END of DB-----

#Trigger live migration

#Apply stress inside vm to achieve migration longer
#

Able to see migration context is created for specific VM:

MariaDB [nova]> select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| migration_context |

| {"nova_object.name": "MigrationContext", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"instance_uuid": "4b115eb3-59f7-4e27-b877-2e326ef017b3", "migration_id": 283, "new_numa_topology": {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["cpuset_reserved", "id", "pcpuset", "pagesize", "cpu_pinning_raw", "cpu_policy", "cpu_thread_policy", "memory", "cpuset"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}, "old_numa_topology": {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}

old numa cell is 1, new numa cell is 0

#trigger abort

Feb 13 20:59:00 cdc-appblx095-36 nova-compute[638201]: 2024-02-13 20:59:00.991 638201 ERROR nova.virt.libvirt.driver [None req-05850c05-ba5b-40ae-a37c-5ccdde8ded47 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 4b115eb3-59f7-4e27-b877-2e326ef017b3] Migration operation has aborted

Post abort numa topology got updated to numa cell 0 which is part of destination

| {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes": ["cpu_thread_policy", "cpuset_reserved", "cpu_pinning_raw", "cpuset", "cpu_policy", "memory", "pagesize", "pcpuset", "id"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]} |

Migration context is not deleted

Expected result
===============
numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node. Hard reboot of vm should be working as expected

Actual result
=============
VM is having newer numa topology based on calculated destination numa details post abort

Hard reboot of vm failed

                                                       2024-02-14 21:56:41.877 790638 ERROR oslo_messaging.rpc.server raise libvirtError('virDomainCreateWithFlags() failed')
                                                       2024-02-14 21:56:41.877 790638 ERROR oslo_messaging.rpc.server libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2024-02-14T21:56:40.684445Z qemu-system-x86_64: unable to map backing store for guest RAM: Cannot allocate memory
                                                       2024-02-14 21:56:41.877 790638 ERROR oslo_messaging.rpc.server

Environment
===========

Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Changed in nova:
assignee: nobody → keerthivasan (keerthivassan86)
description: updated
Revision history for this message
keerthivasan (keerthivassan86) wrote :

This line https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9787 is not executing, able to see migration_context is not dropped

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9788 this line executed before the function https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9780 is completed & it overrides the instance numa_topology via at destination cleanup

Revision history for this message
keerthivasan (keerthivassan86) wrote :

Able to see on-going bug https://review.opendev.org/c/openstack/nova/+/851832 summaries this issue.

Revision history for this message
keerthivasan (keerthivassan86) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.