[Nova] NUMA scheduler filter demands double CPU resource in VM rebuild scenario

Bug #1853575 reported by Alexander Rubtsov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Medium
MOS Maintenance

Bug Description

MOS: 9.2+ (code of Aug 2018)

--- Environment ---
- NUMA and CPU pinning are enabled
- Compute node where 4 vCPUs are dedicated to Nova (e.g. vcpu_pin_set="8-9,18-19")
- Flavor with extra_spec "hw:cpu_policy": "dedicated" and "vcpus=1"
- 2 different Glance images (both are without any CPU settings) (e.g. named "original" and "new")

--- Steps to reproduce ---
1. Launch 4 VM instances on the mentioned Compute node with the mentioned flavor
2. Perform "nova rebuild <VM_instance> <new_image>" against any 1 VM instance

--- Actual behavior ---
Rebuild has failed
root@cic-1:~# nova instance-action-list cd727982-d114-436d-812a-0c9b88e6f6fb
+---------+------------------------------------------+---------+----------------------------+
| Action | Request_ID | Message | Start_Time |
+---------+------------------------------------------+---------+----------------------------+
| create | req-5f5b612f-f596-4371-a84d-2e069bc9c711 | - | 2019-11-18T12:51:36.000000 |
| rebuild | req-248059ec-b283-4342-8d04-54adc4b92974 | Error | 2019-11-18T12:57:59.000000 |
+---------+------------------------------------------+---------+----------------------------+

In nova-scheduler.log:
DEBUG nova.scheduler.filters.numa_topology_filter [req-248059ec-b283-4342-8d04-54adc4b92974 c09b101198cf45f690366ef788518b93 476d94a86e7c487e901d02a003a5f5b6 - - -] [instance: cd727982-d114-436d-812a-0c9b88e6f6fb] compute-0-1.domain.tld, compute-0-1.domain.tld fails NUMA topology requirements. The instance does not fit on this host. host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/numa_topology_filter.py:92

--- Expected behavior ---
- Nova understands that additional CPUs are not needed
- Rebuild went successfully

--- Additional information ---
Upstream bug report that might be related: https://bugs.launchpad.net/nova/+bug/1804502

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla2 for 9.x-updates

Changed in mos:
importance: Undecided → Medium
milestone: none → 9.x-updates
tags: added: customer-found sla2
Changed in mos:
assignee: nobody → MOS Maintenance (mos-maintenance)
Changed in mos:
status: New → Confirmed
milestone: 9.x-updates → 9.2-mu-16
Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

some notes:
there is also several bugs in rebuilding pinned instances:
https://bugs.launchpad.net/nova/+bug/1763766
https://bugs.launchpad.net/nova/+bug/1750623
https://bugs.launchpad.net/nova/+bug/1750618
https://bugs.launchpad.net/nova/+bug/1772523

Proper fix will require time because upstream developers currently in "discussion" stage of the fix. We will provide dangerous workaround (it will be part of the whole fix): without additional "image validation" it is dangerous if rebuild is using new image with different cpu requirements; customer should handle it). Also I'm identifying required fixes from the bugs above and backporting needed.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/41532
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 5d1620c14cfd7da9a31c38b4d921583b194ae905
Author: Vladimir Khlyunev <email address hidden>
Date: Tue Dec 10 15:05:03 2019

Disable NUMATopologyFilter on rebuild

THIS PATCH IS ONLY PART OF THE FIX. Without backported
https://review.opendev.org/#/c/687957/ it just a workaround.

As the new behavior of rebuild enfroces that no changes
to the numa constraints are allowed on rebuild we no longer
need to execute the NUMATopologyFilter. Previously
the NUMATopologyFilter would process the rebuild request
as if it was a request to spawn a new instnace as the
numa_fit_instance_to_host function is not rebuild aware.

As such prior to this change a rebuild would only succeed
if a host had enough additional capacity for a second instance
on the same host meeting the requirement of the new image and
existing flavor. This behavior was incorrect on two counts as
a rebuild uses a noop claim. First the resouce usage cannot
change so it was incorrect to require the addtional capacity
to rebuild an instance. Secondly it was incorrect not to assert
the resouce usage remained the same.

https://review.opendev.org/#/c/687957/ adressed guarding the
rebuild against altering the resouce usage and this change
allows in place rebuild. It should be backported as soon as
possible.

This change found a latent bug that will be adressed in a follow
up change and updated the functional tests to note the incorrect
behavior.

Change-Id: I32525d1ed71704bd72d78a5ae6274c1f484c2072
Partial-bug: 1853575

Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :
Changed in mos:
status: Confirmed → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (mcp/pike)

Fix proposed to branch: mcp/pike
Change author: Vladimir Khlyunev <email address hidden>
Review: https://review.fuel-infra.org/41661

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/nova (mcp/pike)

Change abandoned by Pavlo Shchelokovskyy <email address hidden> on branch: mcp/pike
Review: https://review.fuel-infra.org/41661
Reason: wrong gerrit instance for mcp/pike branch, please file your patches at gerrit.mcp.mirantis.com instead

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers