NUMA instance spawn fails on get_best_cpu_topology when there is no 'threads' preference

Bug #1910466 reported by melanie witt on 2021-01-07
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
sean mooney

Bug Description

Seen downstream in a customer environment where a NUMA instance fails driver.spawn during get_best_cpu_topology when (1) there was no preference for cpu threads in the flavor and (2) the only possible topologies have > 1 cpu threads:

2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] Instance failed to spawn: IndexError: list index out of range
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] Traceback (most recent call last):
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2273, in _build_resources
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] yield resources
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2053, in _build_and_run_instance
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] block_device_info=block_device_info)
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3133, in spawn
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] mdevs=mdevs)
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5447, in _get_guest_xml
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] context, mdevs)
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5202, in _get_guest_config
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] instance.numa_topology)
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3923, in _get_guest_cpu_config
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] flavor, image_meta, numa_topology=instance_numa_topology)
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 625, in get_best_cpu_topology
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] allow_threads, numa_topology)[0]
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [instance: 890325b1-4d73-4a9c-86c0-c4e811690c3f] IndexError: list index out of range

Note that the IndexError ^ is a separate unrelated bug that happens because an empty list [] is being returned after filtering for NUMA threads. This bug is about the empty list after NUMA threads filtering.

In this example failure we have a request for vcpus=8 with limits cores=2, sockets=2, and threads=8:

2020-12-27 23:45:12.322 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Getting desirable topologi
es for flavor Flavor(created_at=2020-11-23T08:41:17Z,deleted=False,deleted_at=None,description=None,disabled=False,ephemeral_gb=0,extra_specs={hw:cpu_max_cores='2',hw:cpu_max_sockets='2',hw:cpu_max_thread
s='8',hw:cpu_policy='dedicated',hw:mem_page_size='large',hw:numa_cpus.0='0,1,2,3',hw:numa_cpus.1='4,5,6,7',hw:numa_mem.0='16384',hw:numa_mem.1='16384',hw:numa_mempolicy='strict',hw:numa_nodes='2'},flavori
d='2060ed99-654c-4309-88b8-9bceeb794ba3',id=176,is_public=True,memory_mb=32768,name='test',projects=<?>,root_gb=40,rxtx_factor=1.0,swap=0,updated_at=None,vcpu_weight=0,vcpus=8) and image_meta ImageMeta(ch
ecksum='157e26aac48c1a02c08d07f4f7a6d1b6',container_format='bare',created_at=2020-03-08T16:12:44Z,direct_url=<?>,disk_format='qcow2',id=7812d228-07c8-4a8f-9878-459a0093cc34,min_disk=0,min_ram=0,name='UAG_
generic-180104',owner='7a2328acc0a6451c8b23fb8184932506',properties=ImageMetaProps,protected=<?>,size=769468416,status='active',tags=<?>,updated_at=2020-03-23T06:11:54Z,virtual_size=<?>,visibility=<?>), a
llow threads: True _get_desirable_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:567
2020-12-27 23:45:12.323 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Flavor limits 2:2:8 _get_c
pu_topology_constraints /usr/lib/python2.7/site-packages/nova/virt/hardware.py:313
2020-12-27 23:45:12.323 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Image limits 2:2:8 _get_cpu_topology_constraints /usr/lib/python2.7/site-packages/nova/virt/hardware.py:324
2020-12-27 23:45:12.324 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Flavor pref -1:-1:-1 _get_cpu_topology_constraints /usr/lib/python2.7/site-packages/nova/virt/hardware.py:347
2020-12-27 23:45:12.324 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Image pref -1:-1:-1 _get_cpu_topology_constraints /usr/lib/python2.7/site-packages/nova/virt/hardware.py:366
2020-12-27 23:45:12.325 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Chosen -1:-1:-1 limits 2:2:8 _get_cpu_topology_constraints /usr/lib/python2.7/site-packages/nova/virt/hardware.py:395
2020-12-27 23:45:12.325 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Topology preferred VirtCPUTopology(cores=-1,sockets=-1,threads=-1), maximum VirtCPUTopology(cores=2,sockets=2,threads=8) _get_desirable_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:571
2020-12-27 23:45:12.325 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Build topologies for 8 vcpu(s) 2:2:8 _get_possible_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:434
2020-12-27 23:45:12.326 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Got 4 possible topologies _get_possible_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:461
2020-12-27 23:45:12.326 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Possible topologies [VirtCPUTopology(cores=2,sockets=2,threads=2), VirtCPUTopology(cores=1,sockets=2,threads=4), VirtCPUTopology(cores=2,sockets=1,threads=4), VirtCPUTopology(cores=1,sockets=1,threads=8)] _get_desirable_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:576
2020-12-27 23:45:12.326 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Filtering topologies best for 1 threads _get_desirable_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:594
2020-12-27 23:45:12.327 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Remaining possible topologies [] _get_desirable_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:599
2020-12-27 23:45:12.327 1 DEBUG nova.virt.hardware [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] Sorted desired topologies [] _get_desirable_cpu_topologies /usr/lib/python2.7/site-packages/nova/virt/hardware.py:602
2020-12-27 23:45:12.327 1 ERROR nova.compute.manager [req-321f4757-b777-4b6a-93d9-26352fe343a3 8529101f4f0c482ea4b82f9a955785cf 7a2328acc0a6451c8b23fb8184932506 - default default] [instance: 890325b1-4d73
-4a9c-86c0-c4e811690c3f] Instance failed to spawn: IndexError: list index out of range

This is showing that the flavor and image specified no preference for sockets, cores, and threads (they are all -1) and 8 vcpus are required. The possible topologies that would satisfy the request for 8 vcpus and stay within the limits of 2 max cores, 2 max sockets, and 8 max threads are: [VirtCPUTopology(cores=2,sockets=2,threads=2), VirtCPUTopology(cores=1,sockets=2,threads=4), VirtCPUTopology(cores=2,sockets=1,threads=4), VirtCPUTopology(cores=1,sockets=1,threads=8)] . When there is no preference for the number of threads, the code will use a value of 1 for the desired number of threads. It will then filter for the closest number of threads that does not exceed the desired number of threads. Because only 4 or 8 threads could satisfy the request for 8 vcpus and 4 and 8 are greater than 1, all of the possible topologies were filtered out, leaving an empty list and the request could not be fulfilled.

Because the request expressed no preference for number of threads, one of the 4 possible cpu topologies should have been chosen instead of filtering all of the topologies out and returning an empty list. We will need to fix the logic around how requests without threads preference are handled.

Revision history for this message
melanie witt (melwitt) wrote :

WIP change is being worked on here:

https://review.opendev.org/c/openstack/nova/+/769614

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

I've discussed this on the review, but I think this is _almost_ working as expected. In that review, Sean is removing the code that leans on the threading information stored in the guest's NUMA topology when it was generated. As noted there, I think this code still serves a purpose:

  My understanding of this code is that it allows us to best map the host topology to the guest
  topology. Consider a 32 core CPU with hyperthreading (i.e. 16 + 16). If you boot a 4 core guest
  and those cores are running on cores 0,1,16,17, then exposing CPU threads inside the guest best
  allows that OS to optimize for the case.

While you're correct that the request expressed no preference for number of threads, the request for a NUMA topology meant we're getting one (of 1) implicitly. I think the bug here is that this implicit request really only makes sense if the user doesn't specify any CPU topology information and we should be outright ignoring information from 'NUMATopologyCell.cpu_topology' (i.e. [1]) if they have. The fix would be to start doing that, but I don't know how easy that is.

[1] https://github.com/openstack/nova/blob/e6f5e814050a19d6f027037424556b2889514ec3/nova/virt/hardware.py#L616-L636

Revision history for this message
sean mooney (sean-k-mooney) wrote :

the guest numa topology should have 0 impact on the guest vcpu topolgoy.
they are intended to be entirely serperate aspect of the vm from the guest point of view.

at the host assignemtn level we do take account of the number of guest numa nodes to determin what subset of cpus we can pin too but that does not impact the virutal guest cpu toplogy. it affect the mapping of the virtual cpus to host cpus.

while we could do an optimisation to refelct the host cpu toplogy into the guest for better perfromcne since that is not taken account of when live migrating it woudl regress if the guest was moved off that host as you cannot alter the vitural cpu toplogy when moving the guest even if we can update the gust to host mapping.

i have put a more detailed respocne in the review but basicaly the code i removed was a partial impleemation fo the

socket0:cores=2,threads=2,socket1:cores=4,threads=1
syntax descibed in https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-vcpu-topology.html#alternatives

which was never approved or supported.

it was a pre mature implementation of support for a feature that was assumed to be required for numa toplogy which libvirt could not support, since it did not support asymmetric cpu topologies. that is something that is still not possibel today and likely never will be
https://libvirt.org/formatdomain.html#cpu-model-and-topology

when numa was actually added to nova we did not couple numa nodes to cpu socket which was one of the assumption
https://github.com/openstack/nova/commit/770ab8eeb72b184ac6164aeabb89c4bf45f938a9 made.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/769601
Committed: https://opendev.org/openstack/nova/commit/fe25fa13e0572aafb62b28bf7fa843a1860e5010
Submitter: "Zuul (22348)"
Branch: master

commit fe25fa13e0572aafb62b28bf7fa843a1860e5010
Author: Sean Mooney <email address hidden>
Date: Wed Jan 6 18:56:39 2021 +0000

    Test numa and vcpu topologies bug: #1910466

    This change reproduces bug #1910466
    When hw:cpu_max_[sockets|cores|threads] is configured
    in addition to an explict numa topologies and cpu pinning
    nova is currently incapable of generating the correct
    virtual CPU topology resulting in an index out of range
    error as we attempt to retrieve the first topology
    from an empty list.

    This change reproduces the error via a new functional
    test.

    Related-Bug: #1910466
    Change-Id: I333b3d85deed971678141307dd06545e308cf989

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/769614
Committed: https://opendev.org/openstack/nova/commit/387823b36d091abbaa37efb930fc98b94a5bbb93
Submitter: "Zuul (22348)"
Branch: master

commit 387823b36d091abbaa37efb930fc98b94a5bbb93
Author: Sean Mooney <email address hidden>
Date: Wed Jan 6 19:49:56 2021 +0000

    Fix max cpu topologies with numa affinity

    Nova has never supported specifying per numa node
    cpu toplogies. Logically the cpu toplogy of a guest
    is independent of its numa toplogy and there is no
    way to model different cpu toplogies per numa node
    or implement that in hardware.

    The presence of the code in nova that allowed the generation
    of these invalid configuration has now been removed as it
    broke the automatic selection of cpu topologies based
    on hw:max_[cpus|sockets|threads] flavor and image properties.

    This change removed the incorrect code and related unit
    tests with assert nova could generate invalid topologies.

    Closes-Bug: #1910466
    Change-Id: Ia81a0fdbd950b51dbcc70c65ba492549a224ce2b

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/797652

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/797653

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/797680

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/797681

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/797878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/797879

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/797880

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/797881

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/797975

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/797976

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/797981

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/797982

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/797987

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/797988

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers