Put VMs without PCI-passthrough device to non-affinitized NUMA node

Bug #1614882 reported by egoust on 2016-08-19
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Wishlist
Unassigned

Bug Description

Description
===========

Instances without pci passthrough requests can overfill host NUMA nodes with dedicated for pci passthrough PCI devices. Instances placed on such NUMA nodes despite that there are plenty free resources in other host NUMA nodes. Such scheduling can lead to situation that further deployment of an instance with pci passthrough request will fail because of unavailable capacity in required NUMA node.

Steps to reproduce
==================
Test host with 2 NUMA nodes and PCI device attached to NUMA node 0.

Create a flavor with hw:cpu_policy=dedicated

Spawn several instances without pci passthrough with overall memory allocation equal memory capacity for NUMA node 0.

Then deploy instance with sr-iov port. Scheduling fails with following error:

2016-08-18 11:17:15.470 55110 DEBUG nova.compute.manager [req-c6d96425-e98b-4a63-8289-e56c40ac46d9 bb8e586fd1264034885fef3aae39e777 b770743f66c44840a999cc8cf60916cd - - -] [instance: b4470025-2a59-4772-9990-a96b55966214] Build of instance b4470025-2a59-4772-9990-a96b55966214 was re-scheduled: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology. _do_build_and_run_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:1945

Environment
===========
Mitaka release
Nova: 13.0.0

egoust (ustinov16) on 2016-08-19
description: updated
egoust (ustinov16) on 2016-08-19
description: updated

This is not a bug, but rather by design. Refer to the second warning note in the below document:

    http://docs.openstack.org/admin-guide/compute-cpu-topologies.html#customizing-instance-numa-placement-policies

You will need to define a two-node NUMA topology for the guest if you wish to balance guests across host nodes.

Changed in nova:
status: New → Invalid
egoust (ustinov16) wrote :

In our case we have specific Virtual Network Function images which require deployment with 1 NUMA node topology. Some of them require sr-iov ports and some not.

As from my point of view, NUMA scheduling could be more advanced and instance without pci device request should be placed on NUMA node with no PCI device attached. As example it could be an extra_spec for flavor to specify that guest NUMA nodes should be bind to host NUMA nodes with no associated with them PCI devices in pci_passthrough_whitelist

Changed in nova:
status: Invalid → New
Stephen Gordon (sgordon) wrote :

First thing I will note is that while you can work around this using hw:numa_nodes=2 (or more) to allow the guest to split over more than one node this is not ideal as the splitting of the guest in this fashion impacts not just device allocation but also vCPU and RAM allocation. It also doesn't address that this may still result in failure if as in the description all nodes with PCI devices available have been filled already with guests that aren't using them, while space exists on nodes that don't have PCI devices.

There are really I think a couple of aspects to resolving this ask which comes up in a couple of different ways (this bug report is certainly not the first time I have seen it raised):

1) Weighting scheduling requests such that if a given request does *not* require and SR-IOV/PCI device it is weighted towards landing on NUMA nodes that do not have them exposed.

2) Allowing users requesting a VM with an SR-IOV/PCI device a mechanism for specifying whether they have a "hard" requirement for it to be co-located on the same NUMA node(s) as the guest or a "soft" requirement. In the latter case the scheduler would still attempt to place the guest on the same NUMA node(s) as the available devices but if this is not possible, just place them on the same host and still attach the device.

We need to be mindful of the fact that for some workloads the current behaviour is the expectation, and that for others the behaviour requested here - where just having the device is good enough even if it's across NUMA bounds, particularly common when the guest is using multiple devices - is the expectation and that both may even manifest in the same cloud.

tags: added: numa
summary: - NUMA node scheduling problem
+ SR-IOV devices should be accessible from a non-affinitized NUMA node

As far as I read description, they want to place VM, that does not use SR-IOV, to non-affinitized NUMA node.
So I suppose that summary could be "Put VMs without PCI-passthrough device to non-affinitized NUMA node" et al.

egoust (ustinov16) on 2016-08-29
summary: - SR-IOV devices should be accessible from a non-affinitized NUMA node
+ Put VMs without PCI-passthrough device to non-affinitized NUMA node

The following specs, currently under review, should resolve this issue. Worth reviewing:

* https://review.openstack.org/#/c/361140/
* https://review.openstack.org/#/c/364468/

Sergey Nikitin (snikitin) wrote :

Stephen, is right. Implementation of these specs will solve the problem. But I this is a feature request rather than bug.

Changed in nova:
importance: Undecided → Wishlist
status: New → Confirmed
Augustina Ragwitz (auggy) wrote :

Marking this bug as invalid because it was determined to be a feature request and would be resolved by current specs under review.

Changed in nova:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers