strict NUMA memory allocation for 4K pages leads to OOM-killer

Bug #1792985 reported by Chris Friesen
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

We've seen a case on a resource-constrained compute node where booting multiple instances passed, but led to the following error messages from the host kernel:

[ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child
[ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB

The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node.

If we have a similar instance which *is* limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example).

I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from.

Logically speaking we want to use a numa mode of "preferred", but that only allows us to specify a single NUMA node. So we need to remove the specification entirely to allow the kernel's default "local allocation" rules to take over.

As such, I propose we change the XML for a VM from something like this:

    <numatune>
      <memory mode='strict' nodeset='0'/>
      <memnode cellid='0' mode='strict' nodeset='0'/>
    </numatune>

to something like this (without the overall "memory" mode):

    <numatune>
      <memnode cellid='0' mode='strict' nodeset='0'/>
    </numatune>

This will have the end result that /proc/<pid>/numa_maps 4K memory policy will show 'default' instead of 'bind:<x>' for everything except qemu hugepages membacking. This will cause them to try to allocate from the "local" NUMA node, but they'll fall back to others if the request can't be satisfied locally.

The numa_maps for an instance with 2MB pages would look like this:
7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4
7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4
7fe3c6000000 bind:0 file=/mnt/huge-2048kB/libvirt/qemu/qemu_back_mem._objects_ram-node0.Z0GkXp\040(deleted) huge dirty=256 mapmax=3 N0=256 kernelpagesize_kB=2048

Tags: compute
Chris Friesen (cbf123)
description: updated
Chris Friesen (cbf123)
description: updated
description: updated
description: updated
Revision history for this message
Chris Friesen (cbf123) wrote :

It has been suggested that operators could minimize the chances of this happening by not allowing numa-affined and non-numa-affined instances on the same numa node.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

if we add a hw:numa_mem_policy=strict|prefered then i could be ok with this change
but i would be very consured that it would break placement schduilng in the future
if you used prefer with hugepages.

if we reshape all memory to be per numa node in the future and devided the reserved memory amount per numa node that may prevent this in the future.

another option would be to have a filter or weigher to isolate numa affiened host from non numa affiend hosts.

i would be reluctant to add more technical debt for this edgecase without full considering the edgecases.

today we tell operator to seperate hosts that will be used for instances with numa afinity(hugepages,cpu pinning, device passthough) form host for vms with out numa affinity e.g. a default instance.

since the above failure case will only happen if that advice is ignored it does not seam to be a bug to me but rather a user error that we may want to make harder to make in the future.

the cpu in placement spec https://review.openstack.org/#/c/555081/18/specs/stein/approved/cpu-resources.rst will help with this by allowing operators to isolate a single numa node for floating instances and use the other numa nodes for pinned instances.

it wont fully resove the issue but the only other viable option i see is a host wide option for numa memory affintiy. if we allow instance with stict and non stict memory mode on the same host we will always have this edgecase in one form or another so ether we make it host wide or we have a filter/weigher to prevent/reduce the likelyhood of hit this edgecase.

Revision history for this message
Chris Friesen (cbf123) wrote :

In our docs at https://docs.openstack.org/nova/latest/admin/cpu-topologies.html we have the following:

"Caution

Host aggregates should be used to separate pinned instances from unpinned instances as the latter will not respect the resourcing requirements of the former."

I haven't yet seen anything in the docs that warns operators that the same applies for instances with NUMA affinity and instances without NUMA affinity.

Revision history for this message
Jay Pipes (jaypipes) wrote :

Chris, I'm not sure how much more clear we can be. I mean, this is the warning, as you noted:

"Caution

Host aggregates should be used to separate pinned instances from unpinned instances as the latter will not respect the resourcing requirements of the former."

It is 100% clear to me that the operator should not mix non-NUMA-pinned instances with NUMA-pinned instances. This use case is for a "resource-constrained environment" which I presume means that the operator is, indeed, violating the very warning that was placed into the documentation for this specific reason.

Revision history for this message
Chris Friesen (cbf123) wrote :

Jay, in the document you mention the bit you quote is specifically in a section on CPU pinning. I don't think it will be obvious to operators that the warning is intended to apply to instances with hugepages or PCI devices as well as pinned CPUs. Also, the reason that warning was added was because we didn't account for the CPU resources. I don't think anyone realized there was a memory issue as well.

On the discussion for the "Support shared and dedicated VMs in one host" spec (https://review.openstack.org/#/c/543805/) Sylvain explicitly redirected discussion towards your "Standardize CPU resource tracking" spec, implying that it would enable the ability to have shared and dedicated VMs on one host.

That, combined with the "NUMA Topology with Resource Providers" spec would make it possible to properly track per-node resources, which would make it possible to reliably mix NUMA-pinned instances and non-NUMA-pinned instances.

We've sorted out the handling of floating vs pinned CPUs, the only remaining thing is how to handle 4K memory consumption of instances with floating CPUs. It seems arbitrary to me for us to leave this one gap since as far as I can see it's the last blocker.

Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

Please correct me if I'm got this wrong but this looks like a duplicate of #1439247. In both cases, scheduling instances with NUMA topologies alongside instances without can result in issues with memory allocation. The current solution is to not to schedule these side-by-side, as noted above. The docs need to be updated to reflect this.

Revision history for this message
Chris Friesen (cbf123) wrote :

Stephen: I think you're probably right that this is a dupe, although I think the problem is stated more clearly in this one than in 1439247.

I think it's a cop-out to say "don't schedule numa-topology and non-numa-topology instances on the same compute node". I mean, the way the code is written currently its not safe, but I think we *should* try to make it safe.

Specifically for edge scenarios, we may only have a small number of compute nodes (sometimes just one or two) and so any host-aggregate-based solution doesn't really work. We need to be able to have these things co-exist on a single compute node.

Specifically for 4K memory, this means either disabling "strict" NUMA affinity, or else restricting floating instances to a single NUMA node.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.