strict NUMA memory allocation for 4K pages leads to OOM-killer
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
New
|
Undecided
|
Unassigned |
Bug Description
We've seen a case on a resource-
[ 731.911731] Out of memory: Kill process 133047 (nova-api) score 4 or sacrifice child
[ 731.920377] Killed process 133047 (nova-api) total-vm:374456kB, anon-rss:144708kB, file-rss:1892kB, shmem-rss:0kB
The problem appears to be that currently with libvirt an instance which does not specify a NUMA topology (which implies "shared" CPUs and the default memory pagesize) is allowed to float across the whole compute node. As such, we do not know which host NUMA node its memory is going to be allocated from, and therefore we don't know how much memory is remaining on each host NUMA node.
If we have a similar instance which *is* limited to a particular NUMA node (due to adding a PCI device for example, or in the future by specifying dedicated CPUs) then that allocation will currently use "strict" NUMA affinity. This allocation can fail if there isn't enough memory available on that NUMA node (due to being "stolen" by a floating instance, for example).
I think this means that we cannot use "strict" affinity for the default page size even when we do have a numa_topology since we can't have accurate per-NUMA-node accounting due to the fact that we don't know which NUMA node floating instances allocated their memory from.
Logically speaking we want to use a numa mode of "preferred", but that only allows us to specify a single NUMA node. So we need to remove the specification entirely to allow the kernel's default "local allocation" rules to take over.
As such, I propose we change the XML for a VM from something like this:
<numatune>
<memory mode='strict' nodeset='0'/>
<memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>
to something like this (without the overall "memory" mode):
<numatune>
<memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>
This will have the end result that /proc/<
The numa_maps for an instance with 2MB pages would look like this:
7fe3a3400000 default anon=64 dirty=64 N0=64 kernelpagesize_kB=4
7fe3a5600000 default stack:219058 anon=4 dirty=4 N0=4 kernelpagesize_kB=4
7fe3c6000000 bind:0 file=/mnt/
description: | updated |
description: | updated |
description: | updated |
description: | updated |
It has been suggested that operators could minimize the chances of this happening by not allowing numa-affined and non-numa-affined instances on the same numa node.