Nova doesn't account for hugepages when scheduling VMs

Bug #1950186 reported by Przemyslaw Lal
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Unassigned

Bug Description

Description
===========

When hugepages are enabled on the host it's possible to schedule VMs using more RAM than available.

On the node with memory usage presented below it was possible to schedule 6 instances using a total of 140G of memory and a non-hugepages-enabled flavor. The same machine has 188G of memory in total, of which 64G were reserved for hugepages. Additional ~4G were used for housekeeping, OpenStack control plane, etc. This resulted in overcommitment of roughly 20G.

After running memory intensive operations on the VMs, some of them got OOM killed.

$ cat /proc/meminfo | egrep "^(Mem|Huge)" # on the compute node
MemTotal: 197784792 kB
MemFree: 115005288 kB
MemAvailable: 116745612 kB
HugePages_Total: 64
HugePages_Free: 64
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 67108864 kB

$ os hypervisor show copmute1 -c memory_mb -c memory_mb_used -c free_ram_mb
+----------------+--------+
| Field | Value |
+----------------+--------+
| free_ram_mb | 29309 |
| memory_mb | 193149 |
| memory_mb_used | 163840 |
+----------------+--------+

$ os host show compute1
+----------+----------------------------------+-----+-----------+---------+
| Host | Project | CPU | Memory MB | Disk GB |
+----------+----------------------------------+-----+-----------+---------+
| compute1 | (total) | 0 | 193149 | 893 |
| compute1 | (used_now) | 72 | 163840 | 460 |
| compute1 | (used_max) | 72 | 147456 | 460 |
| compute1 | some_project_id_was_here | 2 | 4096 | 40 |
| compute1 | another_anonymized_id_here | 70 | 143360 | 420 |
+----------+----------------------------------+-----+-----------+---------+

$ os resource provider inventory list uuid_of_compute1_node
+----------------+------------------+----------+----------+----------+-----------+--------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+--------+
| MEMORY_MB | 1.0 | 1 | 193149 | 16384 | 1 | 193149 |
| DISK_GB | 1.0 | 1 | 893 | 0 | 1 | 893 |
| PCPU | 1.0 | 1 | 72 | 0 | 1 | 72 |
+----------------+------------------+----------+----------+----------+-----------+--------+

Steps to reproduce
==================

1. Reserve a large part of memory for hugepages on the hypervisor.
2. Create VMs using a flavor that uses a lot of memory that isn't backed by hugepages.
3. Start memory intensive operations on the VMs, e.g.:
stress-ng --vm-bytes $(awk '/MemAvailable/{printf "%d", $2 * 0.98;}' < /proc/meminfo)k --vm-keep -m 1

Expected result
===============

Nova should not allow overcommitment and should be able to differentiate between hugepages and "normal" memory.

Actual result
=============
Overcommitment resulting in OOM kills.

Environment
===========
nova-api-metadata 2:21.2.1-0ubuntu1~cloud0
nova-common 2:21.2.1-0ubuntu1~cloud0
nova-compute 2:21.2.1-0ubuntu1~cloud0
nova-compute-kvm 2:21.2.1-0ubuntu1~cloud0
nova-compute-libvirt 2:21.2.1-0ubuntu1~cloud0
python3-nova 2:21.2.1-0ubuntu1~cloud0
python3-novaclient 2:17.0.0-0ubuntu1~cloud0

OS: Ubuntu 18.04.5 LTS
Hypervisor: libvirt + KVM

Tags: sts
Revision history for this message
Giuseppe Petralia (peppepetra) wrote :
Download full text (5.2 KiB)

This can be reproduced on Focal/ussuri:

############ Computes:
$ os resource provider list
+--------------------------------------+-----------------------------------------------------+------------+--------------------------------------+----------------------+
| uuid | name | generation | root_provider_uuid | parent_provider_uuid |
+--------------------------------------+-----------------------------------------------------+------------+--------------------------------------+----------------------+
| ca3fa736-7e60-4365-9cc8-7afc78b53005 | juju-98fb61-zaza-d6f2c7825043-9.project.serverstack | 5 | ca3fa736-7e60-4365-9cc8-7afc78b53005 | None |
| 0605bd29-71d5-40ed-ab8f-eceeaaac59b5 | juju-98fb61-zaza-d6f2c7825043-8.project.serverstack | 4 | 0605bd29-71d5-40ed-ab8f-eceeaaac59b5 | None |
+--------------------------------------+-----------------------------------------------------+------------+--------------------------------------+----------------------+

############ Mem Allocation ratio is 1:
$ openstack resource provider inventory list ca3fa736-7e60-4365-9cc8-7afc78b53005
+----------------+------------------+----------+----------+----------+-----------+-------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------+------------------+----------+----------+----------+-----------+-------+-------+
| VCPU | 16.0 | 1 | 8 | 0 | 1 | 8 | 2 |
| MEMORY_MB | 1.0 | 1 | 16008 | 2048 | 1 | 16008 | 13960 |
| DISK_GB | 1.0 | 1 | 77 | 0 | 1 | 77 | 20 |
+----------------+------------------+----------+----------+----------+-----------+-------+-------+

$ openstack resource provider inventory list 0605bd29-71d5-40ed-ab8f-eceeaaac59b5
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | used |
+----------------+------------------+----------+----------+----------+-----------+-------+------+
| VCPU | 16.0 | 1 | 8 | 0 | 1 | 8 | 0 |
| MEMORY_MB | 1.0 | 1 | 16008 | 2048 | 1 | 16008 | 0 |
| DISK_GB | 1.0 | 1 | 77 | 0 | 1 | 77 | 0 |
+----------------+------------------+----------+----------+----------+-----------+-------+------+

######## Hugepages: 1000 * 2M
root@juju-98fb61-zaza-d6f2c7825043-9:~# cat /proc/meminfo | grep -i huge
AnonHugePages: 622592 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 2048000 kB

root@juju-98fb61-zaza-d6f2c7825043-9:~# free -mh
              total used free shared buff/cache available
Mem: 15Gi ...

Read more...

affects: nova → ubuntu
affects: ubuntu → nova (Ubuntu)
Revision history for this message
James Page (james-page) wrote :

Discussed with the Nova team and this is a know issue at the moment - mixing instance types with and with NUMA configuration features such as hugepages will create this type of issue.

The placement API (which is used for scheduling) does not track different pagesizes so can't deal with this scenario today.

Feedback indicated that using flavors with explicit configuration to use small pages might do the trick in terms if triggering the codepath through the NUMA cell configuration in Nova.

Revision history for this message
James Page (james-page) wrote :

Another suggestion was to limit the 'max_unit' value for hypervisors with this memory configuration to the total memory - the hugepage configured memory - this means that the maximum footprint for a single VM is limited.

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Even using a flavor with hw:mem_page_size='small' I am still able to request more memory of what is physically available.

While the update of max_unit seems to be reverted to the original value when the compute node checks its records so can't be used as a valid workaround.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
Alex Walender (awalende) wrote (last edit ):

I guess the only way would be to work with custom extra specs inside flavors/images which can be quite a hassel and can be prone to (human)errors. Especially when forgetting to set this for new flavors. Otherwise I don't think of any way on how to control the scheduling for mixed memory-backend nodes.

+1 for making placement (and/or nova-scheduler) aware about the different memory backends of a compute-node.

Seyeong Kim (seyeongkim)
tags: added: sts
Seyeong Kim (seyeongkim)
affects: nova (Ubuntu) → nova
Revision history for this message
sean mooney (sean-k-mooney) wrote :

This is not a bug it is user error.

when using hugepages if you want to have non hugepage guests on the same host then you must use
hw:mem_page_size=small or hw:mem_page_size=4k for all non hugepages guests

we do not support memory oversubscriton when using hw:mem_page_size and this also makes the guest have 1 implicit numa node.

we intentually do not support mixing numa and non numa guest on the same host which is what happens if you do not use hw:mem_page_size=small

when hw:mem_page_size is not set we do not do page size/numa node aware schduling.

the reason that you are having the current issue is because you are mixing numa and non numa instance on the same host which has never been supported in nova.

we may eventually support this in the distantant future but we have no plans to support this in zed and no one has proposed a way to support it upstream yet.

it is a very non trivial feature and would require us to effectively make all instance numa instances.
we cannot support mixing floating instance an numa affined instances on the same host today due to how we do numa affinity
and how that interacts with the kernel OOM reaper.
basically the OOM reaper operates per numa node not globally so if the kernel need memory on numa node 0 even if there is free memory on numa node 0 if it cant free the memory on numa node 0 it will kill process to free it.

that will often result in numa affined non hugepage guest being killed if a floating guest is spawned and it triggers an OOM event.
that is not something we can allow to happen as its a multi tenant issue so we cannot support mixing numa and non numa instance in the same host.

the workaround to use hugepage and non hugepage guests on the same host is there for to make all the guest have numa affinity by using hw:mem_page_size.

this is a well know limitation and not a bug so I'm closing this as wont fix

Changed in nova:
status: Confirmed → Won't Fix
Revision history for this message
sean mooney (sean-k-mooney) wrote :

@Giuseppe Petralia

in your case can you confirm that you have enabled the numa topology filter

while placement will allow you to schedule to the host the numa topology filter
and host numa tracker should prevent the VM form being scheduled if you have used

hw:mem_page_size=small

that will look at the available 4k pages when scheduling the VM and only schedule to host where
enough can be claimed.

you also need to use the DEFAULT.reserved_huge_pages config option to reserve 4k pages for the host
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.reserved_huge_pages

e.g. this will reserve 2G of 4k pages for the os on both node 0 and node 1
[default]
reserved_huge_pages = node:0,size:4,count:512
reserved_huge_pages = node:1,size:4,count:512

the rest of the 4k pages will be usable by vms

Revision history for this message
sean mooney (sean-k-mooney) wrote :

by the way we do not support numa in placement at this time and we do not expect the max unit to be set only the aviabel 4k pages
that would prevent hugepage guest being booted correctly

we have long term plans to track mempages in placement
https://specs.openstack.org/openstack/nova-specs/specs/victoria/approved/numa-topology-with-rps.html
the work was last considered in victoria ^ we likely will revisit this in the AA or BB release
but not in the zed cycle.

we choose to pause that work in victoria to focus on simpler to implement features that are more generally useful. once we complete the PCI device in placement work we will likely revisit numa but not before that is completed given the current capsity of the nova community.

if new contributors want to repopose that spec and start working on it we can review but its a lot of work especially ensuring we minimis the upgrade impact so it will likely take 1 to 2 upstream cycles to implement and even then we wont nessisaraly support mixing nunma and non numa isntnafce but we will be able to use placement to represent the maximum memory size properly including setting max unit per page size.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.