Need NUMA aware RAM reservation to avoid OOM killing host processes

Bug #1844721 reported by Jing Zhang
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

Description:
===========

CPU pinning is widely used in VNFs. When VM CPU is pinned, currently there is no way to reserve memory on NUMA 0 for host processes:

> ram_allocation_ratio is ignored by the nova scheduler when VM CPU is pinned
> reserved_host_memory_mb is a global reservation, as long as there is memory available globally (on any NUMA node) VM is scheduled.

This leads to many VMs are scheduled on NUMA 0 (CPU pinned to NUMA 0) while their memory needs are met "globally".

When the system starts to take load, VMs' memory start to get allocated on NUMA 0 (because their are pinned to NUMA 0) to the extend that memory shortage occurs on NUMA 0 and OOM kicks in to kill host processes.

Many mitigation are "invented", but those mitigation all have some form of technical or operational "difficulties". One mitigation, for example, is to enable huge pages, and put VMs on huge pages.

The right solution is for nova to support NUMA aware RAM reservation as for the huge pages case, i.e.

reserved_host_memory=node:0, 20G

Steps to reproduce
==================
Create CPU pinned VMs. VMs are crowded on NUMA 0, until no more CPU cores are available on NUMA 0 then they are scheduled on NUMA 1. Stress the system.

Expected result
===============

The system stays operational.

Actual result
=============
OOM kicks to kill host process due to lacking of memory on NUMA 0, while there are tons of memory on NUMA 1.

Tags: numa
Revision history for this message
Matt Riedemann (mriedem) wrote :

Asking stephenfin and sean-k-mooney in IRC about this they said it's a long-standing known issue that is hard to fix and agreed that the workaround is to set hw:mem_page_size=small in the flavors that use CPU pinning. There might be duplicate bugs for this. Either way we should document the known limitation alongside the hw:cpu_policy flavor extra spec here:

https://docs.openstack.org/nova/latest/user/flavors.html

tags: added: numa
Revision history for this message
Matt Riedemann (mriedem) wrote :

OK this is a duplicate of bug 1792985 which is itself a duplicate of bug 1439247 so I'm going to duplicate against bug 1439247. We should still update the flavor extra spec docs with the limitation and known workaround.

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

Hi Matt,

Thanks for looking in to this, I checked bug 1439247, although this bug report is not a strict duplicate of that, this bug report is not for mixing of VM requesting and no-requesting small pages on the same compute, the suggested walk-around of having all VMs setting hw:mem_page_size=small is better than asking all VMs to use huge pages (less operational impact).

But, but, bug 1439247 is pending for 4+ years, and obsessed with numa-topology and non-nuam-topology, which is not the root cause of the issue.

Would it be simpler to address the issue under this ticket, where the issue is presented simple and clear?

Jing

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

Below can be a working around using hw:mem_page_size=small:

(1) All VMs use hw:mem_page_size=small in the flavor in additional to the existing hw:cpu_policy=dedicated

This forces the nova scheduler to check both CPU and memory availability on NUMA 0

(2) A "place-holder" VM is created first, it is scheduled on NUMA 0 and the its memory size is reserved on NUMA 0.

This "place-holder" VM will not take work load, its sole purpose is to reserve memory from being used by other VMs, hence making the memory available for host processes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/686079

Changed in nova:
assignee: nobody → Jing Zhang (jing.zhang.nokia)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Jing Zhang (<email address hidden>) on branch: master
Review: https://review.opendev.org/686079

Revision history for this message
Jing Zhang (jing.zhang.nokia) wrote :

Thanks to bug 532168 fix, this issue can be addressed by setting reserved_huge_pages for small pages per NUMA node in nova.conf, i.e. configuration change only. Hence close this ticket.

Changed in nova:
status: In Progress → New
status: New → Invalid
assignee: Jing Zhang (jing.zhang.nokia) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.