creating vm may fails with large page vm and ordinary vm on the same numa node

Bug #1428551 reported by zhangtralon
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Unassigned

Bug Description

creating ordinary vm may fails with large page vm and ordinary vm on the same numa node.

the following scene can reproduce the problem:
1. Assumpt that a host with two numa nodes are used to create vm, and the memory of each numa node consists of 5GB huge page and 5GB ordinary page.

2. create a vm with huge page that use 3GB huge page memory in the host numa node 0. Now, the usable memory of the host numa node 0 consists of 2GB huge page and 5GB ordinary page.

3. At this time, we create an ordinary numa vm with 6GB and the NUMATopologyFilter filter may select the host numa node 0. If the host numa node 0 is selected, the libvirt will report OOM error.

Tags: libvirt numa
description: updated
Sean Dague (sdague)
tags: added: numa
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
tags: added: libvirt
Changed in nova:
assignee: nobody → zhangtralon (zhangchunlong1)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/167917

Changed in nova:
status: Confirmed → In Progress
wangxiyuan (wangxiyuan)
Changed in nova:
assignee: zhangtralon (zhangchunlong1) → wangxiyuan (wangxiyuan)
wangxiyuan (wangxiyuan)
Changed in nova:
assignee: wangxiyuan (wangxiyuan) → nobody
Changed in nova:
assignee: nobody → Zhenyu Zheng (zhengzhenyu)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/202504

Changed in nova:
assignee: Zhenyu Zheng (zhengzhenyu) → sahid (sahid-ferdjaoui)
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Copying my comment here from one of the proposed patches as to what I think is the best course of action here.

"
It's seems to me that a much easier fix to this would be to change how report memory back to the scheduler.

so we just make sure that the available memory for non large page instances does not include memory reserved as large pages.

It may not be the best idea to do this in the libvirt driver but in the resource tracker so that we are sure that if any other driver implements huge pages support they would get this for free.

Alternatively - we could add several more fields to the compute node (memory_huge_pages, memory_total, memory_small) as it would be explicit, and then change filters/claims/tracking to update this accordingly.
"

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Currently memory overcommit is expressed as a number that the real amount of available memory is to be multiplied with, and is considered against the total amount of memory on the host (or each NUMA cell). Huge pages will never be overcommited, but memory reserved for HP will count towards the overcommit total, which is what is causing the bug.

If we look back onto the original design document of this code, this is (almost) by design. We envisioned that hosts that are pre-configured with huge pages would also be separated into a different host aggregate and the HP-enabled flavors would be marked to only go to a certain aggregate [1]. While not ideal - this limitation allowed us to develop the feature without having an impact on the code that does scheduling of Instances with non-dedicated CPU/memory pages.

So we could think about fixing this bug as actually lifting the limitation described above. This is likely something that will require changing the way we report resources - meaning it will require changes to the data model. Fixing this for huge pages only might be possible without making any changes as we have all the information to properly deduce how much of the non-HP memory is actually available and used, and make sure that we count oversubscription against that chunk and not all of hosts memory which can include dedicated huge pages.

Ultimately however, we want to remove the limitation for CPU pinning and make it possible to drive this through the API, which will definitely require a blueprint.

Since the use-case fixing this bug would enable (mixing instances with and without HP backing on the same compute host, without any support for CPU pinning) seems like it's not a critical defect (but more a nice-to-have), it might be better to not add workarounds, and instead make sure 1) We are clear in our docs about the limitations of the current state of huge pages support for instances in Liberty and earlier releases 2) Design and propose further work to make sure we lift the limitation of having to have a separate aggregate for instances with dedicated resources, and allow for the separation of resources on hosts to be handled through the API.

[1] http://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/virt-driver-large-pages.html#other-deployer-impact

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/202504
Reason: This patch has been stalled for quite a while, so I am going to abandon it to keep the code review queue sane. Please restore the change when it is ready for review.

Changed in nova:
assignee: sahid (sahid-ferdjaoui) → nobody
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/167917
Reason: This patch has been idle for a long time, so I am abandoning it to keep the review clean sane. If you're interested in still working on this patch, then please unabandon it and upload a new patchiest.

Revision history for this message
Daniel Berrange (berrange) wrote :

Mixing guests with huge pages and non-huge pages on the same host opens up a huge can of worms, adding complexity to nova, and resulting in a much unreliable system overall. With our current requirement that guests using huge pages must run on hosts dedicated to the use of huge pages, we can setup hosts such that nearly all RAM is allocated upfront to huge pages, leaving just a little spare for non-guest RAM allocations. To allow effective mixing of huge page and non-hugepage guests on the same host, things now need to be dynamic switching host RAM between being huge page and non-hugepage based. The ability to reconfigure host RAM from small pages to huge pages becomes increasingly problematic over time as RAM becomes fragmented, to the point where you can have many GB of free small pages, but be unable to turn them into huge pages. As such it is far preferrable to stick with the model that hosts are dedicated to use of huge page guests only and huge pages allocated upfront when the host is provisioned.

Changed in nova:
status: Confirmed → Invalid
Revision history for this message
Daniel Berrange (berrange) wrote :

Marked INVALID because this is *not* a bug. It is intended behaviour, so any change would require a blueprint+spec as a feature request. That said any such feature request will likely be rejected for the reasons explained above.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.