Bug #1428551 “creating vm may fails with large page vm and ordin...” : Bugs : OpenStack Compute (nova)

zhangtralon (zhangchunlong1) on 2015-03-05

description:

updated

Sean Dague (sdague) on 2015-03-24

tags:	added: numa
Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium
tags:	added: libvirt

zhangtralon (zhangchunlong1) on 2015-03-25

Changed in nova:
assignee:	nobody → zhangtralon (zhangchunlong1)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-26: Fix proposed to nova (master)

#1

Fix proposed to branch: master
Review: https://review.openstack.org/167917

Changed in nova:
status:	Confirmed → In Progress

wangxiyuan (wangxiyuan) on 2015-05-30

Changed in nova:
assignee:	zhangtralon (zhangchunlong1) → wangxiyuan (wangxiyuan)

wangxiyuan (wangxiyuan) on 2015-06-19

Changed in nova:
assignee:	wangxiyuan (wangxiyuan) → nobody

Zhenyu Zheng (zhengzhenyu) on 2015-06-19

Changed in nova:
assignee:	nobody → Zhenyu Zheng (zhengzhenyu)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-16:

#2

Fix proposed to branch: master
Review: https://review.openstack.org/202504

Changed in nova:
assignee:	Zhenyu Zheng (zhengzhenyu) → sahid (sahid-ferdjaoui)

Revision history for this message

Nikola Đipanov (ndipanov) wrote on 2015-07-17:

#3

Copying my comment here from one of the proposed patches as to what I think is the best course of action here.

"
It's seems to me that a much easier fix to this would be to change how report memory back to the scheduler.

so we just make sure that the available memory for non large page instances does not include memory reserved as large pages.

It may not be the best idea to do this in the libvirt driver but in the resource tracker so that we are sure that if any other driver implements huge pages support they would get this for free.

Alternatively - we could add several more fields to the compute node (memory_huge_pages, memory_total, memory_small) as it would be explicit, and then change filters/claims/tracking to update this accordingly.
"

Revision history for this message

Nikola Đipanov (ndipanov) wrote on 2015-09-28:

#4

Currently memory overcommit is expressed as a number that the real amount of available memory is to be multiplied with, and is considered against the total amount of memory on the host (or each NUMA cell). Huge pages will never be overcommited, but memory reserved for HP will count towards the overcommit total, which is what is causing the bug.

If we look back onto the original design document of this code, this is (almost) by design. We envisioned that hosts that are pre-configured with huge pages would also be separated into a different host aggregate and the HP-enabled flavors would be marked to only go to a certain aggregate [1]. While not ideal - this limitation allowed us to develop the feature without having an impact on the code that does scheduling of Instances with non-dedicated CPU/memory pages.

So we could think about fixing this bug as actually lifting the limitation described above. This is likely something that will require changing the way we report resources - meaning it will require changes to the data model. Fixing this for huge pages only might be possible without making any changes as we have all the information to properly deduce how much of the non-HP memory is actually available and used, and make sure that we count oversubscription against that chunk and not all of hosts memory which can include dedicated huge pages.

Ultimately however, we want to remove the limitation for CPU pinning and make it possible to drive this through the API, which will definitely require a blueprint.

Since the use-case fixing this bug would enable (mixing instances with and without HP backing on the same compute host, without any support for CPU pinning) seems like it's not a critical defect (but more a nice-to-have), it might be better to not add workarounds, and instead make sure 1) We are clear in our docs about the limitations of the current state of huge pages support for instances in Liberty and earlier releases 2) Design and propose further work to make sure we lift the limitation of having to have a separate aggregate for instances with dedicated resources, and allow for the separation of resources on hosts to be handled through the API.

[1] http://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/virt-driver-large-pages.html#other-deployer-impact

Currently memory overcommit is expressed as a number that the real amount of available memory is to be multiplied with, and is  considered against the total amount of memory on the host (or each NUMA cell). Huge pages will never be overcommited, but memory reserved for HP will count towards the overcommit total, which is what is causing the bug.

If we look back onto the original design document of this code, this is (almost) by design. We envisioned that hosts that are pre-configured with huge pages would also be separated into a different host aggregate and the HP-enabled flavors would be marked to only go to a certain aggregate [1]. While not ideal - this limitation allowed us to develop the feature without having an impact on the code that does scheduling of Instances with non-dedicated CPU/memory pages.

So we could think about fixing this bug as actually lifting the limitation described above. This is likely something that will require changing the way we report resources - meaning it will require changes to the data model. Fixing this for huge pages only might be possible without making any changes as we have all the information to properly deduce how much of the non-HP memory is actually available and used, and make sure that we count oversubscription against that chunk and not all of hosts memory which can include dedicated huge pages.

Ultimately however, we want to remove the limitation for CPU pinning and make it possible to drive this through the API, which will definitely require a blueprint.

Since the use-case fixing this bug would enable (mixing instances with and without HP backing on the same compute host, without any support for CPU pinning) seems like it's not a critical defect (but more a nice-to-have), it might be better to not add workarounds, and instead make sure 1) We are clear in our docs about the limitations of the current state of huge pages support for instances in Liberty and earlier releases 2) Design and propose further work to make sure we lift the limitation of having to have a separate aggregate for instances with dedicated resources, and allow for the separation of resources on hosts to be handled through the API.

[1] http://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/virt-driver-large-pages.html#other-deployer-impact

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-15: Change abandoned on nova (master)

#5

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/202504
Reason: This patch has been stalled for quite a while, so I am going to abandon it to keep the code review queue sane. Please restore the change when it is ready for review.

Davanum Srinivas (DIMS) (dims-v) on 2016-03-06

Changed in nova:
assignee:	sahid (sahid-ferdjaoui) → nobody
status:	In Progress → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-26:

#6

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/167917
Reason: This patch has been idle for a long time, so I am abandoning it to keep the review clean sane. If you're interested in still working on this patch, then please unabandon it and upload a new patchiest.

Revision history for this message

Daniel Berrange (berrange) wrote on 2016-09-28:

#7

Mixing guests with huge pages and non-huge pages on the same host opens up a huge can of worms, adding complexity to nova, and resulting in a much unreliable system overall. With our current requirement that guests using huge pages must run on hosts dedicated to the use of huge pages, we can setup hosts such that nearly all RAM is allocated upfront to huge pages, leaving just a little spare for non-guest RAM allocations. To allow effective mixing of huge page and non-hugepage guests on the same host, things now need to be dynamic switching host RAM between being huge page and non-hugepage based. The ability to reconfigure host RAM from small pages to huge pages becomes increasingly problematic over time as RAM becomes fragmented, to the point where you can have many GB of free small pages, but be unable to turn them into huge pages. As such it is far preferrable to stick with the model that hosts are dedicated to use of huge page guests only and huge pages allocated upfront when the host is provisioned.

Changed in nova:
status:	Confirmed → Invalid

Revision history for this message

Daniel Berrange (berrange) wrote on 2016-09-28:

#8

Marked INVALID because this is *not* a bug. It is intended behaviour, so any change would require a blueprint+spec as a feature request. That said any such feature request will likely be rejected for the reasons explained above.

OpenStack Compute (nova)

creating vm may fails with large page vm and ordinary vm on the same numa node

Bug Description

Other bug subscribers

Remote bug watches