Overcommit allowed for pinned instances when using hugepages

Bug #1811886 reported by Stephen Finucane
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Stephen Finucane

Bug Description

When working on a fix for bug 181097, it was noted that the check to ensure pinned instances do not overcommit was not pagesize aware. This means if an instance without hugepages boots on a host with a large number of hugepages allocated, it may not get all of the memory allocated to it. The solution seems to be to make the check pagesize aware. Test cases to prove this is the case are provided below.

---

# Host information

The memory capacity (and some other stuff) for our node:

    $ virsh capabilities | xmllint --xpath '/capabilities/host/topology/cells' -
    <cells num="2">
      <cell id="0">
        <memory unit="KiB">16298528</memory>
        <pages unit="KiB" size="4">3075208</pages>
        <pages unit="KiB" size="2048">4000</pages>
        <pages unit="KiB" size="1048576">0</pages>
        ...
      </cell>
      <cell id="1">
        <memory unit="KiB">16512884</memory>
        <pages unit="KiB" size="4">3128797</pages>
        <pages unit="KiB" size="2048">4000</pages>
        <pages unit="KiB" size="1048576">0</pages>
        ...
      </cell>
    </cells>

Clearly there are not 3075208 and 3128797 4k pages on NUMA nodes 0 and 1,
respectively, since, for NUMA node 0, (3075208 * 4) + (4000 * 2048) != 16298528.
We use [1] to resolve this. Instead we have 16298528 - (4000 * 2048) = 8106528 KiB
memory (or 7.93 GiB) for NUMA cell 0 and something similar for cell 1.

To make things easier, cell 1 is totally disabled by adding the following to 'nova-cpu.conf':

    [DEFAULT]
    vcpu_pin_set = 0-5,12-17

[1] https://review.openstack.org/631038

For all test cases I create the flavor then try to create two servers with the same flavor.

# Test A, unpinned, implicit small pages, oversubscribed.

This should work because we're not using a specific page size.

    $ openstack flavor create --vcpu 2 --disk 0 --ram 7168 test.numa
    $ openstack flavor set test.numa --property hw:numa_nodes=1

    $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test1
    $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test2

Expect: SUCCESS
Actual: SUCCESS

# Test B, unpinned, explicit small pages, oversubscribed

This should fail because we are request a specific page size, though that size is small pages (4k).

    $ openstack flavor create --vcpu 2 --disk 0 --ram 7168 test.numa
    $ openstack flavor set test.numa --property hw:numa_nodes=1
    $ openstack flavor set test.numa --property hw:mem_page_size=small

    $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test1
    $ openstack server create --flavor test.numa --image cirros-0.3.6-x86_64-disk --wait test2

Expect: FAILURE
Actual: FAILURE

# Test C, pinned, implicit small pages, oversubscribed

This should fail because we don't allow oversubscription with CPU pinning.

    $ openstack flavor create --vcpu 2 --disk 0 --ram 7168 test.pinned
    $ openstack flavor set test.pinned --property hw:cpu_policy=dedicated

    $ openstack server create --flavor test.pinned --image cirros-0.3.6-x86_64-disk --wait test1
    $ openstack server create --flavor test.pinned --image cirros-0.3.6-x86_64-disk --wait test2

Expect: FAILURE
Actual: SUCCESS

Interestingly, this fails on the third VM. This is likely because the total
memory for that cell, 16298528 KiB, is sufficient to handle two instances
but not three.

# Test D, pinned, explicit small pages, oversubscribed

This should fail because we don't allow oversubscription with CPU pinning.

    $ openstack flavor create --vcpu 2 --disk 0 --ram 7168 test.pinned
    $ openstack flavor set test.pinned --property hw:cpu_policy=dedicated
    $ openstack flavor set test.pinned --property hw:mem_page_size=small

    $ openstack server create --flavor test.pinned --image cirros-0.3.6-x86_64-disk --wait test1
    $ openstack server create --flavor test.pinned --image cirros-0.3.6-x86_64-disk --wait test2

Expect: FAILURE
Actual: FAILURE

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/631053

Changed in nova:
assignee: nobody → Stephen Finucane (stephenfinucane)
status: New → In Progress
Revision history for this message
Alex Xu (xuhj) wrote :

@Stephen, you test those without the patch https://review.openstack.org/#/c/629281/, right?

Revision history for this message
Tetsuro Nakamura (tetsuro0907) wrote :

> # Test C, pinned, implicit small pages, oversubscribed
> This should fail because we don't allow oversubscription with CPU pinning.

Why is this? I know that the CPU pinning request implicitly means having a numa topology, but I don't know why it affects memory allocation? Is it addressed somewhere in the document?

Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

> @Stephen, you test those without the patch https://review.openstack.org/#/c/629281/, right?

Correct. This was using master on the day I filed the bug.

Revision history for this message
Artom Lifshitz (notartom) wrote :

> # Test C, pinned, implicit small pages, oversubscribed
>
> This should fail because we don't allow oversubscription with CPU pinning.
>
> $ openstack flavor create --vcpu 2 --disk 0 --ram 7168 test.pinned
> $ openstack flavor set test.pinned --property hw:cpu_policy=dedicated
>
> $ openstack server create --flavor test.pinned --image cirros-0.3.6-x86_64-disk --wait test1
> $ openstack server create --flavor test.pinned --image cirros-0.3.6-x86_64-disk --wait test2
>
> Expect: FAILURE
> Actual: SUCCESS

I have trouble grokking why we expect this to fail. We can't oversubscribe CPUs with a dedicated policy, but why can't memory not be oversubcribed, at least with the implicit small page size?

Revision history for this message
Artom Lifshitz (notartom) wrote :

Continuing from comment #5 - with an explicit instance page size, we don't allow oversubscription, so is that why we don't want oversubscription with an implicit page size either?

If we look at cpu_policy, we only disallow oversubscription if it's explicitly requested with `dedicated`, but if the instance a NUMA topology and no CPU pinning, we allow CPU oversubcription, right?

So wouldn't the same thing make sense with pages? If you explicitly request a page size, you're not oversubscribed, but if you get one implicitly, you might get oversubscribed?

Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

I'm not saying it's a good thing, but this is what we do already if you forget about page sizes. See [1]. We're simply not considering hugepages when we should be.

[1] https://github.com/openstack/nova/blob/20.1.0/nova/virt/hardware.py#L1016-L1024

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/631053
Committed: https://opendev.org/openstack/nova/commit/f14c16af82ecc44cff07be8589fb020bd6625af2
Submitter: "Zuul (22348)"
Branch: master

commit f14c16af82ecc44cff07be8589fb020bd6625af2
Author: Stephen Finucane <email address hidden>
Date: Tue Sep 15 15:58:40 2020 +0100

    Make overcommit check for pinned instance pagesize aware

    When working on a fix for bug #1811870, it was noted that the check to
    ensure pinned instances do not overcommit was not pagesize aware. This
    means if an instance without hugepages boots on a host with a large
    number of hugepages allocated, it may not get all of the memory
    allocated to it. Put in concrete terms, consider a host with 1 NUMA
    cell, 2 CPUs, 1G of 4k pages, and a single 1G page. If you boot a first
    instance with 1 CPU, CPU pinning, 1G of RAM, and no specific page size,
    the instance should boot successfully. An attempt to boot a second
    instance with the same configuration should fail because there is only
    the single 1G page available, however, this is not currently the case.
    The reason this happens is because we currently have two tests: a first
    that checks total (not free!) host pages and a second that checks free
    memory but with no consideration for page size. The first check passes
    because we have 1G worth of 4K pages configured and the second check
    passes because we have the single 1G page.

    Close this gap.

    Change-Id: I74861a67827dda1ab2b8451967f5cf0ae93a4ad3
    Signed-off-by: Stephen Finucane <email address hidden>
    Closes-Bug: #1811886

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 30.0.0.0rc1

This issue was fixed in the openstack/nova 30.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.