Resize instace will not change the NUMA topology of a running instance to the one from the new flavor

Bug #1370390 reported by Nikola Đipanov
56
This bug affects 11 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Stephen Finucane

Bug Description

When we resize (change the flavor) of an instance that has a NUMA topology defined, the NUMA info from the new flavor will not be considered during scheduling. The instance will get re-scheduled based on the old NUMA information, but the claiming on the host will use the new flavor data. Once the instane sucessfully lands on a host, we will still use the old data when provisioning it on the new host.

We should be considering only the new flavor information in resizes.

Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
importance: High → Medium
Tiago Mello (timello)
Changed in nova:
assignee: nobody → Tiago Rodrigues de Mello (timello)
Revision history for this message
Bart Wensley (bartwensley) wrote :

This bug essentially means that resize is not usable for any instances that have a NUMA topology. Is anyone working on this?

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

This is basically the same as https://bugs.launchpad.net/nova/+bug/1417667 but this one is slightly more general, so I will mark the other one as a duplicate of this.

So after investigating this - it seems that there is really not that much work that needs to be done all the information is passed in to the filter, it's just that we mangle the request_spec and filter_properties dicts so much, and the keys are so generic, that it is really difficult to make sense of it without following the code all the way from the API.

Because of this it would probably be good to add a method that basically says - when inside a filter - give me a flavor I should be looking at right now.

Revision history for this message
Chris Friesen (cbf123) wrote :

While it's true that this bug would cover the resize case that I mentioned in note #1 of bug #1417667, I think that we still need to keep that bug open for the more general case of live-migration, evacuate, rebuild, etc.

The key different for that bug is that when using dedicated CPUs we need to recalculate which CPUs to use on the destination compute node (and claim those resources) before actually doing the migration/evacuation/rebuild. As it stands, we'll continue to use the originally-specific vcpu/pcpu mapping, even though it may not be valid on the new host.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/158245

Changed in nova:
assignee: Tiago Rodrigues de Mello (timello) → Nikola Đipanov (ndipanov)
status: Confirmed → In Progress
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

@Chris - well from the POV of the code, fixing this for the general case of CPU pinning is a sub-problem of fixing it for NUMA as such really, since CPU pinning uses the same code paths as NUMA does and relies on the same filter.

Fixing it for live migration with specified host likely requires a different bug anyway - so we might want to open that and leave this one closed?

Revision history for this message
zhangtralon (zhangchunlong1) wrote :

This is a big problem, I think that we need to think every features related to NUMA. Now, when using the feature of huge page, I meet the same problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/160484

Changed in nova:
assignee: Nikola Đipanov (ndipanov) → Ed Leafe (ed-leafe)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/160484
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Nikola Dipanov (<email address hidden>) on branch: master
Review: https://review.openstack.org/158245

Revision history for this message
Stephen Finucane (stephenfinucane) wrote :
Download full text (8.0 KiB)

I undertook some research into this. My findings are below, but tl;dr: it appears that this now works as expected and the bug can be closed.

---

# Problem

There were reports that resizing an instance from a pinned flavor to a unpinned
one not result in the pinning being removed. The opposite is also reportedly
true.

# Steps

## Create the required flavors

    $ openstack flavor create test.unpinned --id 100 --ram 2048 --disk 0 --vcpus 2
    $ openstack flavor create test.pinned --id 101 --ram 2048 --disk 0 --vcpus 2
    $ openstack flavor set test.pinned --property "hw:cpu_policy=dedicated"

# Ensure this is available

    $ openstack flavor list
    +-----+---------------+-------+------+-----------+-------+-----------+
    | ID | Name | RAM | Disk | Ephemeral | VCPUs | Is Public |
    +-----+---------------+-------+------+-----------+-------+-----------+
    | 1 | m1.tiny | 512 | 1 | 0 | 1 | True |
    | 101 | test.unpinned | 2048 | 0 | 0 | 2 | True |
    | 101 | test.pinned | 2048 | 0 | 0 | 2 | True |
    | 2 | m1.small | 2048 | 20 | 0 | 1 | True |
    | 3 | m1.medium | 4096 | 40 | 0 | 2 | True |
    | 4 | m1.large | 8192 | 80 | 0 | 4 | True |
    | 42 | m1.nano | 64 | 0 | 0 | 1 | True |
    | 5 | m1.xlarge | 16384 | 160 | 0 | 8 | True |
    | 84 | m1.micro | 128 | 0 | 0 | 1 | True |
    +-----+---------------+-------+------+-----------+-------+-----------+

    $ openstack image list
    +--------------------------------------+---------------------------------+--------+
    | ID | Name | Status |
    +--------------------------------------+---------------------------------+--------+
    | c44bba29-653e-4ddf-963d-442af4c33a13 | cirros-0.3.4-x86_64-uec | active |
    | 8b0284ee-ae6c-4e80-b5ee-26895d574717 | cirros-0.3.4-x86_64-uec-ramdisk | active |
    | 855c2971-aedc-4d5f-a366-73bb14707965 | cirros-0.3.4-x86_64-uec-kernel | active |
    +--------------------------------------+---------------------------------+--------+

# Boot an instance

    $ openstack server create --flavor=test.pinned \
        --image=cirros-0.3.4-x86_64-uec --wait test1

# Validate that the instance is pinned

    $ openstack server list
    +--------------------------------------+-------+--------+--------------------------------------------------------+
    | ID | Name | Status | Networks |
    +--------------------------------------+-------+--------+--------------------------------------------------------+
    | 857597cb-266b-4032-8030-e3cc76ebf0e7 | test1 | ACTIVE | private=10.0.0.3, fd2a:ec16:99e1:0:f816:3eff:fe99:df9f |
    +--------------------------------------+-------+--------+--------------------------------------------------------+

    $ sudo virsh list
     Id Name State
    ----------------------------------------------------
 ...

Read more...

Changed in nova:
assignee: Ed Leafe (ed-leafe) → Stephen Finucane (sfinucan)
Changed in nova:
status: In Progress → Invalid
Revision history for this message
Tony Walker (tony-walker-h) wrote :

I'm seeing this on Kilo @ 2015.1.0. I have 2 NUMA flavors - one double the size of the other in terms of CPU and memory.
If I boot a new instance of the large type, all is well. If I boot the small, and resize to the large, the cputune section gets the correct shares for the large, but the <vcpupin> entries for the old. To compound the issue, the <numa> section contains the memory size of the smaller flavor resulting in:

qemu-system-x86_64: total memory for NUMA nodes (0x1c00000000) should equal RAM size (0x3800000000)

@sfinucan - what version did you find this fixed on?

Revision history for this message
liuxiuli (liu-lixiu) wrote :

@Stephen Finucane - This problem exists in master version. Do you have time to deal with this bug? I wish to see your modification as soon as possible. Thank you.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Jay Pipes (<email address hidden>) on branch: master
Review: https://review.openstack.org/160484
Reason: The bug appears to now be fixed and Nikola is no longer working on Nova. Abandoning...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.