when using dedicated cpus, the guest topology doesn't match the host

Bug #1417723 reported by Chris Friesen
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Stephen Finucane
Mitaka
Fix Released
Undecided
Stephen Finucane

Bug Description

According to "http://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-cpu-pinning.html", the topology of the guest is set up as follows:

"In the absence of an explicit vCPU topology request, the virt drivers typically expose all vCPUs as sockets with 1 core and 1 thread. When strict CPU pinning is in effect the guest CPU topology will be setup to match the topology of the CPUs to which it is pinned."

What I'm seeing is that when strict CPU pinning is in use the guest seems to be configuring multiple threads, even if the host doesn't have theading enabled.

As an example, I set up a flavor with 2 vCPUs and enabled dedicated cpus. I then booted up an instance of this flavor on two separate compute nodes, one with hyperthreading enabled and one with hyperthreading disabled. In both cases, "virsh dumpxml" gave the following topology:

<topology sockets='1' cores='1' threads='2'/>

When running on the system with hyperthreading disabled, this should presumably have been set to "cores=2 threads=1".

Taking this a bit further, even if hyperthreading is enabled on the host it would be more accurate to only specify multiple threads in the guest topology if the vCPUs are actually affined to multiple threads of the same host core. Otherwise it would be more accurate to specify the guest topology with multiple cores of one thread each.

Tags: compute
melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Changed in nova:
assignee: nobody → lyanchih (lyanchih)
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

Hey Chris - thanks for the bug report!

would it be possible to get the actual resulting XML of running virsh capabilites on the two hosts (or at least the interesting bits about their CPUs and NUMA topology) and the resulting instance XMLs for each host (so that we can see which actual CPUs instances got pinned to).

Related: There are some patches in flight that change behaviour regarding exposing threads to guests starting at https://review.openstack.org/#/c/229573/. It might be worth trying them out and seeing if they fix this problem.

Revision history for this message
Chris Friesen (cbf123) wrote :
Download full text (8.1 KiB)

Unfortunately I don't have current master running on a system with multiple numa nodes, nor do I have a lot of spare time. I'll try to set up a test, but I don't know for sure when I'll be able to get it done.

The host XML would look something like this for the hyperthreading-disabled case:
    <cpu>
      <arch>x86_64</arch>
      <model>SandyBridge</model>
      <vendor>Intel</vendor>
      <topology sockets='1' cores='10' threads='1'/>
      <feature name='invtsc'/>
      <feature name='erms'/>
      <feature name='smep'/>
      <feature name='fsgsbase'/>
      <feature name='pdpe1gb'/>
      <feature name='rdrand'/>
      <feature name='f16c'/>
      <feature name='osxsave'/>
      <feature name='dca'/>
      <feature name='pcid'/>
      <feature name='pdcm'/>
      <feature name='xtpr'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='smx'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='dtes64'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <feature name='vme'/>
      <pages unit='KiB' size='4'/>
      <pages unit='KiB' size='2048'/>
      <pages unit='KiB' size='1048576'/>
    </cpu>
    <topology>
      <cells num='2'>
        <cell id='0'>
          <memory unit='KiB'>16696468</memory>
          <pages unit='KiB' size='4'>3287845</pages>
          <pages unit='KiB' size='2048'>5315</pages>
          <pages unit='KiB' size='1048576'>1</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='1' value='21'/>
          </distances>
          <cpus num='10'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
            <cpu id='1' socket_id='0' core_id='1' siblings='1'/>
            <cpu id='2' socket_id='0' core_id='2' siblings='2'/>
            <cpu id='3' socket_id='0' core_id='3' siblings='3'/>
            <cpu id='4' socket_id='0' core_id='4' siblings='4'/>
            <cpu id='5' socket_id='0' core_id='8' siblings='5'/>
            <cpu id='6' socket_id='0' core_id='9' siblings='6'/>
            <cpu id='7' socket_id='0' core_id='10' siblings='7'/>
            <cpu id='8' socket_id='0' core_id='11' siblings='8'/>
            <cpu id='9' socket_id='0' core_id='12' siblings='9'/>
          </cpus>
        </cell>
        <cell id='1'>
          <memory unit='KiB'>16777216</memory>
          <pages unit='KiB' size='4'>3814400</pages>
          <pages unit='KiB' size='2048'>6374</pages>
          <pages unit='KiB' size='1048576'>1</pages>
          <distances>
            <sibling id='0' value='21'/>
            <sibling id='1' value='10'/>
          </distances>
          <cpus num='10'>
            <cpu id='10' socket_id='1' core_id='0' siblings='10'/>
            <cpu id='11' socket_id='1' core_id='1' siblings='11'/>
            <cpu id='12' socket_id='1' core_id='2' siblings='12'/>
            <cpu id='13' socket_id='1' core_id='3' siblings='13'/>
            <cpu id='14' socket_id='1' core_id='4' siblings='14'/>
            <cpu id='15' socket_id='1' core_id='8...

Read more...

Changed in nova:
assignee: Chung Chih, Hung (lyanchih) → nobody
assignee: nobody → Stephen Finucane (sfinucan)
Revision history for this message
Stephen Finucane (stephenfinucane) wrote :
Download full text (3.4 KiB)

So I investigated this and it seems that it's still a bug. You can sidestep the issue using the `require` CPU thread policy, though that's not a real solution. Findings below.

---

# Platform

Testing was conducted on two single-node, Fedora 23-based
(4.3.5-300.fc23.x86_64) OpenStack instances (built with devstack). The system
is a dual-socket, ten core, HT-enabled system (2 sockets * 10 cores * 2 threads
= 40 "pCPUs". 0-9,20-29 = node0, 10-19,30-39 = node1).

Commit `8bafc9` of Nova was used.

# Steps

## Create flavors

    $ openstack flavor create pinned.prefer \
        --id 101 --ram 2048 --disk 0 --vcpus 4
    $ openstack flavor set pinned.prefer \
        --property "hw:cpu_policy=dedicated" \
        --property "hw:cpu_thread_policy=prefer"

## Validate a HT-enabled node

Seeing as we're not running any other instances on this host, the policy should
ensure that thread siblings are preferred and that this information is passed
up to the host. As such, the guest should see two sockets with one core per
socket and two threads per core.

    $ openstack server create --flavor=pinned.prefer \
        --image=cirros-0.3.4-x86_64-uec --wait test1

    $ sudo virsh list
     Id Name State
    ----------------------------------------------------
     1 instance-00000001 running

    $ sudo virsh dumpxml 1
    <domain type='kvm' id='1'>
      <name>instance-00000001</name>
      ...
      <vcpu placement='static'>4</vcpu>
      <cputune>
        <shares>4096</shares>
        <vcpupin vcpu='0' cpuset='1'/>
        <vcpupin vcpu='1' cpuset='21'/>
        <vcpupin vcpu='2' cpuset='0'/>
        <vcpupin vcpu='3' cpuset='20'/>
        <emulatorpin cpuset='0-1,20-21'/>
      </cputune>
      <numatune>
        <memory mode='strict' nodeset='0'/>
        <memnode cellid='0' mode='strict' nodeset='0'/>
      </numatune>
      ...
      <cpu>
        <topology sockets='2' cores='1' threads='2'/>
        <numa>
          <cell id='0' cpus='0-3' memory='2097152' unit='KiB'/>
        </numa>
      </cpu>
      ...
    </domain>

    $ openstack server delete test1

No issues here.

## Validate a HT-disabled node

This is the exact same as the configuration for the HT-enabled node, but should
result in a different output as there are no threads. We should see four
sockets with one cores per socket and one thread per core.

    $ openstack server create --flavor=pinned.prefer \
        --image=cirros-0.3.4-x86_64-uec --wait test1

    $ sudo virsh list
     Id Name State
    ----------------------------------------------------
     1 instance-00000001 running

    $ sudo virsh dumpxml 1
    <domain type='kvm' id='1'>
      <name>instance-00000001</name>
      ...
      <vcpu placement='static'>4</vcpu>
      <cputune>
        <shares>4096</shares>
        <vcpupin vcpu='0' cpuset='0'/>
        <vcpupin vcpu='1' cpuset='1'/>
        <vcpupin vcpu='2' cpuset='2'/>
        <vcpupin vcpu='3' cpuset='3'/>
        <emulatorpin cpuset='0-3'/>
      </cputune>
      <numatune>
        <memory mode='strict' nodeset='0'/>
        <memnode cellid='0' mode='strict' nodeset='0'...

Read more...

Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

In case it's not clear to anyone, the above is incorrect as the latter case has a 2-1-2 socket-core-thread configuration, when it should have a 4-1-1 configuration.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/285321

Changed in nova:
status: Confirmed → In Progress
Changed in nova:
assignee: Stephen Finucane (sfinucan) → Waldemar Znoinski (wznoinsk)
Changed in nova:
assignee: Waldemar Znoinski (wznoinsk) → Stephen Finucane (sfinucan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/285321
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0b2e34f92507fd490faaec3285049b28446dc94c
Submitter: Jenkins
Branch: master

commit 0b2e34f92507fd490faaec3285049b28446dc94c
Author: Stephen Finucane <email address hidden>
Date: Fri Feb 26 13:07:56 2016 +0000

    virt/hardware: Fix 'isolate' case on non-SMT hosts

    The 'isolate' policy is supposed to function on both hosts with an
    SMT architecture (e.g. HyperThreading) and those without. The former
    is true, but the latter is broken due to a an underlying implementation
    detail in how vCPUs are "packed" onto pCPUs.

    The '_pack_instance_onto_cores' function expects to work with a list of
    sibling sets. Since non-SMT hosts don't have siblings, the function is
    being given a list of all cores as one big sibling set. However, this
    conflicts with the idea that, in the 'isolate' case, only one sibling
    from each sibling set should be used. Using one sibling from the one
    available sibling set means it is not possible to schedule instances
    with more than one vCPU.

    Resolve this mismatch by instead providing the function with a list of
    multiple sibling sets, each containing a single core.

    This also resolves another bug. When booting instances on a non-HT
    host, the resulting NUMA topology should not define threads. By
    correctly considering the cores on these systems as non-siblings,
    the resulting instance topology will contain multiple cores with only
    a single thread in each.

    Change-Id: I2153f25fdb6382ada8e62fddf4215d9a0e3a6aa7
    Closes-bug: #1550317
    Closes-bug: #1417723

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/nova 14.0.0.0b1

This issue was fixed in the openstack/nova 14.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326944

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/mitaka)

Reviewed: https://review.openstack.org/326944
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=57638a8c83f98a8c139378b3b1e920b4f1c49a95
Submitter: Jenkins
Branch: stable/mitaka

commit 57638a8c83f98a8c139378b3b1e920b4f1c49a95
Author: Stephen Finucane <email address hidden>
Date: Fri Feb 26 13:07:56 2016 +0000

    virt/hardware: Fix 'isolate' case on non-SMT hosts

    The 'isolate' policy is supposed to function on both hosts with an
    SMT architecture (e.g. HyperThreading) and those without. The former
    is true, but the latter is broken due to a an underlying implementation
    detail in how vCPUs are "packed" onto pCPUs.

    The '_pack_instance_onto_cores' function expects to work with a list of
    sibling sets. Since non-SMT hosts don't have siblings, the function is
    being given a list of all cores as one big sibling set. However, this
    conflicts with the idea that, in the 'isolate' case, only one sibling
    from each sibling set should be used. Using one sibling from the one
    available sibling set means it is not possible to schedule instances
    with more than one vCPU.

    Resolve this mismatch by instead providing the function with a list of
    multiple sibling sets, each containing a single core.

    This also resolves another bug. When booting instances on a non-HT
    host, the resulting NUMA topology should not define threads. By
    correctly considering the cores on these systems as non-siblings,
    the resulting instance topology will contain multiple cores with only
    a single thread in each.

    Change-Id: I2153f25fdb6382ada8e62fddf4215d9a0e3a6aa7
    Closes-bug: #1550317
    Closes-bug: #1417723
    (cherry picked from commit 0b2e34f92507fd490faaec3285049b28446dc94c)

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 13.1.1

This issue was fixed in the openstack/nova 13.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.