_get_host_numa_topology assumes numa cell has memory

Bug #1418187 reported by Scott Moser
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Sahid Orentino
Juno
Fix Released
Medium
Matt Riedemann
nova (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

numa cells are not guaranteed to have memory.
libvirt capabilities represent that correctly.
nova's _get_host_numa_topology assumes that it can convert cell's memory to
kilobytes via:
   memory=cell.memory / units.Ki.

but cell.memory ends up being None. for some LibvirtConfigCapsNUMACell.

stack trace is like this:
[-] unsupported operand type(s) for /: 'NoneType' and 'int'
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 145, in wait
    x.wait()
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/threadgroup.py", line 47, in wait
    return self.thread.wait()
  File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 173, in wait
    return self._exit_event.wait()
  File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
    return hubs.get_hub().switch()
  File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 293, in switch
    return self.greenlet.switch()
  File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 212, in main
    result = function(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/nova/openstack/common/service.py", line 492, in run_service
    service.start()
  File "/usr/lib/python2.7/dist-packages/nova/service.py", line 181, in start
    self.manager.pre_start_hook()
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1188, in pre_start_hook
    self.update_available_resource(nova.context.get_admin_context())
  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6047, in update_available_resource
    rt.update_available_resource(context)
  File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 313, in update_available_resource
    resources = self.driver.get_available_resource(self.nodename)
  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4825, in get_available_resource
    numa_topology = self._get_host_numa_topology()
  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4703, in _get_host_numa_topology
    for cell in topology.cells])
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

Revision history for this message
Scott Moser (smoser) wrote :

attachign virsh capabilities output.

Ryan Beisner (1chb1n)
tags: added: openstack uosci
Revision history for this message
Scott Moser (smoser) wrote :

attaching workaround patch

Changed in nova (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
tags: added: patch
Revision history for this message
Scott Moser (smoser) wrote :

revised patch.

--- driver.py.org 2015-02-05 03:07:39.396827122 +0000
+++ driver.py 2015-02-06 14:41:25.331949566 +0000
@@ -4682,11 +4687,16 @@
         if topology is None or not topology.cells:
             return

+ if len([cell for cell in topology.cells if cell.memory is None]):
+ LOG.warn("odd, some of your numa cells have None memory. %s",
+ [{id: cell.id, cpuset: set(cpu.id for cpu in cell.cpus),
+ memory: cell.memory} for cell in topology.cells])
+
         topology = objects.NUMATopology(
                 cells=[objects.NUMACell(
                     id=cell.id,
                     cpuset=set(cpu.id for cpu in cell.cpus),
- memory=cell.memory / units.Ki,
+ memory=(cell.memory if cell.memory else 0)/ units.Ki,
                     cpu_usage=0, memory_usage=0,
                     mempages=[
                         objects.NUMAPagesTopology(

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

I asked Daniel if his review in progress fixes this bug or we'll need to file another after that lands

https://review.openstack.org/#/c/159106/

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Changed in nova:
assignee: nobody → sahid (sahid-ferdjaoui)
Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

If we look to your attachment we can see that: cell ids 1 and 17 do not provide any memory element. We need to figure out how it is possible and then we can fix it correctly.

Revision history for this message
Daniel Berrange (berrange) wrote :

FYI, having _get_host_numa_topology() not generate an exception on reporting is the least of our worries here. The scheduling placement code for NUMA guests assumes that all NUMA cells have both memory & cpus. So with the proposed patch you will have silenced the exception, but Nova will still be broken in practice, because it'll never schedule guests on 1/2 of the CPUs in this host - ie the ones without local memory will never get used. Fixing this is a far more invasive problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/160904

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/160905

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/160906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/160904
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=291c1a1db1ab3ceccfac7a3c8312b6fdce3aaa84
Submitter: Jenkins
Branch: master

commit 291c1a1db1ab3ceccfac7a3c8312b6fdce3aaa84
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Tue Mar 3 05:00:40 2015 -0500

    libvirt: make default value of numa cell memory to 0 when not defined

    Some arch can have cells without memory or cpus defined and libvirt
    will return an XML without these elements. Our object defintion of the
    fields cpus and memory cannot let us to make them to None when not
    defined but currently the config representation of a NUMA make it to
    None.

    This patch fix the default value of config memory to 0 when libvirt
    does not return memory element for a cell.

    Also this cannot be considered come a fix for bug 1418187 since we
    have to handle these cases (cpus or memory not defined) during
    scheduling. thse case can be addressed when using distances which will
    be addressed in a next serie of patches.

    Related-Bug: #1418187
    Change-Id: Iac08d1221341a86c081d5e905c704fb1c9dca276

Revision history for this message
Matt Riedemann (mriedem) wrote :

We hit the same thing with powerkvm testing in juno have basically the same patch out of tree, I can't remember exactly why it didn't get upstream now, but I think there was at least a bug reported, I'm trying to dig that up for reference.

tags: added: juno-backport-potential
Revision history for this message
Matt Riedemann (mriedem) wrote :

The other bug was bug 1376307.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/166004

Revision history for this message
Alan Pevec (apevec) wrote :

Complete fix will likely not be backportable so this bug will not get Juno task in Launchpad!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/166004
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=34f029e68d607c7154adcb6d6616da8da0db3650
Submitter: Jenkins
Branch: stable/juno

commit 34f029e68d607c7154adcb6d6616da8da0db3650
Author: Sahid Orentino Ferdjaoui <email address hidden>
Date: Tue Mar 3 05:00:40 2015 -0500

    libvirt: make default value of numa cell memory to 0 when not defined

    Some arch can have cells without memory or cpus defined and libvirt
    will return an XML without these elements. Our object defintion of the
    fields cpus and memory cannot let us to make them to None when not
    defined but currently the config representation of a NUMA make it to
    None.

    This patch fix the default value of config memory to 0 when libvirt
    does not return memory element for a cell.

    Also this cannot be considered come a fix for bug 1418187 since we
    have to handle these cases (cpus or memory not defined) during
    scheduling. thse case can be addressed when using distances which will
    be addressed in a next serie of patches.

    Conflicts:
            nova/tests/unit/virt/libvirt/test_config.py
            nova/virt/libvirt/config.py

    NOTE(mriedem): The conflict in config.py is due to the mempages
    code added on master with commit 3283e2a42 that's not in juno.
    The test conflict was due to moving the tests in kilo.

    Related-Bug: #1418187
    Change-Id: Iac08d1221341a86c081d5e905c704fb1c9dca276
    (cherry picked from commit 291c1a1db1ab3ceccfac7a3c8312b6fdce3aaa84)

tags: added: in-stable-juno
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/160905
Reason: This patch as been WIP'ed for a very long time, so I am going to abandon it to keep the review queue sane. Please restore the change when its ready for review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/160906
Reason: This patch as been WIP'ed for a very long time, so I am going to abandon it to keep the review queue sane. Please restore the change when its ready for review.

Alan Pevec (apevec)
tags: removed: in-stable-juno juno-backport-potential
Changed in nova:
status: Confirmed → Incomplete
Changed in nova (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Augustina Ragwitz (auggy) wrote :

This bug was marked Incomplete but it is unclear what further information is needed before it can be Confirmed. What further information is needed for this to be triaged in Nova?

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

This issue has been addressed by:

 https://review.openstack.org/gitweb?p=openstack/nova.git;a=commitdiff;h=8df48025c36c8bc595f346f0b76ee010ae86737d

The scheduling part should have been addressed by Nikola during one of his serie to let the kernel decides the placement when nothing has been requested by users.

Changed in nova:
status: Incomplete → Fix Committed
Revision history for this message
Takashi Natsume (natsume-takashi) wrote :

Since the Mitaka cycle we use the direct release model, so change 'Fix Committed' status to 'Fix Released'.

Changed in nova:
status: Fix Committed → Fix Released
Chuck Short (zulcss)
Changed in nova (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.