I/O (PCIe) Based NUMA Scheduling can't really achieve pci numa binding in some cases.

Bug #1551504 reported by Jinquan Ni on 2016-03-01
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Jinquan Ni

Bug Description

1. version
kilo 2015.1.0, liberty

this bug is base on BP:
https://blueprints.launchpad.net/nova/+spec/input-output-based-numa-scheduling

In the current implementation scheme:

/nova/pci/stats.py
################################################################################
 def _filter_pools_for_numa_cells(pools, numa_cells):
        # Some systems don't report numa node info for pci devices, in
        # that case None is reported in pci_device.numa_node, by adding None
        # to numa_cells we allow assigning those devices to instances with
        # numa topology
        numa_cells = [None] + [cell.id for cell in numa_cells]
#################################################################################

If some compute nodes don't report numa node info for pci devices.
Then these pci devices will be regarded as "belong to all numa node" to deal with by default.

This can lead to a problem:
Pci devices is not on the numa node which CPU\MEM on.
In this way, the real purpose of I/O (PCIe) Based NUMA Scheduling is not reached.
More serious is that the user will be wrong thought pci devices is on the numa node that CPU\MEM on.

The truth is, there are still many systems don't report numa node info for pci devices.
So, i think this bug need fixed.

Jinquan Ni (ni-jinquan) on 2016-03-01
Changed in nova:
assignee: nobody → jinquanni(ZTE) (ni-jinquan)
tags: added: numa pci
Jinquan Ni (ni-jinquan) on 2016-03-01
description: updated
Matt Riedemann (mriedem) wrote :

Are you able to recreate this on master (mitaka) or liberty to see if this is already fixed?

tags: added: scheduler
tags: added: kilo-backport-potential liberty-backport-potential
Jinquan Ni (ni-jinquan) wrote :

To Matt Riedemann (mriedem):

    Thank you for your attention,I had no master's development environment, but have read the code, find bug‘s code still exists 。

    numa_cells = [None] + [cell.id for cell in numa_cells]

    As long as you don't add any judgment, directly add “ [None]” to numa_cells will have this problem

Jinquan Ni (ni-jinquan) on 2016-03-04
Changed in nova:
status: New → In Progress

I'm not sure I understand correctly the issue here.

The use case that seems to be described is: There is a compute that can't report the numa node information about it's PCI devices. A user is requesting to launch an instance with some specific PCI resources. Nova pick this specific compute to schedule the instance. The scheduler is able to meet the PCI request (in this case it's disregarding the numa node and pick any PCI devices that at least match the vendor and product info - and maybe some other criterias).

How can nova schedule the instance using the right numa nodes (for numa node affinity) if it doesn't have the numa node information for those PCI devices ?

Are you suggesting that in this case (since we don't have any numa node information) that we should report an error specifying that the instance can't be launch ?

Or are you suggesting that the scheduler should first try to accommodate the request on a compute that *has* PCI numa node information first, then if none are available it should try on those compute that *can't* report PCI numa node information ?

What is the desired behavior ?

Jinquan Ni (ni-jinquan) wrote :

to Ludovic Beliveau (ludovic-beliveau) :

 your describe in first section are right

the ansower of your first question is that nova schedule can't using the right numa nodes if it doesn't have the numa node information for those PCI devices.

my desired behavior is:

(1) add a parameter “hw:numa_pci” ,and if hw:numa_pci = TRUE:

in this case (since we don't have any numa node information) that we should report an error specifying that the instance can't be launch

(2) if don‘t set hw:numa_pci
the scheduler should first try to accommodate the request on a compute that *has* PCI numa node information first, then if none are available it should try on those compute that *can't* report PCI numa node information

(3) if hw:numa_pci = False:
pci has no requirements about numa

Nikola Đipanov (ndipanov) wrote :

I don't think we should care about libvirt reporting None for device NUMA node, this is related (AFAIK) to the libvirt version, and is not something we check when scheduling. People who care about this feature are likely to run the newer bits in practice, so I doubt this is an issue in practice. The most we should do is log a warning in the libvirt driver IMHO.

There is another question here IIUC and that is whether to allow for "soft PCI device affinity" - if both NUMA affinity and device passthrough is requested - currently there is no way to allow for device to be on a different NUMA node - instances will simply fail to schedule. I am not convinced that this is a very useful feature, but could be convinced otherwise

Jinquan Ni (ni-jinquan) wrote :

to Nikola Đipanov (ndipanov) :
    this is not related to libvirt version, all compute nodes's libvirt version is 1.2.21. in my envrioment.
   some nodes can report pci numa info ,but other nodes can't report.
   we found the reason is related node‘s hardware configuration and BIOS version,
  but some compute can't update BIOS version。

  so this is an issue in practice,sometime,People who care about this feature
  will be wrong thought pci devices is on the numa node that CPU\MEM on. but in fact, it does not。

 And I agree with you that nova should care about system reporting None for device NUMA node。

So, my new desired behavior is(@Ludovic Beliveau):

Pci without numa_node will be ignore when launch an instance with some specific PCI resources。

But these pci devices are available when launch an instance without specific PCI resources

Jinquan Ni (ni-jinquan) wrote :

sorry,comment #6 has a mistake:

And I agree with you that nova should care about system reporting None for device NUMA node。

------->

And I agree with you that nova should not care about system reporting None for device NUMA node

Jinquan Ni (ni-jinquan) on 2016-03-29
Changed in nova:
importance: Undecided → Medium
Jinquan Ni (ni-jinquan) on 2016-06-21
Changed in nova:
milestone: none → newton-2

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/298179

Having discussed this since, it's clear that this isn't a bug but a design decision. nova allows scheduling PCI devices on hosts without an exposed NUMA topology as this satisfies the majority of use cases. It would be possible to make this behavior configurable, but this is a feature request - not a bug - and would require a spec [1].

[1] https://review.openstack.org/#/c/361140/

Changed in nova:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers