No nova hypervisor can be enabled on workers with QAT devices

Bug #1821938 reported by Yang Liu
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
sean mooney
StarlingX
Fix Released
High
Jim Gauld

Bug Description

Brief Description
-----------------
Unable to enable a host as nova hypervisor due to pci device cannot be found if the host has QAT devices (C62x or DH895XCC) configured.

Severity
--------
Major

Steps to Reproduce
------------------
- Install and configure a system where worker nodes have QAT devices configured. e.g.,
[wrsroot@controller-0 ~(keystone_admin)]$ system host-device-list compute-0
+------------------+--------------+----------+-----------+-----------+---------------------------+---------------------------------+----------------------------------------+-----------+---------+
| name | address | class id | vendor id | device id | class name | vendor name | device name | numa_node | enabled |
+------------------+--------------+----------+-----------+-----------+---------------------------+---------------------------------+----------------------------------------+-----------+---------+
| pci_0000_09_00_0 | 0000:09:00.0 | 0b4000 | 8086 | 0435 | Co-processor | Intel Corporation | DH895XCC Series QAT | 0 | True |
| pci_0000_0c_00_0 | 0000:0c:00.0 | 030000 | 102b | 0522 | VGA compatible controller | Matrox Electronics Systems Ltd. | MGA G200e [Pilot] ServerEngines (SEP1) | 0 | True |
+------------------+--------------+----------+-----------+-----------+---------------------------+---------------------------------+----------------------------------------+-----------+---------+

compute-0:~$ lspci | grep QAT
09:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
09:01.0 Co-processor: Intel Corporation DH895XCC Series QAT Virtual Function
09:01.1 Co-processor: Intel Corporation DH895XCC Series QAT Virtual Function
...

- check nova hypervisor-list

Expected Behavior
------------------
- Nova hypervisors exist on system

Actual Behavior
----------------
[wrsroot@controller-0 ~(keystone_admin)]$ nova hypervisor-list
+----+---------------------+-------+--------+
| ID | Hypervisor hostname | State | Status |
+----+---------------------+-------+--------+
+----+---------------------+-------+--------+

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Any system type with QAT devices configured on worker node

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-03-18

Last Pass
--------------
on f/stein branch in early feb

Timestamp/Logs
--------------
# nova-compute pods are spewing errors so they can't register themselves properly as hypervisors:
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager [req-4f652d4c-da7e-4516-9baa-915265c3fdda - - - - -] Error updating resources for node compute-0.: PciDeviceNotFoundById: PCI device 0000:09:02.3 not found
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager Traceback (most recent call last):
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 7956, in _update_available_resource_for_node
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager startup=startup)
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 727, in update_available_resource
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename)
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7098, in get_available_resource
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager self._get_pci_passthrough_devices()
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6102, in _get_pci_passthrough_devices
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager pci_info.append(self._get_pcidev_info(name))
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6062, in _get_pcidev_info
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager device.update(_get_device_type(cfgdev, address))
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6021, in _get_device_type
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager pci_address, pf_interface=True),
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/pci/utils.py", line 159, in get_ifname_by_pci_address
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager raise exception.PciDeviceNotFoundById(id=pci_addr)
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager PciDeviceNotFoundById: PCI device 0000:09:02.3 not found
2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority given this appears to be impacting the use of systems w/ qat devices

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.2019.05 stx.distro.openstack stx.helpwanted
Changed in nova:
importance: Undecided → High
assignee: nobody → sean mooney (sean-k-mooney)
status: New → In Progress
tags: added: stein-rc-potential
Revision history for this message
Eric Fried (efried) wrote :

Nova fix at https://review.openstack.org/#/c/649409/

Will be backported to stein, likely to be included in the next RC.

Not sure if anything needs to be done on the stx side.

Ghada Khalil (gkhalil)
tags: removed: stx.helpwanted
Changed in starlingx:
assignee: nobody → Chris Friesen (cbf123)
assignee: Chris Friesen (cbf123) → Jim Gauld (jgauld)
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Having both stx and nova as impacted projects works well. It gives the stx team visibility to the resolution of the underlying issue. Once the nova fix is available in the stein branch, we will pick it up and update the stx status of the bug accordingly.

description: updated
Revision history for this message
Bruce Jones (brucej) wrote :

Email from Eric Fried 4/2/19

Nova fix at https://review.openstack.org/#/c/649409/

Will be backported to stein, likely to be included in the next RC.

Not sure if anything needs to be done on the stx side.

Revision history for this message
melanie witt (melwitt) wrote :

Because this is a new regression in Stein, I think it is a candidate for another RC.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.openstack.org/649630

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/649409
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e7ae6c65cd24fb3e0776fac80fbab2ab16e9d9ed
Submitter: Zuul
Branch: master

commit e7ae6c65cd24fb3e0776fac80fbab2ab16e9d9ed
Author: Sean Mooney <email address hidden>
Date: Tue Apr 2 18:27:24 2019 +0100

    Libvirt: gracefully handle non-nic VFs

    As part of adding support for bandwidth based scheduling
    I038867c4094d79ae4a20615ab9c9f9e38fcc2e0a introduced
    automatic discovery of parent netdev names for PCIe
    virtual functions.

    Nova's PCI passthrough support was originally developed for
    Intel QAT devices and other generic PCI devices. Later support
    for Neutron based SR-IOV NIC was added.

    The PCI-SIG SR-IOV specification while most often used by NIC
    vendors to virtualise a NIC in hardware was designed for devices
    of any PCIe class. Support for Intel's QAT device and other
    accelerators like AMD's SRIOV based vGPU have therefore been
    regressed by the introduction of the new parent_ifname lookup code.

    This change simply catches the exception that would be raised
    when pci_utils.get_ifname_by_pci_address is called on generic
    VFs allowing a graceful fallback to the previous behaviour.

    Change-Id: Ib3811f828246311d90b0e3ba71c162c03fb8fe5a
    Closes-Bug: #1821938

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.openstack.org/649630
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=97a806b7900d9e55ad15c10184d2bd48f23efff0
Submitter: Zuul
Branch: stable/stein

commit 97a806b7900d9e55ad15c10184d2bd48f23efff0
Author: Sean Mooney <email address hidden>
Date: Tue Apr 2 18:27:24 2019 +0100

    Libvirt: gracefully handle non-nic VFs

    As part of adding support for bandwidth based scheduling
    I038867c4094d79ae4a20615ab9c9f9e38fcc2e0a introduced
    automatic discovery of parent netdev names for PCIe
    virtual functions.

    Nova's PCI passthrough support was originally developed for
    Intel QAT devices and other generic PCI devices. Later support
    for Neutron based SR-IOV NIC was added.

    The PCI-SIG SR-IOV specification while most often used by NIC
    vendors to virtualise a NIC in hardware was designed for devices
    of any PCIe class. Support for Intel's QAT device and other
    accelerators like AMD's SRIOV based vGPU have therefore been
    regressed by the introduction of the new parent_ifname lookup code.

    This change simply catches the exception that would be raised
    when pci_utils.get_ifname_by_pci_address is called on generic
    VFs allowing a graceful fallback to the previous behaviour.

    Change-Id: Ib3811f828246311d90b0e3ba71c162c03fb8fe5a
    Closes-Bug: #1821938
    (cherry picked from commit e7ae6c65cd24fb3e0776fac80fbab2ab16e9d9ed)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc2

This issue was fixed in the openstack/nova 19.0.0.0rc2 release candidate.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

StarlingX will pick the fix in the next nova docker image build

Changed in starlingx:
status: In Progress → Fix Committed
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Maria Guadalupe Perez Ibara (maria-gp) wrote :

the problem still present in all bare metal configurations

Iso: 20190408T233001Z

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The stx docker image build on April 8 failed for other reasons. The images should be rebuilt now. You will need to re-test with a new set of docker images.

Changed in starlingx:
status: Fix Committed → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

Verified nova hypervisors are enabled properly on systems with C62x and DH895XCC QAT pci devices.
Load used: "20190410T013000Z"

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.