PCI whitelist exception causes the resource tracker to stop and will not allow us to spawn further SR-IOV/PCIPT VMs when SR-IOV PF is assigned to a VM.

Bug #1605549 reported by MANJUNATH PATIL
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
MANJUNATH PATIL

Bug Description

Encountered an exception in the pci whitelist causes the resource tracker to stop and blocks user/admin to spawn further VMs

we have the following pci_whitelist to support both SRIOV and PCIPT on
pci_passthrough_whitelist =
[{"devname": "eth1", "physical_network": "physnet1"},
{"physical_network": "physnet1", "address": "*:04:00.0"},
{"physical_network": "physnet2", "address": "*:04:00.1"}]

Once we boot the PCI passthrough VM on physnet1 using eth1,
the device eth1 no longer available to hypervisor.
So when we try to boot another PCI passthrough VM using eth2,
the current code tries to validate the pci_whitelist and
throws an error saying - device eth1 is not found.
This is because pci_whitelist has devname eth1 and
code tries to get the PCI address of the device which is not available.
We also found that with the above mentioned pci_whitelist,
as soon as we boot a PCI passthrough VM, the periodic resource
tracker also stops. We further analysed and found that any
misconfiguration of pci_whitelist could cause periodic
resource tracker to stop.

We get the following error in the nova compute log if eth1 is not present. But compute still shows up and the periodic hypervisor update stops working.

2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager [req-0e7e62d5-23c9-48f2-8ca4-b47b763c29df None None] Error updating resources for node padawan-cp1-comp0001-mgmt.
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager Traceback (most recent call last):
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/compute/manager.py", line 6472, in update_available_resource
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager rt.update_available_resource(context)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 531, in update_available_resource
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self._update_available_resource(context, resources)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager return f(*args, **kwargs)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 564, in _update_available_resource
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager node_id=n_id)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/manager.py", line 68, in __init__
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self.dev_filter = whitelist.Whitelist(CONF.pci_passthrough_whitelist)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/whitelist.py", line 78, in __init__
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self.specs = self._parse_white_list_from_config(whitelist_spec)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/whitelist.py", line 59, in _parse_white_list_from_config
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager spec = devspec.PciDeviceSpec(ds)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/devspec.py", line 134, in __init__
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self._init_dev_details()
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/devspec.py", line 155, in _init_dev_details
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager raise exception.PciDeviceNotFoundById(id=self.dev_name)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager PciDeviceNotFoundById: PCI device eth1 not found
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager

Tags: compute pci
MANJUNATH PATIL (mpatil)
Changed in nova:
assignee: nobody → MANJUNATH PATIL (mpatil)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/345925

Changed in nova:
status: New → In Progress
Revision history for this message
MANJUNATH PATIL (mpatil) wrote :

Hi,

This bug is not a duplicate of https://bugs.launchpad.net/bugs/1603034

In the discussion https://review.openstack.org/#/c/342301/ we had agreed to have different patches for syntax and regex-type validation and second one which fixes user-driven failures due to VM assigned a PF.

Quoting from discussion --> "Yeah, having two separate patches is what I'm arguing for... one (this one) to move syntax and regex-type validation of the whitelist into the nova-compute startup and a second patch that fixes the user-driven failures that can occur due to a VM being assigned a PF and the whitelist value no longer being fully acceptable on the host."

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/345925
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=433fe514e8d166345b2bd8fa0b3055724285406d
Submitter: Jenkins
Branch: master

commit 433fe514e8d166345b2bd8fa0b3055724285406d
Author: Manjunath Patil <email address hidden>
Date: Fri Jul 22 14:31:11 2016 +0530

    Resolve PCI devices on the host during Guest boot-up.

    When devname is used in Whitelist configuration,
    resolve the address of devname when trying to
    match a device in the whiltelist.

    Change-Id: I7a65857454cc132d97df9abb8297d350514cf2df
    Closes-Bug: #1605549
    Co-Authored-By: Raghuveer Shenoy <email address hidden>
    Co-Authored-By: Sonu <email address hidden>

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.0.0b3

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.