pci whitelist exception will kill the periodic update of the hypervisor statistics

Bug #1603034 reported by Raghuveer Shenoy on 2016-07-14
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Mitaka
Medium
Unassigned

Bug Description

An encountered exception in the pci whitelist will cause the periodic hypervisor update loop to terminate and not be tried again. Retries should continue at the normal interval.

Scenario 1:

Update the nova.conf with the pci_whitelist as follows:
pci_passthrough_whitelist = [ {"devname": "hed1", "physical_network": "physnet1"},{"physical_network": "physnet1", "address": "*:04:00.0"},{"physical_network": "physnet2", "address": "*:04:00.1"}]

We get the following error in the nova compute log if hed1 is not present. But compute still shows up and the periodic hypervisor update stops working.

2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager [req-0e7e62d5-23c9-48f2-8ca4-b47b763c29df None None] Error updating resources for node padawan-cp1-comp0001-mgmt.
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager Traceback (most recent call last):
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/compute/manager.py", line 6472, in update_available_resource
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager rt.update_available_resource(context)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 531, in update_available_resource
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self._update_available_resource(context, resources)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager return f(*args, **kwargs)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 564, in _update_available_resource
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager node_id=n_id)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/manager.py", line 68, in __init__
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self.dev_filter = whitelist.Whitelist(CONF.pci_passthrough_whitelist)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/whitelist.py", line 78, in __init__
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self.specs = self._parse_white_list_from_config(whitelist_spec)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/whitelist.py", line 59, in _parse_white_list_from_config
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager spec = devspec.PciDeviceSpec(ds)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/devspec.py", line 134, in __init__
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager self._init_dev_details()
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager File "/opt/stack/venv/nova-20160607T195234Z/lib/python2.7/site-packages/nova/pci/devspec.py", line 155, in _init_dev_details
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager raise exception.PciDeviceNotFoundById(id=self.dev_name)
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager PciDeviceNotFoundById: PCI device hed1 not found
2016-07-13 09:22:42.146 28800 ERROR nova.compute.manager

Changed in nova:
assignee: nobody → Raghuveer Shenoy (rshenoy)
Matt Riedemann (mriedem) wrote :

What version of nova are you using?

tags: added: compute pci
Matt Riedemann (mriedem) wrote :

Nevermind it's still a problem in master from the looks of the code. We shouldn't bring down the resource tracker in the compute just because of misconfig in the whitelist.

Changed in nova:
importance: Undecided → High
importance: High → Medium
status: New → Triaged
Matt Riedemann (mriedem) wrote :

Man, looking at this code, there are several different ways that the pci whitelist can be wrong and totally blow up your resource tracker, just from initializing the PciDevTracker.

Matt Riedemann (mriedem) wrote :
Changed in nova:
assignee: Raghuveer Shenoy (rshenoy) → Matt Riedemann (mriedem)
status: Triaged → In Progress
tags: added: mitaka-backport-potential
Moshe Levi (moshele) wrote :

regarding the validation of the pci whitelist we have this bug
https://bugs.launchpad.net/nova/+bug/1466451 and this proposed
patch https://review.openstack.org/#/c/306054/

Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/306054/ just adds more things that can fail when building the pci whitelist. The point of this bug is to not kill the available resource task on the compute when there are problems in the whitelist.

Reviewed: https://review.openstack.org/342301
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3a61ae35d4b713f423219c7b714126e1584694e8
Submitter: Jenkins
Branch: master

commit 3a61ae35d4b713f423219c7b714126e1584694e8
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 14 13:37:05 2016 -0400

    Validate pci_passthrough_whitelist when starting n-cpu

    Loading up CONF.pci_passthrough_whitelist in the Whitelist
    object performs a bunch of validation and can fail in several
    different ways (invalid json, invalid values, invalid combinations
    of keys, devices not found, etc). This happens today when
    creating the PciDevTracker in the ResourceTracker when updating
    available resources. If the configuration is bad, it kills the
    periodic task to update available resources on the compute node.

    We should just load up the pci_passthrough_whitelist (if set)
    when starting the nova-compute service so we can fail fast and
    kill the service on any misconfiguration rather than run with
    a broken service.

    Change-Id: If50fb837b490042bb5ef20e9ad843b28f871a44e
    Closes-Bug: #1603034

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers