[SRIOV] When a VF is bound to a VM, Nova can't retrieve the PCI info

Bug #1883671 reported by Rodolfo Alonso on 2020-06-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
sean mooney
Pike
Undecided
Unassigned
Queens
Undecided
Unassigned
Rocky
Undecided
Unassigned
Stein
Undecided
Unassigned
Train
Low
sean mooney
Ussuri
Low
sean mooney

Bug Description

Nova periodically updates the available resources per hypervisor [1]. That implies the reporting of the PCI devices [2]->[3].

In [4], a new feature was introduced to read from libvirt the NIC capabilities (gso, tso, tx, etc.). But when the NIC interface is bound to the VM and the MAC address is not the one assigned by the driver (Nova changes the MAC address according to the info provided by Neutron), libvirt fails reading the non-existing device: http://paste.openstack.org/show/794799/.

This command should be avoided or at least, if the executing fails, the exception could be hidden.

[1]https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9642
[2]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6980
[3]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6898
[4]Ia5b6abbbf4e5f762e0df04167c32c6135781d305

tags: added: pci
tags: added: resource-tracker
Balazs Gibizer (balazs-gibizer) wrote :

I was not able to reproduce the issue locally.

I tried the following:
* create a neutron port with vnic_type direct
* change the mac address of the port in neutron
* create a server with the above port

I only see a periodic warning in the nova-compute logs. But it is independent from the fact that the VF is used or not. It only depends on the fact that the VF is whitelisted or not.

WARNING nova.pci.utils [None req-9fbb125a-021f-4bb2-9688-f47eae5bd564 None None] No net device was found for VF 0000:81:10.2: nova.exception.PciDeviceNotFoundById: PCI device 0000:81:10.2 not found

Looking at the code [1] I guess in my env the code bails out with the above WARNING and therefore I never hit the self._host.device_lookup_by_name(devname) call that blows in your env.

Based on this I think it is OK to put a try-except around self._host.device_lookup_by_name(devname) and log a WARNING there and bailing out with None.

[1] https://github.com/openstack/nova/blob/63a03d848196320912bcc70eb2a8e75425fdea84/nova/virt/libvirt/driver.py#L6910

Changed in nova:
status: New → Triaged
importance: Undecided → Low
sean mooney (sean-k-mooney) wrote :

ya im aware of this im currently working on a fix for this.

my working theory is there is a race condition between livirt updating its nodevlist and the mac adress on a Vf being reset.

downstream we are treating this as high/urgent as once it
i have made some of the comment public on https://bugzilla.redhat.com/show_bug.cgi?id=1847924
but this breaks cold migration, live migration and likely the ablity to correctly create vms on the host that is affected.

Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
sean mooney (sean-k-mooney) wrote :

a try catch and ignore is not really correct. its true that we dont actully really use the info we retrive right now but i have been tool that libvirt does not consider the way device are named to be as stable part of the api so im wokring on a better short term fix and we will need to asses were we depend on the naming scheme else where and remvoe that depency later.

Fix proposed to branch: master
Review: https://review.opendev.org/739017

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/739017
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=af80c3ffd116837b296e79858595d42c893708a6
Submitter: Zuul
Branch: master

commit af80c3ffd116837b296e79858595d42c893708a6
Author: Sean Mooney <email address hidden>
Date: Thu Jul 2 10:43:08 2020 +0000

    catch libvirt exception when nodedev not found.

    This is a minimal fix to workaround instance where libvirt
    retruns stale data due to internal caching. In some cases
    libivrt can return stale data vai the nodedev api when the
    mac adress of an interface such as an sriov virtual function
    canages, i.e. when a mac adress is reset after a vm with
    a virtual funciton is migrated.

    Change-Id: Ic5e60c8e28263365fad5867e483b6ad55cee7281
    Partial-Bug: #1883671

Reviewed: https://review.opendev.org/739593
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=247e33af8fa1705b037e3343ec3b72a1b520c0c5
Submitter: Zuul
Branch: stable/ussuri

commit 247e33af8fa1705b037e3343ec3b72a1b520c0c5
Author: Sean Mooney <email address hidden>
Date: Thu Jul 2 10:43:08 2020 +0000

    catch libvirt exception when nodedev not found.

    This is a minimal fix to workaround instance where libvirt
    retruns stale data due to internal caching. In some cases
    libivrt can return stale data vai the nodedev api when the
    mac adress of an interface such as an sriov virtual function
    canages, i.e. when a mac adress is reset after a vm with
    a virtual funciton is migrated.

    Change-Id: Ic5e60c8e28263365fad5867e483b6ad55cee7281
    Partial-Bug: #1883671
    (cherry picked from commit af80c3ffd116837b296e79858595d42c893708a6)

tags: added: in-stable-ussuri

Reviewed: https://review.opendev.org/739131
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=efc27ff84c3f38fbcbf75b0dc230963c58d093e4
Submitter: Zuul
Branch: master

commit efc27ff84c3f38fbcbf75b0dc230963c58d093e4
Author: Sean Mooney <email address hidden>
Date: Fri Jul 3 15:58:02 2020 +0000

    Lookup nic feature by PCI address

    In some environments the libvirt nodedev list can become out of sync
    with the current MAC address assigned to a netdev, As a result the
    nodedev lookup can fail. This results in an uncaught libvirt exception
    which breaks the update_available_resource function resultingin an
    incorrect resource view in the database.

    e.g. libvirt.libvirtError: Node device not found:
    no node device with matching name 'net_enp7s0f3v1_ea_60_77_1f_21_50'

    This change removes the dependency on the nodedev name when looking up
    nic feature flags.

    Change-Id: Ibf8dca4bd57b3bddb39955b53cc03564506f5754
    Closes-Bug: #1883671

Changed in nova:
status: In Progress → Fix Released
sean mooney (sean-k-mooney) wrote :

reading the nic feature flags was intoduced in pike
https://github.com/openstack/nova/commit/e6829f872aca03af6181557260637c8b601e476a

but this only seams to happen on mondern version of libvirt so setting as wont fix. it can be backported if someone hits the issue and care to do so

sean mooney (sean-k-mooney) wrote :

i have marked the affected release feel free to backport this futher if desired i plan to backport the fix to train given older release dont tend to use a libvirt that is affected.

its also a little unfair to blame libvirt here since this is actully caused by a complex interaction betten speicific nic drivers, udev and libvirt but the point stands that it does not seam to happen on older operating system versions.

e.g. it does not happen on centos/rhel 7 or 16.04 but does on 20.04 and rhel 8
there are a lot of things that change beyond the libvirt version so just going to backport this to the known broken branches and other can backport if they hit issues on older branches.

Reviewed: https://review.opendev.org/745116
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2442344fa1afd8d0a4c58b4df35f1b896129a0a1
Submitter: Zuul
Branch: stable/train

commit 2442344fa1afd8d0a4c58b4df35f1b896129a0a1
Author: Sean Mooney <email address hidden>
Date: Thu Jul 2 10:43:08 2020 +0000

    catch libvirt exception when nodedev not found.

    This is a minimal fix to workaround instance where libvirt
    retruns stale data due to internal caching. In some cases
    libivrt can return stale data vai the nodedev api when the
    mac adress of an interface such as an sriov virtual function
    canages, i.e. when a mac adress is reset after a vm with
    a virtual funciton is migrated.

    Change-Id: Ic5e60c8e28263365fad5867e483b6ad55cee7281
    Partial-Bug: #1883671
    (cherry picked from commit af80c3ffd116837b296e79858595d42c893708a6)
    (cherry picked from commit 247e33af8fa1705b037e3343ec3b72a1b520c0c5)

tags: added: in-stable-train
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.