[SRIOV] When a VF is bound to a VM, Nova can't retrieve the PCI info

Bug #1883671 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
sean mooney
Pike
Won't Fix
Undecided
Unassigned
Queens
Won't Fix
Undecided
Unassigned
Rocky
Won't Fix
Undecided
Unassigned
Stein
Won't Fix
Undecided
Unassigned
Train
In Progress
Low
sean mooney
Ussuri
In Progress
Low
sean mooney

Bug Description

Nova periodically updates the available resources per hypervisor [1]. That implies the reporting of the PCI devices [2]->[3].

In [4], a new feature was introduced to read from libvirt the NIC capabilities (gso, tso, tx, etc.). But when the NIC interface is bound to the VM and the MAC address is not the one assigned by the driver (Nova changes the MAC address according to the info provided by Neutron), libvirt fails reading the non-existing device: http://paste.openstack.org/show/794799/.

This command should be avoided or at least, if the executing fails, the exception could be hidden.

[1]https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9642
[2]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6980
[3]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6898
[4]Ia5b6abbbf4e5f762e0df04167c32c6135781d305

tags: added: pci
tags: added: resource-tracker
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I was not able to reproduce the issue locally.

I tried the following:
* create a neutron port with vnic_type direct
* change the mac address of the port in neutron
* create a server with the above port

I only see a periodic warning in the nova-compute logs. But it is independent from the fact that the VF is used or not. It only depends on the fact that the VF is whitelisted or not.

WARNING nova.pci.utils [None req-9fbb125a-021f-4bb2-9688-f47eae5bd564 None None] No net device was found for VF 0000:81:10.2: nova.exception.PciDeviceNotFoundById: PCI device 0000:81:10.2 not found

Looking at the code [1] I guess in my env the code bails out with the above WARNING and therefore I never hit the self._host.device_lookup_by_name(devname) call that blows in your env.

Based on this I think it is OK to put a try-except around self._host.device_lookup_by_name(devname) and log a WARNING there and bailing out with None.

[1] https://github.com/openstack/nova/blob/63a03d848196320912bcc70eb2a8e75425fdea84/nova/virt/libvirt/driver.py#L6910

Changed in nova:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
sean mooney (sean-k-mooney) wrote :

ya im aware of this im currently working on a fix for this.

my working theory is there is a race condition between livirt updating its nodevlist and the mac adress on a Vf being reset.

downstream we are treating this as high/urgent as once it
i have made some of the comment public on https://bugzilla.redhat.com/show_bug.cgi?id=1847924
but this breaks cold migration, live migration and likely the ablity to correctly create vms on the host that is affected.

Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
sean mooney (sean-k-mooney) wrote :

a try catch and ignore is not really correct. its true that we dont actully really use the info we retrive right now but i have been tool that libvirt does not consider the way device are named to be as stable part of the api so im wokring on a better short term fix and we will need to asses were we depend on the naming scheme else where and remvoe that depency later.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739017

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/739017
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=af80c3ffd116837b296e79858595d42c893708a6
Submitter: Zuul
Branch: master

commit af80c3ffd116837b296e79858595d42c893708a6
Author: Sean Mooney <email address hidden>
Date: Thu Jul 2 10:43:08 2020 +0000

    catch libvirt exception when nodedev not found.

    This is a minimal fix to workaround instance where libvirt
    retruns stale data due to internal caching. In some cases
    libivrt can return stale data vai the nodedev api when the
    mac adress of an interface such as an sriov virtual function
    canages, i.e. when a mac adress is reset after a vm with
    a virtual funciton is migrated.

    Change-Id: Ic5e60c8e28263365fad5867e483b6ad55cee7281
    Partial-Bug: #1883671

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/739593

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/739593
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=247e33af8fa1705b037e3343ec3b72a1b520c0c5
Submitter: Zuul
Branch: stable/ussuri

commit 247e33af8fa1705b037e3343ec3b72a1b520c0c5
Author: Sean Mooney <email address hidden>
Date: Thu Jul 2 10:43:08 2020 +0000

    catch libvirt exception when nodedev not found.

    This is a minimal fix to workaround instance where libvirt
    retruns stale data due to internal caching. In some cases
    libivrt can return stale data vai the nodedev api when the
    mac adress of an interface such as an sriov virtual function
    canages, i.e. when a mac adress is reset after a vm with
    a virtual funciton is migrated.

    Change-Id: Ic5e60c8e28263365fad5867e483b6ad55cee7281
    Partial-Bug: #1883671
    (cherry picked from commit af80c3ffd116837b296e79858595d42c893708a6)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/739131
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=efc27ff84c3f38fbcbf75b0dc230963c58d093e4
Submitter: Zuul
Branch: master

commit efc27ff84c3f38fbcbf75b0dc230963c58d093e4
Author: Sean Mooney <email address hidden>
Date: Fri Jul 3 15:58:02 2020 +0000

    Lookup nic feature by PCI address

    In some environments the libvirt nodedev list can become out of sync
    with the current MAC address assigned to a netdev, As a result the
    nodedev lookup can fail. This results in an uncaught libvirt exception
    which breaks the update_available_resource function resultingin an
    incorrect resource view in the database.

    e.g. libvirt.libvirtError: Node device not found:
    no node device with matching name 'net_enp7s0f3v1_ea_60_77_1f_21_50'

    This change removes the dependency on the nodedev name when looking up
    nic feature flags.

    Change-Id: Ibf8dca4bd57b3bddb39955b53cc03564506f5754
    Closes-Bug: #1883671

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/745116

Revision history for this message
sean mooney (sean-k-mooney) wrote :

reading the nic feature flags was intoduced in pike
https://github.com/openstack/nova/commit/e6829f872aca03af6181557260637c8b601e476a

but this only seams to happen on mondern version of libvirt so setting as wont fix. it can be backported if someone hits the issue and care to do so

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i have marked the affected release feel free to backport this futher if desired i plan to backport the fix to train given older release dont tend to use a libvirt that is affected.

its also a little unfair to blame libvirt here since this is actully caused by a complex interaction betten speicific nic drivers, udev and libvirt but the point stands that it does not seam to happen on older operating system versions.

e.g. it does not happen on centos/rhel 7 or 16.04 but does on 20.04 and rhel 8
there are a lot of things that change beyond the libvirt version so just going to backport this to the known broken branches and other can backport if they hit issues on older branches.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/745116
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2442344fa1afd8d0a4c58b4df35f1b896129a0a1
Submitter: Zuul
Branch: stable/train

commit 2442344fa1afd8d0a4c58b4df35f1b896129a0a1
Author: Sean Mooney <email address hidden>
Date: Thu Jul 2 10:43:08 2020 +0000

    catch libvirt exception when nodedev not found.

    This is a minimal fix to workaround instance where libvirt
    retruns stale data due to internal caching. In some cases
    libivrt can return stale data vai the nodedev api when the
    mac adress of an interface such as an sriov virtual function
    canages, i.e. when a mac adress is reset after a vm with
    a virtual funciton is migrated.

    Change-Id: Ic5e60c8e28263365fad5867e483b6ad55cee7281
    Partial-Bug: #1883671
    (cherry picked from commit af80c3ffd116837b296e79858595d42c893708a6)
    (cherry picked from commit 247e33af8fa1705b037e3343ec3b72a1b520c0c5)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/838042

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/838050

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/838050
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/838042
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.