SRIOV agent error when VM booted with direct-physical port

Bug #1616442 reported by Eran Kuris
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brent Eagles

Bug Description

When assigning neutron port to PF (neutron port type - direct-physical )
the vm is booted and active but there is errors in sriov agent log
attached file with the errors

Version-Release number of selected component (if applicable):
RHOS-10
[root@controller1 ~(keystone_admin)]# rpm -qa |grep neutron
python-neutron-lib-0.3.0-0.20160803002107.405f896.el7ost.noarch
openstack-neutron-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
puppet-neutron-9.1.0-0.20160813031056.7cf5e07.el7ost.noarch
python-neutron-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-lbaas-9.0.0-0.20160816191643.4e7301e.el7ost.noarch
python-neutron-fwaas-9.0.0-0.20160817171450.e1ac68f.el7ost.noarch
python-neutron-lbaas-9.0.0-0.20160816191643.4e7301e.el7ost.noarch
openstack-neutron-ml2-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-metering-agent-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-openvswitch-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
python-neutronclient-5.0.0-0.20160812094704.ec20f7f.el7ost.noarch
openstack-neutron-common-9.0.0-0.20160817153328.b9169e3.el7ost.noarch
openstack-neutron-fwaas-9.0.0-0.20160817171450.e1ac68f.el7ost.noarch
[root@controller1 ~(keystone_admin)]# rpm -qa |grep nova
python-novaclient-5.0.1-0.20160724130722.6b11a1c.el7ost.noarch
openstack-nova-api-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
puppet-nova-9.1.0-0.20160813014843.b94f0a0.el7ost.noarch
openstack-nova-common-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-novncproxy-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-conductor-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
python-nova-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-scheduler-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-cert-14.0.0-0.20160817225441.04cef3b.el7ost.noarch
openstack-nova-console-14.0.0-0.20160817225441.04cef3b.el7ost.noarch

How reproducible:

Steps to Reproduce:
1.deploy SRIOV setup and set PF functionality you can use guide :
https://docs.google.com/document/d/1qQbJlLI1hSlE4uwKpmVd0BoGSDBd8Z0lTzx5itQ6WL0/edit#
2.boot vm & assign it to PF
3.check in compute node sriov agent log

Revision history for this message
Eran Kuris (ekuris) wrote :
Revision history for this message
Assaf Muller (amuller) wrote :

Is there an impact?

Revision history for this message
Eran Kuris (ekuris) wrote :

not looks like

Assaf Muller (amuller)
Changed in neutron:
importance: Undecided → Low
Revision history for this message
Moshe Levi (moshele) wrote :

can you give me access to
https://docs.google.com/document/d/1qQbJlLI1hSlE4uwKpmVd0BoGSDBd8Z0lTzx5itQ6WL0/edit#
also can you post the sriov-agent config and ip link show output
thanks

Revision history for this message
Moshe Levi (moshele) wrote :

I think the problem is because you put the PF that you pass-through in the sriov-agent.conf.
So if you pass thought it disappear from the hypervisor

remember my comment here https://bugs.launchpad.net/neutron/+bug/1614086.
Now I am not sure if we want to fix the "Allow SR-IOV agent to start when number of vf is 0" issue

Revision history for this message
Brent Eagles (beagles) wrote :

Say you have an SR-IOV card that is being used for both physical function and virtual function allocation. While there is nothing allocated (PF or VF) or has a VF in use, all should be well with the agent. While it doesn't care about PFs, it will care about the VFs. When an allocation of the PF succeeds (i.e. the VM consumes it), won't the VFs also disappear? If so can the agent handle it?

I agree that the VFs being 0 are less of an issue really since the either the device isn't there (the PF is allocated), it has VFs or it isn't supposed to have VFs and the agent shouldn't "care" about it since that agent isn't responsible for PFs.

tags: added: sriov-pci-pt
Revision history for this message
Moshe Levi (moshele) wrote :

VF interface also disappear when you passthrough but you can control it with ip link command when you know the PF (The command look like ip link set <PF Interface> vf <num VF>.
When you passthrough the PF you can't control it for the hypervisor.
I think that when you deployed openstack cloud you know if you want to do PF or VF passthrough before you setup the cloud. Sot the agent should be configured currently only if you are using VF.

Are you saying that you want to use somethings to passthrough PF and sometimes to passthrough VF of the same PF interface?

Revision history for this message
Eran Kuris (ekuris) wrote :

Yes this is the use case.

Revision history for this message
Moshe Levi (moshele) wrote :

Ok,
so I was planning to deprecate the "physical_device_mappings" options https://review.openstack.org/#/c/360447/ (the patch is still WIP) for dynamically monitoring the VFs according to direct port pci slot.
So when there are no VFs on PF it will stop monitor it.

Maybe we should talk in the summit on all the SR-IOV PF/VF issues you are raising

Revision history for this message
Assaf Muller (amuller) wrote :

Ricardo Noriega says:

"I'm having the same stacktrace from the SRIOV agent, but I think there is indeed an issue.

Let's say a compute node has got 2 network adapters (em1 and em2) with its correspondent VFs configured. If you start a VM with a direct-physical binding it will take one of these NICs. At that moment, SRIOV agent starts to show those ERROR messages including the "device dictionary" completely empty.

In consequence, you cannot allocate VMs with VFs eventhough there is still another NIC available."

I think the use case is interesting and that the issue is not only cosmetic. I'm upping the priority of this bug.

Changed in neutron:
importance: Low → Medium
status: New → Confirmed
MANJUNATH PATIL (mpatil)
Changed in neutron:
assignee: nobody → MANJUNATH PATIL (mpatil)
Revision history for this message
Brent Eagles (beagles) wrote :

@mpatil: in https://review.openstack.org/#/c/377781/ you indicate that Moshe's WIP patch resolves this issue, is this correct?

Changed in neutron:
status: Confirmed → In Progress
Changed in neutron:
assignee: MANJUNATH PATIL (mpatil) → edan david (edand)
Changed in neutron:
assignee: edan david (edand) → Brent Eagles (beagles)
Changed in neutron:
assignee: Brent Eagles (beagles) → edan david (edand)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/395045

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by edan david (<email address hidden>) on branch: master
Review: https://review.openstack.org/360447
Reason: Moved to: https://review.openstack.org/#/c/395044/

Changed in neutron:
assignee: edan david (edand) → Moshe Levi (moshele)
Revision history for this message
Brent Eagles (beagles) wrote :

This is the contents of the aforementioned google docs that was referenced in this BZ in error. Sorry, I should've rectified it sooner. I was reminded by yet another request for access. While it has been edited some, it was only meant to reflect general steps. Accuracy/correctness not guaranteed.

Configuration
-------------

This describes a general configuration of OpenStack services to support PCI SR-IOV. It currently does not extend to creating node preparation, creation of neutron networks, etc. Services that have had their configuration modified will need to be restarted.

Controller Node
---------------

Nova
- Ensure that the PciPassthroughFilter is listed in scheduler_default_filters in nova.conf.

Neutron
- add sriovnicswitch to the list of mechanism_drivers to ML2 configuration (e.g /etc/neutron/plugins/ml2/ml2_conf.ini)
- add vendor device entry to [ml2_sriov]/supported_pci_vendor_devs to ML2 configuration e.g.: supported_pci_vendor_devs = 8086:154d, 8086:10ed

Compute Node

Nova
- For the pci_passthrough_whitelist, use the from [{“vendor_id”:”vendor_id_value”, “product_id”:”product_id_value”, “physical_network”:”physical_network_label”}]. Note the product ID for physical functions and virtual functions is different. If you wish to configure support for both on a node, you will need two separate entries.
For example: /etc/nova/nova.conf
pci_passthrough_whitelist =[{"vendor_id":"8086", "product_id":"154d", "physical_network":"physnet"}, {"vendor_id":"8086", "product_id":"10ed", "physical_network":"physnet"} ]

Clarifications and known issues:
Some documentation states that you need to specify device_type=”type-PF”. This is only required if you use the “alias” method for device allocation.
While using devname seems to work fine for virtual functions, it does not seem to work for physical functions

Neutron
   For PFs, the SRIOV agent does not need any additional configuration.

(The original document references erroneously configuring the sriov agent with the physical function. This is bogus since the agent isn't meant to manage physical functions. There is an issue though if a physical function is allocated as the virtual functions that the agent *does* know about)

Changed in neutron:
assignee: Moshe Levi (moshele) → Brent Eagles (beagles)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/395045
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Fix needs resurrection.

Changed in neutron:
status: In Progress → Confirmed
assignee: Brent Eagles (beagles) → nobody
status: Confirmed → Incomplete
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
milestone: none → ocata-rc1
tags: added: ocata-rc-potential
Changed in neutron:
status: Incomplete → In Progress
assignee: nobody → Brent Eagles (beagles)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/377781
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1bcdc299ba8ffbf778fb1442cd8f9da59903ffdc
Submitter: Jenkins
Branch: master

commit 1bcdc299ba8ffbf778fb1442cd8f9da59903ffdc
Author: Manjunath Patil <email address hidden>
Date: Tue Sep 27 20:12:39 2016 +0530

    Allow the other nic to allocate VMs post PCI-PT VM creation.

    Let's say a compute node has got 2 network adapters
    (em1 and em2) with its correspondent VFs configured.
    If you start a VM with a direct-physical binding
    it will take one of these NICs.

    At that moment, SRIOV agent starts to show
    ERROR messages including the "device dictionary"
    completely empty.

    In consequence, you cannot allocate VMs with VFs
    even though there is still another NIC available.

    Change-Id: I8bf0dd41f900b69e32fcd416690c089dde7989b9
    Closes-Bug: #1616442

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0rc1

This issue was fixed in the openstack/neutron 10.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/442088

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I believe the following problem statement justifies bumping importance here:

"In consequence, you cannot allocate VMs with VFs eventhough there is still another NIC available." That seems like a big deal for SR-IOV enabled environments.

Changed in neutron:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/442088
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ea95bf8af89d384783b6a57ec4bd6ddb4ad52716
Submitter: Jenkins
Branch: stable/newton

commit ea95bf8af89d384783b6a57ec4bd6ddb4ad52716
Author: Manjunath Patil <email address hidden>
Date: Tue Sep 27 20:12:39 2016 +0530

    Allow the other nic to allocate VMs post PCI-PT VM creation.

    Let's say a compute node has got 2 network adapters
    (em1 and em2) with its correspondent VFs configured.
    If you start a VM with a direct-physical binding
    it will take one of these NICs.

    At that moment, SRIOV agent starts to show
    ERROR messages including the "device dictionary"
    completely empty.

    In consequence, you cannot allocate VMs with VFs
    even though there is still another NIC available.

    Conflicts:
        neutron/tests/unit/plugins/ml2/drivers/mech_sriov/agent/test_eswitch_manager.py

    Change-Id: I8bf0dd41f900b69e32fcd416690c089dde7989b9
    Closes-Bug: #1616442
    (cherry picked from commit 1bcdc299ba8ffbf778fb1442cd8f9da59903ffdc)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.3.0

This issue was fixed in the openstack/neutron 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.