sriov agent report_state is slow

Bug #1648206 reported by Kevin Benton
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Mitaka
Triaged
Undecided
Unassigned
Newton
Triaged
High
Unassigned
Ocata
Fix Released
Undecided
Unassigned
neutron
Fix Released
High
Kevin Benton
neutron (Ubuntu)
Fix Released
High
Unassigned
Xenial
Confirmed
High
Unassigned
Yakkety
Triaged
High
Unassigned
Zesty
Fix Released
High
Unassigned

Bug Description

On a system with lots of VFs and PFs we get these logs:

WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 29.67 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 45.43 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 47.64 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 23.89 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 30.20 sec

Depending on the agent_down_time configuration, this can cause the Neutron server to think the agent has died.

This appears to be caused by blocking on the eswitch manager every time to get a device count to include in the state report.

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/408281

Changed in neutron:
status: New → In Progress
Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/408281
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1a2a71baf3904209679fc5448814a0e7940fe44d
Submitter: Jenkins
Branch: master

commit 1a2a71baf3904209679fc5448814a0e7940fe44d
Author: Kevin Benton <email address hidden>
Date: Wed Dec 7 11:33:46 2016 -0800

    SRIOV: don't block report_state with device count

    The device count process can be quite slow on a system with
    lots of interfaces. Doing this during report_state can block
    it long enough that the agent will be reported as dead and
    bindings will fail.

    This adjusts the logic to only update the configuration during
    the normal device retrieval for the scan loop. This will leave
    the report_state loop unblocked by the operation so the agent
    doesn't get reported as dead (which blocks port binding).

    Closes-Bug: #1648206
    Change-Id: Iff45fb6617974b1eceeed238a8d9e958f874f12b

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/408615

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/408616

Assaf Muller (amuller)
tags: added: mitaka-backport-potential newton-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/408616
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2580bcb5bcc68e59a602ab9ead8c2191c162f700
Submitter: Jenkins
Branch: stable/mitaka

commit 2580bcb5bcc68e59a602ab9ead8c2191c162f700
Author: Kevin Benton <email address hidden>
Date: Wed Dec 7 11:33:46 2016 -0800

    SRIOV: don't block report_state with device count

    The device count process can be quite slow on a system with
    lots of interfaces. Doing this during report_state can block
    it long enough that the agent will be reported as dead and
    bindings will fail.

    This adjusts the logic to only update the configuration during
    the normal device retrieval for the scan loop. This will leave
    the report_state loop unblocked by the operation so the agent
    doesn't get reported as dead (which blocks port binding).

    Closes-Bug: #1648206
    Change-Id: Iff45fb6617974b1eceeed238a8d9e958f874f12b
    (cherry picked from commit 1a2a71baf3904209679fc5448814a0e7940fe44d)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/408615
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=151af89ddc9cf36e6811776f955758957155feac
Submitter: Jenkins
Branch: stable/newton

commit 151af89ddc9cf36e6811776f955758957155feac
Author: Kevin Benton <email address hidden>
Date: Wed Dec 7 11:33:46 2016 -0800

    SRIOV: don't block report_state with device count

    The device count process can be quite slow on a system with
    lots of interfaces. Doing this during report_state can block
    it long enough that the agent will be reported as dead and
    bindings will fail.

    This adjusts the logic to only update the configuration during
    the normal device retrieval for the scan loop. This will leave
    the report_state loop unblocked by the operation so the agent
    doesn't get reported as dead (which blocks port binding).

    Closes-Bug: #1648206
    Change-Id: Iff45fb6617974b1eceeed238a8d9e958f874f12b
    (cherry picked from commit 1a2a71baf3904209679fc5448814a0e7940fe44d)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b2

This issue was fixed in the openstack/neutron 10.0.0.0b2 development milestone.

Revision history for this message
Xav Paice (xavpaice) wrote :

Added Ubuntu Cloud Archive to get this fix ported into the current packages - 8.3.0-0ubuntu1.1 doesn't have this patch.

Alvaro Uria (aluria)
tags: added: canonical-bootstack
Alvaro Uria (aluria)
Changed in cloud-archive:
status: New → Confirmed
Revision history for this message
Alvaro Uria (aluria) wrote :

I've updated bug status to confirmed for "Ubuntu Cloud Archive". Latest python-neutron package in Xenial is: https://launchpad.net/ubuntu/+source/neutron/2:8.3.0-0ubuntu1.2

Thank you.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
tags: removed: mitaka-backport-potential newton-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.2.0

This issue was fixed in the openstack/neutron 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.4.0

This issue was fixed in the openstack/neutron 8.4.0 release.

Changed in neutron (Ubuntu):
status: New → Fix Released
Changed in neutron (Ubuntu Yakkety):
status: New → Confirmed
Changed in neutron (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
Changed in neutron (Ubuntu Yakkety):
importance: Undecided → High
status: Confirmed → Triaged
Changed in neutron (Ubuntu Zesty):
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.