sriov agent report_state is slow

Bug #1648206 reported by Kevin Benton on 2016-12-07
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Mitaka
Undecided
Unassigned
Newton
High
Unassigned
Ocata
Undecided
Unassigned
neutron
High
Kevin Benton
neutron (Ubuntu)
High
Unassigned
Xenial
High
Unassigned
Yakkety
High
Unassigned
Zesty
High
Unassigned

Bug Description

On a system with lots of VFs and PFs we get these logs:

WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 29.67 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 45.43 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 47.64 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 23.89 sec
WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent.SriovNicSwitchAgent._report_state' run outlasted interval by 30.20 sec

Depending on the agent_down_time configuration, this can cause the Neutron server to think the agent has died.

This appears to be caused by blocking on the eswitch manager every time to get a device count to include in the state report.

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)

Fix proposed to branch: master
Review: https://review.openstack.org/408281

Changed in neutron:
status: New → In Progress
Changed in neutron:
importance: Undecided → High

Reviewed: https://review.openstack.org/408281
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1a2a71baf3904209679fc5448814a0e7940fe44d
Submitter: Jenkins
Branch: master

commit 1a2a71baf3904209679fc5448814a0e7940fe44d
Author: Kevin Benton <email address hidden>
Date: Wed Dec 7 11:33:46 2016 -0800

    SRIOV: don't block report_state with device count

    The device count process can be quite slow on a system with
    lots of interfaces. Doing this during report_state can block
    it long enough that the agent will be reported as dead and
    bindings will fail.

    This adjusts the logic to only update the configuration during
    the normal device retrieval for the scan loop. This will leave
    the report_state loop unblocked by the operation so the agent
    doesn't get reported as dead (which blocks port binding).

    Closes-Bug: #1648206
    Change-Id: Iff45fb6617974b1eceeed238a8d9e958f874f12b

Changed in neutron:
status: In Progress → Fix Released
Assaf Muller (amuller) on 2016-12-08
tags: added: mitaka-backport-potential newton-backport-potential

Reviewed: https://review.openstack.org/408616
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2580bcb5bcc68e59a602ab9ead8c2191c162f700
Submitter: Jenkins
Branch: stable/mitaka

commit 2580bcb5bcc68e59a602ab9ead8c2191c162f700
Author: Kevin Benton <email address hidden>
Date: Wed Dec 7 11:33:46 2016 -0800

    SRIOV: don't block report_state with device count

    The device count process can be quite slow on a system with
    lots of interfaces. Doing this during report_state can block
    it long enough that the agent will be reported as dead and
    bindings will fail.

    This adjusts the logic to only update the configuration during
    the normal device retrieval for the scan loop. This will leave
    the report_state loop unblocked by the operation so the agent
    doesn't get reported as dead (which blocks port binding).

    Closes-Bug: #1648206
    Change-Id: Iff45fb6617974b1eceeed238a8d9e958f874f12b
    (cherry picked from commit 1a2a71baf3904209679fc5448814a0e7940fe44d)

tags: added: in-stable-mitaka

Reviewed: https://review.openstack.org/408615
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=151af89ddc9cf36e6811776f955758957155feac
Submitter: Jenkins
Branch: stable/newton

commit 151af89ddc9cf36e6811776f955758957155feac
Author: Kevin Benton <email address hidden>
Date: Wed Dec 7 11:33:46 2016 -0800

    SRIOV: don't block report_state with device count

    The device count process can be quite slow on a system with
    lots of interfaces. Doing this during report_state can block
    it long enough that the agent will be reported as dead and
    bindings will fail.

    This adjusts the logic to only update the configuration during
    the normal device retrieval for the scan loop. This will leave
    the report_state loop unblocked by the operation so the agent
    doesn't get reported as dead (which blocks port binding).

    Closes-Bug: #1648206
    Change-Id: Iff45fb6617974b1eceeed238a8d9e958f874f12b
    (cherry picked from commit 1a2a71baf3904209679fc5448814a0e7940fe44d)

tags: added: in-stable-newton

This issue was fixed in the openstack/neutron 10.0.0.0b2 development milestone.

Xav Paice (xavpaice) wrote :

Added Ubuntu Cloud Archive to get this fix ported into the current packages - 8.3.0-0ubuntu1.1 doesn't have this patch.

Alvaro Uria (aluria) on 2017-01-10
tags: added: canonical-bootstack
Alvaro Uria (aluria) on 2017-01-10
Changed in cloud-archive:
status: New → Confirmed
Alvaro Uria (aluria) wrote :

I've updated bug status to confirmed for "Ubuntu Cloud Archive". Latest python-neutron package in Xenial is: https://launchpad.net/ubuntu/+source/neutron/2:8.3.0-0ubuntu1.2

Thank you.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
tags: removed: mitaka-backport-potential newton-backport-potential

This issue was fixed in the openstack/neutron 9.2.0 release.

This issue was fixed in the openstack/neutron 8.4.0 release.

Changed in neutron (Ubuntu):
status: New → Fix Released
Changed in neutron (Ubuntu Yakkety):
status: New → Confirmed
Changed in neutron (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
Changed in neutron (Ubuntu Yakkety):
importance: Undecided → High
status: Confirmed → Triaged
Changed in neutron (Ubuntu Zesty):
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers