[RFE] set inactivity_probe and max_backoff for OVS bridge controller

Bug #1817022 reported by s10 on 2019-02-21
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Wishlist
Darragh O'Reilly

Bug Description

It would be useful to have the option to specify inactivity_probe and max_backoff for OVS bridge controllers in neutron config.

OVS documentation says (https://github.com/openvswitch/ovs/blob/master/ovn/TODO.rst):
The default 5 seconds inactivity_probe value is not sufficient and ovsdb-server drops the client IDL connections for openstack deployments when the neutron server is heavily loaded.

This indeed can happen under the heavy load in neutron-ovs-agent. This was discussed in http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/%23openstack-neutron.2017-01-27.log.html#t2017-01-27T02:46:22 , and the solution was to increase inactivity_probe.

Alternative is to set this settings manually after each neutron-ovs-agent restart:
ovs-vsctl set Controller br-tun inactivity_probe=30000
ovs-vsctl set Controller br-int inactivity_probe=30000
ovs-vsctl set Controller br-ex inactivity_probe=30000
ovs-vsctl set Controller br-tun max_backoff=5000
ovs-vsctl set Controller br-int max_backoff=5000
ovs-vsctl set Controller br-ex max_backoff=5000

s10 (vlad-esten) on 2019-03-05
summary: - RFE: set inactivity_probe and max_backoff for OVS bridge controller
+ [RFE] set inactivity_probe and max_backoff for OVS bridge controller
Changed in neutron:
assignee: nobody → Darragh O'Reilly (darragh-oreilly)
status: New → In Progress
Brian Haley (brian-haley) wrote :

Let's talk about this at the drivers meeting, but from the logs you linked seems valid.

tags: added: rfe-triaged
Miguel Lavalle (minsel) on 2019-03-08
Changed in neutron:
importance: Undecided → Wishlist

The bug description mixes two distinct connection types:
1. the manager connection for ovsdb: ovs-agent<->ovsdb-server:6640.
2. the per bridge openflow controller connection: ovs-vswitchd<->ovs-agent:6633

The inactivity_probe for the first should already be configurable with this patch:
https://git.openstack.org/cgit/openstack/neutron/commit/?id=1698bee770b84a2663ba940a6ded5d4b9733101a

For reference, http://www.openvswitch.org/support/dist-docs/ovs-vswitchd.conf.db.5.html
Controller TABLE
...
     Controller Failure Detection and Handling:

       max_backoff: optional integer, at least 1,000
              Maximum number of milliseconds to wait between connection at‐
              tempts. Default is implementation-specific.

       inactivity_probe: optional integer
              Maximum number of milliseconds of idle time on connection to
              controller before sending an inactivity probe message. If Open
              vSwitch does not communicate with the controller for the speci‐
              fied number of seconds, it will send a probe. If a response is
              not received for the same additional amount of time, Open
              vSwitch assumes the connection has been broken and attempts to
              reconnect. Default is implementation-specific. A value of 0 dis‐
              ables inactivity probes.

When they are not set, `ovs-vsctl --columns=_uuid,inactivity_probe,max_backoff list controller` shows them as '[]', and OVS defaults to 5000ms for inactivity_probe, and 8000ms for max_backoff.

s10 (vlad-esten) wrote :

RFE is related to the second connection type (the per bridge openflow controller connection: ovs-vswitchd<->ovs-agent:6633), which can't be configured by ovs-agent now, and the errors under heavy load are:

2018-04-12T10:57:00.934Z|276022|rconn|ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting
2018-04-12T10:57:00.934Z|276023|rconn|ERR|br-ex<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting

Miguel Lavalle (minsel) wrote :

During the drivers meeting discussion, it was suggested that this script can be used to simulate load: http://paste.openstack.org/show/745685/

Download full text (5.9 KiB)

Did some more scale testing and found increasing inactivity_probe can stop InvalidDatapath errors.

Using OVS firewall driver.
Add 100 rules to default security group.
Add 100 ports to host, one at a time slowly.
This results in about 24.5K flows in br-int. No problems to far.

Stop ovs-agent. Wait a couple of min.
Start ovs-agent.
Get InvalidDatapath errors and sync never completes. CPU very high.

Stop ovs-agent.
Set new of_inactivity_probe=10
Start ovs-agent
No errors and syncs after a couple of min. CPU drops to very low.

Stop ovs-agent
Set debug=False and of_inactivity_probe back to default 5.
Start ovs-agent. No problems.

Rebuilt OVS to allow inactivity_probe < 5 sec.
Set of_inactivity_probe=1 and debug=False
Start ovs-agent
Get InvalidDatapath errors and sync never completes. CPU very high.

2019-03-11T16:35:25.258Z|25948|rconn|ERR|br-int<->tcp:127.0.0.1:6633: no response to inactivity probe after 1 seconds, disconnecting
2019-03-11T16:35:25.258Z|25949|rconn|ERR|br-ex<->tcp:127.0.0.1:6633: no response to inactivity probe after 1 seconds, disconnecting
2019-03-11T16:35:25.258Z|25950|rconn|ERR|br-tun<->tcp:127.0.0.1:6633: no response to inactivity probe after 1 seconds, disconnecting

Mar 11 16:35:23 ubuntu neutron-openvswitch-agent[9883]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-aadec15d-63f8-4f9f-b135-ec34c2d852a4 None None] Cleaning stale br-int flows
Mar 11 16:35:25 ubuntu neutron-openvswitch-agent[9883]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-aadec15d-63f8-4f9f-b135-ec34c2d852a4 None None] Cleaning stale br-ex flows
Mar 11 16:35:25 ubuntu neutron-openvswitch-agent[9883]: ERROR OfctlService [-] unknown dpid 104209607453507
Mar 11 16:35:25 ubuntu neutron-openvswitch-agent[9883]: ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [None req-aadec15d-63f8-4f9f-b135-ec34c2d852a4 None None] ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=255,type=1) error Datapath Invalid 104209607453507: InvalidDatapath: Datapath Invalid 104209607453507
Mar 11 16:35:25 ubuntu neutron-openvswitch-agent[9883]: ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-aadec15d-63f8-4f9f-b135-ec34c2d852a4 None None] Error while processing VIF ports: RuntimeError: ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=255,type=1) error Datapath Invalid 104209607453507
Mar 11 16:35:25 ubuntu neutron-openvswitch-agent[9883]: ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
Mar 11 16:35:25 ubuntu neutron-openvswitch-agent[9883]: ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2230, in rpc_loop
Mar 11 16:35:25 ubuntu neutron-openvswitch-agent[9883]: ERROR neutron.plugins.ml2.driv...

Read more...

Reran with a LOG.exception(e) to show the exception from os_ken.

InvalidDatapath: Datapath Invalid 104209607453507

actually datapath_id 104209607453507 corresponds to br-ex

Retested with https://git.openstack.org/cgit/openstack/neutron/commit/?id=f898ffd71fba4f9b8fd9f4cb851fc3976d72396a and it makes a difference. Now need about 400 ports (~98K flows) before the problem happens. This is with OVS rebuilt to allow 1 sec inactivity probe.

tags: added: ovs
Miguel Lavalle (minsel) wrote :

This RFE was revisited today during drivers meeting and it was approved

tags: added: rfe-approved
removed: rfe-triaged

Reviewed: https://review.opendev.org/641681
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=540d00f68ecf1c11b0296922f965ef60d15d1a86
Submitter: Zuul
Branch: master

commit 540d00f68ecf1c11b0296922f965ef60d15d1a86
Author: Darragh O'Reilly <email address hidden>
Date: Thu Mar 7 14:33:26 2019 +0000

    Make OVS controller inactivity_probe configurable

    This parameter applies to the OVSDB Controller table when the
    native openflow driver is used. There are reports that increasing
    it can reduce errors on busy systems. This patch also sets the
    default value to 10s which is more than the OVS default of 5s.
    See the ovs-vswitchd.conf.db man page for full description.

    Change-Id: If0d42919412dac75deb4d7f484c42cea630fbc59
    Partial-Bug: #1817022

Reviewed: https://review.opendev.org/660074
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6e661ecd2de3994de25af0337f7e2ae781c2cca9
Submitter: Zuul
Branch: stable/stein

commit 6e661ecd2de3994de25af0337f7e2ae781c2cca9
Author: Darragh O'Reilly <email address hidden>
Date: Thu Mar 7 14:33:26 2019 +0000

    Make OVS controller inactivity_probe configurable

    This parameter applies to the OVSDB Controller table when the
    native openflow driver is used. There are reports that increasing
    it can reduce errors on busy systems. This patch also sets the
    default value to 10s which is more than the OVS default of 5s.
    See the ovs-vswitchd.conf.db man page for full description.

    Change-Id: If0d42919412dac75deb4d7f484c42cea630fbc59
    Partial-Bug: #1817022
    (cherry picked from commit 540d00f68ecf1c11b0296922f965ef60d15d1a86)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/663024
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3e555347956acd8841951e5b18c2c6b59325e197
Submitter: Zuul
Branch: stable/rocky

commit 3e555347956acd8841951e5b18c2c6b59325e197
Author: Darragh O'Reilly <email address hidden>
Date: Thu Mar 7 14:33:26 2019 +0000

    Make OVS controller inactivity_probe configurable

    This parameter applies to the OVSDB Controller table when the
    native openflow driver is used. There are reports that increasing
    it can reduce errors on busy systems. This patch also sets the
    default value to 10s which is more than the OVS default of 5s.
    See the ovs-vswitchd.conf.db man page for full description.

    Conflicts:
        neutron/tests/functional/agent/common/test_ovs_lib.py

    Change-Id: If0d42919412dac75deb4d7f484c42cea630fbc59
    Partial-Bug: #1817022
    (cherry picked from commit 540d00f68ecf1c11b0296922f965ef60d15d1a86)
    (cherry picked from commit 6e661ecd2de3994de25af0337f7e2ae781c2cca9)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/663034
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7891c7cd05ad5cff12b1f386cd116db408708371
Submitter: Zuul
Branch: stable/queens

commit 7891c7cd05ad5cff12b1f386cd116db408708371
Author: Darragh O'Reilly <email address hidden>
Date: Thu Mar 7 14:33:26 2019 +0000

    Make OVS controller inactivity_probe configurable

    This parameter applies to the OVSDB Controller table when the
    native openflow driver is used. There are reports that increasing
    it can reduce errors on busy systems. This patch also sets the
    default value to 10s which is more than the OVS default of 5s.
    See the ovs-vswitchd.conf.db man page for full description.

    Conflicts:
        neutron/tests/functional/agent/common/test_ovs_lib.py

    Change-Id: If0d42919412dac75deb4d7f484c42cea630fbc59
    Partial-Bug: #1817022
    (cherry picked from commit 540d00f68ecf1c11b0296922f965ef60d15d1a86)
    (cherry picked from commit 6e661ecd2de3994de25af0337f7e2ae781c2cca9)
    (cherry picked from commit eaad77758daeefb69bd60d48b8ca1e2814604f8a)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/663050
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4ba9f694c63fcd2a9a3b1091e67a91d5496fe087
Submitter: Zuul
Branch: stable/pike

commit 4ba9f694c63fcd2a9a3b1091e67a91d5496fe087
Author: Darragh O'Reilly <email address hidden>
Date: Thu Mar 7 14:33:26 2019 +0000

    Make OVS controller inactivity_probe configurable

    This parameter applies to the OVSDB Controller table when the
    native openflow driver is used. There are reports that increasing
    it can reduce errors on busy systems. This patch also sets the
    default value to 10s which is more than the OVS default of 5s.
    See the ovs-vswitchd.conf.db man page for full description.

    Conflicts:
        neutron/tests/functional/agent/common/test_ovs_lib.py
        neutron/agent/common/ovs_lib.py

    Change-Id: If0d42919412dac75deb4d7f484c42cea630fbc59
    Partial-Bug: #1817022
    (cherry picked from commit 540d00f68ecf1c11b0296922f965ef60d15d1a86)
    (cherry picked from commit 6e661ecd2de3994de25af0337f7e2ae781c2cca9)
    (cherry picked from commit eaad77758daeefb69bd60d48b8ca1e2814604f8a)
    (cherry picked from commit 7891c7cd05ad5cff12b1f386cd116db408708371)

tags: added: in-stable-pike

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/ocata
Review: https://review.opendev.org/663059
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers