neutron-openvswitch-agent crashes on start with firewall config of br-int

Bug #1944201 reported by Julia Kreger
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Critical
Unassigned

Bug Description

In upstream CI, Ironic jobs have been encountering failures where we never find the networking to be stood up by neutron. Investigation into what was going on led us to finding the neutron-openvswitch-agent in failed state, exited due to RuntimeError, just a few seconds after the service was started.

neutron-openvswitch-agent[78787]: DEBUG neutron.agent.securitygroups_rpc [None req-b18a79b7-7258-44f0-9a69-fa92a490bc26 None None] Init firewall settings (driver=openvswitch) {{(pid=78787) init_firewall /opt/stack/neutron/neutron/agent/securitygroups_rpc.py:118}}
neutron-openvswitch-agent[78787]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbAddCommand(table=Bridge, record=br-int, column=protocols, values=('OpenFlow10', 'OpenFlow11', 'OpenFlow12', 'OpenFlow13', 'OpenFlow14')) {{(pid=78787) do_commit /usr/local/lib/python3.8/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:90}}
neutron-openvswitch-agent[78787]: ERROR OfctlService [-] unknown dpid 90695823979334
neutron-openvswitch-agent[78787]: ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [None req-b18a79b7-7258-44f0-9a69-fa92a490bc26 None None] ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=71,type=1) error Datapath Invalid 90695823979334: os_ken.app.ofctl.exception.InvalidDatapath: Datapath Invalid 90695823979334
neutron-openvswitch-agent[78787]: ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-b18a79b7-7258-44f0-9a69-fa92a490bc26 None None] ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=71,type=1) error Datapath Invalid 90695823979334 agent terminated!: RuntimeError: ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=71,type=1) error Datapath Invalid 90695823979334
systemd[1]: <email address hidden>: Main process exited, code=exited, status=1/FAILURE
systemd[1]: <email address hidden>: Failed with result 'exit-code'.

Originally, this was thought to be related to https://bugs.launchpad.net/neutron/+bug/1817022, however this is upon service startup on a relatively low load machine where the only action really is truly just neutron starting at that time. Also, starting, the connections have not been able to exist long enough for inactivity idle triggers to occur.

Investigation into allowed us to identify the general path of what is occurring, yet why we don't understand, at least in the Ironic community.

init_firewall() invocation: https://github.com/openstack/neutron/blob/79445f12be3a9ca892672fe0e016336ef60877a2/neutron/agent/securitygroups_rpc.py#L70
Firewall class launch: https://github.com/openstack/neutron/blob/79445f12be3a9ca892672fe0e016336ef60877a2/neutron/agent/securitygroups_rpc.py#L121

As the default for the firewall driver ends up sending us into openvswitch's firewall code:

https://github.com/openstack/neutron/blob/79445f12be3a9ca892672fe0e016336ef60877a2/neutron/agent/linux/openvswitch_firewall/firewall.py#L548
https://github.com/openstack/neutron/blob/79445f12be3a9ca892672fe0e016336ef60877a2/neutron/agent/linux/openvswitch_firewall/firewall.py#L628

Which eventually ends up in https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L91 where it raises a RuntimeError and the service exits out.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Revision history for this message
Lajos Katona (lajos-katona) wrote :
Changed in neutron:
status: New → Confirmed
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Changed in neutron:
importance: Undecided → Critical
tags: added: gate-failure ovs
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I reviewed the kibana logs and other successful executions. I think the problem is in [1]. When we change the OF service and the controller needs to be re-initialized. E.g.: in [2] we can see how when we change some parameters of the controller, it needs to be restarted. Snippet: [3].

In order to prevent (not really, just mitigate) those errors, I propose to add a retry decorator on those methods that force the OF controller to be restarted. I've identified these two:
- set_fail_mode (that calls ovsdbapp SetFailModeCommand): this command sets the bridge fail_mode.
- add_protocols: this method add new OF protocols to the bridge (makes sense the OF controller needs to be re-initialized).

I'll psuh a patch to retry those methods in case of returning this "InvalidDatapath" exception.

Regards.

[1]https://github.com/openvswitch/ovs/blob/849a40ccfb9c7c6bba635b517caac4f12ab63eee/ofproto/connmgr.c#L605-L612
[2]https://e36beaa2ff297ebe7d5f-5944c3d62ed334b8cdf50b534c246731.ssl.cf5.rackcdn.com/805849/9/check/neutron-ovs-tempest-dvr-ha-multinode-full/f83fa96/compute1/logs/openvswitch/ovs-vswitchd_log.txt
[3]https://paste.opendev.org/show/809505/

P.S.: just for the records, this bug is related to:
- https://bugs.launchpad.net/neutron/+bug/1817022
- https://bugs.launchpad.net/neutron/+bug/1672610
- https://bugs.launchpad.net/neutron/+bug/1627106
- https://bugs.launchpad.net/neutron/+bug/1666731

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Actually the methods that modify the DB (using ovsdbapp library) are not affected but the successive method that uses os-ken library. This is why the OfctlService cannot find the datapath ID of br-int although this ID is valid.

The change could be more difficult as I commented in c#4.

Regards.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/810541

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/810542

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/810592

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Lajos Katona <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/810542

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron-tempest-plugin (master)

Change abandoned by "Lajos Katona <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/810541

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/810592
Committed: https://opendev.org/openstack/neutron/commit/57629dc05122b01a9ba76606b8d75cce9da40776
Submitter: "Zuul (22348)"
Branch: master

commit 57629dc05122b01a9ba76606b8d75cce9da40776
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Sep 23 09:15:09 2021 +0000

    Add retry when executing OF commands if "InvalidDatapath"

    When using the OF API (currently Neutron only uses native
    implementation via "os-ken" librarr), retry the command in case of
    "InvalidDatapath" exception.

    As commented in the related bug, some operations could restart the
    OF controller (set the OF procols, set the bridge fail mode). During
    the controller restart, a command can return a "InvalidDatapath"
    exception.

    Closes-Bug: #1944201
    Change-Id: Ia8d202f8a38362272e9519c1cbd9d6ba9359e0a1

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/810972

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/810972
Committed: https://opendev.org/openstack/neutron/commit/e471a40a418dc743fb533fedfe69cff356859da5
Submitter: "Zuul (22348)"
Branch: stable/xena

commit e471a40a418dc743fb533fedfe69cff356859da5
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Sep 23 09:15:09 2021 +0000

    Add retry when executing OF commands if "InvalidDatapath"

    When using the OF API (currently Neutron only uses native
    implementation via "os-ken" librarr), retry the command in case of
    "InvalidDatapath" exception.

    As commented in the related bug, some operations could restart the
    OF controller (set the OF procols, set the bridge fail mode). During
    the controller restart, a command can return a "InvalidDatapath"
    exception.

    Closes-Bug: #1944201
    Change-Id: Ia8d202f8a38362272e9519c1cbd9d6ba9359e0a1
    (cherry picked from commit 57629dc05122b01a9ba76606b8d75cce9da40776)

tags: added: in-stable-xena
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Julia:

Did you try the last Neutron version with [1] patch?

Let me know if you still have problems (nick ralonsoh in #openstack-neutron channel).

Regards.

[1]https://review.opendev.org/q/Ia8d202f8a38362272e9519c1cbd9d6ba9359e0a1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.0.0.0rc2

This issue was fixed in the openstack/neutron 19.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to os-ken (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-ken/+/812309

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-ken/+/812330

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to os-ken (master)

Reviewed: https://review.opendev.org/c/openstack/os-ken/+/812309
Committed: https://opendev.org/openstack/os-ken/commit/f4e5e04aac0b90282a56c030fcc79f298d6f977f
Submitter: "Zuul (22348)"
Branch: master

commit f4e5e04aac0b90282a56c030fcc79f298d6f977f
Author: Slawek Kaplonski <email address hidden>
Date: Mon Oct 4 11:26:53 2021 +0200

    app/ofctl: fix possible deadlock when the datapath disconnects

    Backport of the Ryu patch [1].

    Related-Bug: #1944201

    [1] https://github.com/faucetsdn/ryu/commit/e3aa55872b70b97e2fd0946cbd404ef29e2cd092

    Change-Id: I13b4ee7e6790b2d45f8cc6b33daac6222f45dfa3

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/os-ken/+/812330
Committed: https://opendev.org/openstack/os-ken/commit/2e8cf5bed22f6be0d3ab51ea81aeda76f39d14b7
Submitter: "Zuul (22348)"
Branch: master

commit 2e8cf5bed22f6be0d3ab51ea81aeda76f39d14b7
Author: Slawek Kaplonski <email address hidden>
Date: Mon Oct 4 11:30:01 2021 +0200

    app/ofctl: fix possible deadlock

    When datapath reconnects/disconnects, waiting requests on old
    datapath should be canceled. Otherwise, threads that sent
    requests may fall into deadlock because replies possibly never
    come.

    This patch is backport of the Ryu patch [1].

    Related-Bug: #1944201

    [1] https://github.com/faucetsdn/ryu/commit/6f906e72c92e10bd0264c9b91a2f7bb85b97780c

    Change-Id: I0bde9df2cd651b88e60631b53d69ad50f8d86d02

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/813666

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/813666
Committed: https://opendev.org/openstack/neutron/commit/5665fc437b42bc732515f1920105fd0910558796
Submitter: "Zuul (22348)"
Branch: master

commit 5665fc437b42bc732515f1920105fd0910558796
Author: Slawek Kaplonski <email address hidden>
Date: Tue Oct 12 17:53:16 2021 +0200

    Bump os-ken to 2.2.0

    Related-Bug: #1944201
    Change-Id: Ib699fd096cf7e7d5df83ee4942e287eecf5a0b8f

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers