[ovn] race condition with add/remove router interface

Bug #2060163 reported by Mohammed Naser
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
High
Unassigned

Bug Description

We're running into an issue in our CI with Atmosphere where we frequently see failures when a router port is removed from an interface, the traceback is the following:

==============================
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource [None req-e5d08cdd-28e6-4231-a50c-7eafc1b8f942 70fc3b55af8c4386b80207dad11db5da dcec54844db44eedbd9667951a5ceb6b - - - -] remove_router_interface failed: No details.: ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Logical_Router_Port with name=lrp-7e0debbb-893c-420a-8569-d8fb98e6a16e
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource Traceback (most recent call last):
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron/api/v2/resource.py", line 98, in resource
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource result = method(request=request, **args)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron_lib/db/api.py", line 140, in wrapped
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource with excutils.save_and_reraise_exception():
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource self.force_reraise()
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource raise self.value
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron_lib/db/api.py", line 138, in wrapped
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource return f(*args, **kwargs)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_db/api.py", line 144, in wrapper
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource with excutils.save_and_reraise_exception() as ectxt:
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource self.force_reraise()
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource raise self.value
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_db/api.py", line 142, in wrapper
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource return f(*args, **kwargs)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron_lib/db/api.py", line 186, in wrapped
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource with excutils.save_and_reraise_exception():
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource self.force_reraise()
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource raise self.value
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron_lib/db/api.py", line 184, in wrapped
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource return f(*dup_args, **dup_kwargs)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron/api/v2/base.py", line 253, in _handle_action
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource ret_value = getattr(self._plugin, name)(*arg_list, **kwargs)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron/services/ovn_l3/plugin.py", line 260, in remove_router_interface
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource with excutils.save_and_reraise_exception():
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource self.force_reraise()
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource raise self.value
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron/services/ovn_l3/plugin.py", line 257, in remove_router_interface
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource self._ovn_client.delete_router_port(
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1904, in delete_router_port
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource with self._nb_idl.transaction(check_error=True) as txn:
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource next(self.gen)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py", line 272, in transaction
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource with super(OvsdbNbOvnIdl, self).transaction(*args, **kwargs) as t:
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource next(self.gen)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/api.py", line 104, in transaction
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource with self.create_transaction(
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/api.py", line 61, in __exit__
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource self.result = self.commit()
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 64, in commit
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource raise result.ex
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 118, in run
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource txn.results.put(txn.do_commit())
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 92, in do_commit
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource command.run_idl(txn)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 341, in run_idl
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource entity = self.api.lookup(self.table, self.entity)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 181, in lookup
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource return self._lookup(table, record)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 228, in _lookup
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource row = idlutils.row_by_value(self, rl.table, rl.column, record)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource File "/var/lib/openstack/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 114, in row_by_value
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource raise RowNotFound(table=table, col=column, match=match)
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Logical_Router_Port with name=lrp-7e0debbb-893c-420a-8569-d8fb98e6a16e
2024-04-03 21:13:09.804 10 ERROR neutron.api.v2.resource 
2024-04-03 21:13:09.807 10 INFO neutron.wsgi [None req-e5d08cdd-28e6-4231-a50c-7eafc1b8f942 70fc3b55af8c4386b80207dad11db5da dcec54844db44eedbd9667951a5ceb6b - - - -] 10.0.0.73 "PUT /v2.0/routers/1a3225bc-725d-400f-b9ef-3ffe7d4fb8d8/remove_router_interface HTTP/1.1" status: 500 len: 363 time: 2.0234911
==============================

I can't tell what's going on exactly, but it seems like for some reason the `lrp` is missing when we try and remove it. The add seems to happen just fine though:

==============================
2024-04-03 21:12:58.396 9 INFO neutron.wsgi [None req-0b71c77f-3844-4549-af1e-541f344f302e 70fc3b55af8c4386b80207dad11db5da dcec54844db44eedbd9667951a5ceb6b - - - -] 10.0.0.73 "PUT /v2.0/routers/1a3225bc-725d-400f-b9ef-3ffe7d4fb8d8/add_router_interface HTTP/1.1" status: 200 len: 503 time: 3.1546130
==============================

https://logs.atmosphere.dev/61/1061/46e91a5e83194215e35f7209ff92bf86597b4cd6/check/atmosphere-molecule-aio-ovn/5f98342/ubuntu-jammy/pod-logs/openstack/neutron-server-746d84b879-6q8kt/neutron-server.txt

Since this is a CI environment, in order to reduce the resources we consume, we limit neutron to 2 workers. Is it possible that it's not keeping up with the events coming in from OVN and we end up in this mismatch?

Tags: ovsdb-lib
Revision history for this message
Miro Tomaska (mtomaska) wrote :

Hi Mohammed,

Is there a way you can reproduce this in a script and post the script here?

"The Cannot find Logical_Router_Port" means that at the time of that transactionthe lrp was not present in the OVN north DB [1]. Yes that could indicate some race condition since right before that there is log

2024-04-03 21:13:08.460 9 INFO neutron.db.ovn_revision_numbers_db [None req-ef11d6a3-b42d-42a6-866e-398525320292 70fc3b55af8c4386b80207dad11db5da dcec54844db44eedbd9667951a5ceb6b - - - -] Successfully bumped revision number for resource 7e0debbb-893c-420a-8569-d8fb98e6a16e (type: ports) to 1

Side note, I also noticed bunch of warnings about "No more DHCP agents unable to schedule network" but given that this is ML2/OVN I am wondering why DHCP agent is used.

[1] https://man7.org/linux/man-pages/man5/ovn-nb.5.html#Logical_Router_Port_TABLE

Revision history for this message
Mohammed Naser (mnaser) wrote :

Hi Miro!

The reproducer is actually just running Tempest..

====================================================================
==============================
Failed 1 tests - output below:
==============================

tearDownClass (tempest.api.compute.servers.test_server_addresses.ServerAddressesTestJSON)
-----------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/test.py", line 267, in tearDownClass
    raise value.with_traceback(trace)

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/test.py", line 231, in tearDownClass
    teardown()

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/test.py", line 778, in clear_credentials
    cls._creds_provider.clear_creds()

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/common/dynamic_creds.py", line 594, in clear_creds
    self._clear_isolated_net_resources()

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/common/dynamic_creds.py", line 574, in _clear_isolated_net_resources
    client.remove_router_interface(

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/services/network/routers_client.py", line 82, in remove_router_interface
    return self.update_resource(uri, kwargs)

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/services/network/base.py", line 77, in update_resource
    resp, body = self.put(req_uri, req_post_data)

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/common/rest_client.py", line 372, in put
    return self.request('PUT', url, extra_headers, headers, body, chunked)

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/common/rest_client.py", line 742, in request
    self._error_checker(resp, resp_body)

      File "/var/lib/openstack/lib/python3.10/site-packages/tempest/lib/common/rest_client.py", line 922, in _error_checker
    raise exceptions.ServerFault(resp_body, resp=resp,

    tempest.lib.exceptions.ServerFault: Got server fault
Details: Request Failed: internal server error while processing your request.
====================================================================

Unfortunately, it's not consistent that it fails :(

For the DHCP agents, I wonder if it's because we set `dhcp_agents_per_network: 3` and Neutron keeps on trying to find agents to get it up to 3?

Revision history for this message
Brian Haley (brian-haley) wrote :

So why are there DHCP agents if using OVN? For bare-metal?

It does seem like it was created, even if just a split second before the remove call, but see my comment in <> below:

2024-04-03 21:12:58.396 9 INFO neutron.wsgi [None req-0b71c77f-3844-4549-af1e-541f344f302e 70fc3b55af8c4386b80207dad11db5da dcec54844db44eedbd9667951a5ceb6b - - - -] 10.0.0.73 "PUT /v2.0/routers/1a3225bc-725d-400f-b9ef-3ffe7d4fb8d8/add_router_interface HTTP/1.1" status: 200 len: 503 time: 3.1546130

2024-04-03 21:13:09.099 10 INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [None req-0a53b0f9-af0e-4fa9-b455-9ce17341aba0 - - - - - -] OVN reports status up for port: 7e0debbb-893c-420a-8569-d8fb98e6a16e

2024-04-03 21:13:09.112 10 INFO neutron.plugins.ml2.plugin [None req-0a53b0f9-af0e-4fa9-b455-9ce17341aba0 - - - - - -] Attempt 1 to provision port 7e0debbb-893c-420a-8569-d8fb98e6a16e

<unfortunately there is no INFO message that it completed, but did not see a DEBUG message it was retrying, although I don't see any DEBUG level messages here, maybe you can enable that?>

2024-04-03 21:13:09.807 10 INFO neutron.wsgi [None req-e5d08cdd-28e6-4231-a50c-7eafc1b8f942 70fc3b55af8c4386b80207dad11db5da dcec54844db44eedbd9667951a5ceb6b - - - -] 10.0.0.73 "PUT /v2.0/routers/1a3225bc-725d-400f-b9ef-3ffe7d4fb8d8/remove_router_interface HTTP/1.1" status: 500 len: 363 time: 2.0234911

Revision history for this message
Miro Tomaska (mtomaska) wrote :

Mohammed,

Which release of openstack are you using, and does it happen on any tempest test which is trying to remove an interface from a router? Or specifically, just the one you posted. I am just trying to gather information for reproducing it outside of your environment.( if possible).

I am still curious why you need a DHCP agent in the ML2/OVN environment. Are you missing some DHCP features that OVN can not provide? Maybe it is for baremetal nodes?

That might be the missing piece why I have not seen that failure in the tempest test that are run periodically

Revision history for this message
Miro Tomaska (mtomaska) wrote :

I have actually seen a similar issue in the downstream CI today (happened once). Also during one of the tempest tearDown stages when the remove_router_interface is called but the logical_rotuer_port resource is not found in the OVN NB db.
BTW there was no DHCP agent deployed... so it is not related. I think this does look like a race condition due to either system load/ number of workers or combination of both. Also tempest tests do a lot of quick setup and teardowns which is why it shows up there.
Unfortunately, I have not identified the secret sauce for this race condition

Miro Tomaska (mtomaska)
tags: added: ovsdb-lib
Miro Tomaska (mtomaska)
Changed in neutron:
importance: Undecided → High
Revision history for this message
Mohammed Naser (mnaser) wrote :

Hi Miro,

In my case, this is running against a `stable/zed` and it is almost _always_ reproducible. In order to save on memory and resources in our CI AIO, we tune down the number of workers to 2:

https://github.com/vexxhost/atmosphere/blob/stable/zed/molecule/aio/group_vars/all/molecule.yml#L159-L162

It almost always fails (fails more than it works) in that scenario:

https://ci.atmosphere.dev/t/atmosphere/builds?job_name=atmosphere-molecule-aio-ovn&project=vexxhost/atmosphere

What is the number of workers that you are using in your CI? I wonder if the simple reproducer could be to set workers down to 2 and then just spam add/remove router interface..

Revision history for this message
Mohammed Naser (mnaser) wrote :

Hmm, upstream CI does this:

https://github.com/openstack/devstack/blob/master/lib/neutron#L444-L448
https://github.com/openstack/devstack/blob/master/stackrc#L778-L782

If I understand this correctly, it will take the max(2, nproc/4) -- So upstream CI, we have 8 vCPUs for nodes, and based on my math that would put us at 2 workers, similar to what we do.. but it seems nothing tunes rpc_workers (however I guess that doesn't matter in an OVN situation?)

https://docs.openstack.org/neutron/latest/admin/config-wsgi.html

However, the docs do say...

> If OVN ML2 plugin is used without any additional agents, neutron requires no worker for RPC message processing. Set both rpc_workers and rpc_state_report_workers to 0, to disable RPC workers.

Which means that doesn't matter.. and upstream CI is using 2 API workers and I assume not hitting this.. I don't know. We do run in Kubernetes/containerized, so not sure if that has any impact..

Revision history for this message
yatin (yatinkarel) wrote :

This look quite similar to https://bugs.launchpad.net/neutron/+bug/2065701 but am not sure(having debug logs and ovn-nb dbs would help to isolate) if it's same for now will tag this issue in the fix for the other one. But symptoms wise it matches as the atmosphere CI also using ovn_emit_need_to_frag = true.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/919699

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.