neutron

Network filtering by provider attributes has a race condition with network removal

Bug #1990561 reported by Anton Kurbatov on 2022-09-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	New	Undecided	Unassigned

Bug Description

I ran into a problem when the list of networks filtered by segment ID does not match the expected list.
An important condition is the parallel removal of another network.

Here is a demo:

Console 1:
$ while :; do openstack network create test-net --provider-segment 200 --provider-network-type vxlan >/dev/null; openstack network delete test-net; done

Console 2:
$ for i in {0..1000}; do net=$(openstack network list --provider-segment 100); [[ -n "${net}" ]] && echo "${net}" && echo "Iter=$i" && break; done
+--------------------------------------+----------+---------+
| ID | Name | Subnets |
+--------------------------------------+----------+---------+
| 64ccd339-c669-4b8b-9d11-758e98295955 | test-net | |
+--------------------------------------+----------+---------+
Iter=81
$

A log file has a message:

2022-09-22 20:13:15.706 25 DEBUG neutron.plugins.ml2.managers [None req-4c379e00-4794-4625-afe7-64643aa801cf 4f5e975fb1044192a4930fd01ca7d9d7 1958e62e718f468299ae302a12364c08 - default default] Network 64ccd339-c669-4b8b-9d11-758e98295955 has no segments extend_network_with_provider_segments /usr/lib64/python3.6/site-packages/neutron/plugins/ml2/managers.py:169

So, it looks like there is a race condition.
OS version: Xena

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2022-09-26:

If I run the reproduction steps in a simplistic all-in-one devstack environment, they don't seem to work (zero hits out of 2000 network listings, tried both on master and on stable/xena). So could you please share more about your environment? How many neutron-servers do you have? What db backend do you use and in what configuration? Any other detail that may be relevant to how to reproduce this bug?

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Anton Kurbatov (akurbatov) wrote on 2022-09-26:

configs.tar Edit (40.0 KiB, application/x-tar)

Hello Bence,
I am attaching my configs (configs.tar)
Of the essential in my opinion, I have the following:

1) I am using postgres DB database.

2) 10 neutron-server processes are running inside kolla-ansible docker CT:

/usr/bin/python3 /usr/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/neutron_vpnaas.conf

[root@node1 ~]# docker exec -tiu root neutron_server bash
(neutron-server)[root@node1 /]# ps axf | grep neutron-server | grep -v grep | wc -l
11
(neutron-server)[root@node1 /]# grep workers /etc/neutron/neutron.conf
api_workers = 4
rpc_workers = 4
metadata_workers = 1
(neutron-server)[root@node1 /]#

Bug is easy to reproduce if you do a little hack (and if you have 2+ neutron-servers processes that can handle requests):

diff --git a/neutron/plugins/ml2/plugin.py b/neutron/plugins/ml2/plugin.py
index ea54f8f1c3..74226d112a 100644
--- a/neutron/plugins/ml2/plugin.py
+++ b/neutron/plugins/ml2/plugin.py
@@ -1211,6 +1211,9 @@ class Ml2Plugin(db_base_plugin_v2.NeutronDbPluginV2,
for net in nets_db:
net_data.append(self._make_network_dict(net, context=context))

+ import time
+ time.sleep(3)
+
             self.type_manager.extend_networks_dict_provider(context, net_data)
             nets = self._filter_nets_provider(context, net_data, filters)
         return [db_utils.resource_fields(net, fields) for net in nets]

And then:

[root@node1 ~]# net_id=$(openstack network create test-net --provider-segment 200 --provider-network-type vxlan -c id -f value); openstack network list --provider-segment 100 & openstack network delete $net_id
[1] 628975
[root@node1 ~]# +--------------------------------------+----------+---------+
| ID | Name | Subnets |
+--------------------------------------+----------+---------+
| 9249ed81-865e-44d2-bc22-c5d6c47786e5 | test-net | |
+--------------------------------------+----------+---------+

[1]+ Done openstack --insecure network list --provider-segment 100
[root@node1 ~]#

Hello Bence,
I am attaching my configs (configs.tar)
Of the essential in my opinion, I have the following:

1) I am using postgres DB database.

2) 10 neutron-server processes are running inside kolla-ansible docker CT:

/usr/bin/python3 /usr/bin/neutron-server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-file /etc/neutron/neutron_vpnaas.conf

Bug is easy to reproduce if you do a little hack (and if you have 2+ neutron-servers processes that can handle requests):

diff --git a/neutron/plugins/ml2/plugin.py b/neutron/plugins/ml2/plugin.py
index ea54f8f1c3..74226d112a 100644
--- a/neutron/plugins/ml2/plugin.py
+++ b/neutron/plugins/ml2/plugin.py
@@ -1211,6 +1211,9 @@ class Ml2Plugin(db_base_plugin_v2.NeutronDbPluginV2,
             for net in nets_db:
                 net_data.append(self._make_network_dict(net, context=context))

+            import time
+            time.sleep(3)
+
             self.type_manager.extend_networks_dict_provider(context, net_data)
             nets = self._filter_nets_provider(context, net_data, filters)
         return [db_utils.resource_fields(net, fields) for net in nets]

And then:

[root@node1 ~]# net_id=$(openstack network create test-net --provider-segment 200 --provider-network-type vxlan -c id -f value); openstack network list --provider-segment 100 & openstack network delete $net_id
[1] 628975
[root@node1 ~]# +--------------------------------------+----------+---------+
| ID                                   | Name     | Subnets |
+--------------------------------------+----------+---------+
| 9249ed81-865e-44d2-bc22-c5d6c47786e5 | test-net |         |
+--------------------------------------+----------+---------+

[1]+  Done                    openstack --insecure network list --provider-segment 100
[root@node1 ~]#

Anton Kurbatov (akurbatov) on 2022-09-27

Changed in neutron:
status:	Incomplete → New

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2022-09-29:

Hi Anton,

This is strange. You have very good looking information on how to reproduce this bug but in my environment I still cannot do that.

I have two devstack environments, one for master, another for stable/xena. Both are single-host environments. I changed neutron worker settings (api, rpc, metadata) to the same as yours. Also the default db in devstack is mysql, not postgre.

Beyond the sleep() suggested by you I have even added another sleep(3) before this line:
https://opendev.org/openstack/neutron/src/commit/d847b52c4f36e17c9c360320d160a0d05330a71c/neutron/db/db_base_plugin_v2.py#L523

In the master environment I even enabled vpnaas as it was enabled in your config.

But in none of these variations was I able to reproduce the bug.

I am running out of ideas what may be the difference between your environment and mine. A few more though:

* In a production environment you likely have neutron-server running on multiple hosts. If you turn neutron-server off on all hosts except one, can you still reproduce the bug?
* I'm no DB expert (especially not for postgre), but if you use any db replication cluster, can you still reproduce bug if you turn that off and you have a single host DB?

Just for later reference:

delete side db transaction:
https://opendev.org/openstack/neutron/src/branch/stable/xena/neutron/db/db_base_plugin_v2.py#L511-L523
segments are deleted from NETWORK PRECOMMIT_DELETE hook:
https://opendev.org/openstack/neutron/src/branch/stable/xena/neutron/services/segments/db.py#L344-L370

list side db transaction:
https://opendev.org/openstack/neutron/src/branch/stable/xena/neutron/plugins/ml2/plugin.py#L1235-L1244
the debug log message in the original report comes from here:
https://opendev.org/openstack/neutron/src/branch/stable/xena/neutron/plugins/ml2/managers.py#L160-L184

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2022-09-29:

Plus one more idea (which is likely a lot of work for quite possibly little gain), but if you can replace postgre with another db like mysql and see if the bug still persists...

If any of these ideas help to make the bug reproducible, please also share the exact version of neutron, oslo.db and the db engine.

Revision history for this message

Anton Kurbatov (akurbatov) wrote on 2022-10-05:

I will try to reproduce the issue on the devstack deployment