Failure running neutron-ovn-db-sync-util

Bug #1871272 reported by Sam Morrison
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
networking-ovn
Fix Released
Undecided
Terry Wilson

Bug Description

When running neutron-ovn-db-sync-util we get the following error:

Unhandled error: RuntimeError: dictionary changed size during iteration
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/.virtualenvs/neutron-git/bin/neutron-ovn-db-sync-util", line 10, in <module>
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util sys.exit(main())
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/jake/networking-ovn/networking_ovn/cmd/neutron_ovn_db_sync_util.py", line 205, in main
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util synchronizer.do_sync()
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/jake/networking-ovn/networking_ovn/ovn_db_sync.py", line 92, in do_sync
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util self.sync_networks_ports_and_dhcp_opts(ctx)
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/jake/networking-ovn/networking_ovn/ovn_db_sync.py", line 1091, in sync_networks_ports_and_dhcp_opts
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util ovn_all_dhcp_options['ports_v6'])
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/jake/networking-ovn/networking_ovn/ovn_db_sync.py", line 857, in _sync_port_dhcp_options
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util port, ip_v))
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/jake/networking-ovn/networking_ovn/common/ovn_client.py", line 169, in _get_port_dhcp_options
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util port, ip_version)
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/jake/networking-ovn/networking_ovn/common/ovn_client.py", line 133, in _get_subnet_dhcp_options_for_port
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util get_opts = self._nb_idl.get_subnets_dhcp_options(subnets)
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/jake/networking-ovn/networking_ovn/ovsdb/impl_idl_ovn.py", line 487, in get_subnets_dhcp_options
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util for row in self._tables['DHCP_Options'].rows.values():
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util File "/root/.virtualenvs/neutron-git/lib/python3.6/_collections_abc.py", line 761, in __iter__
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util for key in self._mapping:
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util RuntimeError: dictionary changed size during iteration
2020-04-06 23:33:38.954 3072 ERROR neutron_ovn_db_sync_util

Revision history for this message
Sam Morrison (sorrison) wrote :

Should mention we are using the stein version

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I'm no db sync expert but could it be that there was a subnet created while the sync was running? Has the API been shut down during the migration?

Revision history for this message
Sam Morrison (sorrison) wrote :

This is running on a snapshot of our production neutron server that is read only for this purpose so nothing is changing on the neutron side.

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Hi Sam, do you always see this around the DHCP_options sync?

IIUC from your email [0] it ends up always failing so I'm wondering if that's always around this code.

[0] http://lists.openstack.org/pipermail/openstack-discuss/2020-April/013946.html

Revision history for this message
Jake Yip (waipengyip) wrote :
Download full text (5.1 KiB)

Hi Daniel,

Answering for Sam since we work together...

Yes, `RuntimeError: dictionary changed size during iteration` is seen while doing sync_networks_ports_and_dhcp_opts()

In addition, if we skip this sync, other syncs like _sync_acls_port_groups() fail too, with different errors (posted right at the end)

I'm not sure if these two errors are symptoms of the same cause, or totally different issues.

I've been trying to trace the code, but it lost me when it got into threading. Could these errors be due to race condition between threads?

Error below:

2020-04-14 05:55:56.852 26321 CRITICAL neutron_ovn_db_sync_util [req-b9f552ef-b842-46d7-bf6a-ab9242634dbf - - - - -] Unhandled error: RuntimeError: ACL (to-lport, 1002, outport == @pg_0fefc4f9_772e_42f3_b69c_8a243744f7c0 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22) already exists
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 104, in transaction
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util yield self._nested_txns_map[cur_thread_id]
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util KeyError: 140466353522496
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util During handling of the above exception, another exception occurred:
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/bin/neutron-ovn-db-sync-util", line 10, in <module>
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util sys.exit(main())
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/cmd/neutron_ovn_db_sync_util.py", line 205, in main
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util synchronizer.do_sync()
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 94, in do_sync
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util self.sync_acls(ctx)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 456, in sync_acls
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util self._sync_acls_port_groups(ctx)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 359, in _sync_acls_port_groups
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util txn.add(self.ovn_api.pg_acl_add(**acla))
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util next(self.gen)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovsdb/impl_idl_...

Read more...

Revision history for this message
Jake Yip (waipengyip) wrote :

Please ignore the previous message, I found out that was due to a duplicate SG rule in SG.

Revision history for this message
Jake Yip (waipengyip) wrote :

As spoken on IRC, I am reverting to clean upstream code without our optimisations, and let's start from there.

So first of all, without changes, Port-Group-SYNC tries add all 53605 port groups in one transaction, but OVSDB doesn't like it. Logs attached

Revision history for this message
Jake Yip (waipengyip) wrote :

We have around

~250 networks/subnets
~11,000 ports
~53,000 security groups
~200,000 security group rules

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Hm... looks like the server is refusing to commit as per "OVSDB transaction returned TRY_AGAIN"

@Jake, can you check ovsdb-server logs for NB database? You may want to have debug enabled as well:

ovs-appctl -t <unix_socket_path_for_NB_db> vlog/set dbg

and then set it back to info :)

Revision history for this message
Jake Yip (waipengyip) wrote :
Download full text (4.7 KiB)

Thanks for your help! Here's the output

May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01952|poll_loop|DBG|wakeup due to 53-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01953|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01954|poll_loop|DBG|wakeup due to 332-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01955|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01956|poll_loop|DBG|wakeup due to 334-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01957|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01958|poll_loop|DBG|wakeup due to [POLLIN] on fd 18 (0.0.0.0:6641<->) at ../lib/stream-fd.c:274 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01959|reconnect|DBG|tcp:127.0.0.1:51024: connecting...
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01960|reconnect|DBG|tcp:127.0.0.1:51024: entering CONNECTING
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01961|reconnect|DBG|tcp:127.0.0.1:51024: connected
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01962|reconnect|DBG|tcp:127.0.0.1:51024: entering ACTIVE
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01963|raft|DBG|raft_is_connected: true
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01964|poll_loop|DBG|wakeup due to 301-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01965|raft|DBG|raft_is_connected: true
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01966|poll_loop|DBG|wakeup due to 334-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)

This happened at the time of TRY_AGAIN.

2020-05-04 18:38:34.785 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53602): PgAddCommand(name=pg_32c23e96_5bd9_4bb7_ae54_a9e1192c5e90, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': '32c23e96-5bd9-4bb7-ae54-a9e1192c5e90'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:34.786 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53603): PgAddCommand(name=pg_aea9eebf_8cce_4b15_b098_7579ca923a8f, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'aea9eebf-8cce-4b15-b098-7579ca923a8f'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:34.786 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53604): PgAddCommand(name=pg_d81957eb_e262_4e43_8318_0fac29e01eae, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'd81957eb-e262-4e43-8318-0fac29e01eae'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:36.860 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned...

Read more...

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

I'm not very familiar with the OVSDB server in A/A config (RAFT) so I'm not sure if it can be the source of the issue. I'd recommend trying with non clustered version first to isolate things.

In order to avoid those probe interval reconnections you can bump ovn-remove-probe-interval to 180000.

Anyways, just to recap from the IRC conversation:

You guys have a patch that breaks down the txns in chunks and you're no longer getting the TRY_AGAIN error. You may post this patch soon.

However, with such patch, the error that you always see is ' RuntimeError: dictionary changed size during iteration' around different parts of the code.

Can you confirm that I understood the issue properly?
Thanks!

Revision history for this message
Jake Yip (waipengyip) wrote :

Hi Daniel,

Thanks so much for your help so far.

I've done a run with single DB. The reconnects still exist. I found out that the reconnects are for neutron-ovn-db-sync-util client

root@jake-ovn:/var/run/openvswitch# netstat -alntp | grep 6641
tcp 0 0 0.0.0.0:6641 0.0.0.0:* LISTEN 5981/ovsdb-server
tcp 0 0 127.0.0.1:53686 127.0.0.1:6641 ESTABLISHED 6092/neutron-ovn-db
tcp 0 0 127.0.0.1:6641 127.0.0.1:53686 ESTABLISHED 5981/ovsdb-server
root@jake-ovn:/var/run/openvswitch#

In `journalctl -u ovn-central.service -f`

May 04 20:59:51 jake-ovn ovsdb-server[5981]: ovs|00009|reconnect|ERR|tcp:127.0.0.1:53686: no response to inactivity probe after 5 seconds, disconnecting

I will try to figure out how bump ovn-remove-probe-interval.

Our patch doesn't fix the TRY_AGAIN error for good, but it helps a lot. For example, by committing 50 commands at a time, we can progress the sync along (each transaction may get TRY_AGAIN for a few times and eventually succeeds)

I will run another sync with our patch and dump the errors that we get here.

Revision history for this message
Jake Yip (waipengyip) wrote :

So with our patch (partial example at https://review.opendev.org/#/c/725474/), we can get the sync to progress.

This however, throws up a host of different errors [1] [2] [3]. We are able get around the errors by running the sync in a for loop, progressing a bit further each time.

I'm not sure if our patch is any good, I would really appreciate if someone can help me take a look.

[1] http://paste.openstack.org/show/kKAtTq2SPV4RMSp4KkGd/
[2] http://paste.openstack.org/show/4cmJ7kh47AeOUYFTEWsj/
[3] http://paste.openstack.org/show/kPvktZUYRHzLv6SBTj2I/

Revision history for this message
Jake Yip (waipengyip) wrote :

We have a test database dump (neutron ussuri) up at https://swift.rc.nectar.org.au/v1/AUTH_f42f9588576c43969760d81384b83b1f/jake-public/neutron-ussuri.sql.gz

This is fake data, but similar in size to our production data. Testing with this gives the failures mentioned, so I think it'll be helpful as a reference dataset.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

When I use the test db dump on devstack ussuri, I don't get the errors above. I did run into some other issues after where you were failing which I'll post a patch for tomorrow after chatting w/ Lucas. I'll need to switch to my real test machine, 8GB of ram and after about 3 hours it was finally killed due to running out of memory.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

This was fixed by adding python-ovs custom indexing, and adding support for that to ovsdbapp in https://review.opendev.org/c/openstack/ovsdbapp/+/678076

Changed in networking-ovn:
assignee: nobody → Terry Wilson (otherwiseguy)
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.