Bug #1871272 “Failure running neutron-ovn-db-sync-util” : Bugs : networking-ovn

Revision history for this message

Sam Morrison (sorrison) wrote on 2020-04-07:

#1

Should mention we are using the stein version

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2020-04-07:

#2

I'm no db sync expert but could it be that there was a subnet created while the sync was running? Has the API been shut down during the migration?

Revision history for this message

Sam Morrison (sorrison) wrote on 2020-04-07:

#3

This is running on a snapshot of our production neutron server that is read only for this purpose so nothing is changing on the neutron side.

Revision history for this message

Daniel Alvarez (dalvarezs) wrote on 2020-04-08:

#4

Hi Sam, do you always see this around the DHCP_options sync?

IIUC from your email [0] it ends up always failing so I'm wondering if that's always around this code.

[0] http://lists.openstack.org/pipermail/openstack-discuss/2020-April/013946.html

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-04-15:

#5

Download full text (5.1 KiB)

Hi Daniel,

Answering for Sam since we work together...

Yes, `RuntimeError: dictionary changed size during iteration` is seen while doing sync_networks_ports_and_dhcp_opts()

In addition, if we skip this sync, other syncs like _sync_acls_port_groups() fail too, with different errors (posted right at the end)

I'm not sure if these two errors are symptoms of the same cause, or totally different issues.

I've been trying to trace the code, but it lost me when it got into threading. Could these errors be due to race condition between threads?

Error below:

2020-04-14 05:55:56.852 26321 CRITICAL neutron_ovn_db_sync_util [req-b9f552ef-b842-46d7-bf6a-ab9242634dbf - - - - -] Unhandled error: RuntimeError: ACL (to-lport, 1002, outport == @pg_0fefc4f9_772e_42f3_b69c_8a243744f7c0 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22) already exists
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 104, in transaction
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util yield self._nested_txns_map[cur_thread_id]
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util KeyError: 140466353522496
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util During handling of the above exception, another exception occurred:
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/bin/neutron-ovn-db-sync-util", line 10, in <module>
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util sys.exit(main())
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/cmd/neutron_ovn_db_sync_util.py", line 205, in main
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util synchronizer.do_sync()
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 94, in do_sync
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util self.sync_acls(ctx)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 456, in sync_acls
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util self._sync_acls_port_groups(ctx)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 359, in _sync_acls_port_groups
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util txn.add(self.ovn_api.pg_acl_add(**acla))
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util next(self.gen)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util File "/usr/lib/python3/dist-packages/networking_ovn/ovsdb/impl_idl_...

Hi Daniel,

Answering for Sam since we work together...

Yes, `RuntimeError: dictionary changed size during iteration` is seen while doing sync_networks_ports_and_dhcp_opts()

In addition, if we skip this sync, other syncs like _sync_acls_port_groups() fail too, with different errors (posted right at the end)

I'm not sure if these two errors are symptoms of the same cause, or totally different issues.

I've been trying to trace the code, but it lost me when it got into threading. Could these errors be due to race condition between threads?

Error below:

2020-04-14 05:55:56.852 26321 CRITICAL neutron_ovn_db_sync_util [req-b9f552ef-b842-46d7-bf6a-ab9242634dbf - - - - -] Unhandled error: RuntimeError: ACL (to-lport, 1002, outport == @pg_0fefc4f9_772e_42f3_b69c_8a243744f7c0 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22) already exists
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 104, in transaction
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     yield self._nested_txns_map[cur_thread_id]
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util KeyError: 140466353522496
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util 
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util During handling of the above exception, another exception occurred:
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util 
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util Traceback (most recent call last):
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/bin/neutron-ovn-db-sync-util", line 10, in <module>
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     sys.exit(main())
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/networking_ovn/cmd/neutron_ovn_db_sync_util.py", line 205, in main
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     synchronizer.do_sync()
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 94, in do_sync
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     self.sync_acls(ctx)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 456, in sync_acls
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     self._sync_acls_port_groups(ctx)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/networking_ovn/ovn_db_sync.py", line 359, in _sync_acls_port_groups
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     txn.add(self.ovn_api.pg_acl_add(**acla))
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     next(self.gen)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/networking_ovn/ovsdb/impl_idl_ovn.py", line 182, in transaction
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     yield t
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     next(self.gen)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 112, in transaction
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     del self._nested_txns_map[cur_thread_id]
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 69, in __exit__
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     self.result = self.commit()
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 65, in commit
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     raise result.ex
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/connection.py", line 122, in run
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     txn.results.put(txn.do_commit())
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 89, in do_commit
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     command.run_idl(txn)
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util   File "/usr/lib/python3/dist-packages/ovsdbapp/schema/ovn_northbound/commands.py", line 121, in run_idl
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util     self.direction, self.priority, self.match))
2020-04-14 05:55:56.852 26321 ERROR neutron_ovn_db_sync_util RuntimeError: ACL (to-lport, 1002, outport == @pg_0fefc4f9_772e_42f3_b69c_8a243744f7c0 && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 22) already exists

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-04-15:

#6

Please ignore the previous message, I found out that was due to a duplicate SG rule in SG.

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-05-04:

#7

neutron-ovn-db-sync-util-debug1 Edit (37.7 KiB, text/plain)

As spoken on IRC, I am reverting to clean upstream code without our optimisations, and let's start from there.

So first of all, without changes, Port-Group-SYNC tries add all 53605 port groups in one transaction, but OVSDB doesn't like it. Logs attached

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-05-04:

#8

We have around

~250 networks/subnets
~11,000 ports
~53,000 security groups
~200,000 security group rules

Revision history for this message

Daniel Alvarez (dalvarezs) wrote on 2020-05-04:

#9

Hm... looks like the server is refusing to commit as per "OVSDB transaction returned TRY_AGAIN"

@Jake, can you check ovsdb-server logs for NB database? You may want to have debug enabled as well:

ovs-appctl -t <unix_socket_path_for_NB_db> vlog/set dbg

and then set it back to info :)

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-05-04:

#10

Download full text (4.7 KiB)

Thanks for your help! Here's the output

May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01952|poll_loop|DBG|wakeup due to 53-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01953|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01954|poll_loop|DBG|wakeup due to 332-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01955|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01956|poll_loop|DBG|wakeup due to 334-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01957|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01958|poll_loop|DBG|wakeup due to [POLLIN] on fd 18 (0.0.0.0:6641<->) at ../lib/stream-fd.c:274 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01959|reconnect|DBG|tcp:127.0.0.1:51024: connecting...
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01960|reconnect|DBG|tcp:127.0.0.1:51024: entering CONNECTING
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01961|reconnect|DBG|tcp:127.0.0.1:51024: connected
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01962|reconnect|DBG|tcp:127.0.0.1:51024: entering ACTIVE
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01963|raft|DBG|raft_is_connected: true
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01964|poll_loop|DBG|wakeup due to 301-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01965|raft|DBG|raft_is_connected: true
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01966|poll_loop|DBG|wakeup due to 334-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)

This happened at the time of TRY_AGAIN.

2020-05-04 18:38:34.785 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53602): PgAddCommand(name=pg_32c23e96_5bd9_4bb7_ae54_a9e1192c5e90, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': '32c23e96-5bd9-4bb7-ae54-a9e1192c5e90'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:34.786 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53603): PgAddCommand(name=pg_aea9eebf_8cce_4b15_b098_7579ca923a8f, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'aea9eebf-8cce-4b15-b098-7579ca923a8f'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:34.786 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53604): PgAddCommand(name=pg_d81957eb_e262_4e43_8318_0fac29e01eae, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'd81957eb-e262-4e43-8318-0fac29e01eae'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:36.860 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned...

Thanks for your help! Here's the output

May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01952|poll_loop|DBG|wakeup due to 53-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01953|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01954|poll_loop|DBG|wakeup due to 332-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01955|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01956|poll_loop|DBG|wakeup due to 334-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01957|raft|DBG|raft_is_connected: true
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01958|poll_loop|DBG|wakeup due to [POLLIN] on fd 18 (0.0.0.0:6641<->) at ../lib/stream-fd.c:274 (0% CPU usage)
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01959|reconnect|DBG|tcp:127.0.0.1:51024: connecting...
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01960|reconnect|DBG|tcp:127.0.0.1:51024: entering CONNECTING
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01961|reconnect|DBG|tcp:127.0.0.1:51024: connected
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01962|reconnect|DBG|tcp:127.0.0.1:51024: entering ACTIVE
May 04 18:38:36 jake-ovn-train ovsdb-server[4573]: ovs|01963|raft|DBG|raft_is_connected: true
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01964|poll_loop|DBG|wakeup due to 301-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01965|raft|DBG|raft_is_connected: true
May 04 18:38:37 jake-ovn-train ovsdb-server[4573]: ovs|01966|poll_loop|DBG|wakeup due to 334-ms timeout at ../ovsdb/raft.c:1934 (0% CPU usage)

This happened at the time of TRY_AGAIN.

2020-05-04 18:38:34.785 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53602): PgAddCommand(name=pg_32c23e96_5bd9_4bb7_ae54_a9e1192c5e90, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': '32c23e96-5bd9-4bb7-ae54-a9e1192c5e90'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:34.786 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53603): PgAddCommand(name=pg_aea9eebf_8cce_4b15_b098_7579ca923a8f, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'aea9eebf-8cce-4b15-b098-7579ca923a8f'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:34.786 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=182 command(idx=53604): PgAddCommand(name=pg_d81957eb_e262_4e43_8318_0fac29e01eae, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'd81957eb-e262-4e43-8318-0fac29e01eae'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:36.860 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:96
2020-05-04 18:38:36.870 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=183 command(idx=0): PgAddCommand(name=neutron_pg_drop, may_exist=False, columns={'acls': []}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:36.871 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=183 command(idx=1): PgAddCommand(name=pg_c3e4ccef_823f_47a4_85b6_70578498677d, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'c3e4ccef-823f-47a4-85b6-70578498677d'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:36.871 5935 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=183 command(idx=2): PgAddCommand(name=pg_c1720ab7_ce6e_4b50_81a7_3f2b1766b59c, may_exist=False, columns={'acls': [], 'external_ids': {'neutron:security_group_id': 'c1720ab7-ce6e-4b50-81a7-3f2b1766b59c'}}) do_commit /root/.virtualenvs/neutron/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:87
2020-05-04 18:38:36.872 5935 DEBUG ovsdbapp.backend.ovs_

There are also logs like

May 04 18:44:45 jake-ovn-train ovsdb-server[4573]: ovs|04609|reconnect|ERR|tcp:127.0.0.1:51034: no response to inactivity probe after 5 seconds, disconnecting
May 04 18:45:56 jake-ovn-train ovsdb-server[4573]: ovs|04610|reconnect|ERR|tcp:127.0.0.1:51036: no response to inactivity probe after 5 seconds, disconnecting

not sure if this is important.

Revision history for this message

Daniel Alvarez (dalvarezs) wrote on 2020-05-04:

#11

I'm not very familiar with the OVSDB server in A/A config (RAFT) so I'm not sure if it can be the source of the issue. I'd recommend trying with non clustered version first to isolate things.

In order to avoid those probe interval reconnections you can bump ovn-remove-probe-interval to 180000.

Anyways, just to recap from the IRC conversation:

You guys have a patch that breaks down the txns in chunks and you're no longer getting the TRY_AGAIN error. You may post this patch soon.

However, with such patch, the error that you always see is ' RuntimeError: dictionary changed size during iteration' around different parts of the code.

Can you confirm that I understood the issue properly?
Thanks!

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-05-04:

#12

Hi Daniel,

Thanks so much for your help so far.

I've done a run with single DB. The reconnects still exist. I found out that the reconnects are for neutron-ovn-db-sync-util client

root@jake-ovn:/var/run/openvswitch# netstat -alntp | grep 6641
tcp 0 0 0.0.0.0:6641 0.0.0.0:* LISTEN 5981/ovsdb-server
tcp 0 0 127.0.0.1:53686 127.0.0.1:6641 ESTABLISHED 6092/neutron-ovn-db
tcp 0 0 127.0.0.1:6641 127.0.0.1:53686 ESTABLISHED 5981/ovsdb-server
root@jake-ovn:/var/run/openvswitch#

In `journalctl -u ovn-central.service -f`

May 04 20:59:51 jake-ovn ovsdb-server[5981]: ovs|00009|reconnect|ERR|tcp:127.0.0.1:53686: no response to inactivity probe after 5 seconds, disconnecting

I will try to figure out how bump ovn-remove-probe-interval.

Our patch doesn't fix the TRY_AGAIN error for good, but it helps a lot. For example, by committing 50 commands at a time, we can progress the sync along (each transaction may get TRY_AGAIN for a few times and eventually succeeds)

I will run another sync with our patch and dump the errors that we get here.

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-05-05:

#13

So with our patch (partial example at https://review.opendev.org/#/c/725474/), we can get the sync to progress.

This however, throws up a host of different errors [1] [2] [3]. We are able get around the errors by running the sync in a for loop, progressing a bit further each time.

I'm not sure if our patch is any good, I would really appreciate if someone can help me take a look.

[1] http://paste.openstack.org/show/kKAtTq2SPV4RMSp4KkGd/
[2] http://paste.openstack.org/show/4cmJ7kh47AeOUYFTEWsj/
[3] http://paste.openstack.org/show/kPvktZUYRHzLv6SBTj2I/

Revision history for this message

Jake Yip (waipengyip) wrote on 2020-05-17:

#14

We have a test database dump (neutron ussuri) up at https://swift.rc.nectar.org.au/v1/AUTH_f42f9588576c43969760d81384b83b1f/jake-public/neutron-ussuri.sql.gz

This is fake data, but similar in size to our production data. Testing with this gives the failures mentioned, so I think it'll be helpful as a reference dataset.

Revision history for this message

Terry Wilson (otherwiseguy) wrote on 2020-05-19:

#15

When I use the test db dump on devstack ussuri, I don't get the errors above. I did run into some other issues after where you were failing which I'll post a patch for tomorrow after chatting w/ Lucas. I'll need to switch to my real test machine, 8GB of ram and after about 3 hours it was finally killed due to running out of memory.

Revision history for this message

Terry Wilson (otherwiseguy) wrote on 2022-01-20:

#16

This was fixed by adding python-ovs custom indexing, and adding support for that to ovsdbapp in https://review.opendev.org/c/openstack/ovsdbapp/+/678076

Changed in networking-ovn:
assignee:	nobody → Terry Wilson (otherwiseguy)
status:	New → Fix Released

networking-ovn

Failure running neutron-ovn-db-sync-util

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches