neutron

OVN metadata randomly stops working

Bug #1996594 reported by Hua Zhang on 2022-11-15

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Medium	Terry Wilson

Bug Description

We found that OVN metadata will not work randomly when OVN is writing a snapshot.

1, At 12:30:35, OVN started to transfer leadership to write a snapshot

$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'Transferring leadership'
sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.

2, At 12:30:36, neutron-ovn-metadata-agent reported OVSDB Error

$ find sosreport-srv1*/var/log/neutron/* |xargs zgrep -i -E 'OVSDB Error'
sosreport-srv1xxx2d-xxx-2022-08-18-cuvkufw/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.103 75556 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available
sosreport-srv1xxx6d-xxx-2022-08-18-bgnovqu/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.104 2171 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available

3, At 12:57:53, we saw the error 'No port found in network', then we will hit the problem that OVN metadata does not work randomly

2022-08-18 12:57:53.800 3730 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.135

After the problem occurs, restarting neutron-ovn-metadata-agent or restarting haproxy instance as follows can be used as a workaround.

/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2 haproxy -f /var/lib/neutron/ovn-metadata-proxy/63e2c276-60dd-40e3-baa1-c16342eacce2.conf

One lp bug #1990978 [1] is trying to reducing the frequency of transfers, it should be beneficial to this problem.
But it only reduces the occurrence of problems, not completely avoiding them. I wonder if we need to add some retry logic on the neutron side

NOTE: The openstack version we are using is focal-xena, and openvswitch's version is 2.16.0-0ubuntu2.1~cloud0

[1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1990978

Tags:

Brian Haley (brian-haley) on 2022-11-15

tags:

added: ovn

Revision history for this message

Bernard Cafarelli (bcafarel) wrote on 2022-11-15:

Do you use OVN in clustered (RAFT) mode? OVS 2.17+ is strongly recommended in that case

Revision history for this message

Terry Wilson (otherwiseguy) wrote on 2022-11-16:

I'll look into this.

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2022-11-16

Changed in neutron:
importance:	Undecided → Medium
assignee:	nobody → Terry Wilson (otherwiseguy)

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2022-11-16:

Hi @bcafarel, thanks for your reply. Yes, we are using ovn raft mode.

Why ovs 2.17+ is strongly recommended in this case? can you share more contexts?
and I found the lp bug #1953591 seems to be a related issue,one comment [1] in it
mentioned that using ovs 2.17.1.post1 can also hit this issue.

[1] https://bugs.launchpad.net/networking-ovn/+bug/1953591/comments/8

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2022-11-16:

Hi @bcafarel, I checked all commits about raft between 2.16.0 to 2.17.0, the commit bf07cc9 looks suspicious.

$ git cherry v2.16.0 v2.17.0 -v |grep raft
+ 0de882954032aa37dc943bafd72c33324aa0c95a raft: Don't keep full json objects in memory if no longer needed.
+ bf07cc9cdb2f37fede8c0363937f1eb9f4cfd730 raft: Only allow followers to snapshot.

but I checked the data, it seems that it's not caused by it.

5-lxd-23: f801 100.94.0.99:6644 follower term=297
6-lxd-24: 0f3c 100.94.0.158:6644 follower term=297
7-lxd-27: 9b15 100.94.0.204:6644 leader term=297 leader

(follower, 5-lxd-23: f801) - sosreport-juju-2752e1-5-lxd-23-xxx-2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.379Z|32327|raft|INFO|server 9b15 is leader for term 297

(new leader, 7-lxd-27: 9b15)
$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'received leadership transfer' -A2 |tail -n3
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92894|raft|INFO|term 297: starting election
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|92895|raft|INFO|term 297: elected leader by 2+ of 3 servers

We see that the new leader (7-lxd-27) to be receives the leadership transfer, initiates the electiona and imediately after starts a snapshot taht takes 0.353 second (17:52:53.378 - 17:52:53.025). During this time, follower(5-xld-23) votes for 7-lxd-27 electing it as cluster leader but 7-lxd-27 doesn't effectively become leader unitl it finishes snapshotting, essentially keeping the cluster without a leader for up to 0.353 second. So actaully this snapshot is less than 0.353 second, this also means that the data to be compressed in the snapshot is not large.

Hi @bcafarel, I checked all commits about raft between 2.16.0 to 2.17.0, the commit bf07cc9 looks suspicious.

but I checked the data, it seems that it's not caused by it.

5-lxd-23: f801 100.94.0.99:6644 follower term=297
6-lxd-24: 0f3c 100.94.0.158:6644 follower term=297
7-lxd-27: 9b15 100.94.0.204:6644 leader term=297 leader

(follower, 5-lxd-23: f801) - sosreport-juju-2752e1-5-lxd-23-xxx-2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.379Z|32327|raft|INFO|server 9b15 is leader for term 297

Revision history for this message

Terry Wilson (otherwiseguy) wrote on 2022-11-16:

The ovs python library does not have support for RAFTs update3/monitor_cond_since primitives before 2.17. This means that every disconnect (which as you point out can happen with every leadership election on leader-only connections), involves syncing all db data to the client. This is untenable on a large production deployment. With update3/monitor_cond_since, a reconnect will only have to pull any transactions that were missed while it was disconnected.

There are several patches and fixes related to adding update3/monitor_cond_since support and the ovs team declined to backport them. So it is an extremely bad idea to try to use RAFT w/ openstack w/ anything earlier than ovs 2.17.

Revision history for this message

yatin (yatinkarel) wrote on 2023-07-18 (last edit on 2023-07-18):

There were many fixes related to the reported issue in neutron and ovs since this bug report, some of these that i quickly catched are:-
- https://patchwork.ozlabs.org<email address hidden>/
- https://review.opendev.org/c/openstack/ovsdbapp/+/856200
- https://review.opendev.org/c/openstack/ovsdbapp/+/862524
- https://review.opendev.org/c/openstack/neutron/+/857775
- https://review.opendev.org/c/openstack/neutron/+/871825

Closing it based on above and Comment #5. If the issues are still seen with python-ovs>=2.17 and above fixes included please feel free to open the issue along with ovsdb-server sb logs and neutron server and metadata agent debug logs.

Changed in neutron:
status:	New → Fix Released

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2023-07-18:

Hi @otherwiseguy @yatinkarel , I confirm this issue has been resolved by using ovs 2.17, thanks for all your help.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #2214289 Edit

Bug watches keep track of this bug in other bug trackers.