OVN metadata randomly stops working

Bug #1996594 reported by Hua Zhang
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Terry Wilson

Bug Description

We found that OVN metadata will not work randomly when OVN is writing a snapshot.

1, At 12:30:35, OVN started to transfer leadership to write a snapshot

$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'Transferring leadership'
sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.

2, At 12:30:36, neutron-ovn-metadata-agent reported OVSDB Error

$ find sosreport-srv1*/var/log/neutron/* |xargs zgrep -i -E 'OVSDB Error'
sosreport-srv1xxx2d-xxx-2022-08-18-cuvkufw/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.103 75556 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available
sosreport-srv1xxx6d-xxx-2022-08-18-bgnovqu/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.104 2171 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available

3, At 12:57:53, we saw the error 'No port found in network', then we will hit the problem that OVN metadata does not work randomly

2022-08-18 12:57:53.800 3730 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.135

After the problem occurs, restarting neutron-ovn-metadata-agent or restarting haproxy instance as follows can be used as a workaround.

/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2 haproxy -f /var/lib/neutron/ovn-metadata-proxy/63e2c276-60dd-40e3-baa1-c16342eacce2.conf

One lp bug #1990978 [1] is trying to reducing the frequency of transfers, it should be beneficial to this problem.
But it only reduces the occurrence of problems, not completely avoiding them. I wonder if we need to add some retry logic on the neutron side

NOTE: The openstack version we are using is focal-xena, and openvswitch's version is 2.16.0-0ubuntu2.1~cloud0

[1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1990978

Tags: ovn
tags: added: ovn
Revision history for this message
Bernard Cafarelli (bcafarel) wrote :

Do you use OVN in clustered (RAFT) mode? OVS 2.17+ is strongly recommended in that case

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

I'll look into this.

Changed in neutron:
importance: Undecided → Medium
assignee: nobody → Terry Wilson (otherwiseguy)
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @bcafarel, thanks for your reply. Yes, we are using ovn raft mode.

Why ovs 2.17+ is strongly recommended in this case? can you share more contexts?
and I found the lp bug #1953591 seems to be a related issue,one comment [1] in it
mentioned that using ovs 2.17.1.post1 can also hit this issue.

[1] https://bugs.launchpad.net/networking-ovn/+bug/1953591/comments/8

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @bcafarel, I checked all commits about raft between 2.16.0 to 2.17.0, the commit bf07cc9 looks suspicious.

$ git cherry v2.16.0 v2.17.0 -v |grep raft
+ 0de882954032aa37dc943bafd72c33324aa0c95a raft: Don't keep full json objects in memory if no longer needed.
+ bf07cc9cdb2f37fede8c0363937f1eb9f4cfd730 raft: Only allow followers to snapshot.

but I checked the data, it seems that it's not caused by it.

5-lxd-23: f801 100.94.0.99:6644 follower term=297
6-lxd-24: 0f3c 100.94.0.158:6644 follower term=297
7-lxd-27: 9b15 100.94.0.204:6644 leader term=297 leader

(old leader, 6-lxd-24, 0f3c) - sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
2022-08-18T17:52:53.367Z|82383|raft|INFO|rejected append_reply (not leader)
2022-08-18T17:52:53.378Z|82384|raft|INFO|server 9b15 is leader for term 297

(follower, 5-lxd-23: f801) - sosreport-juju-2752e1-5-lxd-23-xxx-2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log
2022-08-18T17:52:53.379Z|32327|raft|INFO|server 9b15 is leader for term 297

(new leader, 7-lxd-27: 9b15)
$ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'received leadership transfer' -A2 |tail -n3
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92894|raft|INFO|term 297: starting election
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|92895|raft|INFO|term 297: elected leader by 2+ of 3 servers

We see that the new leader (7-lxd-27) to be receives the leadership transfer, initiates the electiona and imediately after starts a snapshot taht takes 0.353 second (17:52:53.378 - 17:52:53.025). During this time, follower(5-xld-23) votes for 7-lxd-27 electing it as cluster leader but 7-lxd-27 doesn't effectively become leader unitl it finishes snapshotting, essentially keeping the cluster without a leader for up to 0.353 second. So actaully this snapshot is less than 0.353 second, this also means that the data to be compressed in the snapshot is not large.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

The ovs python library does not have support for RAFTs update3/monitor_cond_since primitives before 2.17. This means that every disconnect (which as you point out can happen with every leadership election on leader-only connections), involves syncing all db data to the client. This is untenable on a large production deployment. With update3/monitor_cond_since, a reconnect will only have to pull any transactions that were missed while it was disconnected.

There are several patches and fixes related to adding update3/monitor_cond_since support and the ovs team declined to backport them. So it is an extremely bad idea to try to use RAFT w/ openstack w/ anything earlier than ovs 2.17.

Revision history for this message
yatin (yatinkarel) wrote (last edit ):

There were many fixes related to the reported issue in neutron and ovs since this bug report, some of these that i quickly catched are:-
- https://patchwork.ozlabs.org<email address hidden>/
- https://review.opendev.org/c/openstack/ovsdbapp/+/856200
- https://review.opendev.org/c/openstack/ovsdbapp/+/862524
- https://review.opendev.org/c/openstack/neutron/+/857775
- https://review.opendev.org/c/openstack/neutron/+/871825

Closing it based on above and Comment #5. If the issues are still seen with python-ovs>=2.17 and above fixes included please feel free to open the issue along with ovsdb-server sb logs and neutron server and metadata agent debug logs.

Changed in neutron:
status: New → Fix Released
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Hi @otherwiseguy @yatinkarel , I confirm this issue has been resolved by using ovs 2.17, thanks for all your help.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.