ovn-central config ovsdb-server-election-timer default value should be higher

Bug #2013344 reported by Nishant Dash
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-ovn-central
New
Undecided
Unassigned

Bug Description

Hello,

Environment: Focal-Ussuri, juju 2.9.37, MAAS 3.0/stable (3.0.0-10029-g.986ea3e45)
Process: OVN upgrade from 20.03 to 22.03

The ovsdb-server-election-timer's default value is set to 4s. This value is very small, and I can hit timeouts in even a test environment with no load on it.

Upgrading OVN on a customer environment with decent load, running the upgrade fails with a SB DB migration failure with the following error:

Mar 24 08:52:38 juju-5f7845-5-lxd-21 ovsdb-server[3319803]: ovs|00142|raft|INFO|term 33897: 4425 ms timeout expired, starting election

This combined with neutron api packages not being upgraded from [1] can cause result in a situation with a data plane outage as network agents will go down.

Recovering from this situation required playing around with the /usr/share/ovn/scripts/ovn-ctl script to partially recover the state of OVN and then increasing ovsdb-server-election-timer to 60s and then re-attempting the SB DB migration.

[1] https://bugs.launchpad.net/charm-neutron-api-plugin-ovn/+bug/1992770

Nishant Dash (dash3)
summary: - ovn-central's ovsdb-server-election-timer's default value should be
+ ovn-central config ovsdb-server-election-timer default value should be
higher
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Hello, Nishant, and thank you for the bug report.

While there is no doubt the default election timer could be adjusted, it is a separate issue from the clustered database upgrade failure.

No matter what value you set it to, your upgrade can still fail if the conversion time does not fit within a single election timer window. How to approach that fact is stream of work by itself which should be split out in a bug of its own.

One issue per bug, please :)

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Let's use bug 1999605 for the clustered db schema conversion issue.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

For regular runtime purposes it looks like a value of 16 [0][1] might be a good option as it has come up repeatedly in conversations with upstream.

0: https://github.com/ovn-org/ovn-heater/blob/05bc4b12217f62109995c3eb90cbc0d58ac96714/ovn-fake-multinode-utils/translate_yaml.py#L116
1: https://mail.openvswitch.org/pipermail/ovs-dev/2023-March/403138.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.