This issue can cause major outages when sb and nb clusters can't elect a leader because of stale ovn-central units in the cluster.
Starting from 3 ovn-central, we had to remove 2 ovn-central (because of hw maintenance) and added two back. We didn't manually cluster/leave. The two raft clusters were unable to elect a leader because both SB and NB had 4 members of which 2 down and the 5th unit could not join the cluster
Subscribing field-medium.
This issue can cause major outages when sb and nb clusters can't elect a leader because of stale ovn-central units in the cluster.
Starting from 3 ovn-central, we had to remove 2 ovn-central (because of hw maintenance) and added two back. We didn't manually cluster/leave. The two raft clusters were unable to elect a leader because both SB and NB had 4 members of which 2 down and the 5th unit could not join the cluster
The 3 left ovn-central were /2 /3 /4
To recreate the clusters we followed these steps:
Recovery steps:
1. stop all units:
juju run-action ovn-central/2 pause --wait
juju run-action ovn-central/3 pause --wait
juju run-action ovn-central/4 pause --wait
2. created standalone on ovn-central/2 to-standalone /tmp/standalone _ovnsb_ db.db /var/lib/ ovn/ovnsb_ db.db to-standalone /tmp/standalone _ovnnb_ db.db /var/lib/ ovn/ovnnb_ db.db
# ovsdb-tool cluster-
# ovsdb-tool cluster-
3. create clusters ovn/ovnsb_ db.db /tmp/standalone _ovnsb_ db.db ssl:<ovn- central- 2-ip>:6644 ovn/ovnnb_ db.db /tmp/standalone _ovnnb_ db.db ssl:<ovn- central- 2-ip>:6643
ovsdb-tool create-cluster /var/lib/
ovsdb-tool create-cluster /var/lib/
4. Resume ovn-central/2
5. Join cluster from ovn-central/3
ovsdb-tool --cid=< new-sb- cid-took- from-ovn- central- 2> join-cluster /var/lib/ ovn/ovnsb_ db.db OVN_Southbound ssl:<ovn- central- 3-ip>:6644 ssl:<ovn- central- 2-ip>:6644 new-nb- cid-took- from-ovn- central- 2> join-cluster /var/lib/ ovn/ovnnb_ db.db OVN_Northbound ssl:<ovn- central- 3-ip>:6643 ssl:<ovn- central- 2-ip>:6643
ovsdb-tool --cid=<
6. Resuming /3
juju run-action ovn-central/3 resume --wait
7. Fixing leader-set "<new-nb- cid-took- from-ovn- central- 2>" "<new-sb- cid-took- from-ovn- central- 2>"
juju run -u ovn-central/leader leader-set nb_cid=
juju run -u ovn-central/leader leader-set sb_cid=
8. Resuming /4
juju run-action ovn-central/4 resume --wait