Removing one ovn-central unit doesn't cluster/leave SB and NB clusters

Bug #1948680 reported by Giuseppe Petralia
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
charm-interface-ovsdb
New
Undecided
Martin Kalcok
charm-ovn-central
Fix Committed
High
Martin Kalcok

Bug Description

charm ovn-central rev. 7

after removing one ovn-central unit, the server was not removed from the cluster.

Node had IP: 10.10.240.102

# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
670a
Name: OVN_Southbound
Cluster ID: 5ff3 (5ff30308-1df3-40c0-8cd9-0a048f915017)
Server ID: 670a (670afac0-12e5-4a3c-92e1-9c826a2e9dc2)
Address: ssl:10.10.241.226:6644
Status: cluster member
Role: leader
Term: 7537
Leader: self
Vote: self

Election timer: 4000
Log: [54683521, 54683813]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->ef67 (->e7aa) ->3d49 <-3d49 <-ef67
Servers:
    670a (670a at ssl:10.10.241.226:6644) (self) next_index=54682242 match_index=54683812
    ef67 (ef67 at ssl:10.10.240.88:6644) next_index=54683813 match_index=54683812
    e7aa (e7aa at ssl:10.10.240.102:6644) next_index=54683813 match_index=0
    3d49 (3d49 at ssl:10.10.240.187:6644) next_index=54683813 match_index=54683812

We had to remove it manually with:
# ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound e7aa

summary: - Removing one ovn-central doesn't remove the server from the Southbound
- cluster
+ Removing one ovn-central unit doesn't remove the server from the
+ Southbound cluster
Revision history for this message
Giuseppe Petralia (peppepetra) wrote : Re: Removing one ovn-central unit doesn't remove the server from the Southbound cluster

Subscribing field-medium.

This issue can cause major outages when sb and nb clusters can't elect a leader because of stale ovn-central units in the cluster.

Starting from 3 ovn-central, we had to remove 2 ovn-central (because of hw maintenance) and added two back. We didn't manually cluster/leave. The two raft clusters were unable to elect a leader because both SB and NB had 4 members of which 2 down and the 5th unit could not join the cluster

The 3 left ovn-central were /2 /3 /4

To recreate the clusters we followed these steps:

Recovery steps:

1. stop all units:
juju run-action ovn-central/2 pause --wait
juju run-action ovn-central/3 pause --wait
juju run-action ovn-central/4 pause --wait

2. created standalone on ovn-central/2
# ovsdb-tool cluster-to-standalone /tmp/standalone_ovnsb_db.db /var/lib/ovn/ovnsb_db.db
# ovsdb-tool cluster-to-standalone /tmp/standalone_ovnnb_db.db /var/lib/ovn/ovnnb_db.db

3. create clusters
ovsdb-tool create-cluster /var/lib/ovn/ovnsb_db.db /tmp/standalone_ovnsb_db.db ssl:<ovn-central-2-ip>:6644
ovsdb-tool create-cluster /var/lib/ovn/ovnnb_db.db /tmp/standalone_ovnnb_db.db ssl:<ovn-central-2-ip>:6643

4. Resume ovn-central/2

5. Join cluster from ovn-central/3

ovsdb-tool --cid=<new-sb-cid-took-from-ovn-central-2> join-cluster /var/lib/ovn/ovnsb_db.db OVN_Southbound ssl:<ovn-central-3-ip>:6644 ssl:<ovn-central-2-ip>:6644
ovsdb-tool --cid=<new-nb-cid-took-from-ovn-central-2> join-cluster /var/lib/ovn/ovnnb_db.db OVN_Northbound ssl:<ovn-central-3-ip>:6643 ssl:<ovn-central-2-ip>:6643

6. Resuming /3
juju run-action ovn-central/3 resume --wait

7. Fixing leader-set
juju run -u ovn-central/leader leader-set nb_cid="<new-nb-cid-took-from-ovn-central-2>"
juju run -u ovn-central/leader leader-set sb_cid="<new-sb-cid-took-from-ovn-central-2>"

8. Resuming /4
juju run-action ovn-central/4 resume --wait

summary: - Removing one ovn-central unit doesn't remove the server from the
- Southbound cluster
+ Removing one ovn-central unit doesn't cluster/leave SB and NB clusters
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Definitively something the charm should handle. I think we would need to implement two entry points to cluster member removal:

    one opportunistic one running on the departing unit in its stop hook (a member departing sending a leave message is the preferred approach according to ovsdb-server documentation)

    an action which can be issued by the operator on the remaining units in the event a unit is abruptly removed from the cluster without removing itself

Changed in charm-ovn-central:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Martin Kalcok (martin-kalcok) wrote :
Changed in charm-ovn-central:
assignee: nobody → Martin Kalcok (martin-kalcok)
Changed in charm-interface-ovsdb:
assignee: nobody → Martin Kalcok (martin-kalcok)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ovn-central (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/x/charm-ovn-central/+/859720

Changed in charm-ovn-central:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ovn-central (master)

Reviewed: https://review.opendev.org/c/x/charm-ovn-central/+/859720
Committed: https://opendev.org/x/charm-ovn-central/commit/34a82bc763991c7a2b6f2c23b58e3239c8058ef2
Submitter: "Zuul (22348)"
Branch: master

commit 34a82bc763991c7a2b6f2c23b58e3239c8058ef2
Author: Martin Kalcok <email address hidden>
Date: Wed Sep 28 22:34:54 2022 +0200

    Implementation of ovn-central downscaling.

    This change includes:
    * attempt to leave cluster gracefully when removing unit
    * cluster-status action that shows status of SB and NB clusters
    * cluster-kick action that allows user to remove cluster members

    Associated spec: https://opendev.org/openstack/charm-specs/src/branch/master/specs/yoga/approved/ovn-central-downscaling.rst

    Closes-Bug: #1948680
    func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/933
    Change-Id: I40ae08669d00b3b1fa567a45db2ce51425e6d1cb

Changed in charm-ovn-central:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.