Connectivity issues between ovn-central cluster units during deployment will cause unrecoverable split brain cluster state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
charm-ovn-central |
New
|
Undecided
|
Unassigned |
Bug Description
Deploying Charmed OpenStack - Jammy series, Yoga release
Deploying over 2 nodes with the following ovn-central charm (units are splitted across 2 nodes):
ovn-central:
charm: ch:ovn-central
num_units: 3
options:
source: *openstack-origin
to:
- lxd:0
- lxd:1
- lxd:1
channel: 22.03/stable
constraints: *space-constr
bindings:
"": *internal-space
While deploying the cluster I had connectivity issues on the internal space due to another issue (https:/
ovn-central/0* waiting idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/1 waiting idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/2 waiting idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
show-unit output indicates ovn-central/0 formed a cluster with itself as a leader, while the other 2 units on the second node were not able to identify/join the cluster:
$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/
- Stdout: |
3574
Name: OVN_Southbound
Cluster ID: 5060 (5060fdb0-
Server ID: 3574 (35748432-
Address: ssl:172.
Status: cluster member
Role: leader
Term: 1
Leader: self
Vote: self
Last Election started 57477371 ms ago, reason: timeout
Last Election won: 57477371 ms ago
Election timer: 1000
Log: [2, 4]
Entries not yet committed: 0
Entries not yet applied: 0
Connections:
Disconnections: 0
Servers:
3574 (3574 at ssl:172.
UnitId: ovn-central/0
- ReturnCode: 1
Stderr: |
2022-
ovn-appctl: cannot connect to "/var/run/
Stdout: ""
UnitId: ovn-central/1
- ReturnCode: 1
Stderr: |
2022-
ovn-appctl: cannot connect to "/var/run/
Stdout: ""
UnitId: ovn-central/2
After fixing the cause for connectivity issues on node 0 and rebooting it - we can see that unit ovn-central/1 formed a different cluster, now we have 2 clusters with 2 leaders. juju status output does not indicate any problem except for waiting Vault initialization:
ovn-central/0 waiting idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/1* waiting idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/2 waiting idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/
- Stdout: |
3574
Name: OVN_Southbound
Cluster ID: 5060 (5060fdb0-
Server ID: 3574 (35748432-
Address: ssl:172.
Status: cluster member
Role: leader
Term: 2
Leader: self
Vote: self
Last Election started 340646 ms ago, reason: timeout
Last Election won: 340646 ms ago
Election timer: 1000
Log: [2, 5]
Entries not yet committed: 0
Entries not yet applied: 0
Connections:
Disconnections: 0
Servers:
3574 (3574 at ssl:172.
UnitId: ovn-central/0
- Stdout: |
abfc
Name: OVN_Southbound
Cluster ID: 7b58 (7b58e296-
Server ID: abfc (abfcf295-
Address: ssl:172.
Status: cluster member
Role: leader
Term: 1
Leader: self
Vote: self
Last Election started 620728 ms ago, reason: timeout
Last Election won: 620728 ms ago
Election timer: 1000
Log: [2, 3]
Entries not yet committed: 0
Entries not yet applied: 0
Connections:
Disconnections: 0
Servers:
abfc (abfc at ssl:172.
UnitId: ovn-central/1
- ReturnCode: 1
Stderr: |
2022-
ovn-appctl: cannot connect to "/var/run/
Stdout: ""
UnitId: ovn-central/2
After fixing the cause for connectivity issues on node 1 and rebooting it, and initializing Vault - we are left with 2 ovn clusters with 2 leaders on the same space which are part of a single ovn-cental application as seen as well in juju status output. unit ovn-central/2 will be jumping between the 2 clusters as both leaders will try to add it.
ovn-central/0* active idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db northd: active)
ovn-central/1 active idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db)
ovn-central/2 active idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp Unit is ready
$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/
- Stdout: |
3574
Name: OVN_Southbound
Cluster ID: 5060 (5060fdb0-
Server ID: 3574 (35748432-
Address: ssl:172.
Status: cluster member
Adding server e2c2 (e2c2 at ssl:172.
Role: leader
Term: 2
Leader: self
Vote: self
Last Election started 1648247 ms ago, reason: timeout
Last Election won: 1648247 ms ago
Election timer: 4000
Log: [2, 17]
Entries not yet committed: 0
Entries not yet applied: 0
Connections:
Disconnections: 1
Servers:
3574 (3574 at ssl:172.
UnitId: ovn-central/0
- Stdout: |
abfc
Name: OVN_Southbound
Cluster ID: 7b58 (7b58e296-
Server ID: abfc (abfcf295-
Address: ssl:172.
Status: cluster member
Role: leader
Term: 3
Leader: self
Vote: self
Last Election started 118110 ms ago, reason: timeout
Last Election won: 118110 ms ago
Election timer: 4000
Log: [2, 14]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: <-e2c2 ->e2c2
Disconnections: 0
Servers:
e2c2 (e2c2 at ssl:172.
abfc (abfc at ssl:172.
UnitId: ovn-central/1
- Stdout: |
e2c2
Name: OVN_Southbound
Cluster ID: 7b58 (7b58e296-
Server ID: e2c2 (e2c2a6a0-
Address: ssl:172.
Status: cluster member
Role: follower
Term: 3
Leader: abfc
Vote: unknown
Election timer: 4000
Log: [2, 14]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->0000 <-abfc
Disconnections: 1
Servers:
e2c2 (e2c2 at ssl:172.
abfc (abfc at ssl:172.
UnitId: ovn-central/2
This state is not recoverable, and will cause many openstack operations related issues due to the 2 OVN DB sets running in parallel on the same cluster.
The only way to recover it is to manually remove and recreate one of the leader units so it will join to the same cluster with the other 2 units.
Changed in linux (Ubuntu): | |
status: | New → Incomplete |
no longer affects: | linux (Ubuntu) |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1979188
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.