Connectivity issues between ovn-central cluster units during deployment will cause unrecoverable split brain cluster state

Bug #1979188 reported by Itai Levy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-ovn-central
New
Undecided
Unassigned

Bug Description

Deploying Charmed OpenStack - Jammy series, Yoga release

Deploying over 2 nodes with the following ovn-central charm (units are splitted across 2 nodes):
  ovn-central:
    charm: ch:ovn-central
    num_units: 3
    options:
      source: *openstack-origin
    to:
    - lxd:0
    - lxd:1
    - lxd:1
    channel: 22.03/stable
    constraints: *space-constr
    bindings:
      "": *internal-space

While deploying the cluster I had connectivity issues on the internal space due to another issue (https://bugs.launchpad.net/ubuntu/+source/plan/+bug/1978820). When deployment is completed, before initializing Vault - this is the current juju status output:

ovn-central/0* waiting idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/1 waiting idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/2 waiting idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data

show-unit output indicates ovn-central/0 formed a cluster with itself as a leader, while the other 2 units on the second node were not able to identify/join the cluster:

$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound'
- Stdout: |
    3574
    Name: OVN_Southbound
    Cluster ID: 5060 (5060fdb0-8d1f-4eb9-a419-b69b297146bd)
    Server ID: 3574 (35748432-bae1-41df-91ac-c620cb843d60)
    Address: ssl:172.17.0.13:6644
    Status: cluster member
    Role: leader
    Term: 1
    Leader: self
    Vote: self

    Last Election started 57477371 ms ago, reason: timeout
    Last Election won: 57477371 ms ago
    Election timer: 1000
    Log: [2, 4]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 0
    Servers:
        3574 (3574 at ssl:172.17.0.13:6644) (self) next_index=2 match_index=3
  UnitId: ovn-central/0
- ReturnCode: 1
  Stderr: |
    2022-06-20T07:41:26Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
    ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
  Stdout: ""
  UnitId: ovn-central/1
- ReturnCode: 1
  Stderr: |
    2022-06-20T07:41:26Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
    ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
  Stdout: ""
  UnitId: ovn-central/2

After fixing the cause for connectivity issues on node 0 and rebooting it - we can see that unit ovn-central/1 formed a different cluster, now we have 2 clusters with 2 leaders. juju status output does not indicate any problem except for waiting Vault initialization:

ovn-central/0 waiting idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/1* waiting idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/2 waiting idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data

$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound'
- Stdout: |
    3574
    Name: OVN_Southbound
    Cluster ID: 5060 (5060fdb0-8d1f-4eb9-a419-b69b297146bd)
    Server ID: 3574 (35748432-bae1-41df-91ac-c620cb843d60)
    Address: ssl:172.17.0.13:6644
    Status: cluster member
    Role: leader
    Term: 2
    Leader: self
    Vote: self

    Last Election started 340646 ms ago, reason: timeout
    Last Election won: 340646 ms ago
    Election timer: 1000
    Log: [2, 5]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 0
    Servers:
        3574 (3574 at ssl:172.17.0.13:6644) (self) next_index=4 match_index=4
  UnitId: ovn-central/0
- Stdout: |
    abfc
    Name: OVN_Southbound
    Cluster ID: 7b58 (7b58e296-0c96-4235-8e0e-3f1916bf90bc)
    Server ID: abfc (abfcf295-1946-4cf1-a4a4-319e240dc3f9)
    Address: ssl:172.17.0.17:6644
    Status: cluster member
    Role: leader
    Term: 1
    Leader: self
    Vote: self

    Last Election started 620728 ms ago, reason: timeout
    Last Election won: 620728 ms ago
    Election timer: 1000
    Log: [2, 3]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 0
    Servers:
        abfc (abfc at ssl:172.17.0.17:6644) (self) next_index=2 match_index=2
  UnitId: ovn-central/1
- ReturnCode: 1
  Stderr: |
    2022-06-20T07:54:57Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
    ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
  Stdout: ""
  UnitId: ovn-central/2

After fixing the cause for connectivity issues on node 1 and rebooting it, and initializing Vault - we are left with 2 ovn clusters with 2 leaders on the same space which are part of a single ovn-cental application as seen as well in juju status output. unit ovn-central/2 will be jumping between the 2 clusters as both leaders will try to add it.

ovn-central/0* active idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db northd: active)
ovn-central/1 active idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db)
ovn-central/2 active idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp Unit is ready

$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound'
- Stdout: |
    3574
    Name: OVN_Southbound
    Cluster ID: 5060 (5060fdb0-8d1f-4eb9-a419-b69b297146bd)
    Server ID: 3574 (35748432-bae1-41df-91ac-c620cb843d60)
    Address: ssl:172.17.0.13:6644
    Status: cluster member
    Adding server e2c2 (e2c2 at ssl:172.17.0.18:6644) (adding: catchup)
    Role: leader
    Term: 2
    Leader: self
    Vote: self

    Last Election started 1648247 ms ago, reason: timeout
    Last Election won: 1648247 ms ago
    Election timer: 4000
    Log: [2, 17]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 1
    Servers:
        3574 (3574 at ssl:172.17.0.13:6644) (self) next_index=4 match_index=16
  UnitId: ovn-central/0
- Stdout: |
    abfc
    Name: OVN_Southbound
    Cluster ID: 7b58 (7b58e296-0c96-4235-8e0e-3f1916bf90bc)
    Server ID: abfc (abfcf295-1946-4cf1-a4a4-319e240dc3f9)
    Address: ssl:172.17.0.17:6644
    Status: cluster member
    Role: leader
    Term: 3
    Leader: self
    Vote: self

    Last Election started 118110 ms ago, reason: timeout
    Last Election won: 118110 ms ago
    Election timer: 4000
    Log: [2, 14]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections: <-e2c2 ->e2c2
    Disconnections: 0
    Servers:
        e2c2 (e2c2 at ssl:172.17.0.18:6644) next_index=14 match_index=13 last msg 231 ms ago
        abfc (abfc at ssl:172.17.0.17:6644) (self) next_index=4 match_index=13
  UnitId: ovn-central/1
- Stdout: |
    e2c2
    Name: OVN_Southbound
    Cluster ID: 7b58 (7b58e296-0c96-4235-8e0e-3f1916bf90bc)
    Server ID: e2c2 (e2c2a6a0-b3c7-4e9f-8493-6d9f16f16df5)
    Address: ssl:172.17.0.18:6644
    Status: cluster member
    Role: follower
    Term: 3
    Leader: abfc
    Vote: unknown

    Election timer: 4000
    Log: [2, 14]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections: ->0000 <-abfc
    Disconnections: 1
    Servers:
        e2c2 (e2c2 at ssl:172.17.0.18:6644) (self)
        abfc (abfc at ssl:172.17.0.17:6644) last msg 316 ms ago
  UnitId: ovn-central/2

This state is not recoverable, and will cause many openstack operations related issues due to the 2 OVN DB sets running in parallel on the same cluster.
The only way to recover it is to manually remove and recreate one of the leader units so it will join to the same cluster with the other 2 units.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1979188

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Revision history for this message
Amir Tzin (amirtz) wrote :

fixing commit:
currently in linux_stable (hash from stable tree)
4a333ec73dee net/mlx5e: TC NIC mode, fix tc chains miss table.

(internal mellanox/nvidia issue 3106704)

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Itai Levy (etlvnvda) wrote :

Amir, this kernel patch will solve the connectivity issues, however its not related to this bug which is about how OVN cluster handle connectivity issues.
We have another bug for the connectivity issues root casue: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1978820

Revision history for this message
Itai Levy (etlvnvda) wrote :

apport-collect output attached

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Frode Nordahl (fnordahl)
no longer affects: linux (Ubuntu)
Revision history for this message
Bayani Carbone (bcarbone) wrote (last edit ):

Facing the same issue. Using ovn-central 22.03/stable revision 119.
juju-crashdump prior to initializing vault: https://files.support.canonical.com/preview/triage/tmp-ovn-central/juju-crashdump-2dd7a5ed-7e22-464a-a631-6f51d17e150a.tar.xz

In my case this seems to be related to the use of the sysconfig charm and the timing of the reboot necessary to apply kernel options for hugepages and iommu. I was performing these reboots (one node at a time) prior to vault initialization and this resulted in the situation described in the bug report.

However, when performing the vault initialization step first then performing the reboots this situation is avoided.

Note: after vault initialization and prior to the reboots, the ovn-chassis charms will be stuck in maintenance with status message: "Installation complete - awaiting next status"; this is because ovs will not start due to lack of hugepages. After reboot the charms will transition to active.

Revision history for this message
Gábor Mészáros (gabor.meszaros) wrote (last edit ):

just encountered again on a green field Jammy/Yoga deployment. Had split brain with 2 units clustered and the third one on its own.

Possible cause: units were rebooted while the status were settling.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.