charm-ovn-central

Connectivity issues between ovn-central cluster units during deployment will cause unrecoverable split brain cluster state

Bug #1979188 reported by Itai Levy on 2022-06-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	charm-ovn-central	New	Undecided	Unassigned

Bug Description

Deploying Charmed OpenStack - Jammy series, Yoga release

Deploying over 2 nodes with the following ovn-central charm (units are splitted across 2 nodes):
  ovn-central:
    charm: ch:ovn-central
    num_units: 3
    options:
      source: *openstack-origin
    to:
    - lxd:0
    - lxd:1
    - lxd:1
    channel: 22.03/stable
    constraints: *space-constr
    bindings:
      "": *internal-space

While deploying the cluster I had connectivity issues on the internal space due to another issue (https://bugs.launchpad.net/ubuntu/+source/plan/+bug/1978820). When deployment is completed, before initializing Vault - this is the current juju status output:

ovn-central/0* waiting idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/1 waiting idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/2 waiting idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data

show-unit output indicates ovn-central/0 formed a cluster with itself as a leader, while the other 2 units on the second node were not able to identify/join the cluster:

$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound'
- Stdout: |
    3574
    Name: OVN_Southbound
    Cluster ID: 5060 (5060fdb0-8d1f-4eb9-a419-b69b297146bd)
    Server ID: 3574 (35748432-bae1-41df-91ac-c620cb843d60)
    Address: ssl:172.17.0.13:6644
    Status: cluster member
    Role: leader
    Term: 1
    Leader: self
    Vote: self

    Last Election started 57477371 ms ago, reason: timeout
    Last Election won: 57477371 ms ago
    Election timer: 1000
    Log: [2, 4]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 0
    Servers:
        3574 (3574 at ssl:172.17.0.13:6644) (self) next_index=2 match_index=3
  UnitId: ovn-central/0
- ReturnCode: 1
  Stderr: |
    2022-06-20T07:41:26Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
    ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
  Stdout: ""
  UnitId: ovn-central/1
- ReturnCode: 1
  Stderr: |
    2022-06-20T07:41:26Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
    ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
  Stdout: ""
  UnitId: ovn-central/2

After fixing the cause for connectivity issues on node 0 and rebooting it - we can see that unit ovn-central/1 formed a different cluster, now we have 2 clusters with 2 leaders. juju status output does not indicate any problem except for waiting Vault initialization:

ovn-central/0 waiting idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/1* waiting idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data
ovn-central/2 waiting idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp 'ovsdb-peer' incomplete, 'certificates' awaiting server certificate data

    Last Election started 340646 ms ago, reason: timeout
    Last Election won: 340646 ms ago
    Election timer: 1000
    Log: [2, 5]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 0
    Servers:
        3574 (3574 at ssl:172.17.0.13:6644) (self) next_index=4 match_index=4
  UnitId: ovn-central/0
- Stdout: |
    abfc
    Name: OVN_Southbound
    Cluster ID: 7b58 (7b58e296-0c96-4235-8e0e-3f1916bf90bc)
    Server ID: abfc (abfcf295-1946-4cf1-a4a4-319e240dc3f9)
    Address: ssl:172.17.0.17:6644
    Status: cluster member
    Role: leader
    Term: 1
    Leader: self
    Vote: self

    Last Election started 620728 ms ago, reason: timeout
    Last Election won: 620728 ms ago
    Election timer: 1000
    Log: [2, 3]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 0
    Servers:
        abfc (abfc at ssl:172.17.0.17:6644) (self) next_index=2 match_index=2
  UnitId: ovn-central/1
- ReturnCode: 1
  Stderr: |
    2022-06-20T07:54:57Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
    ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
  Stdout: ""
  UnitId: ovn-central/2

After fixing the cause for connectivity issues on node 1 and rebooting it, and initializing Vault - we are left with 2 ovn clusters with 2 leaders on the same space which are part of a single ovn-cental application as seen as well in juju status output. unit ovn-central/2 will be jumping between the 2 clusters as both leaders will try to add it.

ovn-central/0* active idle 0/lxd/3 10.7.208.37 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db northd: active)
ovn-central/1 active idle 1/lxd/5 10.7.208.58 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db)
ovn-central/2 active idle 1/lxd/6 10.7.208.59 6641/tcp,6642/tcp Unit is ready

$ juju run --application ovn-central 'sudo ovn-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound'
- Stdout: |
    3574
    Name: OVN_Southbound
    Cluster ID: 5060 (5060fdb0-8d1f-4eb9-a419-b69b297146bd)
    Server ID: 3574 (35748432-bae1-41df-91ac-c620cb843d60)
    Address: ssl:172.17.0.13:6644
    Status: cluster member
    Adding server e2c2 (e2c2 at ssl:172.17.0.18:6644) (adding: catchup)
    Role: leader
    Term: 2
    Leader: self
    Vote: self

    Last Election started 1648247 ms ago, reason: timeout
    Last Election won: 1648247 ms ago
    Election timer: 4000
    Log: [2, 17]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections:
    Disconnections: 1
    Servers:
        3574 (3574 at ssl:172.17.0.13:6644) (self) next_index=4 match_index=16
  UnitId: ovn-central/0
- Stdout: |
    abfc
    Name: OVN_Southbound
    Cluster ID: 7b58 (7b58e296-0c96-4235-8e0e-3f1916bf90bc)
    Server ID: abfc (abfcf295-1946-4cf1-a4a4-319e240dc3f9)
    Address: ssl:172.17.0.17:6644
    Status: cluster member
    Role: leader
    Term: 3
    Leader: self
    Vote: self

    Last Election started 118110 ms ago, reason: timeout
    Last Election won: 118110 ms ago
    Election timer: 4000
    Log: [2, 14]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections: <-e2c2 ->e2c2
    Disconnections: 0
    Servers:
        e2c2 (e2c2 at ssl:172.17.0.18:6644) next_index=14 match_index=13 last msg 231 ms ago
        abfc (abfc at ssl:172.17.0.17:6644) (self) next_index=4 match_index=13
  UnitId: ovn-central/1
- Stdout: |
    e2c2
    Name: OVN_Southbound
    Cluster ID: 7b58 (7b58e296-0c96-4235-8e0e-3f1916bf90bc)
    Server ID: e2c2 (e2c2a6a0-b3c7-4e9f-8493-6d9f16f16df5)
    Address: ssl:172.17.0.18:6644
    Status: cluster member
    Role: follower
    Term: 3
    Leader: abfc
    Vote: unknown

    Election timer: 4000
    Log: [2, 14]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections: ->0000 <-abfc
    Disconnections: 1
    Servers:
        e2c2 (e2c2 at ssl:172.17.0.18:6644) (self)
        abfc (abfc at ssl:172.17.0.17:6644) last msg 316 ms ago
  UnitId: ovn-central/2

This state is not recoverable, and will cause many openstack operations related issues due to the 2 OVN DB sets running in parallel on the same cluster.
The only way to recover it is to manually remove and recreate one of the leader units so it will join to the same cluster with the other 2 units.

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2022-06-30: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1979188

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Revision history for this message

Amir Tzin (amirtz) wrote on 2022-06-30:

fixing commit:
currently in linux_stable (hash from stable tree)
4a333ec73dee net/mlx5e: TC NIC mode, fix tc chains miss table.

(internal mellanox/nvidia issue 3106704)

Ubuntu Kernel Bot (ubuntu-kernel-bot) on 2022-06-30

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Itai Levy (etlvnvda) wrote on 2022-07-05:

Amir, this kernel patch will solve the connectivity issues, however its not related to this bug which is about how OVN cluster handle connectivity issues.
We have another bug for the connectivity issues root casue: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1978820

Revision history for this message

Itai Levy (etlvnvda) wrote on 2022-07-05:

apport.linux.vmems9ns.apport Edit (1.6 MiB, text/html)

apport-collect output attached

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Frode Nordahl (fnordahl) on 2022-07-05

no longer affects:

linux (Ubuntu)

Revision history for this message

Bayani Carbone (bcarbone) wrote on 2023-08-22 (last edit on 2023-08-23):

Facing the same issue. Using ovn-central 22.03/stable revision 119.
juju-crashdump prior to initializing vault: https://files.support.canonical.com/preview/triage/tmp-ovn-central/juju-crashdump-2dd7a5ed-7e22-464a-a631-6f51d17e150a.tar.xz

In my case this seems to be related to the use of the sysconfig charm and the timing of the reboot necessary to apply kernel options for hugepages and iommu. I was performing these reboots (one node at a time) prior to vault initialization and this resulted in the situation described in the bug report.

However, when performing the vault initialization step first then performing the reboots this situation is avoided.

Note: after vault initialization and prior to the reboots, the ovn-chassis charms will be stuck in maintenance with status message: "Installation complete - awaiting next status"; this is because ovs will not start due to lack of hugepages. After reboot the charms will transition to active.

Revision history for this message

Gábor Mészáros (gabor.meszaros) wrote on 2024-05-30 (last edit on 2024-05-30):

just encountered again on a green field Jammy/Yoga deployment. Had split brain with 2 units clustered and the third one on its own.

Possible cause: units were rebooted while the status were settling.