Raft bug: OVSDB leadership transfers every 10-20 min after initial compaction

Bug #1990978 reported by Alin-Gabriel Serdean
46
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Yoga
Fix Released
Undecided
Unassigned
openvswitch (Ubuntu)
Fix Released
Undecided
Dariusz Gadomski
Focal
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Dariusz Gadomski
Kinetic
Fix Released
Undecided
Dariusz Gadomski

Bug Description

First compaction starts after 24 hours, or earlier after doubling of DB size.

Subsequent compactions will trigger every 10-20 min.

The OVS version hitting this issue:
ovs-vsctl (Open vSwitch) 2.17.2

Commit ID that fixes the issue is: https://github.com/openvswitch/ovs/commit/a32a4e1fa2d3fad284834d4b7bccc2e71d33f9da

https://github.com/openvswitch/ovs/commit/dfc3e65c8191f5dc375337c23aed128b5c0d7781 (2.17 branch patch)

Reproducer:
Trigger compactions by using command line tool:
ovs-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/compact
or by creating DB pressure, i.e.:
#!/bin/bash
for i in {1..5000}
do
ovn-nbctl ls-add sw$i
if [[ $? -ne 0 ]] ; then
    echo "Failed on ls-add i: $i"
    exit 1
fi
        for j in {1..2000}
        do
                echo "Iteration i: $i and j:$j"
                ovn-nbctl lsp-add sw$i sw$i$j
                if [[ $? -ne 0 ]] ; then
                    echo "Failed on lsp-add i: $i and j: $j"
                    exit 1
                fi
        done
done
for i in {1..5000}
do
        echo "Delete iteration i: $i"
        ovn-nbctl ls-del sw$i
        if [[ $? -ne 0 ]] ; then
            echo "Failed on ls-del i: $i"
            exit 1
        fi
done

Check for leadership transfers using:
sudo grep "Transferring leadership" /var/log/ovn/ov* | grep ovsdb-server-sb.log
There should be a new entry every 10-20min.

=== Ubuntu SRU Details ===

[Impact]
Please see above

[Test Case]
* deploy Openstack Yoga
* connect to the NB DB leader and run the script to generate DB pressure. Compaction will occur after the DB doubles its size
* check for subsequent transfers after one hour using the following script:
sudo grep "Transferring leadership" /var/log/ovn/ov* | grep ovsdb-server-sb.log

[Where things could go wrong]
Regression is not expected since it reduces the frequency of transfers.
The fix has also been applied upstream https://github.com/openvswitch/ovs/commit/dfc3e65c8191f5dc375337c23aed128b5c0d778 however a new version has not been released.

Revision history for this message
Alin-Gabriel Serdean (alin-serdean) wrote :
Revision history for this message
Alin-Gabriel Serdean (alin-serdean) wrote :

Alternatively, the following script can be used to generate DB pressure:

#!/bin/bash
for i in {1..5000}
do
ovn-nbctl ls-add sw$i
if [[ $? -ne 0 ]] ; then
    echo "Failed on ls-add i: $i"
    exit 1
fi
        for j in {1..2000}
        do
                echo "Iteration i: $i and j:$j"
                ovn-nbctl lsp-add sw$i sw$i$j
                if [[ $? -ne 0 ]] ; then
                    echo "Failed on lsp-add i: $i and j: $j"
                    exit 1
                fi
        done
done
for i in {1..5000}
do
        echo "Delete iteration i: $i"
        ovn-nbctl ls-del sw$i
        if [[ $? -ne 0 ]] ; then
            echo "Failed on ls-del i: $i"
            exit 1
        fi
done

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "openvswitch_2.17.2-0ubuntu0.22.04.1~cloud0ubuntu2.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Changed in openvswitch (Ubuntu Kinetic):
status: New → Fix Released
Revision history for this message
Edward Hope-Morley (hopem) wrote :

my bad the patches are in branches but not tags so they are not released yet.

description: updated
Changed in openvswitch (Ubuntu Kinetic):
status: Fix Released → New
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :
description: updated
description: updated
description: updated
Revision history for this message
James Page (james-page) wrote :

This fix is included in openvswitch 2.17.3 and 3.0.1; do we want to cherry pick this fix or wait for the next round of point release SRU's for this package?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

@james-page we see this fix as reasonably urgent since it will manifest as api timeouts and/or slowness when creating ports, networks etc with 2.17.2 so an SRU would be preferable.

Changed in openvswitch (Ubuntu Kinetic):
assignee: nobody → Dariusz Gadomski (dgadomski)
Changed in openvswitch (Ubuntu Jammy):
assignee: nobody → Dariusz Gadomski (dgadomski)
tags: added: sts sts-sponsor-dgadomski
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openvswitch (Ubuntu Jammy):
status: New → Confirmed
Changed in openvswitch (Ubuntu Kinetic):
status: New → Confirmed
Changed in openvswitch (Ubuntu):
status: New → Confirmed
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Thank you for providing the patches, however we will be resolving this issue by SRU'ing upstream point releases. This work is tracked in bug 1995289 and bug 1995288.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

This problem seems to affect focal-xena as well according to lp bug #1996594, can we also nominate it to focal and xena?

Revision history for this message
Edward Hope-Morley (hopem) wrote :

marking as fix released for Focal to Jammy since 2.17.3 is now released.

Changed in openvswitch (Ubuntu Kinetic):
status: Confirmed → Fix Released
Changed in openvswitch (Ubuntu Jammy):
status: Confirmed → Fix Released
Changed in openvswitch (Ubuntu Focal):
status: New → Fix Released
Revision history for this message
Stefan Lupsa (stefanlupsacbsl) wrote :

Hello, as mentioned above, this is also affecting focal-xena.

The commit addressing the issue (https://github.com/openvswitch/ovs/commit/a32a4e1fa2d3fad284834d4b7bccc2e71d33f9da) has been backported to 2.16 in ovs repo and is available in v2.16.5 release tag, however; the latest cloud archive for latest kolla xena image
openvswitch-switch:
  Installed: 2.16.4-0ubuntu1~cloud0
  Candidate: 2.16.4-0ubuntu1~cloud0
  Version table:
 *** 2.16.4-0ubuntu1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu focal-updates/xena/main amd64 Packages
        100 /var/lib/dpkg/status

Frode Nordahl (fnordahl)
Changed in openvswitch (Ubuntu):
status: Confirmed → Fix Released
Changed in cloud-archive:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.