Upgrade from OVN 20.03 to newer OVN version will cause data plane outage

Bug #1940043 reported by Frode Nordahl
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
Wallaby
Triaged
High
Frode Nordahl
charm-layer-ovn
Fix Released
High
Frode Nordahl
charm-ovn-chassis
Fix Released
High
Frode Nordahl
20.03
Fix Released
Undecided
Unassigned
20.12
Fix Released
Undecided
Unassigned
21.09
Fix Released
Undecided
Unassigned
charm-ovn-dedicated-chassis
Fix Released
High
Frode Nordahl
20.03
Fix Released
Undecided
Unassigned
20.12
Fix Released
Undecided
Unassigned
21.09
Fix Released
Undecided
Unassigned
ovn (Ubuntu)
Fix Released
High
Unassigned
Focal
Fix Released
High
Frode Nordahl
Hirsute
Won't Fix
Undecided
Frode Nordahl
Impish
Fix Released
High
Unassigned

Bug Description

[Impact]
When upgrading from OVN 20.03, as made available in Ubuntu Focal, to a newer version of OVN, it is currently not possible to upgrade without causing a data plane outage.

If the user attempts to upgrade the central components first, the ovn-controller will tear down connectivity to running instances as it may not fully understand the data structure of a newer database.

If the user attempts to upgrade the ovn-controler first, recent releases are not guaranteed to understand the older database and connectivity may remain down until all hypervisors and central components have been upgraded.

If the user attempts to manually stop the ovn-controller during the upgrade to avoid it inadvertently tearing down connectivity on central component upgrade, cloud instances will be deprived of vital services such as DNS lookup and DHCP.

To fix this situation two changes are needed:
1) Backport of a upstream feature [0] that allows the ovn-controller to detect version mismatch and subsequently refrain from making further changes to the local Open vSwitch instance until the version mismatch is corrected.

2) Make ovn-controller not clear out runtime flow state in Open vSwitch on exit by updating the ovn-controller systemd service to pass the `--restart` argument when stopping the controller. This flag tells the ovn-controller process that it should not clear out Open vSwitch flows and OVN SB database records on exit, which allows already installed state to continue operation until the new instance of the ovn-controller process starts. [1][2][3]

It does not mean that the service will be restarted as opposed to being stopped, as one might think based on the name of the argument.

This change serves two purposes:

2a) Allow upgrading the ovn-controller to a newer version than the central components, while retaining connectivity to running instances until the central components are upgraded.

2b) Minimize the downtime on package upgrade.

[Test Plan]

1. Deploy OpenStack Ussuri from the Focal archive.
2. Launch and instance and confirm connectivity.
3. Add UCA or other PPA with a newer version of OVN and perform upgrade of the OVN components on relevant units in the deployment.
4. Confirm how new version of central components make the ovn-controller log version mismatch as well as show continued connectivity to the test instance.
5. Upgrade data plane units and confirm how the version mismatch situation is resolved and at the same time instances retain connectivity with minimal downtime during the upgrade.

[Regression Potential]

The backported feature is optional and enabled by specifically entering a key-value pair into the local Open vSwitch database to enable it. It has also been available upstream for several releases.

The change to the ovn-controller systemd service has been in Ubuntu since Impish [3] and we have had no reports of side effects of this change.

[Original Bug Description]
The upstream recommendation for upgrades of OVN is to first upgrade the data plane components (chassis aka. ovn-controller), and then upgrade the central components (the database schema and ovn-northd). The rationale for this is that the new version of the ovn-controller is required to cope with any changes to database schema or how northd programs flows.

However, during the course of rapid OVN development there has also been introduced changes that make the new ovn-controller not cope with a old database schema, breaking the recommended upgrade procedure.

To cope with this upstream has introduced a new optional configuration for the ovn-controller that allows it to detect version inconsistencies, and when they are present stop it from making changes to the data plane until the version inconsistency is resolved [0].

For the above mentioned configuration to be effective we also need the package to call ``ovn-ctl stop_controller`` with the --restart option so that the ovn-controller does not flush the installed flows on exit.

We should make required changes to packages and charms to allow upgrades to progress with less data plane outage.

0: https://github.com/ovn-org/ovn/commit/1dd27ea7aea40122c1edbff845e14abaa70c0413
1: https://github.com/ovn-org/ovn/commit/f508fcc14abfaaa13e9f1bf3b5b6bac59bd27a5f
2: https://github.com/ovn-org/ovn/commit/45c7a85dc7f2af56191a47f1357d16b8af618e20
3: https://git.launchpad.net/~ubuntu-server-dev/ubuntu/+source/ovn/commit/debian/ovn-host.ovn-controller.service?id=3c601ecc13724d3f13ec0cc989f6ffd838f787f8

Related branches

Frode Nordahl (fnordahl)
Changed in charm-ovn-chassis:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Frode Nordahl (fnordahl)
Frode Nordahl (fnordahl)
description: updated
Changed in ovn (Ubuntu Impish):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Frode Nordahl (fnordahl)
Frode Nordahl (fnordahl)
Changed in charm-layer-ovn:
status: New → Incomplete
status: Incomplete → In Progress
importance: Undecided → High
assignee: nobody → Frode Nordahl (fnordahl)
Changed in charm-ovn-dedicated-chassis:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Frode Nordahl (fnordahl)
status: In Progress → Triaged
Changed in charm-ovn-chassis:
status: In Progress → Triaged
Revision history for this message
Frode Nordahl (fnordahl) wrote :
Changed in charm-layer-ovn:
status: In Progress → Fix Committed
milestone: none → 21.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ovn-chassis (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/x/charm-ovn-chassis/+/804743

Changed in charm-ovn-chassis:
status: Triaged → In Progress
Changed in charm-ovn-dedicated-chassis:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ovn-dedicated-chassis (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ovn-dedicated-chassis (master)

Reviewed: https://review.opendev.org/c/x/charm-ovn-dedicated-chassis/+/804744
Committed: https://opendev.org/x/charm-ovn-dedicated-chassis/commit/33ed06cb238ea6b7f031a6500c054d522d538d88
Submitter: "Zuul (22348)"
Branch: master

commit 33ed06cb238ea6b7f031a6500c054d522d538d88
Author: Frode Nordahl <email address hidden>
Date: Mon Aug 16 15:27:46 2021 +0200

    Improve handling of major version upgrades

    Setting the external_ids:ovn-match-northd-version value to
    'true' will make the ovn-controller refrain from making updates to
    the data plane tables in the event of a version mismatch.

    This in combination with stopping the ovn-controller with the
    ovn-ctl stop_controller --restart command will allow upgrades
    to progress with little or no data plane downtime.
    (Note that we will accomplish this by a separate proposal to the
    OVN package itself in Ubuntu.)

    As soon as the central components are upgraded ovn-controller will
    notice and resume (re-)programming of the local Open vSwitch data
    plane.

    Closes-Bug: #1940043
    Change-Id: I16fedbc455e25bec0de4a475a9daa55b700ab3a0

Changed in charm-ovn-dedicated-chassis:
status: In Progress → Fix Committed
Changed in charm-ovn-chassis:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ovn-chassis (master)

Reviewed: https://review.opendev.org/c/x/charm-ovn-chassis/+/804743
Committed: https://opendev.org/x/charm-ovn-chassis/commit/8cbffde5f7dcdd3ab69c892454daf67dc61124f3
Submitter: "Zuul (22348)"
Branch: master

commit 8cbffde5f7dcdd3ab69c892454daf67dc61124f3
Author: Frode Nordahl <email address hidden>
Date: Mon Aug 16 15:26:35 2021 +0200

    Improve handling of major version upgrades

    Setting the external_ids:ovn-match-northd-version value to
    'true' will make the ovn-controller refrain from making updates to
    the data plane tables in the event of a version mismatch.

    This in combination with stopping the ovn-controller with the
    ovn-ctl stop_controller --restart command will allow upgrades
    to progress with little or no data plane downtime.
    (Note that we will accomplish this by a separate proposal to the
    OVN package itself in Ubuntu.)

    As soon as the central components are upgraded ovn-controller will
    notice and resume (re-)programming of the local Open vSwitch data
    plane.

    Closes-Bug: #1940043
    Change-Id: I1e2bb031a970597c5e5b587b3219e4c4f7db2e36

Frode Nordahl (fnordahl)
Changed in ovn (Ubuntu Impish):
status: In Progress → Fix Committed
assignee: Frode Nordahl (fnordahl) → nobody
Changed in ovn (Ubuntu Hirsute):
assignee: nobody → Frode Nordahl (fnordahl)
Changed in ovn (Ubuntu Focal):
assignee: nobody → Frode Nordahl (fnordahl)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ovn - 21.09.0~git20210806.d08f89e21-0ubuntu1.1

---------------
ovn (21.09.0~git20210806.d08f89e21-0ubuntu1.1) impish; urgency=medium

  * Allow upgrades without data plane outage.
    - d/ovn-host.ovn-controller.service: Pass --restart option when
      calling `ovn-ctl stop_controller` (LP: #1940043).

 -- Frode Nordahl <email address hidden> Mon, 16 Aug 2021 11:53:37 +0200

Changed in ovn (Ubuntu Impish):
status: Fix Committed → Fix Released
Changed in charm-ovn-chassis:
milestone: none → 21.10
Changed in charm-ovn-dedicated-chassis:
milestone: none → 21.10
Changed in charm-ovn-chassis:
status: Fix Committed → Fix Released
Changed in charm-layer-ovn:
status: Fix Committed → Fix Released
Changed in charm-ovn-dedicated-chassis:
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote :

The Hirsute Hippo has reached End of Life, so this bug will not be fixed for that release.

Changed in ovn (Ubuntu Hirsute):
status: New → Won't Fix
Frode Nordahl (fnordahl)
Changed in cloud-archive:
status: New → Fix Released
Changed in ovn (Ubuntu Focal):
status: New → Triaged
Frode Nordahl (fnordahl)
description: updated
Frode Nordahl (fnordahl)
Changed in ovn (Ubuntu Focal):
importance: Undecided → High
Revision history for this message
Steve Langasek (vorlon) wrote :

Please explain how the changes to debian/ovn-host.ovn-controller.service relate to this issue. The change is listed in the changelog but there is no explanation for why a systemd 'stop' command should be passed an argument of '--restart'.

Changed in ovn (Ubuntu Focal):
status: Triaged → Incomplete
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Steve, thank you for pointing out the lack of commentary around the need for updating the ovn-controller systemd service as part of this SRU. I have updated the bug description to include the reasoning behind it.

description: updated
Changed in ovn (Ubuntu Focal):
status: Incomplete → Triaged
Frode Nordahl (fnordahl)
description: updated
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Frode, or anyone else affected,

Accepted ovn into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ovn/20.03.2-0ubuntu0.20.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ovn (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (5.7 KiB)

Package versions before we start:
$ juju run --application ovn-central 'dpkg -l |grep ovn'
- Stdout: |
    ii ovn-central 20.03.2-0ubuntu0.20.04.4 amd64 OVN central components
    ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
  UnitId: ovn-central/0
- Stdout: |
    ii ovn-central 20.03.2-0ubuntu0.20.04.4 amd64 OVN central components
    ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
  UnitId: ovn-central/1
- Stdout: |
    ii ovn-central 20.03.2-0ubuntu0.20.04.4 amd64 OVN central components
    ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
  UnitId: ovn-central/2

$ juju run --application ovn-chassis 'dpkg -l |grep ovn'
- Stdout: |
    ii neutron-ovn-metadata-agent 2:16.4.2-0ubuntu4 all Neutron is a virtual network service for Openstack - OVN metadata agent
    ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
    ii ovn-host 20.03.2-0ubuntu0.20.04.4 amd64 OVN host components
  UnitId: ovn-chassis/0
- Stdout: |
    ii neutron-ovn-metadata-agent 2:16.4.2-0ubuntu4 all Neutron is a virtual network service for Openstack - OVN metadata agent
    ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
    ii ovn-host 20.03.2-0ubuntu0.20.04.4 amd64 OVN host components
  UnitId: ovn-chassis/1

Ping running instances:
$ ping 10.78.95.55
PING 10.78.95.55 (10.78.95.55) 56(84) bytes of data.
64 bytes from 10.78.95.55: icmp_seq=1 ttl=63 time=1.80 ms
64 bytes from 10.78.95.55: icmp_seq=2 ttl=63 time=1.22 ms
64 bytes from 10.78.95.55: icmp_seq=3 ttl=63 time=1.06 ms
...

$ ping 10.78.95.162
PING 10.78.95.162 (10.78.95.162) 56(84) bytes of data.
64 bytes from 10.78.95.162: icmp_seq=1 ttl=63 time=1.08 ms
64 bytes from 10.78.95.162: icmp_seq=2 ttl=63 time=0.545 ms
64 bytes from 10.78.95.162: icmp_seq=3 ttl=63 time=0.516 ms
...

Ensure OVN DNS interception/resolution is enabled and working:
ubuntu@zaza-neutrontests-ins-1:~$ dig zaza-neutrontests-ins-2 @10.78.95.1
...
;; ADDITIONAL SECTION:
zaza-neutrontests-ins-2. 3600 IN A 192.168.0.180

Ensure ovn-controllers picks up version mismatch prior to upgrade:

Note that the backported version mismatch handling code does not have the
additional version mismatch check in the incremental processing engine that
later versions have, this means that we need to ensure the main loop version
mismatch check has run prior to allowing northd to fill database tables after
an upgrade. We will just have to deal with this in the charms and/or as part
of upgrade documentation.

Force the version mismatch to happen before we act...

Read more...

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Frode Nordahl (fnordahl) wrote :

The required upgrade steps has now been added to the upstream OVN documentation:
https://github.com/ovn-org/ovn/commit/e7ed121ee0f851289057851272e13c6d02d4ce02

Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for ovn has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ovn - 20.03.2-0ubuntu0.20.04.4

---------------
ovn (20.03.2-0ubuntu0.20.04.4) focal; urgency=medium

  * Adapt to changes made in previous OVS point release (LP: #1980213):
    - d/control: Update required openvswitch build requirement.
    - d/p/lp-1980213-treewide-bump-ovs-and-fix-problematic-loops.patch
  * Fix upgrade from OVN 20.03 to newer OVN versions (LP: #1940043):
    - d/ovn-host.ovn-controller.service: Pass --restart option when
      calling `ovn-ctl stop_controller`
    - d/p/lp-1940043-0001-Provide-the-option-to-pin-ovn-controller-and-ovn-nor.patch
    - d/p/lp-1940043-0002-controller-Allow-pinctrl-thread-to-handle-packet-ins.patch
  * d/rules, d/testlist.py, d/flaky-tests.txt:
    - Dynamically build list of tests to run from list of test descriptions.

 -- Frode Nordahl <email address hidden> Wed, 20 Jul 2022 11:42:49 +0200

Changed in ovn (Ubuntu Focal):
status: Fix Committed → Fix Released
James Page (james-page)
Changed in cloud-archive:
status: Fix Released → Fix Committed
Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package ovn - 23.03.0-1~cloud0
---------------

 ovn (23.03.0-1~cloud0) jammy-antelope; urgency=medium
 .
   * New upstream release for the Ubuntu Cloud Archive.
 .
 ovn (23.03.0-1) unstable; urgency=medium
 .
   * Team upload.
   * Update upstream source from tag 'upstream/23.03.0'
 .
 ovn (23.03.0~git20230221.038cfb1-1) unstable; urgency=medium
 .
   * Team upload.
 .
   [ Frode Nordahl ]
   * Update upstream source from tag 'upstream/23.03.0_git20230221.038cfb1'
   * d/gbp.conf: Set snapshot branch to branch-23.03
   * d/p/a810bd80f572eefb2c14096a1413a45d7673314d.patch: Drop, included in
     snapshot.
   * d/skip-tests.txt: Extend list of flaky tests (LP: #2007923).
   * d/control: Drop dependencies on lsb-base, deprecated.
   * d/control: Bump Standards-Version to 4.6.2, no changes.
   * d/control: Bump openvswitch-source build requirement.
 .
   [ Luca Boccassi ]
   * Set Rules-Requires-Root
 .
 ovn (22.12.0-4) unstable; urgency=medium
 .
   * Extend list of tests to skip (LP: #2002475, LP: #2002476, LP:
     #2002477).
 .
 ovn (22.12.0-3) unstable; urgency=medium
 .
   * d/ovn-host.preinst: Do not clear runtime state on package upgrade
     (LP: #1940043).
   * d/control: Bump version requirement for bash (LP: #1997093).
   * d/skip-tests.txt, d/flaky-tests-amd64.txt: Re-enable "ovn-controller
     incremental processing" test (LP: #1997093).
   * d/rules: Re-enable test suite.
   * d/p/a810bd80f572eefb2c14096a1413a45d7673314d.patch: Fix OVN VIF Python
     3.11 build failure.
   * d/skip-tests.txt: Skip Check NB-SB mirrors sync (LP: #2002406).
 .
 ovn (22.12.0-2) unstable; urgency=medium
 .
   * Team upload.
 .
   [ Luca Boccassi ]
   * Upload to unstable.
 .
 ovn (22.12.0-1) experimental; urgency=medium
 .
   * Team upload.
 .
   [ Luca Boccassi ]
   * Update upstream source from tag 'upstream/22.12.0'
   * Bump dependency on openvswitch-source
   * Temporarily disable tests

Changed in cloud-archive:
status: Fix Committed → Fix Released
Revision history for this message
Michele Palazzi (m1kcloud) wrote :

would it be possible to port this fix to cloud:focal-wallaby in order to allow openstack upgrades without downtime?

currently the repo has 20.12.0-0ubuntu3.1~cloud0 and the only way i found to workaround the issue involves doing nasty things which i would like to avoid in production environments.

Revision history for this message
Edward Hope-Morley (hopem) wrote (last edit ):

@m1kcloud for version of Openstack < Yoga you have the option to use the focal-22.03 ovn backport archive that provides the Yoga version of OVN (22.03.x) for all releases of Openstack from Ussuri to Xena on Focal (using Ubuntu Cloud Archive). See https://docs.openstack.org/charm-guide/latest/project/procedures/ovn-upgrade-2203.html for more information.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.