Upgrade from Queens to Rocky results in dead ovs-vswitchd services

Bug #1923668 reported by Michael Skalka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron Open vSwitch Charm
Invalid
Undecided
Unassigned
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Rocky
Fix Committed
High
Chris MacNaughton
openvswitch (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

While upgrading a cloud from Queens to Rocky I attempted to flush a hypervisor to avoid service disruption on the final unit of nova-compute using live-migrate. The action queues up in the dashboard however it completes with the instance remaining on the same host. Looking into the nova-compute logs from that instance it seems that the target host could not create the tap:

/var/log/nova/nova-compute.log:

2021-04-13 21:12:50.464 1286276 WARNING nova.compute.resource_tracker [req-b1cea8db-be1e-4252-9e31-c78d097ad671 - - - - -] [instance: e341e106-5bec-4048-a76e-03ef0c70441c] Instance not resizing, skipping migration.
2021-04-13 21:12:50.658 1286276 INFO nova.compute.resource_tracker [req-b1cea8db-be1e-4252-9e31-c78d097ad671 - - - - -] Final resource view: name=flagler.playground.solutionsqa phys_ram=32123MB used_ram=18432MB phys_disk=361GB used_disk=20GB total_vcpus=12 used_vcpus=1 pci_stats=[]
2021-04-13 21:13:02.025 1286276 ERROR nova.virt.libvirt.driver [req-06db27eb-b304-4969-b1e2-cbd0d80094ca d966ea789bfe431fb5863da1e72d6e49 80545c41a5db45d98d6adf7083c4914b - 9580fece017f4adf9b4ff1aa2bf836c8 9580fece017f4adf9b4ff1aa2bf836c8] [instance: e341e106-5bec-4048-a76e-03ef0c70441c] Live Migration failure: internal error: Unable to add port tap9c8d13c9-8a to OVS bridge br-int: libvirtError: internal error: Unable to add port tap9c8d13c9-8a to OVS bridge br-int
2021-04-13 21:13:02.187 1286276 ERROR nova.virt.libvirt.driver [req-06db27eb-b304-4969-b1e2-cbd0d80094ca d966ea789bfe431fb5863da1e72d6e49 80545c41a5db45d98d6adf7083c4914b - 9580fece017f4adf9b4ff1aa2bf836c8 9580fece017f4adf9b4ff1aa2bf836c8] [instance: e341e106-5bec-4048-a76e-03ef0c70441c] Migration operation has aborted
2021-04-13 21:13:02.364 1286276 INFO nova.compute.manager [req-06db27eb-b304-4969-b1e2-cbd0d80094ca d966ea789bfe431fb5863da1e72d6e49 80545c41a5db45d98d6adf7083c4914b - 9580fece017f4adf9b4ff1aa2bf836c8 9580fece017f4adf9b4ff1aa2bf836c8] [instance: e341e106-5bec-4048-a76e-03ef0c70441c] Swapping old allocation on 5a94928b-fb98-401f-bdd9-aa2f9f08602c held by migration 44727a6b-3417-4df3-9ca9-5b52e2e0f487 for instance
2021-04-13 21:13:04.381 1286276 WARNING nova.compute.manager [req-2f77835b-38ab-45b9-8acd-38a98ff3fcfc 6cad752c2b9744d6aac17fb26522004c d1aed1922a5a4a7899cae3e3afb6bc90 - c1a08b45ef134260be7501e96bc9ee3d c1a08b45ef134260be7501e96bc9ee3d] [instance: e341e106-5bec-4048-a76e-03ef0c70441c] Received unexpected event network-vif-unplugged-9c8d13c9-8a96-49e0-834a-3c512f1990cb for instance with vm_state active and task_state None.
2021-04-13 21:13:05.836 1286276 WARNING nova.compute.manager [req-66d4ddc6-4ac8-4c1a-8007-582d599da366 6cad752c2b9744d6aac17fb26522004c d1aed1922a5a4a7899cae3e3afb6bc90 - c1a08b45ef134260be7501e96bc9ee3d c1a08b45ef134260be7501e96bc9ee3d] [instance: e341e106-5bec-4048-a76e-03ef0c70441c] Received unexpected event network-vif-plugged-9c8d13c9-8a96-49e0-834a-3c512f1990cb for instance with vm_state active and task_state None.

Looking at the target unit the ovs-vsswitchd service is not even running on a number of the units: https://pastebin.ubuntu.com/p/YhdTQRRGb4/

Restarting the ovs-vsswitchd service on those hosts restores the ability to migrate.

In each attempt the source of the instance was flagler and the destination was everitt which are machines 6 and 3 in the attached crashdump respectively.

Related branches

Revision history for this message
Michael Skalka (mskalka) wrote :

crashdump

tags: added: openstack-upgrade
Michael Skalka (mskalka)
description: updated
Revision history for this message
Michael Skalka (mskalka) wrote :

To add some more complexity, the other canary instance running during this upgrade lost connectivity a few minutes before the service restart was performed: https://pastebin.ubuntu.com/p/YWWykHPTWC/

The host of the router serving the floating-ip network are are machines 6 (the source instance that had a pending upgrade) and machine 5.

Logged into the instance via the nova console and confirmed that it has zero outbound access as well.

Michael Skalka (mskalka)
summary: - Upgrade from Queens to Rocky results in dead ovs-vsswitchd services
+ Upgrade from Queens to Rocky results in dead ovs-vswitchd services
Revision history for this message
Billy Olsen (billy-olsen) wrote :

OVS crashed, here's the backtrace

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfi'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fa11946de00 (LWP 1752502))]
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fa117adb921 in __GI_abort () at abort.c:79
#2 0x0000560f25780568 in ofputil_protocol_to_ofp_version (protocol=<optimized out>) at ../lib/ofp-protocol.c:123
#3 0x0000560f2577b5ee in ofputil_encode_port_status (ps=ps@entry=0x7ffcd9da6ce0, protocol=<optimized out>) at ../lib/ofp-port.c:938
#4 0x0000560f256eb472 in connmgr_send_port_status (mgr=0x560f260c2100, source=source@entry=0x0, pp=pp@entry=0x560f2617e8f0, reason=reason@entry=1 '\001') at ../ofproto/connmgr.c:1654
#5 0x0000560f256af784 in ofport_remove (ofport=0x560f2617e8d0) at ../ofproto/ofproto.c:2439
#6 0x0000560f256b2b0f in ofport_remove_with_name (name=0x560f260f6c00 "ha-49c4020c-85", ofproto=0x560f26137580) at ../ofproto/ofproto.c:2454
#7 update_port (ofproto=ofproto@entry=0x560f26137580, name=name@entry=0x560f260f6c00 "ha-49c4020c-85") at ../ofproto/ofproto.c:2669
#8 0x0000560f256b348b in ofproto_port_del (ofproto=0x560f26137580, ofp_port=<optimized out>) at ../ofproto/ofproto.c:2072
#9 0x0000560f256a19f0 in bridge_delete_or_reconfigure_ports (br=br@entry=0x560f260ee360) at ../vswitchd/bridge.c:884
#10 0x0000560f256a3d32 in bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x560f26117e60) at ../vswitchd/bridge.c:636
#11 0x0000560f256a6159 in bridge_run () at ../vswitchd/bridge.c:3023
#12 0x0000560f2569c6c5 in main (argc=<optimized out>, argv=<optimized out>) at ../vswitchd/ovs-vswitchd.c:125

Which looks like the ofputil_protocol_to_ofp_version cannot find the right OVS version based on the protocol provided:

/* Returns the OpenFlow protocol version number (e.g. OFP10_VERSION,
 * etc.) that corresponds to 'protocol'. */
enum ofp_version
ofputil_protocol_to_ofp_version(enum ofputil_protocol protocol)
{
    switch (protocol) {
    case OFPUTIL_P_OF10_STD:
    case OFPUTIL_P_OF10_STD_TID:
    case OFPUTIL_P_OF10_NXM:
    case OFPUTIL_P_OF10_NXM_TID:
        return OFP10_VERSION;
    case OFPUTIL_P_OF11_STD:
        return OFP11_VERSION;
    case OFPUTIL_P_OF12_OXM:
        return OFP12_VERSION;
    case OFPUTIL_P_OF13_OXM:
        return OFP13_VERSION;
    case OFPUTIL_P_OF14_OXM:
        return OFP14_VERSION;
    case OFPUTIL_P_OF15_OXM:
        return OFP15_VERSION;
    case OFPUTIL_P_OF16_OXM:
        return OFP16_VERSION;
    }

    OVS_NOT_REACHED(); <--- CRASH point
}

Revision history for this message
Billy Olsen (billy-olsen) wrote :

The backtrace from the coredump is in-line with this mail thread - https://mail.openvswitch.org/pipermail/ovs-discuss/2018-December/047876.html which was addressed with this commit - https://github.com/openvswitch/ovs/commit/30e699b7ec43cc70d0d20f0969a2714bfb78c7c8. However, looking at the history around that, this commit is also relevant - https://github.com/openvswitch/ovs/commit/903f6c4f8a9bce51984435ca3990f2717c63f703. In fact, the version of openvswitch in the rocky archives has the problematic commit 476d255 which is referenced in the latter commit.

I think both patches are likely good candidates to prevent the coredump and variations. IIUC, this happens when a port event (add/del/mod) is being sent while still establishing the protocol for the remote connection.

I'm building a version of openvswitch with these patches and uploading to my ppa at https://launchpad.net/~billy-olsen/+archive/ubuntu/bionic-rocky-testppa

Revision history for this message
Billy Olsen (billy-olsen) wrote :
Revision history for this message
Billy Olsen (billy-olsen) wrote :

This is a problem with openvswitch packages so adding Ubuntu/openvswitch and cloud-archive for some bug tracking. Corresponding Ubuntu release is well out of support and so is cloud-archive, but users could run into this on upgrade from Queens->Rocky

Changed in openvswitch (Ubuntu Focal):
status: New → Fix Released
Changed in openvswitch (Ubuntu):
status: New → Fix Released
Changed in cloud-archive:
status: New → Triaged
Changed in charm-neutron-openvswitch:
status: New → Invalid
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote : Please test proposed package

Hello Michael, or anyone else affected,

Accepted openvswitch into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This fix is included in openvswitch 2.10.0-0ubuntu3~cloud0 in rocky-proposed. If anyone can verify that the fix works please let us know and we'll get it promoted to rocky-updates.

Changed in cloud-archive:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.