Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Bug #2015533 reported by Przemyslaw Lal
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
dpdk (Ubuntu)
Expired
Critical
Unassigned
openvswitch (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

We upgraded the following packages on a number of hosts running on bionic-queens:
* dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
* openvswitch-switch and openvswitch-switch-dpdk from 2.9.5-0ubuntu0.18.04.1 to 2.9.8-0ubuntu0.18.04.4

It was just a plain `apt dist-upgrade` which upgraded a number of other packages - I can provide a full list of upgraded packages if needed.

This resulted in a complete dataplane outage on a production cloud.

Symptoms:

1. Loss of network connectivity on virtual machines using dpdkvhostuser ports.

VMs were unable to send any packets. Using `virsh console` we observed the following line printed a few times per second:

net eth0: unexpected txq (0) queue failure: -5

At the same time we also observed the following messages in OVS logs:

Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00727|dpdk|ERR|VHOST_CONFIG: recvmsg failed
Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00732|dpdk|ERR|VHOST_CONFIG: recvmsg failed

rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point of view) were not increasing.

2. Segmentation faults in ovs/dpdk libraries.

This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:

[22985566.641329] ovs-vswitchd[55077]: segfault at 0 ip 00007f3b570ad7a5 sp 00007f3b41b59660 error 6 in librte_eal.so.17.11[7f3b57094000+26000]
[22996115.925645] ovs-vswitchd[10442]: segfault at 0 ip 00007fd4065617a5 sp 00007fd3f0eb7660 error 6 in librte_eal.so.17.11[7fd406548000+26000]

Or on another host:
[22994791.103748] ovs-vswitchd[41066]: segfault at 0 ip 00007ff937ba27a5 sp 00007ff922ffc660 error 6 in librte_eal.so.17.11[7ff937b89000+26000]
[22995667.342714] ovs-vswitchd[56761]: segfault at 0 ip 00007feb1fe10740 sp 00007feb0ab5b530 error 6 in librte_eal.so.17.11[7feb1fdf7000+26000]
[22996548.675879] ovs-vswitchd[30376]: segfault at 0 ip 00007f077a11d7a5 sp 00007f0768eb4660 error 6 in librte_eal.so.17.11[7f077a104000+26000]
[23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_vhost.so.17.11[7f0cf9ff9000+14000]
[23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_vhost.so.17.11[7fdd50075000+14000]
[23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_vhost.so.17.11[7fe9718d1000+14000]
[ 639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_eal.so.17.11[7f8e32264000+26000]
[ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_vhost.so.17.11[7ff2e9403000+14000]
[ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_vhost.so.17.11[7feed53af000+14000]
[ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_vhost.so.17.11[7fcae074f000+14000]

This was sort of "stabilized" by full restart of OVS and neutron agents and not touching any VMs but on one machine we still saw librte_vhost.so segfaults. But even without segfaults we still faced the issue with "net eth0: unexpected txq (0) queue failure: -5" and didn't have working connectivity.

The issue was also easy to trigger by attempting a live migration of a VM that was using a vhu port although it was also crashing randomly on its own.

Failed attempts to restore the dataplane included:
1. Restart of ovs and neutron agents.
2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
3. Reboot of the hosts.
4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.

Solution:

After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and 17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_* debs (63 packages in total). Full list of rolled back packages: [1]

Please note that we also re-installed the latest available OVS (both openvswitch-switch and openvswitch-switch-dpdk) version before rolling back dpdk: 2.9.8-0ubuntu0.18.04.4.

Actions taken after the downgrade:
1. Stopped all VMs.
2. Restarted OVS.
3. Restarted neutron agents.
4. Started all VMs.

Rollback of 63 dpdk/librte_* packages and service restarts were the only actions that we needed to restore the connectivity on all machines. Error messages disappeared from VMs' console log (no more "net eth0: unexpected txq (0) queue failure: -5"). OVS started to report rx_* counters rising on vhu ports. Segmentation faults from ovs and pmd have stopped as well.

[0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz
[1] https://pastebin.ubuntu.com/p/Fx9dpQZwqM/

CVE References

description: updated
Revision history for this message
Przemyslaw Lal (przemeklal) wrote :

Subscribed field-high since a simple apt-dist upgrade results in a complete dataplane outage and requires a rollback of packages upgrade and VM reboots to restore connectivity. It probably affects both bionic-queens and bionic-rocky since rocky seems to use the same dpdk version.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dpdk (Ubuntu):
status: New → Confirmed
Changed in openvswitch (Ubuntu):
status: New → Confirmed
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Removing field-high subscription for process reasons. However, bug is critical and marking as such.

Changed in dpdk (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
this was falling through the cracks for quite some time.
I didn't remember touching DPDK in ages and indeed this [1] was a security update.
I marked it regression-updates and subscribed Leonidas who did that upload.
Maybe he has more context on what/why might fail.

Did you manage to break down the crashes to code touched by that update?
Which was:
    - debian/patches/CVE-2022-2132-*.patch: discard tool small descriptor
      chains and fix header spanned across more than two descriptor, use
      buffer vectors in dequeue path in lib/librte_vhost/vhost.h,
      lib/librte_vhost/virtio_net.c. (LP: #1975764)
    - CVE-2022-2132

[1]: https://launchpad.net/ubuntu/+source/dpdk/17.11.10-0ubuntu0.2

tags: added: regression-update
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

The traffic interruption in dpdkvhostuserclient ports is probably
due to an issue in the package upgrade of `openvswitch-switch-dpdk`,
associated with the later OVS restart by `openvswitch-switch`.

1) When `openvswitch-switch-dpdk` is upgraded, the old version
is affected by bug 1836713 [1], and resets OVS back to non-DPDK:

@ sosreport-brtlvmrs0613co-00358062-2023-04-08-ltxuscc:var/log/apt/term.log

 Preparing to unpack .../190-openvswitch-switch-dpdk_2.9.8-0ubuntu0.18.04.4_amd64.deb ...
 update-alternatives: removing manually selected alternative - switching ovs-vswitchd to auto mode
 update-alternatives: using /usr/lib/openvswitch-switch/ovs-vswitchd to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in auto mode
 Unpacking openvswitch-switch-dpdk (2.9.8-0ubuntu0.18.04.4) over (2.9.5-0ubuntu0.18.04.1) ...

2) When `openvswitch-switch` is upgraded, it restarts OVS:

@ openvswitch-switch.postinst:

 # summary of how this script can be called:
 # * <postinst> `configure' <most-recently-configured-version>
 ...
 if [ "$1" = "configure" ] || ... ; then
          ...
          if [ -n "$2" ]; then
                  _dh_action=restart
   ...
          invoke-rc.d openvswitch-switch $_dh_action || exit 1
   ...
 fi

 Apr 06 07:18:14 brtlvmrs0613co ovs-ctl[49562]: * Exiting ovs-vswitchd (3657)
 ...
 Apr 06 07:18:15 brtlvmrs0613co ovs-vswitchd[49757]: ovs|00007|dpdk|ERR|DPDK not supported in this copy of Open ...
 ...
 Apr 06 07:18:16 brtlvmrs0613co ovs-ctl[49717]: * Starting ovs-vswitchd

3) When OVS (non-DPDK) restarts, the dpdkvhostuserclient ports
cannot be added back:

 2023-04-06T07:18:15.683Z|00029|netdev|WARN|could not create netdev vhu698a70de-9a of unknown type dpdkvhostuserclient
 ...
 2023-04-06T07:18:15.683Z|00031|netdev|WARN|could not create netdev vhu9471d4d7-5b of unknown type dpdkvhostuserclient
 ...
 2023-04-06T07:18:15.683Z|00033|netdev|WARN|could not create netdev vhu6caa02dd-b2 of unknown type dpdkvhostuserclient

4) Now the VMs have vhost-user ports in non-functional state,
waiting for OVS DPDK (which is not running) to start ports w/
vhost-user client to connect to the vhost-user server in QEMU.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

I could reproduce the traffic interruption in a test environment
with the commands executed by the packaging scripts:

$ sudo update-alternatives --remove ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
$ sudo update-alternatives --install /usr/sbin/ovs-vswitchd ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk 50
$ sudo invoke-rc.d openvswitch-switch restart

The VMs stopped pinging each other.
(There are no errors in the VMs console, but this might
differ due to virtio-net driver differences to their VMs).

The fix-up probably happened when Bootstack re-installed
the latest ovs dpdk package, or along the way, when the
update-alternatives was been fixed bacl to ovs-dpdk and
ovs restarted.
[as per `Please note that we also re-installed the latest
available OVS (both openvswitch-switch and openvswitch-switch-dpdk)
 version before rolling back dpdk: 2.9.8-0ubuntu0.18.04.4.]

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

For OVS package upgrades from versions earlier than
`2.9.8-0ubuntu0.18.04.1` (this issue/upgrade is from
`2.9.5-0ubuntu0.18.04.1` (note .5 vs .8),
please manually remove update-alternatives in the prerm script,
as documented in bug 1836713.

[this has broke other DPDK clouds back then, see comment #3],
per comment #4/description:

$ sudo sed -i "/update-alternatives/d" /var/lib/dpkg/info/openvswitch-switch-dpdk.prerm

and then upgrade openvswitch-switch-dpdk (or upgrade or dist-upgrade).

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Now, I still haven't looked at the segfaults, but I initially
couldn't reproduce them with just the dpdk pkg version change.

(And they didn't always happen / correlate with the traffic
interruption, so it does seem to be another problem.)

So, "just" library version incompatibility/interface breakage
on different ends such as QEMU/OVS/DPDK/RTE doesn't seem to be
the issue (which is likely reassured as there are no related
bugs/fixes to the DPDK packages in Bionic for a long time now;
so it seems to be a corner case).

Thus it _might_ be some weird state left in the VMs virtio-net
driver, that eventually got to talk back to DPDK vhost ports,
and provided it wrong pointers... or some unexpected state
as part of restarts of different components.

This needs more assessment and information to determine next steps.

For starters, it'd be nice to know what's the kernel version
running in the VMs to check the virtio-net driver level and
features negotiated with the hypervisor's vhost side.

Marking dpdk as Incomplete.

Changed in openvswitch (Ubuntu):
status: Confirmed → Invalid
Changed in dpdk (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Przemyslaw Lal (przemeklal) wrote :

We encountered the issue with OVS using the wrong (non-DPDK) binary but applied the update-alternatives workaround as soon as we noticed it, then we restarted OVS and the issue with not traffic was still there despite OVS being able to initialize DPDK ports.

We also tried rebooting two hosts, and after reboots, OVS was reporting DPDK support but the issue with no traffic and segfaults was still there since DPDK 17.11.10-0ubuntu0.2 were still present.

We were also able to reproduce the issue on the next day by running:
0. Rollback of DPDK upgrade: downgraded from 17.11.10-0ubuntu0.2 to 17.11.10-0ubuntu0.1 done one day earlier to recover from the outage.
1. OVS-DPDK 2.9.8-0ubuntu0.18.04.1 + DPDK 17.11.10-0ubuntu0.1 running with DPDK support enabled in OVS. Network connectivity is confirmed to work.
2. Upgrade DPDK packages only to 17.11.10-0ubuntu0.2 on one host. Loss of network connectivity, despite DPDK support still being present in OVS (`ovs-vsctl list open_vswitch` showed dpdk support and `ovs-vsctl show` looked good).
3. Downgrade only DPDK packages to 17.11.10-0ubuntu0.1. Connectivity restored.

The kernel version in one of the VMs was 4.15.0-121-generic (ubuntu) but we also saw the issue with non-Ubuntu VMs.

This environment uses jumbo frames on physical and vhostuer ports, is it possible that patch [0] changed something in how such packets are handled? Especially this part of the commit message:
> This patch also has the advantage of requesting the exact packets sizes for the mbufs.

[0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for dpdk (Ubuntu) because there has been no activity for 60 days.]

Changed in dpdk (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.