Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
dpdk (Ubuntu) |
Expired
|
Critical
|
Unassigned | ||
openvswitch (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
We upgraded the following packages on a number of hosts running on bionic-queens:
* dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
* openvswitch-switch and openvswitch-
It was just a plain `apt dist-upgrade` which upgraded a number of other packages - I can provide a full list of upgraded packages if needed.
This resulted in a complete dataplane outage on a production cloud.
Symptoms:
1. Loss of network connectivity on virtual machines using dpdkvhostuser ports.
VMs were unable to send any packets. Using `virsh console` we observed the following line printed a few times per second:
net eth0: unexpected txq (0) queue failure: -5
At the same time we also observed the following messages in OVS logs:
Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[
Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[
rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point of view) were not increasing.
2. Segmentation faults in ovs/dpdk libraries.
This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:
[22985566.641329] ovs-vswitchd[
[22996115.925645] ovs-vswitchd[
Or on another host:
[22994791.103748] ovs-vswitchd[
[22995667.342714] ovs-vswitchd[
[22996548.675879] ovs-vswitchd[
[23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_
[23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_
[23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_
[ 639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_
[ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_
[ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_
[ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_
This was sort of "stabilized" by full restart of OVS and neutron agents and not touching any VMs but on one machine we still saw librte_vhost.so segfaults. But even without segfaults we still faced the issue with "net eth0: unexpected txq (0) queue failure: -5" and didn't have working connectivity.
The issue was also easy to trigger by attempting a live migration of a VM that was using a vhu port although it was also crashing randomly on its own.
Failed attempts to restore the dataplane included:
1. Restart of ovs and neutron agents.
2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
3. Reboot of the hosts.
4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.
Solution:
After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and 17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_* debs (63 packages in total). Full list of rolled back packages: [1]
Please note that we also re-installed the latest available OVS (both openvswitch-switch and openvswitch-
Actions taken after the downgrade:
1. Stopped all VMs.
2. Restarted OVS.
3. Restarted neutron agents.
4. Started all VMs.
Rollback of 63 dpdk/librte_* packages and service restarts were the only actions that we needed to restore the connectivity on all machines. Error messages disappeared from VMs' console log (no more "net eth0: unexpected txq (0) queue failure: -5"). OVS started to report rx_* counters rising on vhu ports. Segmentation faults from ovs and pmd have stopped as well.
[0] http://
[1] https:/
CVE References
description: | updated |
Subscribed field-high since a simple apt-dist upgrade results in a complete dataplane outage and requires a rollback of packages upgrade and VM reboots to restore connectivity. It probably affects both bionic-queens and bionic-rocky since rocky seems to use the same dpdk version.