libvirt - vnc port selection regression with newer kernels

Bug #1722702 reported by Justin Mammarella
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
Undecided
Unassigned
Artful
Won't Fix
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Critical
Unassigned
Artful
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification:

Impact: Bug appears in linux-image-4.11.x kernels on 16.04.3.
4.8.x and 4.10.x are okay.

When libvirtd is restarted on Kernel 4.11.0-x-generic or above , it loses all information regarding existing port bindings and is unable to correctly re-identify vnc ports that currently in use. libvirt attempts to bind to an existing port and fails.

instance-000a6096.log:2017-10-09T01:42:16.017220Z
qemu-system-x86_64: -vnc 0.0.0.0:0: Failed to start VNC server: Failed to listen on socket: Address already in use

Fairly certain this is the same bug reported, and patched in kernel-4.13.4-200.fc26 for Fedora https://bugzilla.redhat.com/show_bug.cgi?id=1432684

Fix: Cherry picking a set of 3 patches (merged as the same set by David Miller) from upstream Linux.

Testcase: Restart of libvirt.

Risk for Regressions: Looking at the changes those 3 patches introduce, the delta is minimal and makes sense with respect to the explanations given in the commit message. Overall risk should be low.

Revision history for this message
Justin Mammarella (jmamma) wrote :

Forgot to add libvirt verison:

ii libvirt-bin 1.3.1-1ubuntu10.13 amd64 programs for the libvirt library
ii qemu-system-common 1:2.5+dfsg-5ubuntu10.14 amd64 QEMU full system emulation binaries (common files)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
I have seen this issue once but never found a root cause when trying to reproduce. I ended up with one system showing the behavior you mentioned and others not without seeing the difference why.

I thank you a lot to throw in the info that in your case it is the changing kernel version that breaks this - I didn't try to compare those when I looked at it.

In my cases I had the issue on multiple versions of libvirt, but it weould be really great if you would have a chance to test newer libvirt stacks e.g. from [1] - that way you can stick to your 16.04.3 set up and switching to your kernels as you had before but bump ther qemu/libvirt versions?
I want to avoid starting to work on it - to again end up unreproducible and this cross check would help a lot to start at the right setup.

Setting confirmed as I have seen it in the past hoping that this bug and the further tests will help to identify and eventually fix the issues root cause.

Since kernel version seems to matter adding a kernel task in case we end up with known issues or a fix there instead of in libvirt.

[1]: https://wiki.ubuntu.com/OpenStack/CloudArchive

Changed in libvirt (Ubuntu):
status: New → Confirmed
summary: - bind port regression
+ libvirt - vnc port selection regression with newer kernels
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1722702

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Duping my bug (with the failed repro) onto this.

Revision history for this message
Daniel Berrange (berrange) wrote :

This is almost certainly the kernel bug mentioned in that Fedora BZ.

There is however a workaround added in libvirt now for distros/users who can't fix their kernel https://www.redhat.com/archives/libvir-list/2017-September/msg00519.html

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Daniel,
after discussing shortly on IRC and checking all the references together:
- https://www.redhat.com/archives/libvir-list/2017-September/msg00519.html
- https://bugzilla.redhat.com/show_bug.cgi?id=1432684
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=863266
- http://www.spinics.net/lists/netdev/msg454644.html

OTOH 4.13 is still iterating on Artful and should be the only >4.10 kernel available to Xenial eventually.

Since there will be no valid 4.11-4.12 anywhere and in 4.13 this might actually break any other user of SO_REUSEPORT IMHO it should be fixed in the kernel to avoid all related regressions instead of tampering over it in libvirt.

Therefore I'm bumping the Linux kernel bug task.
Prio Crit as we are rather short to the Artful release, but obviously open to retriage by the kernel Team.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

There is no apport collect needed, this is known upstream and applies to all >=4.11 - fix is in 4.14-rc2 but should be part of 4.13 on release to avoid regressions in more programs.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in libvirt (Ubuntu):
status: Confirmed → Won't Fix
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I'd ask the kernel Team to integrate the following changes into our 4.13 kernel release to avoid any sort of update-regressions related to SO_REUSEPORT [1]
- Reproducer [2] without libvirt, how to use in [3]
- Fixes are submitted as [3] - [6]
- If I tracked them correctly they went upstream without [5]+[6] but with [7] instead
- Overall the upstream change that I'd ask to pull in as a fix into any 4.13 release is in the merge [8] as part of 4.14-rc2

[1]: https://lwn.net/Articles/542629/
[2]: https://bugzilla.redhat.com/attachment.cgi?id=1314915
[3]: http://www.spinics.net/lists/netdev/msg454647.html
[4]: http://www.spinics.net/lists/netdev/msg454769.html
[5]: http://www.spinics.net/lists/netdev/msg454786.html
[6]: http://www.spinics.net/lists/netdev/msg454933.html
[7]: https://github.com/torvalds/linux/commit/fbed24bcc69d3e48c5402c371f19f5c7688871e5
[8]: https://github.com/torvalds/linux/commit/4e683f499a15cd777d3cb51aaebe48d72334c852

Changed in linux (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Pinged the kernel team (smb) to take a look as I'd be scared of this and further consequences of the regression when Xenial HWE kernels and Artful will be released.

Revision history for this message
Stefan Bader (smb) wrote :

From the comments I conclude that the following three patches from Linux tree need to be picked:

cbb2fb5 net: set tb->fast_sk_family
7a56673 net: use inet6_rcv_saddr to compare sockets
fbed24b inet: fix improper empty comparison

Patches fix: 637bc8bbe6c0 ("inet: reset tb->fastreuseport when adding a reuseport sk") which was added in 4.11.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@smb - Yes that is the correct summary

Stefan Bader (smb)
description: updated
Changed in linux (Ubuntu Artful):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libvirt (Ubuntu Artful):
status: New → Confirmed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Test from [2] in c #8.

4.13.0-16-generic
gcc bind-collision.c && ./a.out
bind: Address already in use
AF_INET check failed.
$ gcc -D CHECK_IPV6 bind-collision.c && ./a.out
AF_INET6 success
AF_INET success
$ gcc bind-collision.c && ./a.out
AF_INET success

From proposed:
4.13.0-17-generic
$ gcc bind-collision.c && ./a.out
AF_INET success

=> verified

tags: added: verification-done-artful
removed: verification-needed-artful
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Waiting on the kernel fix to fully release, but as we won't change libvirt updating the task.

Changed in libvirt (Ubuntu Artful):
status: Confirmed → Won't Fix
Changed in libvirt (Ubuntu):
status: Won't Fix → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since on 8th Nov you wrote "in 5 days or it will be dropped" (I verified and it was good) I expected a fix to be released already.

Not super important, but can you share an ETA on this?

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (21.9 KiB)

This bug was fixed in the package linux - 4.13.0-17.20

---------------
linux (4.13.0-17.20) artful; urgency=low

  * linux: 4.13.0-17.20 -proposed tracker (LP: #1728927)

  [ Seth Forshee ]
  * thunderx2 ahci errata workaround needs additional delays (LP: #1724117)
    - SAUCE: ahci: thunderx2: stop engine fix update

  * usb 3-1: 2:1: cannot get freq at ep 0x1 (LP: #1708499)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics C310/C520-M

  * Plantronics Blackwire C520-M - Cannot get freq at ep 0x1, 0x81
    (LP: #1709282)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics C310/C520-M

  * TSC_DEADLINE incorrectly disabled inside virtual guests (LP: #1724912)
    - x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on CPUs
      without the feature
    - x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on
      hypervisors

  * x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping
    (LP: #1724612)
    - x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping

  * [Artful] Add support for Dell/Wyse 3040 audio codec (LP: #1723916)
    - SAUCE: ASoC: rt5670: Add support for Wyse 3040

  * [Artful] Some Dell Monitors Doesn't Work Well with Dell/Wyse 3040
    (LP: #1723915)
    - SAUCE: drm/i915: Workaround for DP DPMS D3 on Dell monitor

  * [Artful] Support headset mode for DELL WYSE (LP: #1723913)
    - SAUCE: ALSA: hda/realtek - Add support headset mode for DELL WYSE

  * Touchpad and TrackPoint Dose Not Work on Lenovo X1C6 and X280 (LP: #1723986)
    - SAUCE: Input: synaptics-rmi4 - RMI4 can also use SMBUS version 3
    - SAUCE: Input: synaptics - Lenovo X1 Carbon 5 should use SMBUS/RMI
    - SAUCE: Input: synaptics - add Intertouch support on X1 Carbon 6th and X280

  * Artful update to v4.13.8 stable release (LP: #1724669)
    - USB: dummy-hcd: Fix deadlock caused by disconnect detection
    - MIPS: math-emu: Remove pr_err() calls from fpu_emu()
    - MIPS: bpf: Fix uninitialised target compiler error
    - mei: always use domain runtime pm callbacks.
    - dmaengine: edma: Align the memcpy acnt array size with the transfer
    - dmaengine: ti-dma-crossbar: Fix possible race condition with dma_inuse
    - NFS: Fix uninitialized rpc_wait_queue
    - nfs/filelayout: fix oops when freeing filelayout segment
    - HID: usbhid: fix out-of-bounds bug
    - crypto: skcipher - Fix crash on zero-length input
    - crypto: shash - Fix zero-length shash ahash digest crash
    - KVM: MMU: always terminate page walks at level 1
    - KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit
    - usb: renesas_usbhs: Fix DMAC sequence for receiving zero-length packet
    - pinctrl/amd: Fix build dependency on pinmux code
    - iommu/amd: Finish TLB flush in amd_iommu_unmap()
    - device property: Track owner device of device property
    - Revert "vmalloc: back off when the current task is killed"
    - fs/mpage.c: fix mpage_writepage() for pages with buffers
    - ALSA: usb-audio: Kill stray URB at exiting
    - ALSA: seq: Fix use-after-free at creating a port
    - ALSA: seq: Fix copy_from_user() call inside lock
    - ALSA: caiaq: Fix stray URB at probe error path
    - ALSA: li...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.