libvirtd is unable to configure bridge devices inside of LXD containers

Bug #1784501 reported by Tyler Hicks
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Tyler Hicks
Bionic
Fix Released
Medium
Tyler Hicks

Bug Description

[Impact]

libvirtd cannot properly configure the default bridge device when installed inside of unprivileged LXD containers. 'systemctl status libvirtd' shows the following error:

  error : virNetDevBridgeSet:140 : Unable to set bridge virbr0 forward_delay: Permission denied

This is caused due to the files under /sys/class/net/ being owned by init namespace root rather than container root even when the bridge device is created inside of the container. Here's an example from inside of an unprivileged container:

# brctl addbr testbr0
# ls -al /sys/class/net/testbr0/bridge/forward_delay
-rw-r--r-- 1 nobody nogroup 4096 Jul 30 22:33 /sys/class/net/testbr0/bridge/forward_delay

libvirt cannot open this file for writing even though it created the device. Where safe, files under /sys/class/net/ should be owned by container root.

[Test Case]

A simple kernel test is to verify that you can write to the /sys/class/net/<BRIDGE>/ files as root inside of an unprivileged LXD container.

Unpatched kernels will see a Permission denied error:

$ lxc exec c1 -- sh -c 'brctl addbr testbr && \
                        echo 1 > /sys/class/net/testbr/bridge/flush'
sh: 1: cannot create /sys/class/net/testbr/bridge/flush: Permission denied

The echo command will succeed when using a patched kernel.

You can also install libvirt inside of a an unprivileged LXD container, restart the container, and verify that the default bridge (virbr0) is up.

Unpatched kernels will not see the virbr0 bridge:

$ lxc exec c1 -- sh -c 'brctl show virbr0'
bridge name bridge id STP enabled interfaces
virbr0 can't get info No such device

The brctl command will show a valid device when using a patched kerne:

$ lxc exec c1 -- sh -c 'brctl show virbr0'
bridge name bridge id STP enabled interfaces
virbr0 8000.5254005451e8 yes virbr0-nic

[Regression Potential]

The biggest concern with these patches is that they could cause a sensitive /sys/class/net/** file to be read from or written to inside of an unprivileged container. I've (tyhicks) audited all on the in-tree objects exposed to unprivileged containers by this patch set and I don't see any concerns. I did find one file (tx_maxrate) that I couldn't make heads or tails of so I added a CAP_NET_ADMIN check against the init namespace so that it couldn't be modified inside of a container.

These patches were released in 4.19 and also in the Ubuntu 18.10 release kernel. No issues have been reported in those releases.

[Other info]

The following upstream patches have been merged into linux-next which fix this bug:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=c59e18b876da3e466abe5fa066aa69050f5be17c
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=d1753390274f7760e5b593cb657ea34f0617e559

Revision history for this message
Tyler Hicks (tyhicks) wrote :

Patches submitted for inclusion in Ubuntu Cosmic:

  https://lists.ubuntu.com/archives/kernel-team/2018-July/094426.html

Seth Forshee (sforshee)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Stéphane Graber (stgraber) wrote :

Adding a task for bionic as we'll want this fix to be available for our 18.04 users.
No need to backport it to anything older than that though.

Changed in linux (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (30.8 KiB)

This bug was fixed in the package linux - 4.17.0-7.8

---------------
linux (4.17.0-7.8) cosmic; urgency=medium

  * linux: 4.17.0-7.8 -proposed tracker (LP: #1785242)

  * Cosmic update to 4.17.12 stable release (LP: #1785211)
    - spi: spi-s3c64xx: Fix system resume support
    - Input: elan_i2c - add ACPI ID for lenovo ideapad 330
    - Input: i8042 - add Lenovo LaVie Z to the i8042 reset list
    - Input: elan_i2c - add another ACPI ID for Lenovo Ideapad 330-15AST
    - mm: disallow mappings that conflict for devm_memremap_pages()
    - kvm, mm: account shadow page tables to kmemcg
    - delayacct: fix crash in delayacct_blkio_end() after delayacct init failure
    - tracing: Fix double free of event_trigger_data
    - tracing: Fix possible double free in event_enable_trigger_func()
    - kthread, tracing: Don't expose half-written comm when creating kthreads
    - tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure
    - tracing: Quiet gcc warning about maybe unused link variable
    - arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap setups
    - drm/i915/glk: Add Quirk for GLK NUC HDMI port issues.
    - mlxsw: spectrum_switchdev: Fix port_vlan refcounting
    - kcov: ensure irq code sees a valid area
    - mm: check for SIGKILL inside dup_mmap() loop
    - drm/amd/powerplay: Set higher SCLK&MCLK frequency than dpm7 in OD (v2)
    - xen/netfront: raise max number of slots in xennet_get_responses()
    - hv_netvsc: fix network namespace issues with VF support
    - skip LAYOUTRETURN if layout is invalid
    - ixgbe: Fix setting of TC configuration for macvlan case
    - ALSA: emu10k1: add error handling for snd_ctl_add
    - ALSA: fm801: add error handling for snd_ctl_add
    - NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
    - nfsd: fix error handling in nfs4_set_delegation()
    - nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
    - vfio: platform: Fix reset module leak in error path
    - vfio/mdev: Check globally for duplicate devices
    - vfio/type1: Fix task tracking for QEMU vCPU hotplug
    - kernel/hung_task.c: show all hung tasks before panic
    - mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the
      same cacheline
    - mm: /proc/pid/pagemap: hide swap entries from unprivileged users
    - mm: vmalloc: avoid racy handling of debugobjects in vunmap
    - mm/slub.c: add __printf verification to slab_err()
    - rtc: ensure rtc_set_alarm fails when alarms are not supported
    - rxrpc: Fix terminal retransmission connection ID to include the channel
    - perf tools: Fix pmu events parsing rule
    - netfilter: ipset: forbid family for hash:mac sets
    - netfilter: ipset: List timing out entries with "timeout 1" instead of zero
    - irqchip/ls-scfg-msi: Map MSIs in the iommu
    - watchdog: da9063: Fix updating timeout value
    - media: arch: sh: migor: Fix TW9910 PDN gpio
    - printk: drop in_nmi check from printk_safe_flush_on_panic()
    - bpf, arm32: fix inconsistent naming about emit_a32_lsr_{r64,i64}
    - ceph: fix alignment of rasize
    - ceph: fix use-after-free in ceph_statfs()
    - e1000e: Ignore TSYNCRXCTL when getting I219...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Mike Pontillo (mpontillo) wrote :

This seems to be a regression on bionic; I have been using libvirtd inside a container for several weeks; today when I tore down and rebuilt the container I hit this bug.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Were you maybe using a privileged container before? Those aren't affected by the /sys ownership issue.

Revision history for this message
Tyler Hicks (tyhicks) wrote :

@mpontillo Bionic has always suffered from this bug. The patch set has only been backport to Cosmic and the fixes have not yet been backported to Bionic.

Have you, by chance, recently reconfigured your container to be an unprivileged container when it was previously a privileged container? (I don't know if this is actually possible to do in LXD but it would cause this bug to start affecting your container)

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Nothing has changed regarding privileged vs. unprivileged settings.

I just set this up again with a privileged container, and I believe the cause of the regression to actually be in libvirt. I think it must have silently ignored the bridge configuration error before and marked the network active (such that it shows up in `virsh net-list` without the `--all` parameter).

Reasoning: yesterday before I rebuilt my test container, MAAS showed only two networks; none of which were virbr* interfaces (I never explicitly deleted the default virsh network). Today when I encountered this bug, MAAS showed three (because I made the container privileged). Additionally, MAAS KVM pods check to see if a `default` or `maas` network is active in virsh before allowing a KVM pod to be composed. Therefore, unless libvirt believed its `default` network was up and running, my previous test environment would not have worked at all.

Conclusion: we're just now seeing this because libvirt (in bionic-updates) began raising an error and failing to mark a network active in the case that it could not configure the bridge STP parameters.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

To be fair, it's also possible that I did something like the following in my previous test container and forgot about it. ;-)

cat << EOF > maas.xml
<network>
  <name>maas</name>
  <forward mode='bridge'/>
  <bridge name='br0'/>
</network>
EOF
virsh net-define maas.xml
rm maas.xml
virsh net-start maas
virsh net-autostart maas

Revision history for this message
Tyler Hicks (tyhicks) wrote :

@mpontillo would you mind testing with the libvirt that's in the release pocket of Bionic?

https://launchpad.net/ubuntu/+source/libvirt/4.0.0-1ubuntu8

That would help tell us if libvirt has had an SRU that changed this behavior in the short time that Bionic has been released.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

My 2nd theory was correct. ;-) Sorry for polluting this bug report with my thinking-out-loud.

Tyler Hicks (tyhicks)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Tyler Hicks (tyhicks)
status: Triaged → In Progress
Tyler Hicks (tyhicks)
description: updated
description: updated
Revision history for this message
Tyler Hicks (tyhicks) wrote :
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Tyler Hicks (tyhicks) wrote :

I've verified [Test Case] using 4.15.0-42.45-generic.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.1 KiB)

This bug was fixed in the package linux - 4.15.0-42.45

---------------
linux (4.15.0-42.45) bionic; urgency=medium

  * linux: 4.15.0-42.45 -proposed tracker (LP: #1803592)

  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - KVM: s390: reset crypto attributes for all vcpus
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - s390/zcrypt: remove VLA usage from the AP bus
    - s390/zcrypt: Remove deprecated ioctls.
    - s390/zcrypt: Remove deprecated zcrypt proc interface.
    - s390/zcrypt: Support up to 256 crypto adapters.
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  * CVE-2018-18955: nested user namespaces with more than five extents
    incorrectly grant privileges over inode (LP: #1801924) // CVE-2018-18955
    - userns: also map extents in the reverse map to kernel IDs

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

 -- Thadeu Lima de Souza Cascardo <email address hidden> Thu, 15 Nov 2018 17:01:46 ...

Read more...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.