unable to allow an app to access all devices with a certain major number via a <majordev>:* device cgroup rule

Bug #1892895 reported by Dmitrii Shcherbakov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Confirmed
High
Unassigned

Bug Description

I found a race condition which can be avoided by using wildcard rules in device cgroups, however, I do not see a way to enable that in an interface.

There is a use-case for MicroStack where iSCSI targets are added to the host kernel as block devices via iscsid + the iscsi-tcp kernel module.

An immediate idea is to:

* add block-devices interface to nova-compute and libvirtd apps;
* as a result, get major and minor devices of the hot-plugged devices added to device cgroups of Nova and libvirtd (/sys/fs/cgroup/devices/snap.microstack.{nova-compute, libvirtd}/devices.list).
  * This part of the interface makes sure of that: https://github.com/snapcore/snapd/blob/2.46/interfaces/builtin/block_devices.go#L97

As it turns out, this approach is racy since the device is attempted to be used prior to its major and minor number being added to the relevant device cgroup via: /sys/fs/cgroup/devices/snap.microstack.{nova-compute, libvirtd}/devices.allow

snap-device-helper is responsible for that https://github.com/snapcore/snapd/blob/2.46/cmd/snap-confine/snap-device-helper#L73

In essence, the block special file is created and used prior to the time when snapd runs snap-device-helper and confined applications are not synchronized with the operation of the helper in any way.

In the failure mode I observe consistently, I get "Operation not permitted" which is the EPERM returned from the kernel when it enforces accesses based on what is present in the device cgroup:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/focal/tree/security/device_cgroup.c?h=Ubuntu-5.4.0-44.48#n823

Specific to my use-case, what I see is that Nova tells libvirt to use a block device which fails with EPERM. Then Nova tries to remove the volume it just tried to attach and do `blockdev --flushbufs` in the process which fails as well:

* try: virt_driver.attach_volume (Nova) -> virStorageFileReportBrokenChain (libvirt) -> Cannot access storage file '/dev/sde': Operation not permitted -> libvirt.libvirtError Cannot access storage file '/dev/sde': Operation not permitted
* except: "Driver failed to attach volume..." -> volume_api.attachment_delete -> ... -> flush_device_io -> blockdev --flushbufs /dev/sde -> blockdev: cannot open /dev/sde: Operation not permitted
https://opendev.org/openstack/nova/src/branch/stable/ussuri/nova/virt/block_device.py#L498-L510 (Nova code)
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/util/virstoragefile.c?h=applied/ubuntu/focal#n4877 ("Cannot access storage" in libvirt)

https://paste.ubuntu.com/p/RTgq8XkzY6/ (logs)

If I add a wildcard rule to allow devices with any minor number and a certain major number to be used, this race condition is avoided.

sudo bash -c 'echo b 8:* rwm > /sys/fs/cgroup/devices/snap.microstack.libvirtd/devices.allow'
sudo bash -c 'echo b 8:* rwm > /sys/fs/cgroup/devices/snap.microstack.nova-compute/devices.allow'

---------------------------------------------------------------------

Another simple use-case this is valid for is working with loop devices.

If I have this in an interface:

const connectedPlugAppArmor = `
/dev/loop-control rw,
/dev/loop[0-9]* rw,
`

var microStackConnectedPlugUDev = []string{
 `SUBSYSTEM=="block", KERNEL=="loop[0-9]*"`,
 `SUBSYSTEM=="misc", KERNEL=="loop-control"`,
}

And try to use `losetup -f` when there are no free loop files available:

fallocate -l $loop_file_size $loop_file
losetup -f $loop_file

I will get "Operation not permitted" during the losetup invocation since the device cgroup entry is not added fast enough.

This is a much simpler reproducer then the one with iSCSI.

---------------------------------------------------------------------
Update (09-09-2020):

Found one more use-case which is LV activation after reboot:

* reboot -> LV Status NOT available;
* lvchange -a y <vgname-for-lvs> -> device-mapper: reload ioctl on (253:3) failed: Operation not permitted

description: updated
Changed in snapd:
assignee: nobody → Zygmunt Krynicki (zyga)
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I've analyzed the problem and I need to discuss my findings with the rest of the snapd team. I have several ides on how to avoid this problem, in addition the the suggestion provided by the reporter.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Thanks for looking into this!

For the future, please also consider that in cgroupv2 there are no interface files for controlling access rules:

https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/cgroup-v2.rst#L2018
"Cgroup v2 device controller has no interface files and is implemented on top of cgroup BPF. To control access to device files, a user may create bpf programs of the BPF_CGROUP_DEVICE type and attach them to cgroups. On an attempt to access a device file, corresponding BPF programs will be executed, and depending on the return value the attempt will succeed or fail with -EPERM."

description: updated
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

We are well aware of cgroup v2 device model and plan to support it. There are ongoing patches that need review, which build towards that.

Changed in snapd:
status: New → Confirmed
Changed in snapd:
importance: Undecided → Medium
importance: Medium → High
Changed in snapd:
assignee: Zygmunt Krynicki (zyga) → nobody
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Also encountered it during thin LV creation via lvcreate:

https://pastebin.canonical.com/p/3TDh83RgXy/

lvcreate -T -V 4g -n testvol cinder-volumes/cinder-volumes-pool
  device-mapper: reload ioctl on (253:3) failed: Operation not permitted
  Aborting. Failed to locally activate thin pool cinder-volumes/cinder-volumes-pool.

Using bpftrace (with BTF support via libbpf and 5.11 kernel) I was also able to confirm that EPERM (-1) definitely comes from devcgroup_check_permission:
https://paste.ubuntu.com/p/Hpfkd2FJ5t/
# ...
cgroup permission check (devcgroup_check_permission) for: major: 253, minor: 2
kstack:

        bpf_prog_3d4bb0e1b7559353_devcgroup_check+186
        bpf_prog_3d4bb0e1b7559353_devcgroup_check+186
        bpf_trampoline_6442502014_0+85
        devcgroup_check_permission+5
        dm_get_table_device+254
        dm_get_device+437
        linear_ctr+200
        dm_table_add_target+383
        table_load+291
        ctl_ioctl+399
        dm_ctl_ioctl+14
        __x64_sys_ioctl+145
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+68

ustack:

        ioctl+11
        dm_task_run+828
        _reload_with_suppression_v4+783
        dm_task_run+349
        _load_node+697
        dm_tree_preload_children+450
        _tree_action+1284
        dev_manager_activate+49
        _lv_activate_lv+159
        _lv_activate+1271
        lv_activate_with_filter+147
        activate_lv+244
        _lv_create_an_lv+2473
        lv_create_single+1131
        _lvcreate_single+1590
        _process_vgnameid_list+1688
        process_each_vg+1274
        lvcreate+543
        lvm_run_command+2912
        lvm2_main+1426
        main+36
        __libc_start_main+243

retval -1

Revision history for this message
Ian Johnson (anonymouse67) wrote :

I opened https://github.com/snapcore/snapd/pull/10975 specifically for microstack which should resolve the bug for microstack, but this bug is indeed a more general problem wherein a snap application knows about a new device node from the kernel but that device node has not been processed by udev yet.

I wonder if applications like microstack which get information directly from the kernel about a given device node before udev has a chance to process the new device node could instead do one of two things to avoid this race entirely:

1. Just always loop trying to access this block device (with reasonable timeouts) if the error is specifically EPERM, with the assumption that it will eventually be asynchronously allowed by udev when udev has a chance to trigger the snap-device-helper which will update the snap's device cgroup to allow the new specific device that the snap knows about before udev has processed.

2. After learning about the new device node from the kernel, before trying to access it at all, synchronously trigger udev to process the new rule and wait for udev to be finished before accessing the device node. This should entirely eliminate the race because the race here is really between the application and udev, not really between the application and the snap-device-helper.

Long term, we may look at other mechanisms to know about these sorts of changes sooner than udev (i.e. a more immediate way that maybe happens before the kernel releases to userspace the information about the device node if that is possible), but I happen to be of the opinion right now that there will always fundamentally be this sort of race between something privileged in userspace asking/telling the kernel to create a device and udev responding to that device creation since udev is in userspace.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

(1) and (2) assumes that the snap author has control over the source code. In the MicroStack case a lot of external software is packaged in one snap, including LVM, util-linux for loop devices, open-iscsi initiator.

As such, an approach like this would require patching upstream projects and maintaining those patches in the snap over time.

Avoiding the use of the device cgroup will probably be a reasonable workaround for MicroStack in the short term on Ubuntu classic.

Revision history for this message
Ian Johnson (anonymouse67) wrote :

snapd 2.53.3 now has code to enable the microstack-support interface to disable the udev backend entirely when used, so microstack snap should not have any device cgroup enforced for it and it is free to use any device node that it is permitted to via AppArmor

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers