Security advisory for four vulnerabilities in Kata Containers
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| Kata Containers |
Fix Released
|
Undecided
|
Unassigned | ||
Bug Description
Hello,
I’m Yuval Avrahami, a security researcher at Palo Alto Networks. My team follows cloud-native projects and conducts security audits of containerization platforms. I’ve had the chance to review Kata Containers and am writing to disclose several security vulnerabilities I found.
1. A malicious container can access the guest root filesystem device, potentially allowing for code execution on the guest.
* Affects QEMU and Cloud Hypervisor on the default configuration.
* Initrd based guests aren’t affected.
2. Upon container teardown, a malicious guest can trick the kata runtime into unmounting any mount point on the host and all mount points underneath it. This allows DoSing the host.
* Affects QEMU and Cloud Hypervisor on the default configuration.
3. Cloud Hypervisor writes guest filesystem changes to the underlying image file on the host. A malicious guest running on Cloud Hypervisor can therefore compromise the image file on the host. Since Kata Containers uses the same guest image file with all VMMs by default, this issue can subsequently affect QEMU and Firecracker guests.
* Affects Cloud Hypervisor on the default configuration.
4. A malicious guest compromised before container creation (e.g. a malicious guest image) can trick the kata runtime into mounting the container filesystem on any host path, potentially allowing for code execution on the host.
* Affects QEMU and Cloud Hypervisor on the default configuration.
By chaining the vulnerabilities above, the following attacks are possible:
1. A malicious container can exploit vulnerability #1 to compromise the guest and then chain vulnerability #2 to DoS the host by unmounting the root directory.
2. A malicious container can chain vulnerabilities #1 and #3 to compromise the guest image, and then exploit vulnerability #4 to gain code execution on the host the next time a malicious container image is run.
PoCs can be found in the attached file.
Please assign CVE IDs for these vulnerabilities. I plan to publicly share this advisory, so please do coordinate your announcement time with me. As part of our responsible disclosure policy, we plan to release this advisory within 90 days.
1. Background
-------
Kata Containers is an open-source OCI compatible container runtime that runs each container or pod in a dedicated VM, providing another layer of isolation. It can run under Docker, Podman, and Kubernetes.
The kata-runtime on the host utilizes a Virtual Machine Monitor to run the guest VM. QEMU is the default, Firecracker and Cloud Hypervisor are also supported. Inside the guest VM, the kata-agent runs the container workload.
2. Accessing the guest filesystem device
-------
The container within the guest runs without device cgroup isolation [1] and thus can access the guest’s root filesystem device. With Firecracker, the guest fs device is configured as read-only, preventing modifications. With QEMU and Cloud Hypervisor, the device is read-write, and therefore can be modified. Since guests booted on initramfs (i.e. initrd) don’t have a backing block device, they aren’t affected.
To access guest devices the container must have the CAP_MKNOD capability, which is the default configuration in Docker, Podman, and Kubernetes with containerd, but not in Kubernetes with CRI-O. Hardening the container with seccomp doesn’t mitigate the vulnerability.
A malicious container can create a device file for the guest filesystem device using mknod, and access the filesystem within it through utilities like debugfs [2]:
$ ls /sys/dev/block # get guest fs device major and minor numbers
$ mknod --mode 0600 /dev/guest_fs b $major_num $minor_num
$ debugfs -w /dev/guest_fs
debugfs: ... # access and modify filesystem
debugfs: close -a # write changes to device
The container can then modify the filesystem to try and gain code execution on the guest. A malicious container must take into account that modifications take effect at device level, below the kernel page cache and dentry cache. This means changes made directly to the device aren’t always apparent to guest processes.
2.1. Exploitation with DAX
-------
On all QEMU architectures supported by Kata Containers, excluding s390x, DAX [3] is enabled for the guest root mount. DAX, or Direct Access for files, means that virtual memory mappings of processes map straight to the device, rather than to the page cache. This eases the exploitation of the issue.
A malicious container can perform the following actions to gain code execution on the guest in the default configuration. All actions are done through the debugfs utility:
1. Delete the kata-agent binary on the guest.
2. Replace the /usr/bin/umount binary on the guest with a malicious binary.
3. Write several big files with garbage data to the device so that the blocks previously allocated for the kata-agent binary will be used.
4. Because of DAX, the kata-agent process code region is directly mapped to the device (rather than the cache) and now points to garbage data, resulting in a SIGSEGV.
5. A systemd shutdown sequence is initiated, eventually calling umount.
6. The malicious umount binary is executed on the guest as root, the shutdown sequence is suspended until the malicious binary exits.
2.2. Exploitation without DAX
-------
When running Kata Containers with Cloud Hypervisor or QEMU s390x, the guest filesystem is mounted without DAX. It is also possible to disable DAX by adding custom guest kernel parameters to Kata’s configuration. Without DAX, gaining immediate code execution on the guest is significantly harder, but still possible in certain scenarios.
With Cloud Hypervisor, I haven’t found a way to gain immediate code execution on the guest using only this vulnerability. I wasn’t able to easily set up a testing environment for QEMU s390x and did not test this configuration.
I was able to achieve code execution on QEMU x86_64 guests running without DAX (configured through the guest kernel parameters) using the following method. If the user opted into constraining the container memory via cgroups (not the default on Docker or Kubernetes), this method might not work.
1. Delete the kata-agent binary on the guest though debugfs.
2. Replace the /usr/bin/umount binary on the guest with a malicious binary.
3. Write several big files with garbage data to the device so that the blocks previously allocated for the kata-agent binary will be used.
4. Run a process with the sole purpose of exhausting memory.
5. Because of low memory, the guest kernel frees its page cache, including the pages containing the kata-agent binary. The kernel now needs to remap the code region of the kata-agent process to its executable, and reads the kata-agent binary directly from the compromised guest filesystem device.
6. The kata-agent process code region is now mapped to garbage data, resulting in a SIGSEGV.
7. /usr/bin/umount wasn’t accessed previously by the guest and therefore isn’t in the page cache, and must be read from the device.
8. The following steps are identical to those presented in 2.1.
A PoC is available in the attached file under /guest_nodax_poc. The above method doesn’t work for Cloud Hypervisor nor QEMU with virtio-fs, as the VM freezes on low memory instead of freeing up the cache or calling the OOM killer. I do not know if this method will work for QEMU s390x.
There may be other ways to achieve code execution depending on the scenario (the guest image, configuration, etc.). Below are several examples:
* If /usr/bin/umount is in the page cache (e.g. because a custom guest image uses it), other binaries which are called during the systemd shutdown sequence can be used. /usr/lib/
* In custom guest images where the kata agent is used as the init process (no systemd), a malicious container could still gain code execution by injecting shellcode into blocks allocated to the kata-agent binary (through debugfs). Because of DAX, the shellcode will propagate into the kata-agent process memory. Injecting into libraries loaded by the kata-agent process should work as well. Note that I only tested that a process’s memory can be altered in this method, but haven’t created a full PoC for this attack.
Using the methods presented in sections 2.1 and 2.2, an attacker can break out of the container, the first isolation layer provided by kata containers, execute code on the guest, and possibly chain other vulnerabilities to affect the host.
Even though this section focused on exploiting the vulnerability for code execution in the guest, access to the guest filesystem device might cause other issues as illustrated in section 4 (affecting the underlying guest image file on the host).
2.3. Fix suggestions
-------
Applying device cgroups to deny access to the guest filesystem device, as well as any other devices that the container doesn’t need access to. Additionally, setting the guest filesystem device in QEMU and in Cloud Hypervisor as read-only.
3. Malicious guest umount vulnerability
-------
Upon container teardown, a malicious guest can trick the kata runtime into unmounting any mount point on the host.
The guest and host share a directory, accessible at /run/kata-
Upon container teardown, the kata runtime (on the host side) tries to unmount several paths under this shared directory, one of them being /run/kata-
When kata-container is configured with overlay2 as the storage driver, the attack follows the steps below (all actions are executed by the malicious guest):
1. Rename /run/kata-
2. Then, recreate /run/kata-
3. Create a symlink to the host target mount named /run/kata-
For devicemapper:
1. Unmount /run/kata-
2. Create a symlink to the host target mount named /run/kata-
3.1. Impact
-------
The umount operation is done with the MNT_DETACH [4] flag, meaning that mount points under the target mount will be unmounted as well. Because of that, if we target ‘/’ in our attack, then the host mounts underneath it (e.g. /proc, /sys, etc.) will be unmounted as well, resulting in the host being non-functional in most scenarios (e.g. being a Kubernetes node), creating a Denial of Service.
On kubernetes, a malicious guest can trigger multiple container deletions by simply killing the container processes running on it. With the default kubernetes restart policy [5], containers will be removed and recreated. This process can be repeated by the malicious guest several times to control multiple unmount operations on the host.
3.2. Fix suggestions
-------
Any operations involving the shared directory on the host side should be carefully designed with the assumption that the guest is malicious and trying to affect the host through the shared directory.
A simple solution specific to the umount operation is to use the UMOUNT_NOFOLLOW [4] flag when unmounting the container rootfs, though this flag only helps with symlinks in the last component of the path. The unmount operation will still follow symlinks which are not in the last path component. A malicious guest could switch the $ctrid component in /run/kata-
The secure but complicated solution is to use a helper binary that chroots to the shared directory before performing operations within it. This will mitigate symlink attacks like those depicted here and in section 5.
4. Cloud Hypervisor modifies the underlying image file
-------
When running Kata Containers with Cloud Hypervisor, any change made to root filesystem device is written to the underlying .img file. Since the device is plugged as read-write, a malicious guest could write to it through utilities like debugfs.
Compromising the guest image file allows an attacker to control all subsequent guests that run that image. Since, by default, the same guest image file is used by all VMMs (QEMU, Firecracker and Cloud Hypervisor), the next time any guest is executed, it will be malicious. This immediately compromises all subsequent container runs. Additionally, it can expose the host to attacks that require the guest to be malicious from the moment it boots (as shown in section 5.1).
4.1. Fix Suggestions
-------
Deny CLH guests access to the underlying host backing file of their filesystem. I do not know if Cloud Hypervisor supports this configuration, but in Kata Containers’ use case of Cloud Hypervisor, it is required. Additionally, set the guest filesystem device as read-only.
5. Malicious guest mount vulnerability
-------
When creating a container in a new or existing sandbox, the kata runtime bind mounts several files and directories to the shared directory between the guest and the host:
1. The container filesystem [6] at /run/kata-
2. Container volumes at /run/kata-
3. Additional platform-specific files and directories [7] [8] at /run/kata-
* With docker, $name is either resolv.conf, hostname or hosts.
* With Kubernetes, these include the termination-log file and the serviceaccount directory.
If the guest is compromised before container creation, it can use symlinks in the shared directory to trick the kata runtime into mounting the aforementioned mounts to any path on the host.
If a user opts into using the devicemapper storage driver, the container filesystem isn’t mounted to the shared directory, and thus cannot be used to attack the host. The other mounts specified above can still be rerouted. If the user uses Firecracker, which doesn’t support file sharing between the guest and the host, he isn’t affected.
5.1. Rerouting the first container mount
-------
A malicious guest can create a symlink in the shared directory at /run/kata-
To create the symlink the guest must know the container id as it’s a part of the mount’s target path. The first container in a sandbox is unique in that regard since its id is the sandbox id, which is known to the guest at the time of the mount.
If a guest is compromised before the first container is added to it (e.g. a malicious guest image), it can execute the following attack:
1. The malicious guest receives the CreateSandbox message and extracts the sandbox id from it. The first container id matches the sandbox id and is derived from that message.
2. The malicious guest creates the malicious symlink at the shared directory, at /run/kata-
3. The malicious guest returns a response for CreateSandbox
4. Once the kata runtime on the host receives the CreateSandbox response, it tries to bind mount the container image at /run/kata-
5. The malicious symlink redirects the mount operation to the target on the host.
Given that the container image is malicious, the guest can gain code execution on the host by mounting over directories such as /bin or /lib. Besides code execution, DoSing the host is trivial (by mounting over crucial directories).
In the case of overlay2 mounts, once the container engine (e.g. Docker) removes the container, it might also delete the lower layers of the container filesystem, rendering the mount done through this attack empty. In the example of mounting the malicious container image over /bin, if no process tired running a binary from /bin before the container is removed, then /bin will become empty, and the attack fails.
To deal with this problem an attacker could target /lib or /lib64, which contains libraries for dynamically linked binaries such as the kata-runtime itself. Under Docker for example, the kata-runtime will most likely be executed again in the process of spawning a container:
1. docker run $image is called
2. Docker invokes kata-runtime create which inadvertently mounts the container image to the target directory on the host.
3. Docker invokes kata-runtime start. Assuming an attack targeting /lib64 occurred in step 2, the libraries loaded and executed by the kata-runtime process are now malicious.
Another interesting approach to deal with the removal of the mount’s lower layers is mounting over one of the components in /var/lib/
With Kubernetes, there can be multiple containers in a guest, but the first is always the pause container [9]. An attack redirecting the pause container is limited to a host DoS since the pause container image isn’t malicious.
5.2. Rerouting the other mounts
-------
The rest of the mounts mentioned in section 5 can also be redirected to the host, but the malicious guest must win a race condition. The race is required because these mounts are performed on paths in the shared directory which the guest cannot fully know. In the case of non-initial container images, the guest doesn’t know the container id. In the case of volumes and platform-specific files and directories, there is also a random string.
To exploit the vulnerability in this scenario, the malicious guest must win a race condition between:
1. ensureDestinati
2. The bind mount [11] done after calling ensureDestinati
Between those steps, the guest must replace the created file or directory with a symlink to a target on the host. An attack redirecting container volumes or platform-specific mounts is most likely limited to a DoS since the content of these mounts isn’t malicious. Successfully redirecting container images can lead to code execution on the host if the image is malicious (as shown in section 5.1).
With docker, there is only one container in a guest. With Kubernetes, there can be multiple containers in a guest, and there are several scenarios in which a container is added to an existing guest that might already be compromised (by the existing containers in it). With the default pod restart policy, for example, a malicious guest can simply kill an existing container and the Kubelet will recreate it, causing the kata runtime to perform another mount operation on the shared directory. This can be repeated by the guest in order to get several opportunities to win the race.
Nevertheless, the race condition is rather difficult to win. The shared directory is mounted using either virtio-9p or virtio-fs, which don’t support inotify [12], a Linux API that eases exploitation of file related races. I tried winning the race using a malicious guest process and *failed*, though that doesn’t necessarily mean the race cannot be won. The approach below is quite timely to implement but will have better chances of winning the race.
1. Instead of running as a guest process, run as part of the guest kernel by loading a malicious kernel module.
2. Wait for the virtio-9p/virtio-fs packets generated by ensureDestinati
3. Send back the appropriate virtio-9p/virtio-fs packet to create the malicious symlink at /run/kata-
Since I’m not familiar with the internals of virtio-fs and virtio-9p, I cannot definitively say whether this race is beatable in the above approach, or at all.
5.3. Fix Suggestions
-------
Any operations involving the shared directory on the host side should be carefully designed with the assumption that the guest is malicious and trying to affect the host through the shared directory.
The approach suggested in section 2.3, using a helper binary which chroots to the shared directory, cannot be implemented as is to address this vulnerability. The bind mount operation will not work in a chroot jail as it requires access to the source of the mount, which is outside of the shared directory.
After giving it some thought, the only solution I came up with is to halt the guest, check the target path isn’t a symlink, and only then bind mount to the shared directory.
6. Chained attacks
-------
The following sections describe PoCs for attacks that chain several of the outlined vulnerabilities. The attacks are carried out either by a compromised container or by a container running a malicious image.
6.1. Chained attack #1 - unmount, DoS
-------
Chaining the vulnerabilities described in sections 2 and 3, a malicious container can unmount any path on the host, allowing a DoS attack on the host. The malicious container will first exploit vulnerability #1 to gain control over the guest, and then chain vulnerability #2 to trick the kata runtime into unmounting a target path on the host.
A PoC is available in the attached file under /host_umount.
6.2. Chained attack #2 - mount, code execution
-------
Chaining the vulnerabilities described in sections 2, 4 and 5, code execution on the host can be achieved. This requires a scenario where:
1. A malicious container image is run once with CloudHypervisor.
2. The malicious container image is run again with either QEMU or CloudHypervisor.
When the malicious container image is first run with Cloud Hypervisor, it exploits vulnerability #1 to access the guest filesystem device. It then replaces the kata-agent binary on the guest device with a malicious version. Because of vulnerability #3 in Kata Containers with Cloud Hypervisor, modifications to the guest file system propagate to the underlying .img file on the host.
The next time a container is run, it runs in a malicious guest. In the second container run, the malicious guest exploits vulnerability #4 to mount the container image on a crucial path on the host, /bin (/lib and /lib64 are good targets as well).
The next time the hosts attempt to execute a binary from /bin (for example /bin/ls), a binary from the malicious container image is executed on the host instead.
A PoC is available in the attached file under /host_mount.
Please acknowledge receiving this report. If needed, my mail address is <email address hidden>.
Best regards,
Yuval Avrahami | Senior Security Researcher
Palo Alto Networks
Footnotes
-------
[1] https:/
[2] https:/
[3] https:/
[4] http://
[5] https:/
[6] https:/
[7] https:/
[8] https:/
[9] https:/
[10] https:/
[11] https:/
[12] http://
| information type: | Private Security → Private |
| information type: | Private → Private Security |
| information type: | Private Security → Private |
| information type: | Private → Private Security |
| description: | updated |
| information type: | Private Security → Public Security |
Have not read all the detailed info.
At least we should fix the "unmounting any mount point on the host and all mount points underneath it" as soon as possible.