Security advisory for four vulnerabilities in Kata Containers

Bug #1863875 reported by Yuval Avrahami
260
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kata Containers
Fix Released
Undecided
Unassigned

Bug Description

Hello,

I’m Yuval Avrahami, a security researcher at Palo Alto Networks. My team follows cloud-native projects and conducts security audits of containerization platforms. I’ve had the chance to review Kata Containers and am writing to disclose several security vulnerabilities I found.

 1. A malicious container can access the guest root filesystem device, potentially allowing for code execution on the guest.
    * Affects QEMU and Cloud Hypervisor on the default configuration.
    * Initrd based guests aren’t affected.

 2. Upon container teardown, a malicious guest can trick the kata runtime into unmounting any mount point on the host and all mount points underneath it. This allows DoSing the host.
    * Affects QEMU and Cloud Hypervisor on the default configuration.

 3. Cloud Hypervisor writes guest filesystem changes to the underlying image file on the host. A malicious guest running on Cloud Hypervisor can therefore compromise the image file on the host. Since Kata Containers uses the same guest image file with all VMMs by default, this issue can subsequently affect QEMU and Firecracker guests.
    * Affects Cloud Hypervisor on the default configuration.

 4. A malicious guest compromised before container creation (e.g. a malicious guest image) can trick the kata runtime into mounting the container filesystem on any host path, potentially allowing for code execution on the host.
    * Affects QEMU and Cloud Hypervisor on the default configuration.

By chaining the vulnerabilities above, the following attacks are possible:

 1. A malicious container can exploit vulnerability #1 to compromise the guest and then chain vulnerability #2 to DoS the host by unmounting the root directory.

 2. A malicious container can chain vulnerabilities #1 and #3 to compromise the guest image, and then exploit vulnerability #4 to gain code execution on the host the next time a malicious container image is run.

PoCs can be found in the attached file.

Please assign CVE IDs for these vulnerabilities. I plan to publicly share this advisory, so please do coordinate your announcement time with me. As part of our responsible disclosure policy, we plan to release this advisory within 90 days.

1. Background
--------------------------------
Kata Containers is an open-source OCI compatible container runtime that runs each container or pod in a dedicated VM, providing another layer of isolation. It can run under Docker, Podman, and Kubernetes.

The kata-runtime on the host utilizes a Virtual Machine Monitor to run the guest VM. QEMU is the default, Firecracker and Cloud Hypervisor are also supported. Inside the guest VM, the kata-agent runs the container workload.

2. Accessing the guest filesystem device
--------------------------------
The container within the guest runs without device cgroup isolation [1] and thus can access the guest’s root filesystem device. With Firecracker, the guest fs device is configured as read-only, preventing modifications. With QEMU and Cloud Hypervisor, the device is read-write, and therefore can be modified. Since guests booted on initramfs (i.e. initrd) don’t have a backing block device, they aren’t affected.

To access guest devices the container must have the CAP_MKNOD capability, which is the default configuration in Docker, Podman, and Kubernetes with containerd, but not in Kubernetes with CRI-O. Hardening the container with seccomp doesn’t mitigate the vulnerability.

A malicious container can create a device file for the guest filesystem device using mknod, and access the filesystem within it through utilities like debugfs [2]:
$ ls /sys/dev/block # get guest fs device major and minor numbers
$ mknod --mode 0600 /dev/guest_fs b $major_num $minor_num
$ debugfs -w /dev/guest_fs
debugfs: ... # access and modify filesystem
debugfs: close -a # write changes to device

The container can then modify the filesystem to try and gain code execution on the guest. A malicious container must take into account that modifications take effect at device level, below the kernel page cache and dentry cache. This means changes made directly to the device aren’t always apparent to guest processes.

2.1. Exploitation with DAX
--------------------------------
On all QEMU architectures supported by Kata Containers, excluding s390x, DAX [3] is enabled for the guest root mount. DAX, or Direct Access for files, means that virtual memory mappings of processes map straight to the device, rather than to the page cache. This eases the exploitation of the issue.

A malicious container can perform the following actions to gain code execution on the guest in the default configuration. All actions are done through the debugfs utility:
 1. Delete the kata-agent binary on the guest.
 2. Replace the /usr/bin/umount binary on the guest with a malicious binary.
 3. Write several big files with garbage data to the device so that the blocks previously allocated for the kata-agent binary will be used.
 4. Because of DAX, the kata-agent process code region is directly mapped to the device (rather than the cache) and now points to garbage data, resulting in a SIGSEGV.
 5. A systemd shutdown sequence is initiated, eventually calling umount.
 6. The malicious umount binary is executed on the guest as root, the shutdown sequence is suspended until the malicious binary exits.

2.2. Exploitation without DAX
--------------------------------
When running Kata Containers with Cloud Hypervisor or QEMU s390x, the guest filesystem is mounted without DAX. It is also possible to disable DAX by adding custom guest kernel parameters to Kata’s configuration. Without DAX, gaining immediate code execution on the guest is significantly harder, but still possible in certain scenarios.

With Cloud Hypervisor, I haven’t found a way to gain immediate code execution on the guest using only this vulnerability. I wasn’t able to easily set up a testing environment for QEMU s390x and did not test this configuration.

I was able to achieve code execution on QEMU x86_64 guests running without DAX (configured through the guest kernel parameters) using the following method. If the user opted into constraining the container memory via cgroups (not the default on Docker or Kubernetes), this method might not work.
 1. Delete the kata-agent binary on the guest though debugfs.
 2. Replace the /usr/bin/umount binary on the guest with a malicious binary.
 3. Write several big files with garbage data to the device so that the blocks previously allocated for the kata-agent binary will be used.
 4. Run a process with the sole purpose of exhausting memory.
 5. Because of low memory, the guest kernel frees its page cache, including the pages containing the kata-agent binary. The kernel now needs to remap the code region of the kata-agent process to its executable, and reads the kata-agent binary directly from the compromised guest filesystem device.
 6. The kata-agent process code region is now mapped to garbage data, resulting in a SIGSEGV.
 7. /usr/bin/umount wasn’t accessed previously by the guest and therefore isn’t in the page cache, and must be read from the device.
 8. The following steps are identical to those presented in 2.1.

A PoC is available in the attached file under /guest_nodax_poc. The above method doesn’t work for Cloud Hypervisor nor QEMU with virtio-fs, as the VM freezes on low memory instead of freeing up the cache or calling the OOM killer. I do not know if this method will work for QEMU s390x.

There may be other ways to achieve code execution depending on the scenario (the guest image, configuration, etc.). Below are several examples:
 * If /usr/bin/umount is in the page cache (e.g. because a custom guest image uses it), other binaries which are called during the systemd shutdown sequence can be used. /usr/lib/systemd/systemd-shutdown is a good candidate since it is only called on shutdown and therefore most likely isn't in cache.
 * In custom guest images where the kata agent is used as the init process (no systemd), a malicious container could still gain code execution by injecting shellcode into blocks allocated to the kata-agent binary (through debugfs). Because of DAX, the shellcode will propagate into the kata-agent process memory. Injecting into libraries loaded by the kata-agent process should work as well. Note that I only tested that a process’s memory can be altered in this method, but haven’t created a full PoC for this attack.

Using the methods presented in sections 2.1 and 2.2, an attacker can break out of the container, the first isolation layer provided by kata containers, execute code on the guest, and possibly chain other vulnerabilities to affect the host.

Even though this section focused on exploiting the vulnerability for code execution in the guest, access to the guest filesystem device might cause other issues as illustrated in section 4 (affecting the underlying guest image file on the host).

2.3. Fix suggestions
--------------------------------
Applying device cgroups to deny access to the guest filesystem device, as well as any other devices that the container doesn’t need access to. Additionally, setting the guest filesystem device in QEMU and in Cloud Hypervisor as read-only.

3. Malicious guest umount vulnerability
--------------------------------
Upon container teardown, a malicious guest can trick the kata runtime into unmounting any mount point on the host.

The guest and host share a directory, accessible at /run/kata-containers/shared/containers in the guest and at /run/kata-containers/shared/sandbox/$sandbox_id in the host.

Upon container teardown, the kata runtime (on the host side) tries to unmount several paths under this shared directory, one of them being /run/kata-containers/shared/sandbox/$sandbox_id/$ctr_id/rootfs. The guest can alter the last component in this path to be a symlink pointing to a target of his choosing, which the umount operation on the host will follow upon destroying the container.

When kata-container is configured with overlay2 as the storage driver, the attack follows the steps below (all actions are executed by the malicious guest):
 1. Rename /run/kata-containers/shared/containers/${ctrid} to /run/kata-containers/shared/containers/${ctrid}_original
 2. Then, recreate /run/kata-containers/shared/containers/${ctrid}
 3. Create a symlink to the host target mount named /run/kata-containers/shared/containers/${ctrid}/rootfs

For devicemapper:
 1. Unmount /run/kata-containers/shared/containers/${ctrid}
 2. Create a symlink to the host target mount named /run/kata-containers/shared/containers/${ctrid}/rootfs

3.1. Impact
--------------------------------
The umount operation is done with the MNT_DETACH [4] flag, meaning that mount points under the target mount will be unmounted as well. Because of that, if we target ‘/’ in our attack, then the host mounts underneath it (e.g. /proc, /sys, etc.) will be unmounted as well, resulting in the host being non-functional in most scenarios (e.g. being a Kubernetes node), creating a Denial of Service.

On kubernetes, a malicious guest can trigger multiple container deletions by simply killing the container processes running on it. With the default kubernetes restart policy [5], containers will be removed and recreated. This process can be repeated by the malicious guest several times to control multiple unmount operations on the host.

3.2. Fix suggestions
--------------------------------
Any operations involving the shared directory on the host side should be carefully designed with the assumption that the guest is malicious and trying to affect the host through the shared directory.

A simple solution specific to the umount operation is to use the UMOUNT_NOFOLLOW [4] flag when unmounting the container rootfs, though this flag only helps with symlinks in the last component of the path. The unmount operation will still follow symlinks which are not in the last path component. A malicious guest could switch the $ctrid component in /run/kata-containers/shared/containers/${ctrid}/rootfs to a symlink, which will be followed upon sandbox teardown. This allows unmounting any mount point named rootfs on the host.

The secure but complicated solution is to use a helper binary that chroots to the shared directory before performing operations within it. This will mitigate symlink attacks like those depicted here and in section 5.

4. Cloud Hypervisor modifies the underlying image file
--------------------------------
When running Kata Containers with Cloud Hypervisor, any change made to root filesystem device is written to the underlying .img file. Since the device is plugged as read-write, a malicious guest could write to it through utilities like debugfs.

Compromising the guest image file allows an attacker to control all subsequent guests that run that image. Since, by default, the same guest image file is used by all VMMs (QEMU, Firecracker and Cloud Hypervisor), the next time any guest is executed, it will be malicious. This immediately compromises all subsequent container runs. Additionally, it can expose the host to attacks that require the guest to be malicious from the moment it boots (as shown in section 5.1).

4.1. Fix Suggestions
--------------------------------
Deny CLH guests access to the underlying host backing file of their filesystem. I do not know if Cloud Hypervisor supports this configuration, but in Kata Containers’ use case of Cloud Hypervisor, it is required. Additionally, set the guest filesystem device as read-only.

5. Malicious guest mount vulnerability
--------------------------------
When creating a container in a new or existing sandbox, the kata runtime bind mounts several files and directories to the shared directory between the guest and the host:
 1. The container filesystem [6] at /run/kata-containers/shared/sandbox/$sbx_id/$ctr_id/rootfs
 2. Container volumes at /run/kata-containers/shared/sandbox/$sbx_id/${ctr_id}-${rand_str}-${volume_target_in_ctr}
 3. Additional platform-specific files and directories [7] [8] at /run/kata-containers/shared/sandbox/$sbx_id/${ctr_id}-${rand_str}-$name.
    * With docker, $name is either resolv.conf, hostname or hosts.
    * With Kubernetes, these include the termination-log file and the serviceaccount directory.

If the guest is compromised before container creation, it can use symlinks in the shared directory to trick the kata runtime into mounting the aforementioned mounts to any path on the host.

If a user opts into using the devicemapper storage driver, the container filesystem isn’t mounted to the shared directory, and thus cannot be used to attack the host. The other mounts specified above can still be rerouted. If the user uses Firecracker, which doesn’t support file sharing between the guest and the host, he isn’t affected.

5.1. Rerouting the first container mount
--------------------------------
A malicious guest can create a symlink in the shared directory at /run/kata-containers/shared/containers/${ctrid}/rootfs to a target directory on the host. Upon container creation, the kata runtime will be tricked into bind mounting the container filesystem at the target directory on the host.

To create the symlink the guest must know the container id as it’s a part of the mount’s target path. The first container in a sandbox is unique in that regard since its id is the sandbox id, which is known to the guest at the time of the mount.

If a guest is compromised before the first container is added to it (e.g. a malicious guest image), it can execute the following attack:
 1. The malicious guest receives the CreateSandbox message and extracts the sandbox id from it. The first container id matches the sandbox id and is derived from that message.
 2. The malicious guest creates the malicious symlink at the shared directory, at /run/kata-containers/shared/containers/${first_ctrid}/rootfs
 3. The malicious guest returns a response for CreateSandbox
 4. Once the kata runtime on the host receives the CreateSandbox response, it tries to bind mount the container image at /run/kata-containers/shared/sandbox/$sbx_id/${first_ctrid}/rootfs
 5. The malicious symlink redirects the mount operation to the target on the host.

Given that the container image is malicious, the guest can gain code execution on the host by mounting over directories such as /bin or /lib. Besides code execution, DoSing the host is trivial (by mounting over crucial directories).

In the case of overlay2 mounts, once the container engine (e.g. Docker) removes the container, it might also delete the lower layers of the container filesystem, rendering the mount done through this attack empty. In the example of mounting the malicious container image over /bin, if no process tired running a binary from /bin before the container is removed, then /bin will become empty, and the attack fails.

To deal with this problem an attacker could target /lib or /lib64, which contains libraries for dynamically linked binaries such as the kata-runtime itself. Under Docker for example, the kata-runtime will most likely be executed again in the process of spawning a container:
 1. docker run $image is called
 2. Docker invokes kata-runtime create which inadvertently mounts the container image to the target directory on the host.
 3. Docker invokes kata-runtime start. Assuming an attack targeting /lib64 occurred in step 2, the libraries loaded and executed by the kata-runtime process are now malicious.

Another interesting approach to deal with the removal of the mount’s lower layers is mounting over one of the components in /var/lib/docker/overlay2 (i.e. over /var, or over /var/lib, etc.). Since the lower layers of the container image are kept there, when docker tries to remove them it will fail (as we mounted over them), and our malicious mount on the host will not be emptied. If an attacker can gain code execution through mounting over /var, then this removes the possibility of the attack failing or turning into a DoS.

With Kubernetes, there can be multiple containers in a guest, but the first is always the pause container [9]. An attack redirecting the pause container is limited to a host DoS since the pause container image isn’t malicious.

5.2. Rerouting the other mounts
--------------------------------
The rest of the mounts mentioned in section 5 can also be redirected to the host, but the malicious guest must win a race condition. The race is required because these mounts are performed on paths in the shared directory which the guest cannot fully know. In the case of non-initial container images, the guest doesn’t know the container id. In the case of volumes and platform-specific files and directories, there is also a random string.

To exploit the vulnerability in this scenario, the malicious guest must win a race condition between:
 1. ensureDestinationExists [10], which creates the file or directory that is to be mounted on; and
 2. The bind mount [11] done after calling ensureDestinationExists.

Between those steps, the guest must replace the created file or directory with a symlink to a target on the host. An attack redirecting container volumes or platform-specific mounts is most likely limited to a DoS since the content of these mounts isn’t malicious. Successfully redirecting container images can lead to code execution on the host if the image is malicious (as shown in section 5.1).

With docker, there is only one container in a guest. With Kubernetes, there can be multiple containers in a guest, and there are several scenarios in which a container is added to an existing guest that might already be compromised (by the existing containers in it). With the default pod restart policy, for example, a malicious guest can simply kill an existing container and the Kubelet will recreate it, causing the kata runtime to perform another mount operation on the shared directory. This can be repeated by the guest in order to get several opportunities to win the race.

Nevertheless, the race condition is rather difficult to win. The shared directory is mounted using either virtio-9p or virtio-fs, which don’t support inotify [12], a Linux API that eases exploitation of file related races. I tried winning the race using a malicious guest process and *failed*, though that doesn’t necessarily mean the race cannot be won. The approach below is quite timely to implement but will have better chances of winning the race.
 1. Instead of running as a guest process, run as part of the guest kernel by loading a malicious kernel module.
 2. Wait for the virtio-9p/virtio-fs packets generated by ensureDestinationExists, that create the /run/kata-containers/shared/containers/${new_ctrid} path (this example is for a container image mount), and therefore contain ${new_ctrid}.
 3. Send back the appropriate virtio-9p/virtio-fs packet to create the malicious symlink at /run/kata-containers/shared/containers/${new_ctrid}/rootfs

Since I’m not familiar with the internals of virtio-fs and virtio-9p, I cannot definitively say whether this race is beatable in the above approach, or at all.

5.3. Fix Suggestions
--------------------------------
Any operations involving the shared directory on the host side should be carefully designed with the assumption that the guest is malicious and trying to affect the host through the shared directory.

The approach suggested in section 2.3, using a helper binary which chroots to the shared directory, cannot be implemented as is to address this vulnerability. The bind mount operation will not work in a chroot jail as it requires access to the source of the mount, which is outside of the shared directory.

After giving it some thought, the only solution I came up with is to halt the guest, check the target path isn’t a symlink, and only then bind mount to the shared directory.

6. Chained attacks
--------------------------------
The following sections describe PoCs for attacks that chain several of the outlined vulnerabilities. The attacks are carried out either by a compromised container or by a container running a malicious image.

6.1. Chained attack #1 - unmount, DoS
--------------------------------
Chaining the vulnerabilities described in sections 2 and 3, a malicious container can unmount any path on the host, allowing a DoS attack on the host. The malicious container will first exploit vulnerability #1 to gain control over the guest, and then chain vulnerability #2 to trick the kata runtime into unmounting a target path on the host.

A PoC is available in the attached file under /host_umount.

6.2. Chained attack #2 - mount, code execution
--------------------------------
Chaining the vulnerabilities described in sections 2, 4 and 5, code execution on the host can be achieved. This requires a scenario where:
 1. A malicious container image is run once with CloudHypervisor.
 2. The malicious container image is run again with either QEMU or CloudHypervisor.

When the malicious container image is first run with Cloud Hypervisor, it exploits vulnerability #1 to access the guest filesystem device. It then replaces the kata-agent binary on the guest device with a malicious version. Because of vulnerability #3 in Kata Containers with Cloud Hypervisor, modifications to the guest file system propagate to the underlying .img file on the host.

The next time a container is run, it runs in a malicious guest. In the second container run, the malicious guest exploits vulnerability #4 to mount the container image on a crucial path on the host, /bin (/lib and /lib64 are good targets as well).

The next time the hosts attempt to execute a binary from /bin (for example /bin/ls), a binary from the malicious container image is executed on the host instead.
A PoC is available in the attached file under /host_mount.

Please acknowledge receiving this report. If needed, my mail address is <email address hidden>.

Best regards,
Yuval Avrahami | Senior Security Researcher
Palo Alto Networks

Footnotes
--------------------------------
[1] https://github.com/kata-containers/runtime/blob/4d443056bf1ce7c3a3e27d76821f69e5ea6e019c/virtcontainers/kata_agent.go#L1025
[2] https://linux.die.net/man/8/debugfs
[3] https://www.kernel.org/doc/Documentation/filesystems/dax.txt
[4] http://man7.org/linux/man-pages/man2/umount.2.html
[5] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
[6] https://github.com/kata-containers/runtime/blob/ebe9677f23b574c5defacf57456d221d8ce901f2/virtcontainers/mount.go#L299
[7] https://github.com/kata-containers/runtime/blob/ebe9677f23b574c5defacf57456d221d8ce901f2/virtcontainers/kata_agent.go#L1281
[8] https://github.com/kata-containers/runtime/blob/ebe9677f23b574c5defacf57456d221d8ce901f2/virtcontainers/container.go#L530
[9] https://github.com/kubernetes/kubernetes/tree/master/build/pause
[10] https://github.com/kata-containers/runtime/blob/ebe9677f23b574c5defacf57456d221d8ce901f2/virtcontainers/mount.go#L276
[11] https://github.com/kata-containers/runtime/blob/ebe9677f23b574c5defacf57456d221d8ce901f2/virtcontainers/mount.go#L280
[12] http://man7.org/linux/man-pages/man7/inotify.7.html

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :
summary: - Security advisory for four vulnerabilities in Kata containers
+ Security advisory for four vulnerabilities in Kata Containers
Revision history for this message
Xu Wang (gnawux) wrote :

Have not read all the detailed info.

At least we should fix the "unmounting any mount point on the host and all mount points underneath it" as soon as possible.

Revision history for this message
Peng Tao (bergwolf) wrote :

Thanks a lot for the detailed report! To fix it in Kata Containers, I think we should:
1. make sure we never allow guest to write to the image file
2. make sure when we umount container rootfs on the host, we do not follow any symlink on the way, and $cid should always come from host saved sandbox config instead of scanning the sandbox directory
3. make sure we use UMOUNT_NOFOLLOW to umount container rootfs on the host

For 1), we have already posted some PR for qemu. For the rest, we should start fixing them ASAP.

Revision history for this message
Peng Tao (bergwolf) wrote :

FYI, this is the guest image readonly rootfs image preparing patch on intel/govmm side: https://github.com/intel/govmm/pull/113

Revision history for this message
Peng Tao (bergwolf) wrote :
Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Thank you for the quick reply.

The unmount fix sounds good to me, but note that it doesn't cover other operations done on the shared directory, mainly bind mounts that also follow symlinks.

Regarding the second point, I was under the impression that kata-runtime always fetches the $sbx-id and $ctr-id from the config, when does it scan the shared directory?

Revision history for this message
Peng Tao (bergwolf) wrote :

Add the qemu readonly guest rootfs image PR: https://github.com/kata-containers/runtime/pull/2477

Revision history for this message
Peng Tao (bergwolf) wrote :

firecracker is confirmed to have its guest rootfs readonly in Kata.

cloudhypervisor OTOH doesn't seem to have a variable option in its DiskConfig to set a disk readonly.

Revision history for this message
Peng Tao (bergwolf) wrote :

@yuvalavra, to exploit bind mounts that follow symlinks, one would need a modified guest image at first, am I understanding it correctly? That one should be mitigated by setting guest image readonly on vmm.

For the second point, I was just listing things we need to ensure we are not tricked by the symlinks. You are right that Kata never scans the shared directory to umount things.

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Bind mounts could still be an issue in environments like Kubernetes, where Kata creates containers in existing sandboxes. Since these sandboxes already have running containers, they might be compromised. I'm concerned that the race condition I mentioned in section 5.2 can be won by using more sophisticated methods than the one I tried. If a malicious guest could win the race, he'll be able to redirect bind mounts and possibly gain code execution on the host.

Kata should have checks in place to prevent the possibility of a bind mount following a symlink to the host, simmilar to the unmount PR. The following solutions *should* work, unfortunately they aren't as easy to implement as the unmount fix:

 * Halt the guest so that it cannot replace the mount's target path, verfiy the target path isn't a symlink, then bind mount. I assume stopping the guest is something you'd like to avoid.

* Create a new directory named '/run/kata-containers/shared/sandboxes/mounts'. When bind mounting, first bind mount to this directory. Then, chroot to the '/run/kata-contianers/shared/sandboxes' directory, and bind mount '/mounts/mount-name' to '/$sbx_id/mount-name'. If '/$sbx_id/mount-name' is a malicious symlink, it is now resolved under the chroot jail with limited impact.

* The previous solution can be improved by creating a 'mounts' directory for each sandbox, but that will require to reorganize the directory structure a bit:
    - /run/kata-containers/shared/sandboxes/$sbx_id
        - mounts - used for the same purpose shown in the previous solution, but just for this sandbox
        - shared-dir - the actual shared directory
The chroot will then be to '/run/kata-containers/shared/sanboxes/$sbx-id', thus scoped to only one sanbox.

 * Create a chain of symlinks that eventually lead to the mount's target path, so that the number of symlinks in the chain is the maximum amount of symlinks allowed in Linux path resolution (40). Then, if the mount's target path is a symlink as well, the bind mount operation will fail with ELOOP. This is definitely extreme, but theoretically it could work.

I might be missing an easier solution, so if you have additional ideas, please share them.

Revision history for this message
Samuel Ortiz (sameo) wrote :

Many thanks for the detailed report and the work that was put into it.

FYI we pushed the changes for the Cloud Hypervisor fix: https://github.com/kata-containers/runtime/pull/2487

Revision history for this message
Peng Tao (bergwolf) wrote :

Yuval, thanks again for the detailed analysis. In kubernetes case, it is quite difficult for a guest to guess container id except for the first pause container, right? That's why I thought that one would need a modified guest image at first. For a case that a malicious guest succeeds at guessing a container id, it can then possibly create such a symlink for kata runtime to mount to.

One of my colleague suggested a solution quite similar to yours, that we introduce a new shared directory structure as:

1. create two directories for each sandbox:
   -. /run/kata-containers/shared/sandboxes/$sbx_id/mounts/, a directory to hold all host/guest shared mounts
   -. /run/kata-containers/shared/sandboxes/$sbx_id/shared/, a host/guest shared directory (9pfs/virtiofs source dir)

2. /run/kata-containers/shared/sandboxes/$sbx_id/mounts/ is bind mounted readonly to /run/kata-containers/shared/sandboxes/$sbx_id/shared/, so guest cannot modify it

3. host-guest shared files/directories are mounted one-level under /run/kata-containers/shared/sandboxes/$sbx_id/mounts/ and thus present to guest at one level under /run/kata-containers/shared/sandboxes/$sbx_id/shared/

What do you think of such an approach?

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

The direction of avoiding a chroot will definitely simplfy things.

1. How does 9p and virtiofs handle a mount directly on top of them? will everything still work?

2. How will the container rw layer work? I think currently it's included in the shared directory and all changes propogate to the host. If 'mounts' is bind mounted read-only to the shared directory, will the guest container still be able to create and modify files?

P.S. I'm traveling so I won't have access to my computer for the following weeks, and cannot test any solution ):

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Thought about it a bit and I think it should work, very good idea.

Revision history for this message
Peng Tao (bergwolf) wrote :

Thanks Yuval!

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Can you please share the timeline for the mount vulnerability fix?

Also, please reserve CVE IDs for the vulnerabilities. I can assist with that if needed, PANW is a CNA so we can reserve them for you.

Thanks, Yuval

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Hope you are all feeling well.

I reviewed the patches and noticed the qemu read-only patch (https://github.com/kata-containers/runtime/pull/2477) doesn’t cover nvidmm devices, meaning vulnerability #1 can still be exploited. You can verify this by running the guest_nodax PoC with kata 1.11.0-alpha1.

Consider further restricting the container with device cgroups. I saw that simply enforcing the standard Docker/Kubernetes profiles caused some issues (https://github.com/kata-containers/runtime/pull/701#issuecomment-422400715). To solve them, the kata-agent could infer the minor and major numbers of crucial devices (e.g. the root fs device) at runtime, and create a device cgroup profile that only restricts access to those devices.

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

A reminder that 65 days have passed since the initial disclosure. Per our responsible disclosure policy, we publish our findings after 90 days. Given a sensible reason and a concrete target date, this period may be extended.

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Palo Alto Networks have assigned the following CVEs to the aforementioned issues:

  * CVE-2020-2023 - Kata Containers doesn't restrict containers from accessing the guest's root filesystem device. Malicious containers can exploit this to gain code execution on the guest and masquerade as the kata-agent.

  * CVE-2020-2024 - Upon container teardown, a malicious guest can trick the kata-runtime into unmounting any mount point on the host and all mount points underneath it, pottenality resulting in a host DoS.

  * CVE-2020-2025 - Kata Containers on Cloud Hypervisor persists guest filesystem changes to the underlying image file on the host. A malicious guest can overwrite the image file to gain control of all subsequent guest VMs. Since Kata Containers uses the same VM image file with all VMMs, this issue may also affect QEMU and Firecracker based guests.

  * CVE-2020-2026 - A malicious guest compromised before a container creation (e.g. a malicious guest image or a guest running multiple containers) can trick the kata runtime into mounting the untrusted container filesystem on any host path, potentially allowing for code execution on the host.

CVE-2020-2024 and CVE-2020-2025 were fixed in version 1.11.0. On May 19th, 90 days from the initial discousler date, these vulnerabilities will be made public. Please alert your users to upgrade to the patched version.

The two other vulnerabilities aren’t fixed at the moment:

  * CVE-2020-2023 - On NVDIMM based guests, containers can still access the guest root filesystem device and escape. I suggested a fix for this issue in a previous comment [1].

  * CVE-2020-2026 - Bind mounts in the shared directory could still be redirected to the host by a malicious guest. A fix for this issue was suggested by Peng Tao in a previous comment [2].

We will wait an additional two weeks before making these two CVEs public, please use that time to fix them. These CVEs will go public on June 2nd.

Yuval

[1] https://bugs.launchpad.net/katacontainers.io/+bug/1863875/comments/17
[2] https://bugs.launchpad.net/katacontainers.io/+bug/1863875/comments/12

information type: Private Security → Private
information type: Private → Private Security
information type: Private Security → Private
information type: Private → Private Security
Revision history for this message
Peng Tao (bergwolf) wrote :

Sorry for the delay. Somehow I didn't get notification from the thread since after `2020-02-22`. I'm working on fixing the remaining part of CVE-2020-2023 and CVE-2020-2026. Will let you know when the fix is out.

Revision history for this message
Peng Tao (bergwolf) wrote :

Hi Yuval, w.r.t NVDIMM rootfs image exploit, it is setup without the "share=on" qemu nvdimm device option, so that any writes to the guest rootfs file system should not be flushed back to the host backing file. I've tried both dax and non-dax (with your guest kernel parameter "rootflags=data=ordered,errors=remount-ro ro") and any writes to the guest file system were not saved on the host.

Qemu's explanation of the "share=on/off" is here https://github.com/qemu/qemu/blob/a20ab81d22300cca80325c284f21eefee99aa740/docs/nvdimm.txt#L38

For detailed steps, I verified it by:
1. run a kata container with (kernel_params = "agent.debug_console rootflags=data=ordered,errors=remount-ro ro") option
2. attach to the container's debug console that would give me guest root access (socat "stdin,raw,echo=0,escape=0x11" "unix-connect:/var/run/vc/vm/${cid}/console.sock")
3. remount the rootfs with `mount -o remount,rw`
4. create a new file on the root directory, sync
5. stop the container
6. mount the guest rootfs image file locally by:
   6.1 losetup -P -f --show rootfs-image-file
   6.2 partprobe -s /dev/loop<N>
   6.3 mount /dev/loop<N>p<n> /mnt
7. check /mnt that the rootfs doesn't contain the just created file

W.r.t your reproducer /guest_nodax_poc, I do not quite understand why it claims the guest rootfs image file is modified on the host by checking a kata's shared directory (which looks to be a poc claim for the 9pfs symlink CVE fixed by https://github.com/kata-containers/runtime/pull/2475). So I was checking the image file directly by mounting it on the host.

Please point out if I missed anything. And thanks again for your excellent findings! I am still working on fixing CVE-2020-2026 and will update on that one later.

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Hi,

Regarding NVDIMM based guests, I meant that the container could still use the guest root fs device to escape to the guest VM, meaning CVE-2020-202*3* is still in effect. You can verify this by running the /guest_nodax_poc on Kata's latest version - the container can still compromise the guest VM, as shown by him (the container) being able to write to the shared directory.

Revision history for this message
Peng Tao (bergwolf) wrote :

Ok, I see. So the problem is that a container might be able to escape its own namespace and modify things on the guest rootfs right? The threat model Kata Containers is following is that we do not trust anything in the guest. Even if the malicious container could escape its namespace constraints, it is still contained by the virtual machine and that is the defense line we are actively protecting. The namespace isolation is not secure enough and that's why virtualization is used.

For example, a user can even ask for a privileged container that is pretty much owner of the guest kernel by definition. However, Kata Containers still protects the host from being attacked by it.

We certainly still need to fix it. Just want to make sure we are on the same page about Kata Containers threat model.

Revision history for this message
Peng Tao (bergwolf) wrote :

A fix to CVE-2020-2026 is posted at https://github.com/kata-containers/runtime/pull/2713

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

@bergwolf I see your point. Do you think container to guest breakouts in Kata - which arise from Kata's implementation and are unique to it - should be assigned CVE IDs and treated as vulnerabilities? I initially wanted to discuss this with you, but that didn’t work out because of the comms issue we had.

I believe privilege escalation inside the Kata sandbox should be assigned CVE IDs. CVE-2020-2023 allows malicious containers to escape Kata’s first layer of isolation and gain root control of the guest. I understand that alone, the impact is mostly limited to an attacker masquerading as the kata-agent, which only affects the visibility users have into their workloads. That being said, compromising the guest is likely the first step in an exploit chain attacking the host, and bugs allowing container to guest breakouts shouldn’t be treated lightly.

Let me know what you think. Assigning CVE-2020-2023 for the issue isn’t set in stone - if you (the kata maintainers) believe a CVE shouldn’t be assigned for the issue, than let's discuss it.

Revision history for this message
Archana Shinde (amshinde) wrote :

Yuval, agree with Peng Tao here regarding CVE-2020-2023. While we definitely want to fix it, the attack here is just limited to the particular workload itself. Considering Kata's threat model, I am not sure we should be assigning a CVE here at all. Will try to get input from some more Kata Containers here.
That being said, I have just started taking a look at the issue, while Peng Tao already has a fix submitted for the other pending CVE. Looks like the notifications for this issue didnt make it to the right people. Do you think we can push disclosing the CVEs by another week, while we work on getting the fixes merged and available in next release?

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Re CVE-2020-2023, I was on the fence on assiging a CVE for this issue, so I'll go with whatever you decide internally. Do note that in multi-container workloads, one container can use the issue to compromise the rest. If I remember correctly, in Kubernetes there is no security guarantee for this kind of attacks (inter-pod), but you should decide what is Kata take on this.

To be clear, I think that regardless of the label, the issue should be fixed as soon as possible.

If you can commit to fixing both CVE-2020-2023 and CVE-2020-2026, than I'll postpone the release of the CVEs by 1 week to June 9th. Please let me know by the end of the day.

Revision history for this message
Archana Shinde (amshinde) wrote :

Yuval, yes will be working on fixing those issues by this week.

Revision history for this message
Peng Tao (bergwolf) wrote :

FYI, the fix to CVE-2020-2026 (https://github.com/kata-containers/runtime/pull/2713) is merged and will appear in the next release.

Revision history for this message
Peng Tao (bergwolf) wrote :

The fix to CVE-2020-2023 (https://github.com/kata-containers/agent/pull/792) has been merged too.

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Thank you.

Please let me know what is your final decision with regard to assigning CVEs to issues like CVE-2020-2023.

Revision history for this message
Peng Tao (bergwolf) wrote :

Let's ask the Kata Architecture Committee for a final decision.

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Any updates on CVE-2020-2023?
CVE-2020-2026 will be publicly released tomorrow as discussed.

Revision history for this message
Peng Tao (bergwolf) wrote :

There was some discussion during the weekend but no decision was made yet. Let me try to push for one ASAP. Thanks!

Revision history for this message
Peng Tao (bergwolf) wrote :

Hi Yuval, please go ahead and publish CVE-2020-2023 too. The architecture committee has agreed on marking it as Kata Containers CVE. Thanks!

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

Will do, thanks for the update.

Revision history for this message
Archana Shinde (amshinde) wrote :

Thanks Yuval. We have tagged releases 1.11.1 which includes all the security fixes. Once the CVEs are published we can update releases notes and assign a KCSA.

Revision history for this message
Yuval Avrahami (yuvalavra) wrote :

CVEs are published.

Regarding patches, I believe the agent patch which denies access to the root fs device should be forward-ported to the rust agent.

Revision history for this message
Archana Shinde (amshinde) wrote :

Agree Yuval, we will be porting the fix for the rust agent as well.

Changed in katacontainers.io:
status: New → Fix Released
Revision history for this message
Archana Shinde (amshinde) wrote :

Yuval, we have merged the fix in the rust agent as well:https://github.com/kata-containers/kata-containers/pull/319

description: updated
information type: Private Security → Public Security
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.