Comment 24 for bug 1943049

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote (last edit ):

So to hit this problem you need docker to include a syscall which:

 a) has a number higher than clone3 in its seccomp profile
 b) is known by libseccomp (as runc uses libseccomp to translate syscall names into numbers)

I think the syscall that we are hitting here is faccessat2, which was added to the default seccomp profile in 20.10 (in https://github.com/moby/moby/pull/41353) and is understood by libseccomp 2.5.0+, both of which have been backported to all stable releases. There are other syscalls in the default docker profile that could cause problems but they are not understood by any released version of libseccomp afaict.

I think the current version of https://github.com/moby/moby/pull/42836 should fix this (unfortunately I think Tianon found this version just a couple of hours after you were testing things). We don't need to backport runc or containerd to fix docker, but I don't know about, say, k8s. containerd probably needs a patch to _its_ default policy but I don't know who uses that.

I think the reason that podman works in fedora is because fedora has a newer version of the github.com/containers/common, newer even than the one vendored into podman's git tree (yay?) -- it looks like v0.40.0 added support for the clone3 syscall. That seems to be in sid, so we could sync that over to fix podman on impish (after a rebuild of course), not sure what we should do for hirsute users.

So, what to do now and what to do in the future.

For now, at the moment I feel reasonably confident that we can patch docker in supported releases before impish release, and hopefully there can be an upstream 20.10.9 release with the fix also before impish release. Then we can just tell docker users to update when they hit this and not feel tooooo guilty.

But what about other container runtimes? Don't know. As above, at least some versions of podman have problems.

My feeling currently is to not patch out the use of clone3 in libc. But I am prepared to be persuaded otherwise.

For the future, I'm not sure there's much that can be done other than to really pay attention to seccomp policy changes. Maybe it's possible to write a tool to print out the syscalls that are getting implicitly getting EPERM (probably using the amazingly useful https://github.com/hrw/syscalls-table/tree/master/tables) for a given runc seccomp policy and have a github action print out any changes to this set...

The dependence on libseccomp versions adds a wrinkle. Unless I'm misunderstanding things quite badly, the runc default policy contains a bunch of syscalls that are not understood by the current release of libseccomp but are in its git, so the next libseccomp release will "activate" these syscalls and possibly flip some others from ENOSYS to EPERM.