Hello Robie and anyone else affected,

> Are we sure this issue doesn't affect 24.04?

24.04 is does have the patched version of libseccomp and is not affected by the issue when using the upstream version of docker. I have tested it on an ARM machine based on the test plan I wrote and it's not affected. 

> Presumably this bug isn't currently fixed in 23.10 since 2.5.4-1ubuntu3 is current there.

Correct! I skipped Mantic thinking that since it is close to EOL it wouldn't be worth it but since it is a hard requirement I will back-port the patch to it as well.

> Please update the Test Plan to use Docker as shipped by Ubuntu. Or, if this issue only affects a version of Docker external to the Ubuntu archive, then we might still be able to fix libseccomp for you...

The docker version shipped by ubuntu (docker.io) is affected by the issue on all ubuntu releases (including noble). The fix needed is the following: https://github.com/moby/moby/pull/47341/files and https://github.com/containerd/containerd/commit/a6e52c74fa043a63d7dae4ac6998215f6c1bb6ac
Which as @mark-elvers pointed out is available since docker 25.0.3
The thing is that all this patch does is change the default seccomp profile used by docker (which can be done by the user through the argument: --security-opt seccomp=/path/to/seccomp/profile.json in docker). 
Of course I understand that this does mean that docker.io will still need to be updated even if a workaround is possible. But the reason I am stating this is because before back-porting this patch to docker.io, the libseccomp patches will need to be back-ported and made available first. Noble and oracular being the only exceptions to this rule since they already have a libseccomp version that is aware of fchmodat2. 
I assume the method to follow would be to get the libseccomp patch in all the ubuntu releases, verify that the patch works with the upstream version of docker, then patch docker.io on these versions and verify that docker.io works?

> Will libseccomp allow Docker to return ENOSYS for all non defined syscalls instead? This seems like it would be the correct general fix to me, to avoid future breakages of the same class.

Sadly no. The default behavior in the seccomp profile is that all syscalls that have no rules defined to them in the docker seccomp profile will return EPERM. This applies to syscalls that libseccomp is aware of as well as those of which it is not aware of. The return value can be modified but it cannot be limited to syscalls not defined in libseccomp. 

> Is this possible, and how would doing this instead change the risk to 22.04 in this fix?

Current applications have been running fine in docker with EPERM as the return value for denied syscalls, changing this default return to ENOSYS will affect any application that makes use of these syscalls that get denied. The docker seccomp profile denies syscalls based on this default return value and for the few other syscalls that need to get denied with a different value it has exceptions for them written in the default seccomp profile. Changing this default will require a rewrite to this profile and will make the majority of syscalls that get denied with EPERM the "exception" and making the default seccomp profile a mess to read/work with. 
Even worse changing this default upstream also means needing not only to test this change for ubuntu but also for every linux distro since ubuntu isn't the only distro that can be deployed in docker. I cannot see anyone upstream agreeing to such a change.

> If this issue only affects an upstream version of Docker, can that be fixed there please, rather than risking regression to all other libseccomp consumers by working around this in libseccomp in 22.04? Then I think your "use the latest version of upstream Docker" would just start working?

Sadly as mentioned previously in this post it also affects docker.io even more than it affects upstream docker since upstream docker already has the fix in place, the second part of this fix is libseccomp which is the patch I backported.


> The proposed patch looks like it adds system call numbers for fchmodat2 for all architectures, as well as for a bunch of other system calls. The architecture-specific changes seem to apply to cacheflush and memfd_secret system calls only. So is this an armhf-specific problem, or is it a general one that affects all architectures?
> why is this SRU patch not minimal, adding the definition for the required system call for the affected architectures only?

Correct I back-ported the whole patch to include all architectures as well as the other newer syscalls introduced 6.7. 
The reason for doing so is because the lack of syscall awareness by libseccomp was first discovered on armhf with tar when fchmodat2 was used, but when investigating the issue it appeared to not only be limited to tar or to fchmodat2 but I saw it as a general issue affecting all noble docker containers and tar on armhf happened to be the first application to expose the bug. There are other binaries most probably on noble that make use of one or more syscall introduced in linux 6.7 that are also affected that we just do not know about yet. This somewhat manifested itself when @mark-elvers showed that the same issue affects ppc64le, which is why I also applies the patch to all architectures.

Of course I can still modify the patch and minimize it to only add support for fchmodat2 for arm and ppc64le. But seeing as the patch only adds additional syscalls rather than modify existing ones and that packages used in jammy do not make use of these syscalls since they did not exist back then (jammy glibc does not make use of them like it does on noble glibc), it only leaves containers to make use of them.


> Or, if the problem we're solving is wider than this and requires the full patch, then I think we need to include that in our regression analysis please, together with a Test Plan that exercises relevant code paths.

Of course, if my justification on why I back-ported the entire patch was convincing enough to go ahead, I will redo the test plan to clarify that this is a general syscall missing issue rather than it being isolated to tar on armhf.

> I'd much prefer to see just the fchmodat2 syscall arranged to return ENOSYS rather than EPERM, to meet glibc's expectations. This would seem to be the minimalist safe fix to me. Is it possible to hack that in as a magic number?

I could look into it for docker.io, and it might be possible, but I highly doubt upstream docker will get such a hack merged since as I mentioned earlier docker on ubuntu is not the only distro they support (as a host or as a container). Plus adding a hack for fchmodat2 will also entail that every new syscall that causes issues in the future will also need to be hacked in which makes it even harder to convince upstream especially when there is a legitimate fix which is to have libseccomp be aware of the syscall (which is how they have been resolving this problem).

> then can we just touch fchmodat2 in libseccomp in 22.04, rather than messing with other system calls as well?
Of course! I haven't tested the patch with only fchmodat2 back-ported but I do not see why it wouldn't work. I will be waiting for confirmation based on this as well as the other questions I've asked in this post before going ahead with the changes. 

Thank you again for reviewing the patch and sorry for the long post.