Cannot launch armhf containers on arm64 host under noble

Bug #2062176 reported by Dave Jones
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxd
Fix Released
Unknown
linux-raspi (Ubuntu)
Confirmed
Undecided
Unassigned
lxd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

[Impact]

Under the current noble daily server image for Raspberry Pi (arm64 architecture) I cannot launch an armhf container image using lxd from the channel 5.21/stable/ubuntu-24.04 (build 28284). It *appears* to launch successfully, and lxd reports no error yet afterward the container is stopped:

$ lxc launch ubuntu-daily:n/armhf nobletest
Creating nobletest
Starting nobletest
$ lxc list
+-----------+---------+------+------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+-----------+---------+------+------+-----------+-----------+
| nobletest | STOPPED | | | CONTAINER | 0 |
+-----------+---------+------+------+-----------+-----------+

In case this was an issue with the current noble daily, I also attempted launching the mantic and jammy armhf images, but with the same symptoms:

$ lxc launch ubuntu:m/armhf mantictest
Creating mantictest
Starting mantictest
$ lxc list
+------------+---------+------+------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------------+---------+------+------+-----------+-----------+
| nobletest | STOPPED | | | CONTAINER | 0 |
+------------+---------+------+------+-----------+-----------+
| mantictest | STOPPED | | | CONTAINER | 0 |
+------------+---------+------+------+-----------+-----------+

I attempted the same tests under the 23.10 server for Pi images (also arm64 host) and both containers launched successfully, so this appears to be an issue with lxd under noble specifically.

[Fix]

Reenable COMPAT_32BIT_TIME in raspi kernel.

[Test Case]

See above.

[Where Problems Could Occur]

Disabling COMPAT_32BIT_TIME in Noble raspi introduced a regression when running armhf binaries, so the fix is not expected to break anything (new).

Revision history for this message
Dave Jones (waveform) wrote :

This is likely to be a linux-raspi issue since armhf containers are apparently working happily in the autopkgtest cloud. Further, armhf chroots are also failing under linux-raspi with a Futex error from the kernel (will attempt to add some detail on this in due course).

Revision history for this message
Simon Déziel (sdeziel) wrote :

As I could reproduce the issue I opened https://github.com/canonical/lxd/issues/13512 to track it.

Thanks Dave for bringing this to our attention.

Changed in lxd:
status: Unknown → New
Juerg Haefliger (juergh)
tags: added: kern-10956
Revision history for this message
Aleksandr Mikhalitsyn (mihalicyn) wrote (last edit ):

This is the reason:
https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux-raspi/+git/noble/tree/debian.raspi/config/annotations?h=master-next#n155
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2038582

Minimal reproducer:
# cat test.c
#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/syscall.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/futex.h>

#define futex(A, B, C, D, E, F) syscall(__NR_futex, A, B, C, D, E, F)

int main(int argc, char **argv)
{
 unsigned int addr = 0;
 long ret;

 ret = futex(&addr, FUTEX_WAKE, 1, NULL, NULL, 0);
 if (ret) {
  printf("Error! %s", strerror(errno));
  exit(1);
 }

 printf("OK!\n");
 return 0;
}

# uname -a
Linux ubuntu 6.8.0-1004-raspi #4-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 02:29:55 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

$ arm-linux-gnueabihf-gcc -static test.c
$ strace -f /usr/arm-linux-gnueabihf/lib/ld-linux-armhf.so.3 ./a.out

futex(0xff83679c, FUTEX_WAKE, 1) = -1 ENOSYS (Function not implemented)
statx(1, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFCHR|0620, stx_size=0, ...}) = 0
write(1, "Error! Function not implemented", 31Error! Function not implemented) = 31
exit_group(1) = ?
+++ exited with 1 +++

This code uses futex_time32:
https://github.com/torvalds/linux/blob/4a4be1ad3a6efea16c56615f31117590fd881358/kernel/futex/syscalls.c#L492

Changed in lxd:
status: New → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-raspi (Ubuntu):
status: New → Confirmed
Changed in lxd (Ubuntu):
status: New → Confirmed
Steve Langasek (vorlon)
affects: lxd (Ubuntu) → glibc (Ubuntu)
Revision history for this message
Steve Langasek (vorlon) wrote :

I initially reassigned this from lxd to glibc on the basis that glibc is responsible for deciding whether to use the 64-bit vs 32-bit time_t variants for all the syscalls it wraps, and give preference to the 64-bit variants; HOWEVER,

       Note: glibc provides no wrapper for futex(), necessitating the use of
       syscall(2).

futex(2).

Confirmed that there are no 'futex' symbols in glibc.

So this is not a bug in glibc, and there is insufficient information here or in the linked github issue to say where the problem lies. The test case given in the github issue is invalid, because it builds without the default noble compiler flags of -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 so tells us nothing about what code is actually being run as part of noble that depends on the old syscall.

Someone will need to strace this to find the guilty binary.

affects: glibc (Ubuntu) → lxd (Ubuntu)
Changed in lxd (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Aleksandr Mikhalitsyn (mihalicyn) wrote :

Hi Steve,

I have described a reason of this issue above:
https://bugs.launchpad.net/ubuntu/+source/lxd/+bug/2062176/comments/3

>So this is not a bug in glibc,

This is not a bug. This is a kernel configuration issue.
Kernel configuration has COMPAT_32BIT_TIME=n, but must have COMPAT_32BIT_TIME=y to support running armhf binaries properly.

>Someone will need to strace this to find the guilty binary.

That's what I did earlier and written my minimal reproducer for the problem ;-)

Revision history for this message
Aleksandr Mikhalitsyn (mihalicyn) wrote (last edit ):
Download full text (3.2 KiB)

>The test case given in the github issue is invalid, because it builds without the default noble compiler flags of -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 so tells us nothing about what code is actually being run as part of noble that depends on the old syscall.

It's not invalid, cause we can't require old software binaries to be built with a new -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 flags. Imagine people who run, let say, Ubuntu Bionic for armhf inside the LXD container. All the binaries inside that container image will use __NR_futex and not __NR_futex_time64 for obvious reasons. We can not ask people to rebuild all the software with a new flags, that breaks idea of running stuff inside the container. Isn't it?

Also, right now, I have repeated my experiment but with a new flags:
# arm-linux-gnueabihf-gcc -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 -static test.c
# strace -n -f /usr/arm-linux-gnueabihf/lib/ld-linux-armhf.so.3 ./a.out
[ 221] execve("/usr/arm-linux-gnueabihf/lib/ld-linux-armhf.so.3", ["/usr/arm-linux-gnueabihf/lib/ld-"..., "./a.out"], 0xffffddf2a718 /* 27 vars */ <unfinished ...>
[ 221] [ Process PID=2166 runs in 32 bit mode. ]
strace: WARNING: Proper structure decoding for this personality is not supported, please consider building strace with mpers support enabled.
[ 221] <... execve resumed>) = 0
...
[ 6] close(3) = 0
[ 11] execve("./a.out", ["./a.out"], 0xffd6a6a0 /* 27 vars */) = 0
[ 45] brk(NULL) = 0x1c96000
...
[ 125] mprotect(0x5f000, 12288, PROT_READ) = 0
[ 240] futex(0xff812a1c, FUTEX_WAKE, 1) = -1 ENOSYS (Function not implemented)
[ 397] statx(1, "", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT|AT_EMPTY_PATH, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFCHR|0620, stx_size=0, ...}) = 0
[ 4] write(1, "Error! Function not implemented", 31Error! Function not implemented) = 31
[ 248] exit_group(1) = ?
[ 248] +++ exited with 1 +++

Obviously, these flags are not changing anything in behavior because __NR_futex constant value does not depend on _TIME_BITS or _FILE_OFFSET_BITS values.

=== strace from a real armhf container (Ubuntu 22.04.4 LTS inside):

# strace -o strace.log -n -f lxc-start -F ubuntu-armh

2944 [ 322] openat(4, "systemd", O_RDONLY|O_LARGEFILE|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
...
e=4096, ...}) = 0
2944 [ 6] close(4) = 0
2944 [ 322] openat(5, "system.conf.d", O_RDONLY|O_LARGEFILE|O_NOFOLLOW|O_CLOEXEC|O_PATH) = -1
 ENOENT (No such file or directory)
...
2944 [ 240] futex(0xf798a4b4, FUTEX_WAKE_PRIVATE, 2147483647) = -1 ENOSYS (Function not imple
mented)
2944 [ 146] writev(2, [{iov_base="The futex facility returned an u"..., iov_len=54}], 1) = 54
2944 [ 192] mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf79c0000
2944 [ 175] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
2944 [ 224] gettid() = 1
2944 [ 20] getpid() = 1
2944 [ 268] tgkill(1, 1, SIGABRT) = 0
2944 [ 268] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=1, si_uid=0} ---

As we can see:
2944 [ 240] futex(0xf798a4b4, ...

Read more...

Juerg Haefliger (juergh)
description: updated
Revision history for this message
Aleksandr Mikhalitsyn (mihalicyn) wrote (last edit ):

Another reproducer:

# cat test2.c
#include <pthread.h>
#include <stdio.h>

void *threadfn(void *ptr)
{
    return NULL;
}

int main(int argc, char **argv)
{
    pthread_t thread;
    pthread_create(&thread, NULL, &threadfn, NULL);
    pthread_join(thread, NULL);
    return 0;
}

# arm-linux-gnueabihf-gcc -D_FILE_OFFSET_BITS=64 -D_TIME_BITS=64 -static test2.c

# strace -n -f /usr/arm-linux-gnueabihf/lib/ld-linux-armhf.so.3 ./a.out
...
[pid 3205] [ 338] set_robust_list(0xf7b3180c, 12 <unfinished ...>

[pid 3204] [ 240] <... futex resumed>) = -1 ENOSYS (Function not implemented)

BOOM!

[pid 3205] [ 338] <... set_robust_list resumed>) = 0
[pid 3204] [ 146] writev(2, [{iov_base="The futex facility returned an u"..., iov_len=54}], 1The futex facility returned an unexpected error code.
 <unfinished ...>
[pid 3205] [ 175] rt_sigprocmask(SIG_SETMASK, [], <unfinished ...>
[pid 3204] [ 146] <... writev resumed>) = 54
[pid 3205] [ 175] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 3204] [ 192] mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 3205] [ 175] rt_sigprocmask(SIG_BLOCK, ~[RT_1], <unfinished ...>
[pid 3204] [ 192] <... mmap2 resumed>) = 0xf7330000
[pid 3205] [ 175] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 3204] [ 175] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...>
[pid 3205] [ 220] madvise(0xf7331000, 8372224, MADV_DONTNEED <unfinished ...>
[pid 3204] [ 175] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 3205] [ 220] <... madvise resumed>) = 0
[pid 3205] [ 1] exit(0 <unfinished ...>
[pid 3204] [ 224] gettid( <unfinished ...>
[pid 3205] [ 1] <... exit resumed>) = ?
[pid 3205] [ 1] +++ exited with 0 +++
[ 224] <... gettid resumed>) = 3204
[ 20] getpid() = 3204
[ 268] tgkill(3204, 3204, SIGABRT) = 0
[ 268] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=3204, si_uid=0} ---
[ 268] +++ killed by SIGABRT (core dumped) +++

Revision history for this message
Dave Jones (waveform) wrote :

I think the consensus at this point is that, while this manifests in LXD (and chroot), it's ultimately a kernel configuration issue. Marking lxd invalid.

Just to fill in a bit of background, there's been some discussion between kernel team and foundations what the correct course of action here is and I think the consensus is "just flip COMPAT_32BIT_TIME back on". The one remaining question, looking at the original bug that flipped that switch in the first place (LP: #2038582) is: should this change be limited to linux-raspi?

Changed in lxd (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Juerg Haefliger (juergh) wrote :

I'll turn it back on for raspi. The others are cloud kernels. I'll talk to the respective kernel owners to figure this out with the cloud vendors.

Revision history for this message
Luke-Jr (luke-jr) wrote :

Any update on this? Or is there a way to get the 1001 kernel back until it's fixed? (apt apparently deleted it on me)

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-raspi/6.8.0-1007.7 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux-raspi' to 'verification-done-noble-linux-raspi'. If the problem still exists, change the tag 'verification-needed-noble-linux-raspi' to 'verification-failed-noble-linux-raspi'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-noble-linux-raspi-v2 verification-needed-noble-linux-raspi
Revision history for this message
Simon Déziel (sdeziel) wrote :

Verification for Noble of linux-image-6.8.0-1007-raspi:

````
root@ubuntu:~# apt-get dist-upgrade -V -t noble-proposed
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following NEW packages will be installed:
   linux-image-6.8.0-1007-raspi (6.8.0-1007.7)
   linux-modules-6.8.0-1007-raspi (6.8.0-1007.7)
The following packages will be upgraded:
   apparmor (4.0.0-beta3-0ubuntu3 => 4.0.1-0ubuntu0.24.04.2)
   apt (2.7.14build2 => 2.8.0)
   apt-utils (2.7.14build2 => 2.8.0)
   dracut-install (060+5-1ubuntu3 => 060+5-1ubuntu3.1)
   libapparmor1 (4.0.0-beta3-0ubuntu3 => 4.0.1-0ubuntu0.24.04.2)
   libapt-pkg6.0t64 (2.7.14build2 => 2.8.0)
   libgnutls30t64 (3.8.3-1.1ubuntu3 => 3.8.3-1.1ubuntu3.1)
   libpam-modules (1.5.3-5ubuntu5 => 1.5.3-5ubuntu5.1)
   libpam-modules-bin (1.5.3-5ubuntu5 => 1.5.3-5ubuntu5.1)
   libpam-runtime (1.5.3-5ubuntu5 => 1.5.3-5ubuntu5.1)
   libpam0g (1.5.3-5ubuntu5 => 1.5.3-5ubuntu5.1)
   linux-image-raspi (6.8.0-1005.5 => 6.8.0-1007.7)
   login (1:4.13+dfsg1-4ubuntu3 => 1:4.13+dfsg1-4ubuntu3.2)
   lxd-installer (4 => 4ubuntu0.1)
   passwd (1:4.13+dfsg1-4ubuntu3 => 1:4.13+dfsg1-4ubuntu3.2)
   python3-distupgrade (1:24.04.18 => 1:24.04.19)
   python3-software-properties (0.99.48 => 0.99.49)
   software-properties-common (0.99.48 => 0.99.49)
   ubuntu-pro-client (32.3~24.04 => 32.3.1~24.04)
   ubuntu-pro-client-l10n (32.3~24.04 => 32.3.1~24.04)
   ubuntu-release-upgrader-core (1:24.04.18 => 1:24.04.19)
21 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
...

root@ubuntu:~# uname -a
Linux ubuntu 6.8.0-1007-raspi #7-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 24 10:21:12 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

root@ubuntu:~# lxc launch ubuntu-daily:n/armhf nobletest
root@ubuntu:~# lxc exec nobletest -- hostname
nobletest
root@ubuntu:~# lxc ls
+-----------+---------+---------------------+----------------------------------------------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+-----------+---------+---------------------+----------------------------------------------+-----------+-----------+
| nobletest | RUNNING | 10.123.62.15 (eth0) | fd42:32a5:f6b4:f6f:216:3eff:fe93:f224 (eth0) | CONTAINER | 0 |
+-----------+---------+---------------------+----------------------------------------------+-----------+-----------+
```

tags: added: verification-done-noble-linux-raspi
removed: verification-needed-noble-linux-raspi
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-raspi-realtime/6.8.0-2006.6 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux-raspi-realtime' to 'verification-done-noble-linux-raspi-realtime'. If the problem still exists, change the tag 'verification-needed-noble-linux-raspi-realtime' to 'verification-failed-noble-linux-raspi-realtime'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-noble-linux-raspi-realtime-v2 verification-needed-noble-linux-raspi-realtime
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.