linux-user clone() can't handle glibc posix_spawn() (causes locale-gen to assert)

Bug #1673976 reported by l3iggs on 2017-03-18
46
This bug affects 9 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned

Bug Description

I'm running a command (locale-gen) inside of an armv7h chroot mounted on my x86_64 desktop by putting qemu-arm-static into /usr/bin/ of the chroot file system and I get a core dump.

locale-gen
Generating locales...
  en_US.UTF-8...localedef: ../sysdeps/unix/sysv/linux/spawni.c:360: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped
/usr/bin/locale-gen: line 41: 34 Aborted (core dumped) localedef -i $input -c -f $charset -A /usr/share/locale/locale.alias $locale

I've done this same thing successfully for years, but this breakage has appeared some time in the last 3 or so months. Possibly with the update to qemu version 2.8.

l3iggs (l3iggs) on 2017-03-18
description: updated
Matheus Izvekov (mizvekov) wrote :

I can confirm this. The ninja build system is also affected.

Thomas Huth (th-huth) wrote :

Could you please check whether the problem also occurs with QEMU v2.10?

Changed in qemu:
status: New → Incomplete
Alchemist (xavier-miller) wrote :

Hi,

I can confirm it with QEMU 2.10.0 (running Gentoo Linux)

Portage 2.3.10 (python 2.7.14-final-0, default/linux/amd64/17.0/no-multilib, gcc-7.2.0, glibc-2.25-r5, 4.13.4-gentoo x86_64)

Alchemist (xavier-miller) wrote :
Download full text (3.8 KiB)

# uname -a && locale-gen
Linux **** 4.13.4-gentoo #1 SMP PREEMPT Thu Sep 28 09:41:30 CEST 2017 armv7l Intel(R) Celeron(R) 2957U @ 1.40GHz GNU/Linux
 * Generating 8 locales (this might take a while) with 2 jobs
 * (2/8) Generating en_US.UTF-8 ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * (1/8) Generating en_US.ISO-8859-1 ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * (3/8) Generating fr_BE.ISO-8859-15@euro ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * (4/8) Generating fr_BE.ISO-8859-1 ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * (5/8) Generating fr_BE.UTF-8 ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * (6/8) Generating fr_FR.ISO-8859-15@euro ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * (7/8) Generating fr_FR.ISO-8859-1 ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * (8/8) Generating fr_FR.UTF-8 ...
localedef: ../sysdeps/unix/sysv/linux/spawni.c:366: __spawnix: Assertion `ec >= 0' failed.
qemu: uncaught target signal 6 (Aborted) - core dumped [ !! ]
 * Generation co...

Read more...

Thomas Huth (th-huth) on 2017-10-03
Changed in qemu:
status: Incomplete → New
Dominique Belhachemi (domibel) wrote :

Looks like the __clone() call is failing for some reason:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/spawni.c;h=dea1650d08ded5fd848f263aebebe8748e703697;hb=HEAD#l362

Here is a workaround:

cd /usr/share/i18n/charmaps
gunzip --keep UTF-8.gz
locale-gen en_US.UTF-8

Dominique Belhachemi (domibel) wrote :

It is possible to reproduce the issue with a simple clone example taken from

   http://man7.org/linux/man-pages/man2/clone.2.html

# qemu-aarch64-static -strace ./a.out testname
585 brk(NULL) = 0x0000004000013000
585 uname(0x4000812d08) = 0
585 faccessat(AT_FDCWD,"/etc/ld.so.nohwcap",F_OK,0x82e888) = -1 errno=2 (No such file or directory)
585 mmap(NULL,12288,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0) = 0x0000004000843000
585 faccessat(AT_FDCWD,"/etc/ld.so.preload",R_OK,AT_SYMLINK_NOFOLLOW|0x82d848) = -1 errno=2 (No such file or directory)
585 openat(AT_FDCWD,"/etc/ld.so.cache",O_RDONLY|O_CLOEXEC) = 3
585 fstat(3,0x0000004000812680) = 0
585 mmap(NULL,20645,PROT_READ,MAP_PRIVATE,3,0) = 0x0000004000846000
585 close(3) = 0
585 faccessat(AT_FDCWD,"/etc/ld.so.nohwcap",F_OK,0x82e888) = -1 errno=2 (No such file or directory)
585 openat(AT_FDCWD,"/lib/aarch64-linux-gnu/libc.so.6",O_RDONLY|O_CLOEXEC) = 3
585 read(3,0x812830,832) = 832
585 fstat(3,0x00000040008126d0) = 0
585 mmap(NULL,1393456,PROT_EXEC|PROT_READ,MAP_PRIVATE|MAP_DENYWRITE,3,0) = 0x000000400084c000
585 mprotect(0x0000004000987000,65536,PROT_NONE) = 0
585 mmap(0x0000004000997000,24576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_DENYWRITE|MAP_FIXED,3,0x13b000) = 0x0000004000997000
585 mmap(0x000000400099d000,13104,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED,-1,0) = 0x000000400099d000
585 close(3) = 0
585 mprotect(0x0000004000997000,16384,PROT_READ) = 0
585 mprotect(0x0000004000011000,4096,PROT_READ) = 0
585 mprotect(0x0000004000840000,4096,PROT_READ) = 0
585 munmap(0x0000004000846000,20645) = 0
585 brk(NULL) = 0x0000004000013000
585 brk(0x0000004000034000) = 0x0000004000013000
585 mmap(NULL,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0) = 0x00000040009a1000
585 mmap(NULL,1052672,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0) = 0x0000004000aa1000
585 clone(CLONE_NEWUTS|0x11,child_stack=0x0000004000ba1010,parent_tidptr=0x0000004000aa1010,tls=0x0000000000000000,child_tidptr=0x0000000000000000) = -1 errno=22 (Invalid argument)
585 dup(2,4222427270,274886578000,22,0,0) = 3
585 fcntl(3,F_GETFL) = 1026
585 fstat(3,0x0000004000812628) = 0
585 write(3,0x9a1490,24)clone: Invalid argument
 = 24
585 close(3) = 0
585 exit_group(1)

# strace ./a.out testname
qemu: Unsupported syscall: 117
qemu: Unsupported syscall: 117
/usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented
+++ exited with 1 +++

Peter Maydell (pmaydell) on 2017-11-03
summary: - core dump
+ locale-gen dumps core run under arm-linux-user on x86-64 host
Peter Maydell (pmaydell) on 2017-11-06
tags: added: linux-user

This happens because QEMU is now stricter about its checking of the flags passed to clone() -- previously we would silently allow flags we couldn't support and create new threads with the wrong behaviour. Now we check and fail the clone() syscall if the requested behaviour is something we can't implement. Unfortunately we don't have any way to distinguish "guest program is asking for something odd that it doesn't really need" from "guest program is asking for something odd that it does need". So we err on the safe side and tell the guest we can't do it.

It's particularly unfortunate that the glibc implementation of posix_spawn() runs into this, though.

summary: - locale-gen dumps core run under arm-linux-user on x86-64 host
+ linux-user clone() can't handle glibc posix_spawn() (causes locale-gen
+ to assert)
l3iggs (l3iggs) wrote :

Is there a way I can ask QEMU to not do this strict checking so that my stuff stops breaking?

Peter Maydell (pmaydell) wrote :

Not without messing with the QEMU source, no.

Peter Maydell (pmaydell) wrote :

OK, this can't be as simple as "posix_spawn() fails", because I've just tried the test program from the posix_spawn manpage (http://man7.org/linux/man-pages/man3/posix_spawn.3.html) and that works fine for x86-64 guest, aarch64 guest and armhf guest. In the x86 and armhf cases the libc I have seems to use the NR_vfork syscall, but for aarch64 it uses clone(CLONE_VM | CLONE_VFORK | SIGCHLD, ...) which is what the glibc sources linked in comment #5 do, and that all works fine.

And locale-gen runs fine for my xenial-armhf chroot using current head-of-git QEMU:

root@e104462:/# locale-gen
Generating locales (this might take a while)...
  en_GB.UTF-8... done
Generation complete.

So can I ask that people: (1) please try with current head of git (or with 2.11-rc1, which is almost the same thing); (2) if there's still a problem with localegen or with programs calling posix_spawn() or other real-world code, please provide full repro instructions so I can try to reproduce locally.

I don't think we can make clone() in general work, so oddball demo code like the example program in the clone(2) manpage is out of scope, but there may well be specific cases we can address.

James Cowgill (jcowgill) wrote :

I can reproduce the bug in a mips64el chroot running current Debian unstable - the posix_spawn example you mention fails there. I have tested v2.11.0-rc2 and it fails there as well. I think you need glibc >= 2.25 to trigger the bug (artful / bionic chroot). I only noticed it due to Debian updating to a newer glibc recently.

James Cowgill (jcowgill) wrote :

I think I see the problem. This glibc commit rewrote the posix_spawn implementation on Linux:
https://sourceware.org/git/?p=glibc.git;a=commit;h=9ff72da471a509a8c19791efe469f47fa6977410

It now relies on the exact behavior of clone(CLONE_VM | CLONE_VFORK) - ie:
- That the parent will wait for the child to exec before continuing.
- Writes to memory in the child are later visible in the parent

QEMU emulates this clone using fork() which no longer works properly and causes the assertion failure.

James Cowgill (jcowgill) wrote :

Sorry, this is probably the commit that broke things (not the one above). I was added in glibc 2.25:
https://sourceware.org/git/?p=glibc.git;a=commit;h=4b4d4056bb154603f36c6f8845757c1012758158

Peter Maydell (pmaydell) wrote :

Thanks for tracking down the glibc change; I will try to set up a chroot with a more recent glibc to see whether we can do something that fixes that posix_spawn implementation...

Interestingly, this also affects Microsoft Windows Services For Linux, i.e. Microsoft's Linux emulation layer.

> https://github.com/Microsoft/WSL/issues/1878

I have verified that this patch [1] in glibc_2.25 and glibc_2.26 fixes the assert.

> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=22273

This should probably be put under 'glibc', since this is really an issue with that package, which is fixed by the way since Oct 2017.

https://sourceware.org/git/?p=glibc.git;a=commit;h=fe05e1cb6d64dba6172249c79526f1e9af8f2bfd

This should also be backported to 17.10

Peter Maydell (pmaydell) wrote :

That glibc change has caused the assert to go away, but QEMU's spawn(CLONE_VFORK) still does not have the "always waits for child" semantics that glibc has assumed since glibc commit 4b4d4056bb154. The child and the parent will end up racing each other, and the child will never be able to write to the parent's address space. I think that the effect of that race will be that if the child fails (for instance if a bad filename is passed and exec() fails) the parent will never notice and will return a success code from the spawn function when it should not.

So there remains a QEMU bug here; though it is also the case that I can't see any way we can fix it.

Ok, thank you for clearing that up.

I'm noticing in 4b4d4056bb154 this comment:

"...we just make explicit use of the fact the the child and parent run in the same VM, so the child can write an error code to a field of the posix_spawn_args struct instead of sending it through a pipe. To ensure that this mechanism really works, the parent initializes the field to -1 and the child writes 0 before execing."

So, if the child fail to execute, that error code field of the posix_spawn_args struct will remain -1. Would this ensure that QEMU return error in case of failing exec?

Best Regards,
Eric

Peter Maydell (pmaydell) wrote :

Commit fe05e1cb6d64db changed that, so args.err is initialized to zero.

Download full text (3.4 KiB)

Ok, yes you are right...

I have looked a bit more on the source code, and indeed, I think understand the issue with the VFORK with QEMU. Please correct me if I'm wrong...

- In the syscall trap handler, it has to use the fork() function to emulate the vfork() due to restriction of the vfork() function (as QEMU must continue to control the flow of instruction past the call to vfork(), and do a lot more things in the child thread before ending up performing a execve() or _exit())
- Also, it can not do a wait() for the emulated child, as this child will continue to exist even after it calls execve(), so the parent would stall.
- Then, I taught about doing condition signalling, like waiting for a pthread condition signal that the child would send once it come to the point of performing the _exit() or execve(), but the child would, for example, need to know if execve() was successful, or otherwise the child would continue and set an error flag and then call _exit(). We do need that error flag before continuing the execution on the parent. So we can not signal back to the parent that the 'emulated vfork' is OK before calling execve(), but we can not wait after execve() because if the call is successful, there is no return from that function, and code goes outside the control of QEMU.

So, I taught of an idea... What if, in the TARGET_NR_clone syscall trap, when we are called upon a CLONE_VFORK, we do:
- Do a regular fork, as it's currently done, with CLONE_VM flag (so the child share the same memory as the parent). However, we also set a state flag that we are in this 'vfork emulation' mode just before the fork (more on that bellow...).
- Let the parent wait for the child to terminate (again, more on that bellow...).
- Let the child return normally and continue execution, as if the parent was waiting.

Then, eventually the child will eventually either end up in the TARGET_NR_execve or __NR_exit_group syscall trap. At which point:
- The child check if it is in 'vfork emulation' mode. If not, then there's nothing special, just continue the way the code is currently written. If the flag is set, then follow on with the steps bellow...
- The child set a flag that tell where it is (TARGET_NR_execve or __NR_exit_group, and the arguments passed to that syscall), and that everything is ok (it has not simply died meanwhile).
- The child terminate, which resume the parent's execution.

The parent then:
- Clear the 'vfork emulation' flag.
- Look at where the child left (was it performing TARGET_NR_execve or __NR_exit_group syscall? What was the arguments passed to the syscall?). This is pretty easy since the child was writing to the parent's memory space the whole time (CLONE_VM). The parent could even use a flag allocated on it's stack before the fork(), since the child will have run with it's own stack during that time (so the parent stack is still intact).
- Now that we know what the child wanted to do (what syscall and which parameters), the parent (which at his point has no more 'leftover' child), can then do a *real* vfork, or otherwise return the proper error code.

It's a bit far fetched, and I'm far from implying that I know much about QEM...

Read more...

Peter Maydell (pmaydell) wrote :

Unfortunately that won't work, because if we do a clone(CLONE_VM) in QEMU that will mean that parent and child share not just the guest address space, but also all the QEMU data structures for the emulated CPUs and also the host libc data structures. Then actions done by the child will update those data structures and break execution of the parent when it resumes.

Ok, I taught that could be an issue, but as I said, I don't really know all the internals of QEMU.

Another idea would be to fork the child, without CLONE_VM, on the initial call to the clone syscall, like it's done right now, and then wait for that child until he call execve or exit syscall. Maybe using some shared memory or IPC to pass the relevant status when the child finally invoke those syscalls.

When the child finally call one of those, then after signalling the parent about where it is (and the params to the syscall), the child could exit and the parent actually take action.

Regards,
Eric

Peter Maydell (pmaydell) wrote :

That way round the child doesn't have the shared memory with the parent, so it can't update the parent's status variable. There's no easy way to say "fork, and then share the guest memory mappings and only the guest memory mappings with the child", because QEMU doesn't currently track what memory the guest has mapped at all.

Download full text (3.5 KiB)

Hello

Sorry for the delay...

Actually, you only need the parent to get the status from the child, which can be passed in other way than through common memory.

The idea is to use pipefd to actually wait for the child to either terminate or successfully call execve. As follow:

When the TARGET_NR_clone syscall is trapped, you do:
- Call do_fork(), as currently done
- In do_fork(), at the beginning, if CLONE_VFORK flag is set, keep track of it (i.e. do not clear the flag, just clear the CLONE_VM, as currently done, to do a normal fork, i.e. the child have it's own copy of the memory segments).
- Just before the call to fork(), create a pipefd.
- The parent branch and then (if CLONE_VFORK is set) close the write end of the pipe (it's own copy), and start looping (could be indefinitely, but preferably some sort of timeout logic could be set) on the read fd, waiting continuously for status updates from the child.
- The child branch close the read-end of the pipe (it's own forked copy), set the write-end fd flag FD_CLOEXEC (with fnctl()), and put the write fd into it's QEMU state variables (parent vfork fd).
- The child then move on.

When the TARGET_NR_execve syscall is trapped (this is in child context), you do:
- Do everything as currently done, up to just before the safe_execve() call.
- Just before the call to safe_execve(), check if the QEMU state variable (parent vfork fd) is defined. If so, tell the the parent (through the pipe), that we are good so far, and about to call execve(). Note that the parent just update the child status, but keep looping endlessly.
- Call the execve().
- If the above call return, an error occurred. If this occur, check if the QEMU state variable (parent vfork fd) is defined. If so, tell whatever error status you got to the parent (through the pipe). The parent update it's child status, but again, continue to loop endlessly.
- Continue normally.

That's pretty much the bulk of the work done! What will happen:
- Either the child will eventually call execve, which will succeed, at which point the write end of the pipe will be closed (because we set the pipe to close on execve, with the FD_CLOEXEC flag).
- The child could be playing on us, and try to re-call execve() multiple times (possibly with different arguments, executables path, etc.), but every time, the parent will just receive status update through the pipe. And eventually, the above case will occur (success), and pipe will be closed.
- The child call _exit(), which will close the pipe again.
- The child get some horrible signal, get killed, or whatever else... Pipe still get closed.

The parent, on it's side, just update the status endlessly, UNTIL the other end of the pipe get closed. At this point, the read() of the pipe will get a 'broken pipe' error. This signal the parent to move on, and return whatever status the child last provided.

Note that this status could initially be set to an error state (in case the child die or call _exit() before calling execve()).

The only thing that could make the parent hang is if the child hang (and never call execve() or _exit() or die...). But the beauty is that this is perfectly fine, because that is...

Read more...

Peter Maydell (pmaydell) wrote :

> Actually, you only need the parent to get the status from the child, which can be passed in other way than through common memory.

Certainly, it *can* be, but the glibc code we're trying to run in the guest here doesn't do it in some other way, it uses common memory. Having QEMU effectively pause the parent process until the child has done its execve is certainly possible along the lines you suggest. But that is only half the requirement -- the parent also has to be able to see in its memory space the updates to the status variable that the child has made.

If you're willing to change the guest code the problem is easy (for instance you could just go back to the old glibc approach). But we need to run the code as it stands.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.