QEMU

Bug #1673976
Comment #21

Comment 21 for bug 1673976

Revision history for this message

Éric Hoffman (ehoffman-videotron) wrote on 2018-03-02:

#21

Ok, yes you are right...

I have looked a bit more on the source code, and indeed, I think understand the issue with the VFORK with QEMU. Please correct me if I'm wrong...

- In the syscall trap handler, it has to use the fork() function to emulate the vfork() due to restriction of the vfork() function (as QEMU must continue to control the flow of instruction past the call to vfork(), and do a lot more things in the child thread before ending up performing a execve() or _exit())
- Also, it can not do a wait() for the emulated child, as this child will continue to exist even after it calls execve(), so the parent would stall.
- Then, I taught about doing condition signalling, like waiting for a pthread condition signal that the child would send once it come to the point of performing the _exit() or execve(), but the child would, for example, need to know if execve() was successful, or otherwise the child would continue and set an error flag and then call _exit(). We do need that error flag before continuing the execution on the parent. So we can not signal back to the parent that the 'emulated vfork' is OK before calling execve(), but we can not wait after execve() because if the call is successful, there is no return from that function, and code goes outside the control of QEMU.

So, I taught of an idea... What if, in the TARGET_NR_clone syscall trap, when we are called upon a CLONE_VFORK, we do:
- Do a regular fork, as it's currently done, with CLONE_VM flag (so the child share the same memory as the parent). However, we also set a state flag that we are in this 'vfork emulation' mode just before the fork (more on that bellow...).
- Let the parent wait for the child to terminate (again, more on that bellow...).
- Let the child return normally and continue execution, as if the parent was waiting.

Then, eventually the child will eventually either end up in the TARGET_NR_execve or __NR_exit_group syscall trap. At which point:
- The child check if it is in 'vfork emulation' mode. If not, then there's nothing special, just continue the way the code is currently written. If the flag is set, then follow on with the steps bellow...
- The child set a flag that tell where it is (TARGET_NR_execve or __NR_exit_group, and the arguments passed to that syscall), and that everything is ok (it has not simply died meanwhile).
- The child terminate, which resume the parent's execution.

The parent then:
- Clear the 'vfork emulation' flag.
- Look at where the child left (was it performing TARGET_NR_execve or __NR_exit_group syscall? What was the arguments passed to the syscall?). This is pretty easy since the child was writing to the parent's memory space the whole time (CLONE_VM). The parent could even use a flag allocated on it's stack before the fork(), since the child will have run with it's own stack during that time (so the parent stack is still intact).
- Now that we know what the child wanted to do (what syscall and which parameters), the parent (which at his point has no more 'leftover' child), can then do a *real* vfork, or otherwise return the proper error code.

It's a bit far fetched, and I'm far from implying that I know much about QEMU, but this is an idea :-) Sound like it's pretty straightforward though. Basically we just wait for the code between the _clone() function and the _execve/_exit function to complete, at which point we take action and we are in measure to assess the status code (and do the real vfork).

Regards,
Eric

Ok, yes you are right...

I have looked a bit more on the source code, and indeed, I think understand the issue with the VFORK with QEMU.  Please correct me if I'm wrong...

- In the syscall trap handler, it has to use the fork() function to emulate the vfork() due to restriction of the vfork() function (as QEMU must continue to control the flow of instruction past the call to vfork(), and do a lot more things in the child thread before ending up performing a execve() or _exit())
- Also, it can not do a wait() for the emulated child, as this child will continue to exist even after it calls execve(), so the parent would stall.
- Then, I taught about doing condition signalling, like waiting for a pthread condition signal that the child would send once it come to the point of performing the _exit() or execve(), but the child would, for example, need to know if execve() was successful, or otherwise the child would continue and set an error flag and then call _exit().  We do need that error flag before continuing the execution on the parent.  So we can not signal back to the parent that the 'emulated vfork' is OK before calling execve(), but we can not wait after execve() because if the call is successful, there is no return from that function, and code goes outside the control of QEMU.

So, I taught of an idea...  What if, in the TARGET_NR_clone syscall trap, when we are called upon a CLONE_VFORK, we do:
- Do a regular fork, as it's currently done, with CLONE_VM flag (so the child share the same memory as the parent).  However, we also set a state flag that we are in this 'vfork emulation' mode just before the fork (more on that bellow...).
- Let the parent wait for the child to terminate (again, more on that bellow...).
- Let the child return normally and continue execution, as if the parent was waiting.

Then, eventually the child will eventually either end up in the TARGET_NR_execve or __NR_exit_group syscall trap.  At which point:
- The child check if it is in 'vfork emulation' mode.  If not, then there's nothing special, just continue the way the code is currently written.  If the flag is set, then follow on with the steps bellow...
- The child set a flag that tell where it is (TARGET_NR_execve or __NR_exit_group, and the arguments passed to that syscall), and that everything is ok (it has not simply died meanwhile).
- The child terminate, which resume the parent's execution.

The parent then:
- Clear the 'vfork emulation' flag.
- Look at where the child left (was it performing TARGET_NR_execve or __NR_exit_group syscall?  What was the arguments passed to the syscall?).  This is pretty easy since the child was writing to the parent's memory space the whole time (CLONE_VM).  The parent could even use a flag allocated on it's stack before the fork(), since the child will have run with it's own stack during that time (so the parent stack is still intact).
- Now that we know what the child wanted to do (what syscall and which parameters), the parent (which at his point has no more 'leftover' child), can then do a *real* vfork, or otherwise return the proper error code.

It's a bit far fetched, and I'm far from implying that I know much about QEMU, but this is an idea :-)  Sound like it's pretty straightforward though.  Basically we just wait for the code between the _clone() function and the _execve/_exit function to complete, at which point we take action and we are in measure to assess the status code (and do the real vfork).

Regards,
Eric