Coroutines are racy for risc64 emu on arm64 - crash on Assertion

Bug #1921664 reported by Tommy Thorn
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
qemu (Fedora)
Unknown
Unknown
qemu (Ubuntu)
Triaged
Medium
Paride Legovini
Jammy
Triaged
Undecided
Paride Legovini

Bug Description

Note: this could as well be "riscv64 on arm64" for being slow@slow and affect
other architectures as well.

The following case triggers on a Raspberry Pi4 running with arm64 on
Ubuntu 21.04 [1][2]. It might trigger on other environments as well,
but that is what we have seen it so far.

   $ wget https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/UbuntuFocal-riscv64-QemuVM.tar.gz
   $ tar xzf UbuntuFocal-riscv64-QemuVM.tar.gz
   $ ./run_riscvVM.sh
(wait ~2 minutes)
   [ OK ] Reached target Local File Systems (Pre).
   [ OK ] Reached target Local File Systems.
            Starting udev Kernel Device Manager...
qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.

This is often, but not 100% reproducible and the cases differ slightly we
see either of:
- qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
- qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

Rebuilding working cases has shown to make them fail, as well as rebulding
(or even reinstalling) bad cases has made them work. Also the same builds on
different arm64 CPUs behave differently. TL;DR: The full list of conditions
influencing good/bad case here are not yet known.

[1]: https://ubuntu.com/tutorials/how-to-install-ubuntu-on-your-raspberry-pi#1-overview
[2]: http://cdimage.ubuntu.com/daily-preinstalled/pending/hirsute-preinstalled-desktop-arm64+raspi.img.xz

--- --- original report --- ---

I regularly run a RISC-V (RV64GC) QEMU VM, but an update a few days ago broke it. Now when I launch it, it hits an assertion:

OpenSBI v0.6
   ____ _____ ____ _____
  / __ \ / ____| _ \_ _|
 | | | |_ __ ___ _ __ | (___ | |_) || |
 | | | | '_ \ / _ \ '_ \ \___ \| _ < | |
 | |__| | |_) | __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
618 bytes read in 2 ms (301.8 KiB/s)
RISC-V Qemu Boot Options
1: Linux kernel-5.5.0-dirty
2: Linux kernel-5.5.0-dirty (recovery mode)
Enter choice: 1: Linux kernel-5.5.0-dirty
Retrieving file: /boot/initrd.img-5.5.0-dirty
qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.
./run.sh: line 31: 1604 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 8 -m 8G -bios fw_payload.bin -device virtio-blk-devi
ce,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -devi
ce virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports

Interestingly this doesn't happen on the AMD64 version of Ubuntu 21.04 (fully updated).

Think you have everything already, but just in case:

$ lsb_release -rd
Description: Ubuntu Hirsute Hippo (development branch)
Release: 21.04

$ uname -a
Linux minimacvm 5.11.0-11-generic #12-Ubuntu SMP Mon Mar 1 19:27:36 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
(note this is a VM running on macOS/M1)

$ apt-cache policy qemu
qemu:
  Installed: 1:5.2+dfsg-9ubuntu1
  Candidate: 1:5.2+dfsg-9ubuntu1
  Version table:
 *** 1:5.2+dfsg-9ubuntu1 500
        500 http://ports.ubuntu.com/ubuntu-ports hirsute/universe arm64 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
DistroRelease: Ubuntu 21.04
Package: qemu 1:5.2+dfsg-9ubuntu1
ProcVersionSignature: Ubuntu 5.11.0-11.12-generic 5.11.0
Uname: Linux 5.11.0-11-generic aarch64
ApportVersion: 2.20.11-0ubuntu61
Architecture: arm64
CasperMD5CheckResult: unknown
CurrentDmesg:
 Error: command ['pkexec', 'dmesg'] failed with exit code 127: polkit-agent-helper-1: error response to PolicyKit daemon: GDBus.Error:org.freedesktop.PolicyKit1.Error.Failed: No session for cookie
 Error executing command as another user: Not authorized

 This incident has been reported.
Date: Mon Mar 29 02:33:25 2021
Dependencies:

KvmCmdLine: COMMAND STAT EUID RUID PID PPID %CPU COMMAND
Lspci-vt:
 -[0000:00]-+-00.0 Apple Inc. Device f020
            +-01.0 Red Hat, Inc. Virtio network device
            +-05.0 Red Hat, Inc. Virtio console
            +-06.0 Red Hat, Inc. Virtio block device
            \-07.0 Red Hat, Inc. Virtio RNG
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t:

Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: console=hvc0 root=/dev/vda
SourcePackage: qemu
UpgradeStatus: Upgraded to hirsute on 2020-12-30 (88 days ago)
acpidump:
 Error: command ['pkexec', '/usr/share/apport/dump_acpi_tables.py'] failed with exit code 127: polkit-agent-helper-1: error response to PolicyKit daemon: GDBus.Error:org.freedesktop.PolicyKit1.Error.Failed: No session for cookie
 Error executing command as another user: Not authorized

 This incident has been reported.

Related branches

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

FWIW, I just now built qemu-system-riscv64 from git ToT and that works fine.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Tommy,
you reported that against "1:5.2+dfsg-9ubuntu1" which is odd.
The only recent change was around
a) package dependencies
b) CVEs not touching your use-case IMHO

Was the formerly working version 1:5.2+dfsg-6ubuntu2 as I'm assuming or did you upgrade from a different one?

Could you also add the full commandline you use to start your qemu test case?
If there are any images or such involved as far as you can share where one could fetch them please.

And to be clear on your report - with the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you.
Just the emulation of riscv64 on arm64 HW is what now fails for you correct?

It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?
If you built v5.2.0 it might be something in the Ubuntu Delta that I have to look for.
If you've built the latest HEAD of qemu git then most likely the solution is a vommit since v5.2.0 - in that case would you be willing and able to maybe bisect that from v5.2.0..HEAD what the fix was?

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

0. Repro:

   $ wget https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/UbuntuFocal-riscv64-QemuVM.tar.gz
   $ tar xzf UbuntuFocal-riscv64-QemuVM.tar.gz
   $ ./run_riscvVM.sh
(wait ~ 20 s)
   [ OK ] Reached target Local File Systems (Pre).
   [ OK ] Reached target Local File Systems.
            Starting udev Kernel Device Manager...
   qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

  (root password is "riscv" fwiw)

1. "Was the formerly working version 1:5.2+dfsg-6ubuntu2?"

   I'm afraid I don't know, but I update a few times a week.

   If you can tell me know to try individual versions, I'll do that

2. "full commandline you use to start your qemu test case?"

   Probably the repo above is more useful, but FWIW:

   qemu-system-riscv64 \
    -machine virt \
    -nographic \
    -smp 4 \
    -m 4G \
    -bios fw_payload.bin \
    -device virtio-blk-device,drive=hd0 \
    -object rng-random,filename=/dev/urandom,id=rng0 \
    -device virtio-rng-device,rng=rng0 \
    -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 \
    -device virtio-net-device,netdev=usernet \
    -netdev user,id=usernet,$ports

3. "the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you? Just the emulation of riscv64 on arm64 HW is what now fails for you correct?"

   Yes x 2, confirmed with the above repro.

   $ apt-cache policy qemu
qemu:
  Installed: 1:5.2+dfsg-9ubuntu1
  Candidate: 1:5.2+dfsg-9ubuntu1
  Version table:
 *** 1:5.2+dfsg-9ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu hirsute/universe amd64 Packages
        100 /var/lib/dpkg/status

4. "It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?"

  latest.

  Rebuilding from the "vommit" tagged with v5.2.0 ...

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Self-built v5.2.0 qemu-system-riscv64 does _not_ produce the bug.

Changed in qemu (Ubuntu):
status: Incomplete → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

0. Repro:

> ...
> $ ./run_riscvVM.sh
> ...

Thanks, I was not able to reproduce with that using the most recent
qemu 1:5.2+dfsg-9ubuntu1 on amd64 (just like you)

Trying the same on armhf was slower and a bit odd.
- I first got:
  qemu-system-riscv64: at most 2047 MB RAM can be simulated
  Reducing the memory to 2047M started up the system.
- then I have let it boot, which took quite a while and eventually
  hung at
[ 13.017716] mousedev: PS/2 mouse device common for all mice
[ 13.065889] usbcore: registered new interface driver usbhid
[ 13.070209] usbhid: USB HID core driver
[ 13.092671] NET: Registered protocol family 10

So it hung on armhf, while working on a amd64 host. That isn't good, but there was no crash to be seen :-/

Maybe it depends on what arm platform (as there are often subtle differences) or which storage (as the assert is about storage) you run on.
My CPU is an X-Gene and my Storage is a ZFS (that backs my container running hirsute and Hirsute's qemu).
What is it for you?

I've waited more, but no failure other than the hang was showing up.
Is this failing 100% of the times for you, or just sometimes and maybe racy?

---

1. "Was the formerly working version 1:5.2+dfsg-6ubuntu2?"

> I'm afraid I don't know, but I update a few times a week.

A hint which versions to look at can be derived from
  $ grep -- qemu-system-misc /var/log/dpkg.log

   If you can tell me know to try individual versions, I'll do that

You can go to https://launchpad.net/ubuntu/+source/qemu/+publishinghistory
There you'll see every version of the package that existed. If you click on a version
it allows you to download the debs which you can install with "dpkg -i ....deb"

---

2. "full commandline you use to start your qemu test case?"

> Probably the repo above is more useful, but FWIW:

Indeed, thanks!

3. "the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you? Just the emulation of riscv64 on arm64 HW is what now fails for you correct?"

> Yes x 2, confirmed with the above repro.

Thanks for the confirmation

---

4. "It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?"

> Rebuilding from the "commit" tagged with v5.2.0 ...

Very interesting, this short after a release this is mostly a few CVEs and integration of e.g. Ubuntu/Debian specific paths. Still chances are that you used a different toolchain than the packaging builds.
Could you rebuild what you get with "apt source qemu". That will be 5.2 plus the Delta we have...
If that doesn't fail then your build-env differs from our builds, and therein is the solution.
If it fails we need to check which delta it is.

Furthermore if indeed that fails while v5.2.0 worked I've pushed all our delta as one commit at a time to https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+ref/hirsute-delta-as-commits-lp1921664 so you could maybe bisect that. But to be sure build from the first commit in there and verify that it works. If this fails as well we have to look what differs in those builds.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI my qemu is still busy
   1913 root 20 0 2833396 237768 7640 S 100.7 5.9 25:54.13 qemu-system-ris

And after about 1000 seconds the guest moved a bit forward now reaching
[ 13.070209] usbhid: USB HID core driver
[ 13.092671] NET: Registered protocol family 10
[ 1003.282387] Segment Routing with IPv6
[ 1004.790268] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[ 1009.002716] NET: Registered protocol family 17
[ 1012.612965] 9pnet: Installing 9P2000 support
[ 1012.915223] Key type dns_resolver registered
[ 1015.022864] registered taskstats version 1
[ 1015.324660] Loading compiled-in X.509 certificates
[ 1036.408956] Freeing unused kernel memory: 264K
[ 1036.410322] This architecture does not have kernel memory protection.
[ 1036.710012] Run /init as init process
Loading, please wait...

I'll keep it running to check if I'll hit the assert later ....

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

> Maybe it depends on what arm platform (as there are often subtle differences) or which storage (as > the assert is about storage) you run on.
> My CPU is an X-Gene and my Storage is a ZFS (that backs my container running hirsute and Hirsute's > qemu).
> What is it for you?

Sorry, I thought I had already reported that, but it's not clear. My setup is special in a couple of ways:
- I'm running Ubuntu/Arm64 (21.04 beta, fully up-to-date except kernel), but ...
- it's a virtual machine on a macOS/Mac Mini M1 (fully up-to-date)
- It's running the 5.8.0-36-generic which isn't the latest (for complicated reasons)

I'll try to bring my Raspberry Pi 4 back up on Ubuntu and see if I can reproduce it there.

> Is this failing 100% of the times for you, or just sometimes and maybe racy?

100% consistently reproducible with the official packages. 0% reproducible with my own build

> A hint which versions to look at can be derived from
> $ grep -- qemu-system-misc /var/log/dpkg.log

Alas, I had critical space issues and /var/log was among the casualties

> Could you rebuild what you get with "apt source qemu". That will be 5.2 plus the Delta we have...

TIL. I tried `apt source --compile qemu` but it complains

  dpkg-checkbuilddeps: error: Unmet build dependencies: gcc-alpha-linux-gnu gcc-powerpc64-linux-gnu

but these packages are not available [anymore?]. I don't currently have the time to figure this out.

> FYI my qemu is still busy

It's hung. The boot take ~ 20 seconds on my host. Multi-minutes is not normal.

If I can reproduce this on a Raspberry Pi 4, then I'll proceed with your suggestions above, otherwise I'll pause this until I can run Ubuntu natively on the Mac Mini.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok,
thanks for all the further details.

Let us chase this further down once you got to that test & bisect.
I'll set the state to incomplete util then.

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

On my 4 GB Raspberry Pi 4

  QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-3ubuntu1)

worked as expected as did, but

  QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-9ubuntu1)

*did* reproduce the issue, but it took slightly longer to hit it (a few minutes):

```
...
[ OK ] Started Serial Getty on ttyS0.
[ OK ] Reached target Login Prompts.

Ubuntu 20.04 LTS Ubuntu-riscv64 ttyS0

Ubuntu-riscv64 login: qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
./run_riscvVM.sh: line 31: 2304 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 4 -m 3G -bios fw_payload.bin -device virtio-blk-device,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -device virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports
```

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Christian, I think I need some help. Like I said I couldn't build with apt source --compile qemu.
I proceeded with

  $ git clone -b hirsute-delta-as-commits-lp1921664 git+ssh://<email address hidden>/~paelzer/ubuntu/+source/qemu

  (git submodule update --init did nothing)

but the configure step failed with

  $ ../configure
  warn: ignoring non-existent submodule meson
  warn: ignoring non-existent submodule dtc
  warn: ignoring non-existent submodule capstone
  warn: ignoring non-existent submodule slirp
  cross containers no

  NOTE: guest cross-compilers enabled: cc s390x-linux-gnu-gcc cc s390x-linux-gnu-gcc
  /usr/bin/python3: can't open file '/home/tommy/qemu/meson/meson.py': [Errno 2] No such file or directory
  /usr/bin/python3: can't open file '/home/tommy/qemu/meson/meson.py': [Errno 2] No such file or directory

I had no problem building the master branch so I'm not sure what's going on with the submodules in your repo.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

I'm not sure how I was _supposed_ to do this, but I checked out the official release and then switch to the hirsute-delta-as-commits-lp1921664 (6c7e3708580ac50f78261a82b2fcdc2f288d6cea) branch which kept the directories around. I configured with "--target-list=riscv64-softmmu" to save time and the resulting binary did *not* reproduce the bug.

So in summary:
- Debian 1:5.2+dfsg-9ubuntu1 reproduces the issue of both RPi4 and my M1 VM.
- So far no version I have built have reproduced the issue.
Definitely makes either _how_ I built it or the _build tools_ I used sus

I'm not sure what to do next. I assume I'm supposed to set the bug back to "new"?

Changed in qemu (Ubuntu):
status: Incomplete → New
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

FWIW: I went full inception and ran QEMU/RISC-V under QEMU/RISC-V but I couldn't reproduce the issue here (that is, everything worked, but very slowly).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you for all your work and these confirmations Tommy!

I was bringing my RPi4 up as well...
Note: My RPi4 is installed as aarch64
I ran userspaces with arm64 and armhf (via LXD).

In the arm64 userspace case I was able to trigger the bug reliably in 3/3 tries under a minute each time
In the armhf userspace case it worked just fine.

So to summarize (on my RPi4)
- RPi4 riscv emulation on arm64 userspace on arm64 kernel - fails (local system)
- RPi4 riscv emulation on armhf userspace on arm64 kernel - TODO (local system)
- XGene riscv emulation on armhf userspace on arm64 kernel - works (Canonistac)
- M1 riscv emulation on armhf userspace on armhf kernel - fails (Tommy)

But I've found a way to recreate this, which is all I needed for now \o/

...
[ OK ] Finished Load/Save Random Seed.
[ OK ] Started udev Kernel Device Manager.
qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
./run_riscvVM.sh: line 31: 8302 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 2 -m 1G -bios fw_payload.bin -device virtio-blk-device,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -device virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports

I need to build & rebuild the different qemu options (git, ubuntu, ubuntu without delta, former ubuntu version) to compare those. And a lot of other tasks fight for having higher prio ... that will take a while ...

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Small correction: everything I've done have been everything 64-bit. I don't use armhf.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, thanks Tommy - then my Repro hits exactly what you had.
Good to have that sorted out as well.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since it was reported to have worked with former builds in Ubuntu I was testing the former builds that were published in Hirsute.

https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu2 - failing
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1 - failing
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-3ubuntu1 - working
https://launchpad.net/ubuntu/+source/qemu/1:5.1+dfsg-4ubuntu3 - working
I have also had prepped 1:5.0-5ubuntu11 but didn't go further after above results.

This absolutely confirms your initial report (changed in recent version) and gladly leaves us much less to churn through.
OTOH the remaining changes that could be related are mostly CVEs which are most of the time not very debatable.1

That was a rebase without changes in Ubuntu, but picking up Debian changes between
1:5.2+dfsg-3 -> 1:5.2+dfsg-9.

Those are:
- virtiofsd changes - not used here
- package dependency changes - not relevant here
- deprecate qemu-debootstrap - not used here
- security fixes
  - arm_gic-fix-interrupt-ID-in-GICD_SGIR-CVE-2021-20221.patch - not used (arm virt)
  - 9pfs-Fully-restart-unreclaim-loop-CVE-2021-20181.patch - not used (9pfs)
  - CVE-2021-20263 - again virtiofsd (not used)
  - CVE-2021-20257 - network for e1000 (not related to the error and nic none works)
  - I'll still unapply these for a test just to be sure
- there also is the chance that this is due to libs/build-toolchain - I'll rebuild a former working version for a re-test

I was trying to to further limit the scope but here things got a bit crazy:

- 1:5.2+dfsg-9ubuntu1 - tried 3 more times as-is - 2 failed 1 worked
So it isn't 100% reproducible :-/

This made me re-recheck the older builds (maybe some race window got bigger/smaller).

Then I had 3 more tries with "-nic none"
All three failed - so it is unlikely the e1000 fix that could have crept in via a default config.

I have created two PPAs which just started to build:
https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-secrevertpatches
https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold

Once these are complete I can further chase this down ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From my test PPAs the version "1:5.2+dfsg-9ubuntu2~hirsuteppa3" which is a no-change rebuild of the formerly working "1:5.2+dfsg-9ubuntu1" did in three tries fail three times.

So we are not looking at anything that is in the qemu source or the Ubuntu/Debian Delta applied to it. But at something in the build environment that now creates binaries behaving badly, which built on 2021-03-23 worked fine.
Since I have no idea yet where exactly to look at I'll add "the usual suspects" of glibc, gcc-10 and binutils - also Doko/Rbalint (who look after those packages) have seen a lot might have an idea about what might be going on here.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

That would explain why I could reproduce with personal builds. Glibc looks very relevant here.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

couldN’T, grr

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Before going into any rebuild-mania I wanted to further reduce how much builds I'll need I've changed "qemu-system-misc" but kept the others like "qemu-block-extra" and "qemu-system-common" - that mostly means /usr/bin/qemu-system-riscv64 is replaced but all the roms and modules (can't load then).
Reminder: all these are the same effective source
I've done this two ways:

All good pkg (1:5.2+dfsg-9ubuntu1) + emu bad (1:5.2+dfsg-9ubuntu2~hirsuteppa3):
All bad pkg (1:5.2+dfsg-9ubuntu2~hirsuteppa3) + emu good (1:5.2+dfsg-9ubuntu1): 3/3 fails
That made me wonder and I also got:
All good pkg (1:5.2+dfsg-9ubuntu1): 5/5 fails (formerly this was known good)

Sadly - the formerly seen non-distinct results continued. For example I did at one point end up with all packages of version "1:5.2+dfsg-9ubuntu1" (that is known good) failing in 5/5 tests repeatedly.

So I'm not sure how much the results are worth anymore :-/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Furthermore I've built (again the very same source) in groovy as 5.2+dfsg-9ubuntu2~groovyppa1 in the same PPA.
This build works as well in my tries.

So I have the same code as in "1:5.2+dfsg-9ubuntu1" three times now:
1. [1] => built 2021-03-23 in Hirsute => works
2. [2] => built 2021-04-12 in Hirsute => fails
3. [3] => built 2021-04-13 in Groovy => works

[1]: https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1/+build/21196422
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21392458
[3]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21394457

With the two results above, I expected as next step could spin up a (git based)
Groovy and a Hirsute build environment.
I'd do a build from git (and optimize a bit for build speed).
If these builds confirm the above results of [2] and [3] then I should be able
to upgrade the components in the Groovy build environment one by one to Hirsute
To identify which one is causing the breakage...

But unfortunately I have to start to challenge the reproducibility and that is
breaking the camels back here. Without that I can't go on well, and as sad (it
is a real issue) as it is it is riscv64 emulation on an arm64 host really isn't
the most common use case. So I'm unsure how much time I can spend on this.

Maybe I have looked at this from the wrong angle, let me try something else before I give up ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I've continued on one of the former approaches and started a full Ubuntu style
package build of the full source on arm64 in Groovy and Hirsute.
But it fell apart going out of space and I'm slowly getting hesitant to spend
more HW and time on this without
a) at least asking upstream if it is any known issue
b) not seeing it on a less edge case than risc emulation @ arm64

But I think by now we can drop the formerly "usual suspects" again as I have
had plenty of fails with the former good builds. It is just racy and a yet
unknown amount of conditions seems to influence this race.

If we are later on finding some evidence we can add them back ...

no longer affects: glibc (Ubuntu)
no longer affects: binutils (Ubuntu)
no longer affects: gcc-10 (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From the error message this seems to be about concurrency:

qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
Aborted (core dumped)

 42 void coroutine_fn qemu_co_queue_wait_impl(CoQueue *queue, QemuLockable *lock)
 43 {
 44 Coroutine *self = qemu_coroutine_self();
 45 QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
 46
 47 if (lock) {
 48 qemu_lockable_unlock(lock);
 49 }
 50
 51 /* There is no race condition here. Other threads will call
 52 * aio_co_schedule on our AioContext, which can reenter this
 53 * coroutine but only after this yield and after the main loop
 54 * has gone through the next iteration.
 55 */
 56 qemu_coroutine_yield();
 57 assert(qemu_in_coroutine());
 58
 59 /* TODO: OSv implements wait morphing here, where the wakeup
 60 * primitive automatically places the woken coroutine on the
 61 * mutex's queue. This avoids the thundering herd effect.
 62 * This could be implemented for CoMutexes, but not really for
 63 * other cases of QemuLockable.
 64 */
 65 if (lock) {
 66 qemu_lockable_lock(lock);
 67 }
 68 }

I wondered if I can stop this from happening by reducing the SMP count and/or
the real CPUs that are usable.

- Running with -smp 1 - 3/3 fails

Arm cpus are not so easily hot-pluggable so I wasn't able to run with just
one cpu yet - but then the #host cpus won't change the threads/processes that are executed - just their concurrency.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (15.1 KiB)

There are two follow on changes to this code (in the not yet released qemu 6.0):
 050de36b13 coroutine-lock: Reimplement CoRwlock to fix downgrade bug
 2f6ef0393b coroutine-lock: Store the coroutine in the CoWaitRecord only once

They change how things are done, but are no known fixes to the current issue.

We might gather more data and report it upstream - it could ring a bell for
someone there.

Attaching gdb to the live qemu in into further issues
# Cannot find user-level thread for LWP 29341: generic error
Which on qemu led to
# [ 172.294630] watchdog: BUG: soft lockup - CPU#0 stuck for 78s! [systemd-udevd:173]

I'm not sorting this out now, so post mortem debugging it will be :-/

I've taken a crash dump of the most recent 1:5.2+dfsg-9ubuntu2 which
has debug symbols in Ubuntu and even later one can fetch from
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu2

(gdb) info threads
  Id Target Id Frame
* 1 Thread 0xffffa98f9010 (LWP 29397) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
  2 Thread 0xffffa904f8b0 (LWP 29398) syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
  3 Thread 0xffffa3ffe8b0 (LWP 29399) 0x0000ffffab022d14 in __GI___sigtimedwait (set=set@entry=0xaaaac2fed320, info=info@entry=0xffffa3ffdd88, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:54
  4 Thread 0xffff237ee8b0 (LWP 29407) __futex_abstimed_wait_common64 (cancel=true, private=-1022925096, abstime=0xffff237ede48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766d8)
    at ../sysdeps/nptl/futex-internal.c:74
  5 Thread 0xffff22fde8b0 (LWP 29408) __futex_abstimed_wait_common64 (cancel=true, private=-1022925096, abstime=0xffff22fdde48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766d8)
    at ../sysdeps/nptl/futex-internal.c:74
  6 Thread 0xffff2bee18b0 (LWP 29405) __futex_abstimed_wait_common64 (cancel=true, private=-1022925092, abstime=0xffff2bee0e48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766dc)
    at ../sysdeps/nptl/futex-internal.c:74
  7 Thread 0xffffa27ce8b0 (LWP 29402) futex_wait (private=0, expected=2, futex_word=0xaaaab912d640 <qemu_global_mutex.lto_priv>) at ../sysdeps/nptl/futex-internal.h:146
  8 Thread 0xffffa2fde8b0 (LWP 29401) futex_wait (private=0, expected=2, futex_word=0xaaaab912d640 <qemu_global_mutex.lto_priv>) at ../sysdeps/nptl/futex-internal.h:146
  9 Thread 0xffff23ffe8b0 (LWP 29406) 0x0000ffffab0b9024 in __GI_pwritev64 (fd=<optimized out>, vector=0xaaaac3559fd0, count=2, offset=668794880)
    at ../sysdeps/unix/sysv/linux/pwritev64.c:26
  10 Thread 0xffffa37ee8b0 (LWP 29404) 0x0000ffffab0b9d3c in fdatasync (fd=<optimized out>) at ../sysdeps/unix/sysv/linux/fdatasync.c:28

(gdb) thread apply all bt

Thread 10 (Thread 0xffffa37ee8b0 (LWP 29404)):
#0 0x0000ffffab0b9d3c in fdatasync (fd=<optimized out>) at ../sysdeps/unix/sysv/linux/fdatasync.c:28
#1 0x0000aaaab8b8d3a8 in qemu_fdatasync (fd=<optimized out>) at ../../util/cutils.c:161
#2 handle_aiocb_flush (opaque=<optimized out>) at ../../block/file-posix.c:1350
#3 0x0000aaaab8c57314 in worker_thread (opaque=opaque@entry=0xaaaac307660...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also I've rebuilt the most recent master c1e90def01 about ~55 commits newer than 6.0-rc2.
As in the experiments of Tommy I was unable to reproduce it there.
But with the data from the tests before it is very likely that this is more
likely an accident by having a slightly different timing than a fix (to be
clear I'd appreciate if there is a fix, I'm just unable to derive from this
being good I could e.g. bisect).

export CFLAGS="-O0 -g -fPIC"
../configure --enable-system --disable-xen --disable-werror --disable-docs --disable-libudev --disable-guest-agent --disable-sdl --disable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi --disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir --disable-seccomp --disable-glusterfs --disable-tpm --disable-numa --disable-opengl --disable-virglrenderer --disable-xfsctl --disable-slirp --disable-blobs --disable-rdma --disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock --disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user --disable-spice --disable-qom-cast-debug --disable-bochs --disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat --disable-qed --disable-parallels --disable-sheepdog --disable-avx2 --disable-nettle --disable-gnutls --disable-capstone --enable-tools --disable-libssh --disable-libpmem --disable-cap-ng --disable-vte --disable-iconv --disable-curses --disable-linux-aio --disable-linux-io-uring --disable-kvm --disable-replication --audio-drv-list="" --disable-vhost-kernel --disable-vhost-vdpa --disable-live-block-migration --disable-keyring --disable-auth-pam --disable-curl --disable-strip --enable-fdt --target-list="riscv64-softmmu"
make -j10

Just like the package build that configures as
   coroutine backend: ucontext
   coroutine pool: YES

5/5 runs with that were ok
But since we know it is racy I'm unsure if that implies much :-/

P.S. I have not yet went into a build-option bisect, but chances are it could be
related. But that is too much stabbing in the dark, maybe someone experienced
in the coroutines code can already make sense of all the info we have gathered so
far.
I'll update the bug description and add an upstream task to have all the info we have get mirrored to the qemu mailing lists.

summary: - Recent update broke qemu-system-riscv64
+ Coroutines are racy for risc64 emu on arm64 - crash on Assertion
description: updated
Changed in qemu (Ubuntu):
importance: Undecided → Low
Revision history for this message
Thomas Huth (th-huth) wrote :

@Christian & Tommy : Could you please check whether the problematic binaries were built with link-time optimization, i.e. with -flto ? If so, does the problem go away when you rebuild the package without LTO?

Changed in qemu:
status: New → Incomplete
Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm, thanks for the hint Thomas.

Of the two formerly referenced same-source different result builds:

[1] => built 2021-03-23 in Hirsute => works
[2] => built 2021-04-12 in Hirsute => fails

[1]: https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1/+build/21196422
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21392458

The default flags changed in
  https://launchpad.net/ubuntu/+source/dpkg/1.20.7.1ubuntu4
and according to the build logs both ran with that.
Copy-Pasta from the log:
  dpkg (= 1.20.7.1ubuntu4),
=> In between those we did not switch the LTO default flags

For clarification LTO is the default nowadays and we are not disabling it generally in qemu. So - yes the builds are with LTO, but both the good and the bad one are.

Although looking at versions I see we have:
- good case 10.2.1-23ubuntu2
- bad case 10.3.0-1ubuntu1

So maybe - while it wasn't LTO - something in 10.3 maybe even LTO-since-10.3 is what is broken?

@Tommy - I don't have any of the test systems around anymore, if I'd build you a no-LTO qemu for testing what would you these days need - Hirsute, Impish, ... ?

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for QEMU because there has been no activity for 60 days.]

Changed in qemu:
status: Incomplete → Expired
Revision history for this message
Dana Goyette (danagoyette) wrote :

I've been having crashes with the same assertion message, when trying to run Windows 10 ARM under a VM. But I finally figured out that what's actually crashing it is not the fact that it's Windows, it's the fact that I was attaching the virtual drive via virtual USB.

If I do the same thing to an Ubuntu ARM64 guest, it *also* crashes.

qemu-system-aarch64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed

With the RISC-V guest, does your crash change if you change the type of attachment that's used for the virtual disk?

Also, I tried enabling core dumps in libvirt, but it didn't seem to dump cores to apport. Enabling core dumps would be useful for issues like this.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

No, as I described in great detail it has nothing to do with the attached devices.
I just noticed that the bug was excused away
as being do to the “slow” RPi 4. I’ll share that I originally hit it
on Apple’s M1 but as I expect my environment might be too unusual I replicated
it on RPi 4. I have since switched to building qemu from source so I don’t know if
it still happens.

Revision history for this message
Paride Legovini (paride) wrote :

I am consistently hitting this when trying to install the Ubuntu arm64 ISO image in a VM. A minimal command line that reproduces the problem is (host system is jammy arm64):

qemu-system-aarch64 -enable-kvm -m 2048 -M virt -cpu host -nographic -drive file=flash0.img,if=pflash,format=raw -drive file=flash1.img,if=pflash,format=raw -drive file=image2.qcow2,if=virtio -cdrom jammy-live-server-arm64.iso

The installation never gets to an end, always crashing.

Changed in qemu:
status: Expired → Incomplete
Changed in qemu (Ubuntu):
status: Expired → Incomplete
Revision history for this message
Thomas Huth (th-huth) wrote :

Upstream QEMU bugs are now tracked on https://gitlab.com/qemu-project/qemu/-/issues - so if you can reproduce it with the latest version from upstream QEMU, please report it there.

no longer affects: qemu
Revision history for this message
Paride Legovini (paride) wrote :

I tried the qemu package from Kinetic on a Jammy system

$ qemu-system-aarch64 --version
QEMU emulator version 7.0.0 (Debian 1:7.0+dfsg-7ubuntu1)

and it fails in the same way:

qemu-system-aarch64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
Aborted (core dumped)

Revision history for this message
Paride Legovini (paride) wrote :

In the end looks like it's LTO. I rebuilt Jammy's qemu (1:6.2+dfsg-2ubuntu6.3) with

  DEB_BUILD_MAINT_OPTIONS = optimize=-lto

and it doesn't crash anymore. I can't really tell if the issue is with Qemu's code or is due to a compiler bug. The rebuilt package is available in a PPA:

  https://launchpad.net/~paride/+archive/ubuntu/qemu-bpo

which despite the name doesn't actually contain backports.

FWIW Fedora disables LTO on aarch64 (arm64) because of this issue, see:

  https://bugzilla.redhat.com/show_bug.cgi?id=1952483
  https://src.fedoraproject.org/rpms/qemu/c/38b1a6c732bee90f75345c4d07

This is also discussed in this short Fedora mailing list thread:

https://<email address hidden>/msg159665.html

Changed in qemu (Ubuntu):
status: Incomplete → Confirmed
Paride Legovini (paride)
Changed in qemu (Ubuntu):
importance: Low → Medium
Paride Legovini (paride)
tags: added: lto server-todo
Revision history for this message
Paride Legovini (paride) wrote :

@Christian if we agree the path forward here is "disable LTO on non-amd64" I can prepare MPs and uploads for Kinetic and Jammy. I have a reproducer handy which will help with the SRU.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We have recently looked at some coroutine racyness in older versions, but all of those I know of would be fixed in 7.0.

If you even see this in 7.0 (as stated above) and you have a reproducer we can use - then I'd be absolutely happy if you could prep this change.

The upstream bug discussion seems to indicate !x86 is at fault, so I'm just curious if you found more than riscv.
@Paride, have you had a chance to check and confirm this on !risc64 and/or !7.0 qemu?

Changed in qemu (Ubuntu):
assignee: nobody → Paride Legovini (paride)
Revision history for this message
Paride Legovini (paride) wrote :

Hi, all my findings above are based on testing on arm64, not riscv64. I do confirm seeing the coroutine racyness with 7.0, but I tested it on Jammy, not Kinetic, so another round of tests is needed to confirm Kinetic is affected by this (I think it is).

In any case Jammy needs to be fixed. The machine where I can reliably reproduce the issue is the same we use to run the Ubuntu ISO tests, and given that this is point release week I have to be careful with it, as I don't want to interfere with the ISO testing. After the point release I'll be away from keyboard for a couple of weeks, so the ETA for the fix is end of August.

Revision history for this message
Paride Legovini (paride) wrote :

Confirmed happening on arm64 using a clean Kinetic host system (qemu 1:7.0+dfsg-7ubuntu1).

Changed in qemu (Ubuntu):
status: Confirmed → Triaged
Changed in qemu (Ubuntu Jammy):
status: New → Triaged
assignee: nobody → Paride Legovini (paride)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.