Coroutines are racy for risc64 emu on arm64 - crash on Assertion

Bug #1921664 reported by Tommy Thorn
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
qemu (Fedora)
In Progress
Medium
qemu (Ubuntu)
Fix Released
Medium
Paride Legovini
Jammy
Triaged
Undecided
Paride Legovini

Bug Description

Note: this could as well be "riscv64 on arm64" for being slow@slow and affect
other architectures as well.

The following case triggers on a Raspberry Pi4 running with arm64 on
Ubuntu 21.04 [1][2]. It might trigger on other environments as well,
but that is what we have seen it so far.

   $ wget https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/UbuntuFocal-riscv64-QemuVM.tar.gz
   $ tar xzf UbuntuFocal-riscv64-QemuVM.tar.gz
   $ ./run_riscvVM.sh
(wait ~2 minutes)
   [ OK ] Reached target Local File Systems (Pre).
   [ OK ] Reached target Local File Systems.
            Starting udev Kernel Device Manager...
qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.

This is often, but not 100% reproducible and the cases differ slightly we
see either of:
- qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
- qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

Rebuilding working cases has shown to make them fail, as well as rebulding
(or even reinstalling) bad cases has made them work. Also the same builds on
different arm64 CPUs behave differently. TL;DR: The full list of conditions
influencing good/bad case here are not yet known.

[1]: https://ubuntu.com/tutorials/how-to-install-ubuntu-on-your-raspberry-pi#1-overview
[2]: http://cdimage.ubuntu.com/daily-preinstalled/pending/hirsute-preinstalled-desktop-arm64+raspi.img.xz

--- --- original report --- ---

I regularly run a RISC-V (RV64GC) QEMU VM, but an update a few days ago broke it. Now when I launch it, it hits an assertion:

OpenSBI v0.6
   ____ _____ ____ _____
  / __ \ / ____| _ \_ _|
 | | | |_ __ ___ _ __ | (___ | |_) || |
 | | | | '_ \ / _ \ '_ \ \___ \| _ < | |
 | |__| | |_) | __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
618 bytes read in 2 ms (301.8 KiB/s)
RISC-V Qemu Boot Options
1: Linux kernel-5.5.0-dirty
2: Linux kernel-5.5.0-dirty (recovery mode)
Enter choice: 1: Linux kernel-5.5.0-dirty
Retrieving file: /boot/initrd.img-5.5.0-dirty
qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.
./run.sh: line 31: 1604 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 8 -m 8G -bios fw_payload.bin -device virtio-blk-devi
ce,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -devi
ce virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports

Interestingly this doesn't happen on the AMD64 version of Ubuntu 21.04 (fully updated).

Think you have everything already, but just in case:

$ lsb_release -rd
Description: Ubuntu Hirsute Hippo (development branch)
Release: 21.04

$ uname -a
Linux minimacvm 5.11.0-11-generic #12-Ubuntu SMP Mon Mar 1 19:27:36 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
(note this is a VM running on macOS/M1)

$ apt-cache policy qemu
qemu:
  Installed: 1:5.2+dfsg-9ubuntu1
  Candidate: 1:5.2+dfsg-9ubuntu1
  Version table:
 *** 1:5.2+dfsg-9ubuntu1 500
        500 http://ports.ubuntu.com/ubuntu-ports hirsute/universe arm64 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
DistroRelease: Ubuntu 21.04
Package: qemu 1:5.2+dfsg-9ubuntu1
ProcVersionSignature: Ubuntu 5.11.0-11.12-generic 5.11.0
Uname: Linux 5.11.0-11-generic aarch64
ApportVersion: 2.20.11-0ubuntu61
Architecture: arm64
CasperMD5CheckResult: unknown
CurrentDmesg:
 Error: command ['pkexec', 'dmesg'] failed with exit code 127: polkit-agent-helper-1: error response to PolicyKit daemon: GDBus.Error:org.freedesktop.PolicyKit1.Error.Failed: No session for cookie
 Error executing command as another user: Not authorized

 This incident has been reported.
Date: Mon Mar 29 02:33:25 2021
Dependencies:

KvmCmdLine: COMMAND STAT EUID RUID PID PPID %CPU COMMAND
Lspci-vt:
 -[0000:00]-+-00.0 Apple Inc. Device f020
            +-01.0 Red Hat, Inc. Virtio network device
            +-05.0 Red Hat, Inc. Virtio console
            +-06.0 Red Hat, Inc. Virtio block device
            \-07.0 Red Hat, Inc. Virtio RNG
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t:

Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: console=hvc0 root=/dev/vda
SourcePackage: qemu
UpgradeStatus: Upgraded to hirsute on 2020-12-30 (88 days ago)
acpidump:
 Error: command ['pkexec', '/usr/share/apport/dump_acpi_tables.py'] failed with exit code 127: polkit-agent-helper-1: error response to PolicyKit daemon: GDBus.Error:org.freedesktop.PolicyKit1.Error.Failed: No session for cookie
 Error executing command as another user: Not authorized

 This incident has been reported.

Related branches

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

FWIW, I just now built qemu-system-riscv64 from git ToT and that works fine.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Tommy,
you reported that against "1:5.2+dfsg-9ubuntu1" which is odd.
The only recent change was around
a) package dependencies
b) CVEs not touching your use-case IMHO

Was the formerly working version 1:5.2+dfsg-6ubuntu2 as I'm assuming or did you upgrade from a different one?

Could you also add the full commandline you use to start your qemu test case?
If there are any images or such involved as far as you can share where one could fetch them please.

And to be clear on your report - with the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you.
Just the emulation of riscv64 on arm64 HW is what now fails for you correct?

It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?
If you built v5.2.0 it might be something in the Ubuntu Delta that I have to look for.
If you've built the latest HEAD of qemu git then most likely the solution is a vommit since v5.2.0 - in that case would you be willing and able to maybe bisect that from v5.2.0..HEAD what the fix was?

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

0. Repro:

   $ wget https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/UbuntuFocal-riscv64-QemuVM.tar.gz
   $ tar xzf UbuntuFocal-riscv64-QemuVM.tar.gz
   $ ./run_riscvVM.sh
(wait ~ 20 s)
   [ OK ] Reached target Local File Systems (Pre).
   [ OK ] Reached target Local File Systems.
            Starting udev Kernel Device Manager...
   qemu-system-riscv64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

  (root password is "riscv" fwiw)

1. "Was the formerly working version 1:5.2+dfsg-6ubuntu2?"

   I'm afraid I don't know, but I update a few times a week.

   If you can tell me know to try individual versions, I'll do that

2. "full commandline you use to start your qemu test case?"

   Probably the repo above is more useful, but FWIW:

   qemu-system-riscv64 \
    -machine virt \
    -nographic \
    -smp 4 \
    -m 4G \
    -bios fw_payload.bin \
    -device virtio-blk-device,drive=hd0 \
    -object rng-random,filename=/dev/urandom,id=rng0 \
    -device virtio-rng-device,rng=rng0 \
    -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 \
    -device virtio-net-device,netdev=usernet \
    -netdev user,id=usernet,$ports

3. "the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you? Just the emulation of riscv64 on arm64 HW is what now fails for you correct?"

   Yes x 2, confirmed with the above repro.

   $ apt-cache policy qemu
qemu:
  Installed: 1:5.2+dfsg-9ubuntu1
  Candidate: 1:5.2+dfsg-9ubuntu1
  Version table:
 *** 1:5.2+dfsg-9ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu hirsute/universe amd64 Packages
        100 /var/lib/dpkg/status

4. "It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?"

  latest.

  Rebuilding from the "vommit" tagged with v5.2.0 ...

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Self-built v5.2.0 qemu-system-riscv64 does _not_ produce the bug.

Changed in qemu (Ubuntu):
status: Incomplete → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

0. Repro:

> ...
> $ ./run_riscvVM.sh
> ...

Thanks, I was not able to reproduce with that using the most recent
qemu 1:5.2+dfsg-9ubuntu1 on amd64 (just like you)

Trying the same on armhf was slower and a bit odd.
- I first got:
  qemu-system-riscv64: at most 2047 MB RAM can be simulated
  Reducing the memory to 2047M started up the system.
- then I have let it boot, which took quite a while and eventually
  hung at
[ 13.017716] mousedev: PS/2 mouse device common for all mice
[ 13.065889] usbcore: registered new interface driver usbhid
[ 13.070209] usbhid: USB HID core driver
[ 13.092671] NET: Registered protocol family 10

So it hung on armhf, while working on a amd64 host. That isn't good, but there was no crash to be seen :-/

Maybe it depends on what arm platform (as there are often subtle differences) or which storage (as the assert is about storage) you run on.
My CPU is an X-Gene and my Storage is a ZFS (that backs my container running hirsute and Hirsute's qemu).
What is it for you?

I've waited more, but no failure other than the hang was showing up.
Is this failing 100% of the times for you, or just sometimes and maybe racy?

---

1. "Was the formerly working version 1:5.2+dfsg-6ubuntu2?"

> I'm afraid I don't know, but I update a few times a week.

A hint which versions to look at can be derived from
  $ grep -- qemu-system-misc /var/log/dpkg.log

   If you can tell me know to try individual versions, I'll do that

You can go to https://launchpad.net/ubuntu/+source/qemu/+publishinghistory
There you'll see every version of the package that existed. If you click on a version
it allows you to download the debs which you can install with "dpkg -i ....deb"

---

2. "full commandline you use to start your qemu test case?"

> Probably the repo above is more useful, but FWIW:

Indeed, thanks!

3. "the same 1:5.2+dfsg-9ubuntu1 @amd64 it works fine for you? Just the emulation of riscv64 on arm64 HW is what now fails for you correct?"

> Yes x 2, confirmed with the above repro.

Thanks for the confirmation

---

4. "It also is interesting that you built qemu from git to have it work.
Did you build tag v5.2.0 or the latest commit?"

> Rebuilding from the "commit" tagged with v5.2.0 ...

Very interesting, this short after a release this is mostly a few CVEs and integration of e.g. Ubuntu/Debian specific paths. Still chances are that you used a different toolchain than the packaging builds.
Could you rebuild what you get with "apt source qemu". That will be 5.2 plus the Delta we have...
If that doesn't fail then your build-env differs from our builds, and therein is the solution.
If it fails we need to check which delta it is.

Furthermore if indeed that fails while v5.2.0 worked I've pushed all our delta as one commit at a time to https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+ref/hirsute-delta-as-commits-lp1921664 so you could maybe bisect that. But to be sure build from the first commit in there and verify that it works. If this fails as well we have to look what differs in those builds.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI my qemu is still busy
   1913 root 20 0 2833396 237768 7640 S 100.7 5.9 25:54.13 qemu-system-ris

And after about 1000 seconds the guest moved a bit forward now reaching
[ 13.070209] usbhid: USB HID core driver
[ 13.092671] NET: Registered protocol family 10
[ 1003.282387] Segment Routing with IPv6
[ 1004.790268] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[ 1009.002716] NET: Registered protocol family 17
[ 1012.612965] 9pnet: Installing 9P2000 support
[ 1012.915223] Key type dns_resolver registered
[ 1015.022864] registered taskstats version 1
[ 1015.324660] Loading compiled-in X.509 certificates
[ 1036.408956] Freeing unused kernel memory: 264K
[ 1036.410322] This architecture does not have kernel memory protection.
[ 1036.710012] Run /init as init process
Loading, please wait...

I'll keep it running to check if I'll hit the assert later ....

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

> Maybe it depends on what arm platform (as there are often subtle differences) or which storage (as > the assert is about storage) you run on.
> My CPU is an X-Gene and my Storage is a ZFS (that backs my container running hirsute and Hirsute's > qemu).
> What is it for you?

Sorry, I thought I had already reported that, but it's not clear. My setup is special in a couple of ways:
- I'm running Ubuntu/Arm64 (21.04 beta, fully up-to-date except kernel), but ...
- it's a virtual machine on a macOS/Mac Mini M1 (fully up-to-date)
- It's running the 5.8.0-36-generic which isn't the latest (for complicated reasons)

I'll try to bring my Raspberry Pi 4 back up on Ubuntu and see if I can reproduce it there.

> Is this failing 100% of the times for you, or just sometimes and maybe racy?

100% consistently reproducible with the official packages. 0% reproducible with my own build

> A hint which versions to look at can be derived from
> $ grep -- qemu-system-misc /var/log/dpkg.log

Alas, I had critical space issues and /var/log was among the casualties

> Could you rebuild what you get with "apt source qemu". That will be 5.2 plus the Delta we have...

TIL. I tried `apt source --compile qemu` but it complains

  dpkg-checkbuilddeps: error: Unmet build dependencies: gcc-alpha-linux-gnu gcc-powerpc64-linux-gnu

but these packages are not available [anymore?]. I don't currently have the time to figure this out.

> FYI my qemu is still busy

It's hung. The boot take ~ 20 seconds on my host. Multi-minutes is not normal.

If I can reproduce this on a Raspberry Pi 4, then I'll proceed with your suggestions above, otherwise I'll pause this until I can run Ubuntu natively on the Mac Mini.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok,
thanks for all the further details.

Let us chase this further down once you got to that test & bisect.
I'll set the state to incomplete util then.

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

On my 4 GB Raspberry Pi 4

  QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-3ubuntu1)

worked as expected as did, but

  QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-9ubuntu1)

*did* reproduce the issue, but it took slightly longer to hit it (a few minutes):

```
...
[ OK ] Started Serial Getty on ttyS0.
[ OK ] Reached target Login Prompts.

Ubuntu 20.04 LTS Ubuntu-riscv64 ttyS0

Ubuntu-riscv64 login: qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
./run_riscvVM.sh: line 31: 2304 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 4 -m 3G -bios fw_payload.bin -device virtio-blk-device,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -device virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports
```

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Christian, I think I need some help. Like I said I couldn't build with apt source --compile qemu.
I proceeded with

  $ git clone -b hirsute-delta-as-commits-lp1921664 git+ssh://<email address hidden>/~paelzer/ubuntu/+source/qemu

  (git submodule update --init did nothing)

but the configure step failed with

  $ ../configure
  warn: ignoring non-existent submodule meson
  warn: ignoring non-existent submodule dtc
  warn: ignoring non-existent submodule capstone
  warn: ignoring non-existent submodule slirp
  cross containers no

  NOTE: guest cross-compilers enabled: cc s390x-linux-gnu-gcc cc s390x-linux-gnu-gcc
  /usr/bin/python3: can't open file '/home/tommy/qemu/meson/meson.py': [Errno 2] No such file or directory
  /usr/bin/python3: can't open file '/home/tommy/qemu/meson/meson.py': [Errno 2] No such file or directory

I had no problem building the master branch so I'm not sure what's going on with the submodules in your repo.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

I'm not sure how I was _supposed_ to do this, but I checked out the official release and then switch to the hirsute-delta-as-commits-lp1921664 (6c7e3708580ac50f78261a82b2fcdc2f288d6cea) branch which kept the directories around. I configured with "--target-list=riscv64-softmmu" to save time and the resulting binary did *not* reproduce the bug.

So in summary:
- Debian 1:5.2+dfsg-9ubuntu1 reproduces the issue of both RPi4 and my M1 VM.
- So far no version I have built have reproduced the issue.
Definitely makes either _how_ I built it or the _build tools_ I used sus

I'm not sure what to do next. I assume I'm supposed to set the bug back to "new"?

Changed in qemu (Ubuntu):
status: Incomplete → New
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

FWIW: I went full inception and ran QEMU/RISC-V under QEMU/RISC-V but I couldn't reproduce the issue here (that is, everything worked, but very slowly).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you for all your work and these confirmations Tommy!

I was bringing my RPi4 up as well...
Note: My RPi4 is installed as aarch64
I ran userspaces with arm64 and armhf (via LXD).

In the arm64 userspace case I was able to trigger the bug reliably in 3/3 tries under a minute each time
In the armhf userspace case it worked just fine.

So to summarize (on my RPi4)
- RPi4 riscv emulation on arm64 userspace on arm64 kernel - fails (local system)
- RPi4 riscv emulation on armhf userspace on arm64 kernel - TODO (local system)
- XGene riscv emulation on armhf userspace on arm64 kernel - works (Canonistac)
- M1 riscv emulation on armhf userspace on armhf kernel - fails (Tommy)

But I've found a way to recreate this, which is all I needed for now \o/

...
[ OK ] Finished Load/Save Random Seed.
[ OK ] Started udev Kernel Device Manager.
qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
./run_riscvVM.sh: line 31: 8302 Aborted (core dumped) qemu-system-riscv64 -machine virt -nographic -smp 2 -m 1G -bios fw_payload.bin -device virtio-blk-device,drive=hd0 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -drive file=riscv64-UbuntuFocal-qemu.qcow2,format=qcow2,id=hd0 -device virtio-net-device,netdev=usernet -netdev user,id=usernet,$ports

I need to build & rebuild the different qemu options (git, ubuntu, ubuntu without delta, former ubuntu version) to compare those. And a lot of other tasks fight for having higher prio ... that will take a while ...

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

Small correction: everything I've done have been everything 64-bit. I don't use armhf.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, thanks Tommy - then my Repro hits exactly what you had.
Good to have that sorted out as well.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since it was reported to have worked with former builds in Ubuntu I was testing the former builds that were published in Hirsute.

https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu2 - failing
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1 - failing
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-3ubuntu1 - working
https://launchpad.net/ubuntu/+source/qemu/1:5.1+dfsg-4ubuntu3 - working
I have also had prepped 1:5.0-5ubuntu11 but didn't go further after above results.

This absolutely confirms your initial report (changed in recent version) and gladly leaves us much less to churn through.
OTOH the remaining changes that could be related are mostly CVEs which are most of the time not very debatable.1

That was a rebase without changes in Ubuntu, but picking up Debian changes between
1:5.2+dfsg-3 -> 1:5.2+dfsg-9.

Those are:
- virtiofsd changes - not used here
- package dependency changes - not relevant here
- deprecate qemu-debootstrap - not used here
- security fixes
  - arm_gic-fix-interrupt-ID-in-GICD_SGIR-CVE-2021-20221.patch - not used (arm virt)
  - 9pfs-Fully-restart-unreclaim-loop-CVE-2021-20181.patch - not used (9pfs)
  - CVE-2021-20263 - again virtiofsd (not used)
  - CVE-2021-20257 - network for e1000 (not related to the error and nic none works)
  - I'll still unapply these for a test just to be sure
- there also is the chance that this is due to libs/build-toolchain - I'll rebuild a former working version for a re-test

I was trying to to further limit the scope but here things got a bit crazy:

- 1:5.2+dfsg-9ubuntu1 - tried 3 more times as-is - 2 failed 1 worked
So it isn't 100% reproducible :-/

This made me re-recheck the older builds (maybe some race window got bigger/smaller).

Then I had 3 more tries with "-nic none"
All three failed - so it is unlikely the e1000 fix that could have crept in via a default config.

I have created two PPAs which just started to build:
https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-secrevertpatches
https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold

Once these are complete I can further chase this down ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From my test PPAs the version "1:5.2+dfsg-9ubuntu2~hirsuteppa3" which is a no-change rebuild of the formerly working "1:5.2+dfsg-9ubuntu1" did in three tries fail three times.

So we are not looking at anything that is in the qemu source or the Ubuntu/Debian Delta applied to it. But at something in the build environment that now creates binaries behaving badly, which built on 2021-03-23 worked fine.
Since I have no idea yet where exactly to look at I'll add "the usual suspects" of glibc, gcc-10 and binutils - also Doko/Rbalint (who look after those packages) have seen a lot might have an idea about what might be going on here.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

That would explain why I could reproduce with personal builds. Glibc looks very relevant here.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

couldN’T, grr

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Before going into any rebuild-mania I wanted to further reduce how much builds I'll need I've changed "qemu-system-misc" but kept the others like "qemu-block-extra" and "qemu-system-common" - that mostly means /usr/bin/qemu-system-riscv64 is replaced but all the roms and modules (can't load then).
Reminder: all these are the same effective source
I've done this two ways:

All good pkg (1:5.2+dfsg-9ubuntu1) + emu bad (1:5.2+dfsg-9ubuntu2~hirsuteppa3):
All bad pkg (1:5.2+dfsg-9ubuntu2~hirsuteppa3) + emu good (1:5.2+dfsg-9ubuntu1): 3/3 fails
That made me wonder and I also got:
All good pkg (1:5.2+dfsg-9ubuntu1): 5/5 fails (formerly this was known good)

Sadly - the formerly seen non-distinct results continued. For example I did at one point end up with all packages of version "1:5.2+dfsg-9ubuntu1" (that is known good) failing in 5/5 tests repeatedly.

So I'm not sure how much the results are worth anymore :-/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Furthermore I've built (again the very same source) in groovy as 5.2+dfsg-9ubuntu2~groovyppa1 in the same PPA.
This build works as well in my tries.

So I have the same code as in "1:5.2+dfsg-9ubuntu1" three times now:
1. [1] => built 2021-03-23 in Hirsute => works
2. [2] => built 2021-04-12 in Hirsute => fails
3. [3] => built 2021-04-13 in Groovy => works

[1]: https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1/+build/21196422
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21392458
[3]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21394457

With the two results above, I expected as next step could spin up a (git based)
Groovy and a Hirsute build environment.
I'd do a build from git (and optimize a bit for build speed).
If these builds confirm the above results of [2] and [3] then I should be able
to upgrade the components in the Groovy build environment one by one to Hirsute
To identify which one is causing the breakage...

But unfortunately I have to start to challenge the reproducibility and that is
breaking the camels back here. Without that I can't go on well, and as sad (it
is a real issue) as it is it is riscv64 emulation on an arm64 host really isn't
the most common use case. So I'm unsure how much time I can spend on this.

Maybe I have looked at this from the wrong angle, let me try something else before I give up ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I've continued on one of the former approaches and started a full Ubuntu style
package build of the full source on arm64 in Groovy and Hirsute.
But it fell apart going out of space and I'm slowly getting hesitant to spend
more HW and time on this without
a) at least asking upstream if it is any known issue
b) not seeing it on a less edge case than risc emulation @ arm64

But I think by now we can drop the formerly "usual suspects" again as I have
had plenty of fails with the former good builds. It is just racy and a yet
unknown amount of conditions seems to influence this race.

If we are later on finding some evidence we can add them back ...

no longer affects: glibc (Ubuntu)
no longer affects: binutils (Ubuntu)
no longer affects: gcc-10 (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From the error message this seems to be about concurrency:

qemu-system-riscv64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
Aborted (core dumped)

 42 void coroutine_fn qemu_co_queue_wait_impl(CoQueue *queue, QemuLockable *lock)
 43 {
 44 Coroutine *self = qemu_coroutine_self();
 45 QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
 46
 47 if (lock) {
 48 qemu_lockable_unlock(lock);
 49 }
 50
 51 /* There is no race condition here. Other threads will call
 52 * aio_co_schedule on our AioContext, which can reenter this
 53 * coroutine but only after this yield and after the main loop
 54 * has gone through the next iteration.
 55 */
 56 qemu_coroutine_yield();
 57 assert(qemu_in_coroutine());
 58
 59 /* TODO: OSv implements wait morphing here, where the wakeup
 60 * primitive automatically places the woken coroutine on the
 61 * mutex's queue. This avoids the thundering herd effect.
 62 * This could be implemented for CoMutexes, but not really for
 63 * other cases of QemuLockable.
 64 */
 65 if (lock) {
 66 qemu_lockable_lock(lock);
 67 }
 68 }

I wondered if I can stop this from happening by reducing the SMP count and/or
the real CPUs that are usable.

- Running with -smp 1 - 3/3 fails

Arm cpus are not so easily hot-pluggable so I wasn't able to run with just
one cpu yet - but then the #host cpus won't change the threads/processes that are executed - just their concurrency.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (15.1 KiB)

There are two follow on changes to this code (in the not yet released qemu 6.0):
 050de36b13 coroutine-lock: Reimplement CoRwlock to fix downgrade bug
 2f6ef0393b coroutine-lock: Store the coroutine in the CoWaitRecord only once

They change how things are done, but are no known fixes to the current issue.

We might gather more data and report it upstream - it could ring a bell for
someone there.

Attaching gdb to the live qemu in into further issues
# Cannot find user-level thread for LWP 29341: generic error
Which on qemu led to
# [ 172.294630] watchdog: BUG: soft lockup - CPU#0 stuck for 78s! [systemd-udevd:173]

I'm not sorting this out now, so post mortem debugging it will be :-/

I've taken a crash dump of the most recent 1:5.2+dfsg-9ubuntu2 which
has debug symbols in Ubuntu and even later one can fetch from
https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu2

(gdb) info threads
  Id Target Id Frame
* 1 Thread 0xffffa98f9010 (LWP 29397) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
  2 Thread 0xffffa904f8b0 (LWP 29398) syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38
  3 Thread 0xffffa3ffe8b0 (LWP 29399) 0x0000ffffab022d14 in __GI___sigtimedwait (set=set@entry=0xaaaac2fed320, info=info@entry=0xffffa3ffdd88, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:54
  4 Thread 0xffff237ee8b0 (LWP 29407) __futex_abstimed_wait_common64 (cancel=true, private=-1022925096, abstime=0xffff237ede48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766d8)
    at ../sysdeps/nptl/futex-internal.c:74
  5 Thread 0xffff22fde8b0 (LWP 29408) __futex_abstimed_wait_common64 (cancel=true, private=-1022925096, abstime=0xffff22fdde48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766d8)
    at ../sysdeps/nptl/futex-internal.c:74
  6 Thread 0xffff2bee18b0 (LWP 29405) __futex_abstimed_wait_common64 (cancel=true, private=-1022925092, abstime=0xffff2bee0e48, clockid=-1022925184, expected=0, futex_word=0xaaaac30766dc)
    at ../sysdeps/nptl/futex-internal.c:74
  7 Thread 0xffffa27ce8b0 (LWP 29402) futex_wait (private=0, expected=2, futex_word=0xaaaab912d640 <qemu_global_mutex.lto_priv>) at ../sysdeps/nptl/futex-internal.h:146
  8 Thread 0xffffa2fde8b0 (LWP 29401) futex_wait (private=0, expected=2, futex_word=0xaaaab912d640 <qemu_global_mutex.lto_priv>) at ../sysdeps/nptl/futex-internal.h:146
  9 Thread 0xffff23ffe8b0 (LWP 29406) 0x0000ffffab0b9024 in __GI_pwritev64 (fd=<optimized out>, vector=0xaaaac3559fd0, count=2, offset=668794880)
    at ../sysdeps/unix/sysv/linux/pwritev64.c:26
  10 Thread 0xffffa37ee8b0 (LWP 29404) 0x0000ffffab0b9d3c in fdatasync (fd=<optimized out>) at ../sysdeps/unix/sysv/linux/fdatasync.c:28

(gdb) thread apply all bt

Thread 10 (Thread 0xffffa37ee8b0 (LWP 29404)):
#0 0x0000ffffab0b9d3c in fdatasync (fd=<optimized out>) at ../sysdeps/unix/sysv/linux/fdatasync.c:28
#1 0x0000aaaab8b8d3a8 in qemu_fdatasync (fd=<optimized out>) at ../../util/cutils.c:161
#2 handle_aiocb_flush (opaque=<optimized out>) at ../../block/file-posix.c:1350
#3 0x0000aaaab8c57314 in worker_thread (opaque=opaque@entry=0xaaaac307660...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also I've rebuilt the most recent master c1e90def01 about ~55 commits newer than 6.0-rc2.
As in the experiments of Tommy I was unable to reproduce it there.
But with the data from the tests before it is very likely that this is more
likely an accident by having a slightly different timing than a fix (to be
clear I'd appreciate if there is a fix, I'm just unable to derive from this
being good I could e.g. bisect).

export CFLAGS="-O0 -g -fPIC"
../configure --enable-system --disable-xen --disable-werror --disable-docs --disable-libudev --disable-guest-agent --disable-sdl --disable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi --disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir --disable-seccomp --disable-glusterfs --disable-tpm --disable-numa --disable-opengl --disable-virglrenderer --disable-xfsctl --disable-slirp --disable-blobs --disable-rdma --disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock --disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user --disable-spice --disable-qom-cast-debug --disable-bochs --disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat --disable-qed --disable-parallels --disable-sheepdog --disable-avx2 --disable-nettle --disable-gnutls --disable-capstone --enable-tools --disable-libssh --disable-libpmem --disable-cap-ng --disable-vte --disable-iconv --disable-curses --disable-linux-aio --disable-linux-io-uring --disable-kvm --disable-replication --audio-drv-list="" --disable-vhost-kernel --disable-vhost-vdpa --disable-live-block-migration --disable-keyring --disable-auth-pam --disable-curl --disable-strip --enable-fdt --target-list="riscv64-softmmu"
make -j10

Just like the package build that configures as
   coroutine backend: ucontext
   coroutine pool: YES

5/5 runs with that were ok
But since we know it is racy I'm unsure if that implies much :-/

P.S. I have not yet went into a build-option bisect, but chances are it could be
related. But that is too much stabbing in the dark, maybe someone experienced
in the coroutines code can already make sense of all the info we have gathered so
far.
I'll update the bug description and add an upstream task to have all the info we have get mirrored to the qemu mailing lists.

summary: - Recent update broke qemu-system-riscv64
+ Coroutines are racy for risc64 emu on arm64 - crash on Assertion
description: updated
Changed in qemu (Ubuntu):
importance: Undecided → Low
Revision history for this message
In , mrezanin (mrezanin-redhat-bugs) wrote :

When running build for qemu-kvm for RHEL 9, test-block-iothread during "make check " fails on aarch64, ppc64le and s390x architecture for /attach/blockjob (pass on x86_64):

ERROR test-block-iothread - Bail out! ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)

Same code passes test on RHEL 8.

Revision history for this message
In , smitterl (smitterl-redhat-bugs) wrote :

I cannot reproduce this on RHEL 9 Beta compose ID RHEL-9.0.0-20210504.5 on s390x (z15) with qemu master@3e13d8e34b53d8f9a3421a816ccfbdc5fa874e98.
I ran
# ../configure --target-list=s390x-softmmu
# make
# make check-unit
I had to install TAP::Parser though CPAN.

Revision history for this message
In , eric.auger (eric.auger-redhat-bugs) wrote :

Hi Kevin,

we seem to have a bunch of issues related to presumed coroutine races:
- https://bugzilla.redhat.com/show_bug.cgi?id=1924014
- https://bugzilla.redhat.com/show_bug.cgi?id=1950192
- https://bugzilla.redhat.com/show_bug.cgi?id=1924974
also I saw https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1921664

I don't understand yet if / how it relates to that BZ.

Above BZs mention
- qemu-system-xxx: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
- qemu-system-xxx: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed.

Thanks

Eric

Revision history for this message
In , kwolf (kwolf-redhat-bugs) wrote :

So far it seems we are only seeing this kind of problems on RHEL 9 and only on non-x86. My best guess is still that something is wrong with the TLS implementation there.

If you can reproduce the problems, you could try to figure out which of the components makes the difference: Does the problem still occur when compiling and running the RHEL 9 qemu source on RHEL 8, and when building and compiling RHEL 8 qemu on RHEL 9? If the problem is not in the QEMU source, is it the compiler/toolchain, the kernel or can we identify any other specific component that causes the difference when changed individually?

Revision history for this message
In , thuth (thuth-redhat-bugs) wrote :

Stefan, could you please have a look at this BZ here? It's easy to reproduce the problem with failing coroutines when compiling with -flto on a non-x86 box here:

 tar -xaf ~/qemu-6.0.0.tar.xz
 cd qemu-6.0.0/
 ./configure --disable-docs --extra-cflags='-O2 -flto=auto -ffat-lto-objects' --target-list='s390x-softmmu'
 cd build/
 make -j8 tests/unit/test-block-iothread
 tests/unit/test-block-iothread

Fails with:

 ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)
 Bail out! ERROR:../tests/unit/test-block-iothread.c:379:test_job_run: assertion failed: (qemu_get_current_aio_context() == job->aio_context)

Revision history for this message
In , thuth (thuth-redhat-bugs) wrote :

Backtrace looks like this:

(gdb) bt full
#0 0x000003fffca9f4ae in __pthread_kill_internal () from /lib64/libc.so.6
No symbol table info available.
#1 0x000003fffca4fa20 in raise () from /lib64/libc.so.6
No symbol table info available.
#2 0x000003fffca31398 in abort () from /lib64/libc.so.6
No symbol table info available.
#3 0x000003fffde00de6 in g_assertion_message () from /lib64/libglib-2.0.so.0
No symbol table info available.
#4 0x000003fffde00e46 in g_assertion_message_expr () from /lib64/libglib-2.0.so.0
No symbol table info available.
#5 0x000002aa00023c20 in test_job_run (job=0x2aa001f5f50, errp=<optimized out>) at ../tests/unit/test-block-iothread.c:379
        s = 0x2aa001f5f50
        __func__ = "test_job_run"
        __mptr = <optimized out>
#6 0x000002aa0005a546 in job_co_entry (opaque=0x2aa001f5f50) at ../job.c:914
        job = 0x2aa001f5f50
        __PRETTY_FUNCTION__ = "job_co_entry"
#7 0x000002aa000e2d7a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
        arg = {p = <optimized out>, i = {<optimized out>, <optimized out>}}
        self = <optimized out>
        co = 0x2aa001f6110
        fake_stack_save = 0x0
#8 0x000003fffca65092 in __makecontext_ret () from /lib64/libc.so.6

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

Thanks to Florian Weimer for input from the toolchain side.

It looks grim, unfortunately. The toolchain may cache TLS values regardless of stack/thread switches. This means a coroutine running in thread 1 that switches to thread 2 might see TLS values from thread 1.

To recap the issue:

  static int coroutine_fn test_job_run(Job *job, Error **errp)
  {
      TestBlockJob *s = container_of(job, TestBlockJob, common.job);

      job_transition_to_ready(&s->common.job);
      while (!s->should_complete) {
          s->n++;
          g_assert(qemu_get_current_aio_context() == job->aio_context);

          /* Avoid job_sleep_ns() because it marks the job as !busy. We want to
           * emulate some actual activity (probably some I/O) here so that the
           * drain involved in AioContext switches has to wait for this activity
           * to stop. */
          qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
          ^^^^^^^^^^^^^^^^

          job_pause_point(&s->common.job);
      }

      g_assert(qemu_get_current_aio_context() == job->aio_context);
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Both qemu_co_sleep_ns() and qemu_get_current_aio_context() function load the same TLS variable. The compiler could cache the value since it knows other threads will not modify the TLS value.

We have discussed changing how QEMU uses TLS from coroutines, but TLS is used widely so it's not easy to fix this.

Kevin mentioned that he was able to work around the issue by disabling inlining in one case, but the concern is that there's no systematic way of preventing these bugs.

From a QEMU perspective the easiest option would be a "TLS barrier" primitive that tells the toolchain that TLS accesses cannot be cached across a certain point.

Another toolchain option is to disable TLS caching optimizations - i.e. stop assuming that TLS memory can only be modified from the current thread. I wonder how much of a performance impact this has.

Finally, a hack might be to find a way to convert a TLS variable's address into a regular pointer so the compiler can no longer assume other threads don't modify it. Then loads shouldn't be cached across sequence points according to the C standard.

Florian: Are any of these toolchain approaches possible?

Revision history for this message
In , fweimer (fweimer-redhat-bugs) wrote :

I have started a thread on the gcc list:

Disabling TLS address caching to help QEMU on GNU/Linux
https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html

Revision history for this message
In , berrange (berrange-redhat-bugs) wrote :

(In reply to Florian Weimer from comment #9)
> I have started a thread on the gcc list:
>
> Disabling TLS address caching to help QEMU on GNU/Linux
> https://gcc.gnu.org/pipermail/gcc/2021-July/236831.html

This discussion rather implies that we ought to have a RFE bug open against GCC to request a solution and track its progress towards RHEL.

This is complicated in 9 by the fact that we're now using CLang too, so presumably we might need an RFE against Clang instead of, or as well as, GCC.

Revision history for this message
Thomas Huth (th-huth) wrote :

@Christian & Tommy : Could you please check whether the problematic binaries were built with link-time optimization, i.e. with -flto ? If so, does the problem go away when you rebuild the package without LTO?

Changed in qemu:
status: New → Incomplete
Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm, thanks for the hint Thomas.

Of the two formerly referenced same-source different result builds:

[1] => built 2021-03-23 in Hirsute => works
[2] => built 2021-04-12 in Hirsute => fails

[1]: https://launchpad.net/ubuntu/+source/qemu/1:5.2+dfsg-9ubuntu1/+build/21196422
[2]: https://launchpad.net/~paelzer/+archive/ubuntu/lp-1921664-testbuilds-rebuildold/+build/21392458

The default flags changed in
  https://launchpad.net/ubuntu/+source/dpkg/1.20.7.1ubuntu4
and according to the build logs both ran with that.
Copy-Pasta from the log:
  dpkg (= 1.20.7.1ubuntu4),
=> In between those we did not switch the LTO default flags

For clarification LTO is the default nowadays and we are not disabling it generally in qemu. So - yes the builds are with LTO, but both the good and the bad one are.

Although looking at versions I see we have:
- good case 10.2.1-23ubuntu2
- bad case 10.3.0-1ubuntu1

So maybe - while it wasn't LTO - something in 10.3 maybe even LTO-since-10.3 is what is broken?

@Tommy - I don't have any of the test systems around anymore, if I'd build you a no-LTO qemu for testing what would you these days need - Hirsute, Impish, ... ?

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

Hi Florian,
Thanks for starting the mailing list thread. Has there been activity in the gcc community?

As Daniel Berrange mentioned, clang has entered the picture. I wanted to check if you see anything happening. If not, then I'll open RFEs as suggested.

Revision history for this message
In , fweimer (fweimer-redhat-bugs) wrote :

(In reply to Stefan Hajnoczi from comment #13)
> Hi Florian,
> Thanks for starting the mailing list thread. Has there been activity in the
> gcc community?

There has been the discussion, but nothing else that I know of.

(If you have switched to clang, this is not really relevant to you anyway, so I don't know what the priority is for the gcc side.)

Revision history for this message
In , thuth (thuth-redhat-bugs) wrote :

We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47) that build still fails on aarch64. It seems to work on s390x on a first glance, though.

Revision history for this message
In , tstellar (tstellar-redhat-bugs) wrote :

(In reply to Thomas Huth from comment #15)
> We switched to Clang with qemu-kvm-6.0.0-12, but if I've got a comment in
> another BZ right (https://bugzilla.redhat.com/show_bug.cgi?id=1940132#c47)
> that build still fails on aarch64. It seems to work on s390x on a first
> glance, though.

It's a little hard to follow through all the different bugzilla links, would someone be able to file a bug against the clang component with the summary of the failures specific to clang? That would make it easier for our team to analyze and track.

Revision history for this message
In , thuth (thuth-redhat-bugs) wrote :

@<email address hidden> : Can you still reproduce the problem with a Clang build (that uses -flto) on aarch64? If so, could you please open a BZ against "clang" as Tom suggested, with the instructions how to reproduce the problem there?

Revision history for this message
In , tstellar (tstellar-redhat-bugs) wrote :

Is there a reduced test case that will demonstrate the problem mentioned in comment 8.

Revision history for this message
In , eric.auger (eric.auger-redhat-bugs) wrote :

(In reply to Tom Stellard from comment #20)
> Is there a reduced test case that will demonstrate the problem mentioned in
> comment 8.

Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against CLANG as you suggested. Here you will find the qemu configuration and test case I used to trigger the issue. It is basically the same as the one described by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.

Revision history for this message
In , tstellar (tstellar-redhat-bugs) wrote :

(In reply to Eric Auger from comment #21)
> (In reply to Tom Stellard from comment #20)
> > Is there a reduced test case that will demonstrate the problem mentioned in
> > comment 8.
>
> Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against
> CLANG as you suggested. Here you will find the qemu configuration and test
> case I used to trigger the issue. It is basically the same as the one
> described by Thomas in
> https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.

OK, thanks. Am I correct that the core problem is sequences like this:

thread_local int *ptr;

read(ptr);
context_switch();
read(ptr);

Where the 2 read functions read the same address even though they may run in different threads.

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

(In reply to Tom Stellard from comment #22)
> (In reply to Eric Auger from comment #21)
> > (In reply to Tom Stellard from comment #20)
> > > Is there a reduced test case that will demonstrate the problem mentioned in
> > > comment 8.
> >
> > Tom, I opened https://bugzilla.redhat.com/show_bug.cgi?id=2000479 against
> > CLANG as you suggested. Here you will find the qemu configuration and test
> > case I used to trigger the issue. It is basically the same as the one
> > described by Thomas in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6, with CLANG.
>
> OK, thanks. Am I correct that the core problem is sequences like this:
>
> thread_local int *ptr;
>
> read(ptr);
> context_switch();
> read(ptr);
>
> Where the 2 read functions read the same address even though they may run in
> different threads.

Yes.

Revision history for this message
In , sguelton (sguelton-redhat-bugs) wrote :

This may sound a bit naive, but... if the couroutine are implemented in terms of setjmp/longjmp, then the c11 standard says that

```
7.13.2.1 [The longjmp function]

1 #include <setjmp.h>
             _Noreturn void longjmp(jmp_buf env, int val);
    Description

2 The longjmp function restores the environment saved by the most recent invocation of
    the setjmp macro in the same invocation of the program with the corresponding
    jmp_buf argument. If there has been no such invocation, or **if the invocation was from
    another thread of execution**, or if the function containing the invocation of the setjmp
    macro has terminated execution248) in the interim, or if the invocation of the setjmp
    macro was within the scope of an identifier with variably modified type and execution has
    left that scope in the interim, the behavior is undefined.

```

Then isn't that prone to failure? Wouldn't setting the thread local variables as volatile change something?

This small godbolt experiment is promising:

https://gcc.godbolt.org/z/6nznnMvTs

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

(In reply to serge_sans_paille from comment #24)
> This may sound a bit naive, but... if the couroutine are implemented in
> terms of setjmp/longjmp, then the c11 standard says that
>
> ```
> 7.13.2.1 [The longjmp function]
>
> 1 #include <setjmp.h>
> _Noreturn void longjmp(jmp_buf env, int val);
> Description
>
> 2 The longjmp function restores the environment saved by the most recent
> invocation of
> the setjmp macro in the same invocation of the program with the
> corresponding
> jmp_buf argument. If there has been no such invocation, or **if the
> invocation was from
> another thread of execution**, or if the function containing the
> invocation of the setjmp
> macro has terminated execution248) in the interim, or if the invocation
> of the setjmp
> macro was within the scope of an identifier with variably modified type
> and execution has
> left that scope in the interim, the behavior is undefined.
>
> ```
>
> Then isn't that prone to failure?

Yes, according to the spec the behavior is undefined. QEMU has other coroutine implementations too, e.g. assembly or using other OS/runtime APIs.

I don't think setjmp is the culprit here since QEMU could switch to the assembly implementation and it would still have the TLS problem.

> Wouldn't setting the thread local
> variables as volatile change something?
>
> This small godbolt experiment is promising:
>
> https://gcc.godbolt.org/z/6nznnMvTs

I don't think that approach is workable because:
1. It's very tricky to get it right. The relatively innocuous addition I made here is broken: https://gcc.godbolt.org/z/8GP4dTP56. The likelihood of errors like this slipping past code review is high.
2. All __thread variables in QEMU need to be converted to volatile pointers, including auditing and rewriting code that uses the variables.

Revision history for this message
In , kwolf (kwolf-redhat-bugs) wrote :

(In reply to serge_sans_paille from comment #24)
> Wouldn't setting the thread local variables as volatile change something?

When I tried that with the original reproducer, it was not enough. In hindsight that made sense to me: It's not the value of the variable that is volatile and must not be cached, but it's its address that changes between threads. I don't think this can be expressed with volatile.

Also note that the bug didn't reproduce on x86 without -mtls-dialect=gnu2 (which apparently is the default on other architectures).

Revision history for this message
In , sguelton (sguelton-redhat-bugs) wrote :

Another attempt: if we force an indirect access to the TLS through `-mno-tls-direct-seg-refs`, that should prevent hoisting (?)

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

Do you have time to try the suggestion in Comment 27? I am on PTO until Sept 29th. Thank you!

Revision history for this message
In , fweimer (fweimer-redhat-bugs) wrote :

(In reply to serge_sans_paille from comment #27)
> Another attempt: if we force an indirect access to the TLS through
> `-mno-tls-direct-seg-refs`, that should prevent hoisting (?)

-mno-tls-direct-seg-refs is an x86 option, introduced for i386 para-virtualization (with i386 host kernels), so it's thoroughly obslete by now. It also goes in the wrong direction: -mtls-direct-seg-refs (the default) completely avoids materializing the thread pointer for direct accesses to thread-local variables (of the initial-exec or local-exec kind). And if the thread pointer is not loaded in to a general-purpose register, it can't be out-of-date after a context switch.

Revision history for this message
In , sguelton (sguelton-redhat-bugs) wrote :

The following diff

```
--- qemu-6.1.0.orig/util/async.c 2021-08-24 13:35:41.000000000 -0400
+++ qemu-6.1.0/util/async.c 2021-09-20 17:48:15.404681749 -0400
@@ -673,6 +673,10 @@

 AioContext *qemu_get_current_aio_context(void)
 {
+ if (qemu_in_coroutine()) {
+ Coroutine *self = qemu_coroutine_self();
+ return self->ctx;
+ }
     if (my_aiocontext) {
         return my_aiocontext;
     }
```

fixes the scenario proposed by Thomas in https://bugzilla.redhat.com/show_bug.cgi?id=1952483#c6 (but it does not fix all tests).

I understand this puts an extra burden on qemu developers, but it also seems sane to me to prevent coroutine from accessing thread local variable from another thread than the one they were created (interesting read on that topic: http://www.crystalclearsoftware.com/soc/coroutine/coroutine/coroutine_thread.html)

Would that be acceptable to enforce that property upstream?

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

(In reply to serge_sans_paille from comment #30)
> The following diff
>
> ```
> --- qemu-6.1.0.orig/util/async.c 2021-08-24 13:35:41.000000000 -0400
> +++ qemu-6.1.0/util/async.c 2021-09-20 17:48:15.404681749 -0400
> @@ -673,6 +673,10 @@
>
> AioContext *qemu_get_current_aio_context(void)
> {
> + if (qemu_in_coroutine()) {

This uses the `current` TLS variable. Are you sure this works? It seems like the same problem :).

Revision history for this message
In , sguelton (sguelton-redhat-bugs) wrote :

The patch above fixes the LTO issue, and once applied, I've been successfully building qemu with LTO with GCC: https://koji.fedoraproject.org/koji/taskinfo?taskID=76803353 (all archs) and with Clang : https://koji.fedoraproject.org/koji/taskinfo?taskID=76802978 (s390x only).

It's a compiler-agnostic patch, it works for any compiler that honors __attribute__((noinline)), as long as the compiler doesn't tries to do inter procedural optimization across non inlinable functions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for QEMU because there has been no activity for 60 days.]

Changed in qemu:
status: Incomplete → Expired
Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

RFC patch posted upstream based on the patch Serge attached to this BZ:
https://<email address hidden>/

Revision history for this message
Dana Goyette (danagoyette) wrote :

I've been having crashes with the same assertion message, when trying to run Windows 10 ARM under a VM. But I finally figured out that what's actually crashing it is not the fact that it's Windows, it's the fact that I was attaching the virtual drive via virtual USB.

If I do the same thing to an Ubuntu ARM64 guest, it *also* crashes.

qemu-system-aarch64: ../../block/aio_task.c:64: aio_task_pool_wait_one: Assertion `qemu_coroutine_self() == pool->main_co' failed

With the RISC-V guest, does your crash change if you change the type of attachment that's used for the virtual disk?

Also, I tried enabling core dumps in libvirt, but it didn't seem to dump cores to apport. Enabling core dumps would be useful for issues like this.

Revision history for this message
Tommy Thorn (tommy-ubuntuone) wrote :

No, as I described in great detail it has nothing to do with the attached devices.
I just noticed that the bug was excused away
as being do to the “slow” RPi 4. I’ll share that I originally hit it
on Apple’s M1 but as I expect my environment might be too unusual I replicated
it on RPi 4. I have since switched to building qemu from source so I don’t know if
it still happens.

Revision history for this message
In , kkiwi (kkiwi-redhat-bugs) wrote :

Based on recent discussions with Stefan/Thomas and others, I'm moving this to ITR 9.1.0 as a "FutureFeature" since we don't yet enable LTO downstream on non-x86 architectures. We do have an RFC patch upstream, so hopefully this can be added soon.

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

The following have been merged:
d5d2b15ecf cpus: use coroutine TLS macros for iothread_locked
17c78154b0 rcu: use coroutine TLS macros
47b7446456 util/async: replace __thread with QEMU TLS macros
7d29c341c9 tls: add macros for coroutine-safe TLS variables

I sent another 3 patches as a follow-up series.

Revision history for this message
In , mdeng (mdeng-redhat-bugs) wrote :

(In reply to Miroslav Rezanina from comment #0)
> When running build for qemu-kvm for RHEL 9, test-block-iothread during "make
> check " fails on aarch64, ppc64le and s390x architecture for
> /attach/blockjob (pass on x86_64):
  FYI, qemu-kvm isn't supported on RHEL 9 on Power.
Thanks

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

The following patches were merged upstream:
c1fe694357 coroutine-win32: use QEMU_DEFINE_STATIC_CO_TLS()
ac387a08a9 coroutine: use QEMU_DEFINE_STATIC_CO_TLS()
34145a307d coroutine-ucontext: use QEMU_DEFINE_STATIC_CO_TLS()

Revision history for this message
In , lijin (lijin-redhat-bugs) wrote :

Hi Yihuang and Boqiao,

Could you do the pre-verify on aarch64 and s390x with the fixed version?

Thanks.

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

Analyzed the build log, "-flto" is still not in the configure setting, is this expected? The full configure from here: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/aarch64/build.log

I can see x86 enabled -flto: http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on aarch64, it passed. Steps refer to Eric's bug 2000479

# ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64 --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto' '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg -fstack-protector-strong -fasynchronous-unwind-tables ' --target-list=aarch64-softmmu --enable-kvm --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls --enable-trace-backends=log --enable-seccomp --enable-cap-ng --disable-werror --without-default-devices --disable-capstone --target-list='aarch64-softmmu'

# make check-unit -j16
......
......
22/92 qemu:unit / test-block-iothread OK 0.64s 16 subtests passed
......
......
Ok: 92
Expected Fail: 0
Fail: 0
Unexpected Pass: 0
Skipped: 0
Timeout: 0

So in my opinion, maybe we can also enable -flto on other architectures?

Anyway, the test result on the official build is passed.

Result: PASS as no Critical Regression or TestBlocker found

Test Environment:
Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
Host Kernel: kernel-5.14.0-119.el9.aarch64
QEMU: qemu-kvm-7.0.0-7.el9.aarch64
edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
Guest: RHEL.9.1.0

Results Analysis:
From 85 tests executed, 84 passed and 0 warned - success rate of 98.82% (excluding SKIP and CANCEL)
1 test case failed with an auto issue but retes passed

New bugs(0):
Existing bugs(0):

Job link:
http://10.0.136.47/6759356/results.html

Revision history for this message
In , yfu (yfu-redhat-bugs) wrote :

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Revision history for this message
In , eric.auger (eric.auger-redhat-bugs) wrote :

(In reply to Yihuang Yu from comment #43)
> Analyzed the build log, "-flto" is still not in the configure setting, is
> this expected? The full configure from here:
> http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> 0/7.el9/data/logs/aarch64/build.log
>
> I can see x86 enabled -flto:
> http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on
> aarch64, it passed. Steps refer to Eric's bug 2000479
>
> # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64
> --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M
> --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec
> '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto'
> '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall
> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
> --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg
> -fstack-protector-strong -fasynchronous-unwind-tables '
> --target-list=aarch64-softmmu --enable-kvm
> --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls
> --enable-trace-backends=log --enable-seccomp --enable-cap-ng
> --disable-werror --without-default-devices --disable-capstone
> --target-list='aarch64-softmmu'
>
> # make check-unit -j16
> ......
> ......
> 22/92 qemu:unit / test-block-iothread OK 0.64s
> 16 subtests passed
> ......
> ......
> Ok: 92
> Expected Fail: 0
> Fail: 0
> Unexpected Pass: 0
> Skipped: 0
> Timeout: 0
>
> So in my opinion, maybe we can also enable -flto on other architectures?
>
> Anyway, the test result on the official build is passed.
>
> Result: PASS as no Critical Regression or TestBlocker found
>
> Test Environment:
> Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
> Host Kernel: kernel-5.14.0-119.el9.aarch64
> QEMU: qemu-kvm-7.0.0-7.el9.aarch64
> edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
> Guest: RHEL.9.1.0
>
> Results Analysis:
> From 85 tests executed, 84 passed and 0 warned - success rate of 98.82%
> (excluding SKIP and CANCEL)
> 1 test case failed with an auto issue but retes passed
>
> New bugs(0):
> Existing bugs(0):
>
> Job link:
> http://10.0.136.47/6759356/results.html

While at it, would you have cycles to test with Safestack enabled (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same symptoms and maybe Stefan's series also fixes that other BZ. Thank you in advance!

Revision history for this message
In , thuth (thuth-redhat-bugs) wrote :

(In reply to Yihuang Yu from comment #43)
> Analyzed the build log, "-flto" is still not in the configure setting, is
> this expected? The full configure from here:

I think we also need a change to the qemu-kvm.spec file to enable LTO on non-x86 again. There's a hack there at the top of the file that looks like this:

%ifnarch x86_64
     %global _lto_cflags %%{nil}
%endif

Without removing that, we don't get LTO on s390x and aarch64, so I think this cannot properly verified. @stefanha, could you add such a patch on top, please?

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

(In reply to Eric Auger from comment #45)
> (In reply to Yihuang Yu from comment #43)
> > Analyzed the build log, "-flto" is still not in the configure setting, is
> > this expected? The full configure from here:
> > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> > 0/7.el9/data/logs/aarch64/build.log
> >
> > I can see x86 enabled -flto:
> > http://download.eng.bos.redhat.com/brewroot/vol/rhel-9/packages/qemu-kvm/7.0.
> > 0/7.el9/data/logs/x86_64/build.log, and I also compiled with -flto myself on
> > aarch64, it passed. Steps refer to Eric's bug 2000479
> >
> > # ./configure --cc=clang --cxx=/bin/false --prefix=/usr --libdir=/usr/lib64
> > --datadir=/usr/share --sysconfdir=/etc --interp-prefix=/usr/qemu-%M
> > --localstatedir=/var --docdir=/usr/share/doc --libexecdir=/usr/libexec
> > '--extra-ldflags=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -flto'
> > '--extra-cflags=-O2 -flto -fexceptions -g -grecord-gcc-switches -pipe -Wall
> > -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS
> > --config /usr/lib/rpm/redhat/redhat-hardened-clang.cfg
> > -fstack-protector-strong -fasynchronous-unwind-tables '
> > --target-list=aarch64-softmmu --enable-kvm
> > --extra-cflags=-Wstrict-prototypes --extra-cflags=-Wredundant-decls
> > --enable-trace-backends=log --enable-seccomp --enable-cap-ng
> > --disable-werror --without-default-devices --disable-capstone
> > --target-list='aarch64-softmmu'
> >
> > # make check-unit -j16
> > ......
> > ......
> > 22/92 qemu:unit / test-block-iothread OK 0.64s
> > 16 subtests passed
> > ......
> > ......
> > Ok: 92
> > Expected Fail: 0
> > Fail: 0
> > Unexpected Pass: 0
> > Skipped: 0
> > Timeout: 0
> >
> > So in my opinion, maybe we can also enable -flto on other architectures?
> >
> > Anyway, the test result on the official build is passed.
> >
> > Result: PASS as no Critical Regression or TestBlocker found
> >
> > Test Environment:
> > Host Distro: RHEL-9.1.0-20220627.0 BaseOS aarch64
> > Host Kernel: kernel-5.14.0-119.el9.aarch64
> > QEMU: qemu-kvm-7.0.0-7.el9.aarch64
> > edk2: edk2-aarch64-20220526git16779ede2d36-1.el9.noarch
> > Guest: RHEL.9.1.0
> >
> > Results Analysis:
> > From 85 tests executed, 84 passed and 0 warned - success rate of 98.82%
> > (excluding SKIP and CANCEL)
> > 1 test case failed with an auto issue but retes passed
> >
> > New bugs(0):
> > Existing bugs(0):
> >
> > Job link:
> > http://10.0.136.47/6759356/results.html
>
> While at it, would you have cycles to test with Safestack enabled
> (https://bugzilla.redhat.com/show_bug.cgi?id=1992968)? We had the same
> symptoms and maybe Stefan's series also fixes that other BZ. Thank you in
> advance!

OK Eric, I will enable both flto and safe-stack, and then trigger a tier1 testing. Will update test result later.

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

Unfortunately, I cannot rebuild the qemu-kvm rpm package from src.rpm if I have both flto and safe-stack enabled. Eric, so I don't think now is the right time to enable the safe-stack. Maybe we need to tweak some CFLAGS?

# diff /root/rpmbuild/SPECS/qemu-kvm.spec /home/qemu-kvm.spec.backup
7a8,13
> # LTO does not work with the coroutines of QEMU on non-x86 architectures
> # (see BZ 1952483 and 1950192 for more information)
> %ifnarch x86_64
> %global _lto_cflags %%{nil}
> %endif
>
18c24
< %global have_safe_stack 1
---
> %global have_safe_stack 0
22a29,31
> %ifarch x86_64
> %global have_safe_stack 1
> %endif

flto + safe-stack:

 27/128 qemu:unit / test-bdrv-drain ERROR 1.01s killed by signal 11 SIGSEGV
―――――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――――――
stderr:

TAP parsing error: Too few tests run (expected 42, got 20)
(test program exited with status code -11)

 33/128 qemu:unit / test-block-iothread ERROR 1.66s killed by signal 6 SIGABRT
―――――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――――――
stderr:
qemu_aio_coroutine_enter: Co-routine was already scheduled in ''

TAP parsing error: Too few tests run (expected 16, got 10)
(test program exited with status code -6)

Summary of Failures:

 27/128 qemu:unit / test-bdrv-drain ERROR 0.94s killed by signal 11 SIGSEGV
 33/128 qemu:unit / test-block-iothread ERROR 1.54s killed by signal 6 SIGABRT

Ok: 123
Expected Fail: 0
Fail: 2
Unexpected Pass: 0
Skipped: 3
Timeout: 0

Revision history for this message
In , bfu (bfu-redhat-bugs) wrote :

(In reply to lijin from comment #42)
> Hi Yihuang and Boqiao,
>
> Could you do the pre-verify on aarch64 and s390x with the fixed version?
>
> Thanks.

[root@l42 build]# tests/unit/test-block-iothread
# random seed: R02Sdf2c11a84ebf6fa4a3bf33e5f4ba9f5c
1..16
# Start of sync-op tests
ok 1 /sync-op/pread
ok 2 /sync-op/pwrite
ok 3 /sync-op/load_vmstate
ok 4 /sync-op/save_vmstate
ok 5 /sync-op/pdiscard
ok 6 /sync-op/truncate
ok 7 /sync-op/block_status
ok 8 /sync-op/flush
ok 9 /sync-op/check
ok 10 /sync-op/activate
# End of sync-op tests
# Start of attach tests
ok 11 /attach/blockjob
ok 12 /attach/second_node
ok 13 /attach/preserve_blk_ctx
# End of attach tests
# Start of propagate tests
ok 14 /propagate/basic
ok 15 /propagate/diamond
ok 16 /propagate/mirror
# End of propagate tests

I didn't see an error on s390x

Revision history for this message
In , stefanha (stefanha-redhat-bugs) wrote :

Based on comment 48 there are still issues, probably related to coroutines, that need to be debugged if we want to enable LTO + SafeStack on non-x86 architectures.

The coroutine TLS patches were already merged in 7.0.0-7 for this BZ.

I am on PTO until August. At that time I can investigate the root cause. Let's keep LTO disabled until the root cause is understood.

If someone else wants to take over this BZ while I'm away, feel free.

Revision history for this message
In , yihyu (yihyu-redhat-bugs) wrote :

OK, Stefan.

Then let me move the ITM to a bit later until we decide to fix the compile issue in which release, thanks for understanding.

Revision history for this message
Paride Legovini (paride) wrote :

I am consistently hitting this when trying to install the Ubuntu arm64 ISO image in a VM. A minimal command line that reproduces the problem is (host system is jammy arm64):

qemu-system-aarch64 -enable-kvm -m 2048 -M virt -cpu host -nographic -drive file=flash0.img,if=pflash,format=raw -drive file=flash1.img,if=pflash,format=raw -drive file=image2.qcow2,if=virtio -cdrom jammy-live-server-arm64.iso

The installation never gets to an end, always crashing.

Changed in qemu:
status: Expired → Incomplete
Changed in qemu (Ubuntu):
status: Expired → Incomplete
Revision history for this message
Thomas Huth (th-huth) wrote :

Upstream QEMU bugs are now tracked on https://gitlab.com/qemu-project/qemu/-/issues - so if you can reproduce it with the latest version from upstream QEMU, please report it there.

no longer affects: qemu
Revision history for this message
Paride Legovini (paride) wrote :

I tried the qemu package from Kinetic on a Jammy system

$ qemu-system-aarch64 --version
QEMU emulator version 7.0.0 (Debian 1:7.0+dfsg-7ubuntu1)

and it fails in the same way:

qemu-system-aarch64: ../../util/qemu-coroutine-lock.c:57: qemu_co_queue_wait_impl: Assertion `qemu_in_coroutine()' failed.
Aborted (core dumped)

Revision history for this message
Paride Legovini (paride) wrote :

In the end looks like it's LTO. I rebuilt Jammy's qemu (1:6.2+dfsg-2ubuntu6.3) with

  DEB_BUILD_MAINT_OPTIONS = optimize=-lto

and it doesn't crash anymore. I can't really tell if the issue is with Qemu's code or is due to a compiler bug. The rebuilt package is available in a PPA:

  https://launchpad.net/~paride/+archive/ubuntu/qemu-bpo

which despite the name doesn't actually contain backports.

FWIW Fedora disables LTO on aarch64 (arm64) because of this issue, see:

  https://bugzilla.redhat.com/show_bug.cgi?id=1952483
  https://src.fedoraproject.org/rpms/qemu/c/38b1a6c732bee90f75345c4d07

This is also discussed in this short Fedora mailing list thread:

https://<email address hidden>/msg159665.html

Changed in qemu (Ubuntu):
status: Incomplete → Confirmed
Paride Legovini (paride)
Changed in qemu (Ubuntu):
importance: Low → Medium
Paride Legovini (paride)
tags: added: lto server-todo
Revision history for this message
Paride Legovini (paride) wrote :

@Christian if we agree the path forward here is "disable LTO on non-amd64" I can prepare MPs and uploads for Kinetic and Jammy. I have a reproducer handy which will help with the SRU.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We have recently looked at some coroutine racyness in older versions, but all of those I know of would be fixed in 7.0.

If you even see this in 7.0 (as stated above) and you have a reproducer we can use - then I'd be absolutely happy if you could prep this change.

The upstream bug discussion seems to indicate !x86 is at fault, so I'm just curious if you found more than riscv.
@Paride, have you had a chance to check and confirm this on !risc64 and/or !7.0 qemu?

Changed in qemu (Ubuntu):
assignee: nobody → Paride Legovini (paride)
Revision history for this message
Paride Legovini (paride) wrote :

Hi, all my findings above are based on testing on arm64, not riscv64. I do confirm seeing the coroutine racyness with 7.0, but I tested it on Jammy, not Kinetic, so another round of tests is needed to confirm Kinetic is affected by this (I think it is).

In any case Jammy needs to be fixed. The machine where I can reliably reproduce the issue is the same we use to run the Ubuntu ISO tests, and given that this is point release week I have to be careful with it, as I don't want to interfere with the ISO testing. After the point release I'll be away from keyboard for a couple of weeks, so the ETA for the fix is end of August.

Revision history for this message
Paride Legovini (paride) wrote :

Confirmed happening on arm64 using a clean Kinetic host system (qemu 1:7.0+dfsg-7ubuntu1).

Changed in qemu (Ubuntu):
status: Confirmed → Triaged
Changed in qemu (Ubuntu Jammy):
status: New → Triaged
assignee: nobody → Paride Legovini (paride)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:7.0+dfsg-7ubuntu2

---------------
qemu (1:7.0+dfsg-7ubuntu2) kinetic; urgency=medium

  [ Paride Legovini ]
  * d/rules: disable LTO on non-amd64 builds (LP: #1921664)
  * GCC-12 FTBFS (LP: #1988710)
    - d/p/u/lp1988710-silence-openbios-array-bounds-false-positive.patch.
      Silence -Warray-bounds false positive (treated as error)

  [ Christian Ehrhardt ]
  * More on GCC-12 FTBFS (LP 1988710)
    - d/rules: set -O1 for alpha firmware build
    - d/p/u/lp1988710-opensbi-Makefile-fix-build-with-binutils-2.38.patch:
      further FTBFS fixup

 -- Christian Ehrhardt <email address hidden> Mon, 19 Sep 2022 08:07:24 +0200

Changed in qemu (Ubuntu):
status: Triaged → Fix Released
Changed in qemu (Fedora):
importance: Unknown → Medium
status: Unknown → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.