autopkgtest-virt-qemu on noble images sometimes hangs doing copydown

Bug #2056461 reported by Paride Legovini
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
autopkgtest (Ubuntu)
Confirmed
Undecided
Unassigned
linux (Ubuntu)
In Progress
Undecided
Andrea Righi

Bug Description

[Impact]

It seems that kernel 6.8 introduced a regression in the 9pfs related to caching and netfslib, that can cause some user-space apps to read content from files that is not up-to-date (when they are used in a producer/consumer fashion).

It seems that the offending commit is this one:

 80105ed2fd27 ("9p: Use netfslib read/write_iter")

Reverting the commit seems to fix the problem. However the actual bug might be in netfslib or how netfslib is used in the 9p context.

The regression has been reported upstream and we are still investigating (https://lore.kernel.org/lkml/Zj0ErxVBE3DYT2Ea@gpd/).

In the meantime it probably makes sense to temporarily revert the commit as a SAUCE patch. Then we will drop the SAUCE patch once we'll have a proper fix upstream.

[Test case]

The following test should complete correctly without any timeout:

  pull-lp-source -d hello
  autopkgtest-buildvm-ubuntu-cloud -r noble
  autopkgtest -U hello*.dsc -- qemu ./autopkgtest-noble-amd64.img

[Fix]

Revert the following commit (until we have a proper fix upstream):

 80105ed2fd27 ("9p: Use netfslib read/write_iter")

[Regression potential]

We may experience other regressions related to 9pfs with this change, however it's quite unlikely to happen since we are reverting a commit, restoring the previous behavior.

[Original bug report]

autopkgtest-virt-qemu sometimes hangs when running tests on noble images. Originally reported by schopin, who also provided a reproducer:

  pull-lp-source -d hello
  autopkgtest-buildvm-ubuntu-cloud -r noble
  autopkgtest -U hello*.dsc -- qemu ./autopkgtest-noble-amd64.img

I've been able to reproduce it with debugging enabled:

  autopkgtest -ddd -U hello_2.10-3.dsc -- qemu --debug --show-boot /path/to/image

It can get stuck during different stages, but AFAICT always during "copydown" operations, log excerpts follow. It may be a coincidence, but this started happening around the time linux-image-6.8.0-11-generic (6.8.0-11.11) migrated to noble. The testbeds I used booted 6.6 but then rebooted into that 6.8 kernel after being upgraded by autopkgtest.

-- logs --

Removing autopkgtest-satdep (0) ...
[...]
autopkgtest-virt-qemu: DBG: executing copydown /tmp/autopkgtest.output.g8v75e8g/tests-tree/ /t/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', '/tmp/autopkgtest.output.g8v75e]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'deststdout']
autopkgtest-virt-qemu: DBG: +< tar --directory /tmp/autopkgtest.output.g8v75e8g/tests-tree/ --
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.ztmr6f5k/runcmd sh -ec if ! test -d /tmp-
autopkgtest-virt-qemu: DBG: +>?

-- or --

autopkgtest: DBG: sending command to testbed: copydown /tmp/autopkgtest.output.c9utq3bx/tests-tree/ /tmp/autopkgtest.H8NDfW/build.DLR/src/
autopkgtest-virt-qemu: DBG: executing copydown /tmp/autopkgtest.output.c9utq3bx/tests-tree/ /tmp/autopkgtest.H8NDfW/build.DLR/src/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', '/tmp/autopkgtest.output.c9utq3bx/tests-tree/', '--warning=none', '-c', '.', '-f', '-'], ['/tmp/autopkgtest-qemu.qtkcgg5l/runcm]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'deststdout', "<_io.BufferedReader name='/dev/null'>", 'devnull_read', <_io.BufferedReader name='/dev/null'>]
autopkgtest-virt-qemu: DBG: +< tar --directory /tmp/autopkgtest.output.c9utq3bx/tests-tree/ --warning=none -c . -f -
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.qtkcgg5l/runcmd sh -ec if ! test -d /tmp/autopkgtest.H8NDfW/build.DLR/src/; then mkdir -- /tmp/autopkgtest.H8NDfW/build.DLR/src/; fi; cd-
autopkgtest-virt-qemu: DBG: +>?

Revision history for this message
Paride Legovini (paride) wrote :

Communication (including copydown()) with the qemu virt server happens via the serial terminal. That isn't the most solid approach, and I suspect it's racy for some reason.

Revision history for this message
Tim Andersson (andersson123) wrote :

I've discovered this as well. For me, copydown works perfectly fine for individual files, but fails for directories. My reproducer is as follows:

./autopkgtest/runner/autopkgtest -d -o /tmp/autopkgtest-astra-toolbox8/ --apt-pocket=proposed=src:nvidia-cuda-toolkit astra-toolbox/ --env=ADT_TEST_TRIGGERS=nvidia-cuda-toolkit/12.0.1-4build4 -- qemu /home/andersson123/canonical/code/autopkgtest-noble-amd64.img

with failures as follows:
autopkgtest: DBG: sending command to testbed: copydown astra-toolbox// /tmp/autopkgtest.F/
autopkgtest-virt-qemu: DBG: executing copydown astra-toolbox// /tmp/autopkgtest.FixwAf/tr/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', 'astra-toolbox//', '--warn]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'destst]
autopkgtest-virt-qemu: DBG: +< tar --directory astra-toolbox// --warning=none -c . -f -
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.d2mlr22y/runcmd sh -ec if ! test -d-
autopkgtest-virt-qemu: DBG: +>?

Changed in autopkgtest (Ubuntu):
status: New → Confirmed
Revision history for this message
Aleksandr Mikhalitsyn (mihalicyn) wrote :

Looks like I'm facing the same issue:

/testbed-packages
autopkgtest-virt-qemu: DBG: +> cat
autopkgtest-virt-qemu: DBG: +>?
autopkgtest-virt-qemu: DBG: +<?
autopkgtest [19:13:55]: @@@@@@@@@@@@@@@@@@@@ unbuilt-tree lxc-5.0.3/
autopkgtest-virt-qemu: DBG: executing copydown lxc-5.0.3// /tmp/autopkgtest.5ldgRV/ubtree-/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', 'lxc-5.0.3//', '--warning=none', '-c', '.', '-f', '-'], ['/tmp/autopkgtest-qemu.1i2y4v3q/runcmd', 'sh', '-ec', 'if ! test -d /tmp/autopkgtest.5ldgRV/ubtree-/; then mkdir -- /tmp/autopkgtest.5ldgRV/ubtree-/; fi; cd /tmp/autopkgtest.5ldgRV/ubtree-/; tar --warning=none --preserve-permissions --extract --no-same-owner -f -'])"]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'deststdout', "<_io.BufferedReader name='/dev/null'>", 'devnull_read', <_io.BufferedReader name='/dev/null'>]
autopkgtest-virt-qemu: DBG: +< tar --directory lxc-5.0.3// --warning=none -c . -f -
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.1i2y4v3q/runcmd sh -ec if ! test -d /tmp/autopkgtest.5ldgRV/ubtree-/; then mkdir -- /tmp/autopkgtest.5ldgRV/ubtree-/; fi; cd /tmp/autopkgtest.5ldgRV/ubtree-/; tar --warning=none --preserve-permissions --extract --no-same-owner -f -
autopkgtest-virt-qemu: DBG: +>?

Revision history for this message
Tim Andersson (andersson123) wrote :

For anyone encountering this you can use a different virt server other than qemu to workaround the issue

Revision history for this message
Luca Boccassi (bluca) wrote :

I'm hitting this as well, running autopkgtest (from latest main) against an amd64 noble image built with autopkgtest-buildvm-ubuntu-cloud on Debian stable with qemu 7.2.
I can confirm that switching the kernel in the noble image to the 6.5 shipped in Mantic fixes the issue, with no other changes, so it does seem like a problem introduced by the recent noble kernel update.

Revision history for this message
Skia (hyask) wrote :

Okay, I've spend the last 24h digging into that issue, and here are my findings:
* I confirm I can only reproduce the issue with 6.8 kernels, not with 6.5. Just to be sure, I've also tested 6.8.0-32.32 in kernel team's PPA, and it has the issue too.
* Changing `--cpus` or `--ram-size` in `autopkgtest-virt-qemu` doesn't change anything.
* The code responsible for this in autopkgtest is a wonderful piece of black magic: most lives in `virt/autopkgtest-virt-qemu`, and the main part is the `runcmd` script being crafted in the `make_auxverb` function. The hanging is the `# wait until command has exited` loop, that never exits.

The most funny part is the new workaround that I found.
Run the following from your host that runs `autopkgtest`. Obvious change the port number depending on which free port was available for `autopkgtest`.
```
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -p 10022 ubuntu@localhost stat /run/autopkgtest/shared/job\*/stdin
```
For some reason, `stat`ing that file in the testbed makes the loop exit and everything continues as normal, meaning you can use the test results normally.
I've seen this work at least 20 times on my machine, but I'd like someone else to confirm that this workaround indeed works.

This bug is becoming more and more interesting, stay tuned for more adventures!

Revision history for this message
Luca Boccassi (bluca) wrote :

Can confirm the stat workaround works for me as well

Revision history for this message
Paride Legovini (paride) wrote :

Confirmed working: the `stat` immediately unblocks the copydown.

Revision history for this message
Luca Boccassi (bluca) wrote :
Revision history for this message
Paride Legovini (paride) wrote :

AIUI that's an issue with a test which expects a timeout, and fails when things happen too fast. At first glance it doesn't seem to be related.

Skia (hyask)
tags: added: rls-nn-incoming
Revision history for this message
Skia (hyask) wrote :

Latest findings:
* The deepest loop that actually never exits is actually the `while not os.path.exists(fexit):` in the `eofcat` script. This is because the `block = os.read(0, 1000000)` always reads nothing.
* From the guest:
  * `head`ing the `stdin` file doesn't help.
  * `tail`ing the `stdin` file does help, it unblocks the loop, just the same way as `stat`ing it.
* From the host:
  * `stat`ing `stdin` doesn't change anything.
  * Appending more data to it doesn't change anything either.
  * This adds up with the fact that it's a kernel change in the guest that triggered the bug.
* That `stdin` file is stored on a 9p shared FS. This might be a related change, but nothing confirmed yet: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c9b93cafb69cbbbe375de29c1ebf410dbc33ebfc

Revision history for this message
Andrea Righi (arighi) wrote :

I've done some bisecting and it looks like the culprit is this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=80105ed2fd2715fb09a8fdb0655a8bdc86c120db

I'll prepare a test kernel with this one reverted, then if it's confirmed, we can figure out a proper fix.

Revision history for this message
Andrea Righi (arighi) wrote :

I've uploaded a test kernel here:

https://people.canonical.com/~arighi/lp2056461/amd64/

With this one I can't reproduce the problem. It'd be great if someone else could test it as well, to confirm that it's actually fixing the problem.

Note: this kernel is a 6.8.0-32.32 with this commit reverted:

 commit 80105ed2fd27 ("9p: Use netfslib read/write_iter")

Changed in linux (Ubuntu):
status: New → In Progress
assignee: nobody → Andrea Righi (arighi)
Revision history for this message
Paride Legovini (paride) wrote :

I tested the the 6.8.0-32.32+lp2056461 kernel and it appears to reliably fix the problem. I tested it several times making autopkgtest 'copydown' a large source tree, and also verified that using an image with the standard kernel I did hit the problem.

Thanks Andrea for doing the bisect!

Andrea Righi (arighi)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.