autopkgtest-virt-qemu on noble images sometimes hangs doing copydown

Bug #2056461 reported by Paride Legovini
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
autopkgtest (Debian)
New
Unknown
autopkgtest (Ubuntu)
Confirmed
Undecided
Unassigned
linux (Ubuntu)
In Progress
Undecided
Andrea Righi

Bug Description

[Impact]

It seems that kernel 6.8 introduced a regression in the 9pfs related to caching and netfslib, that can cause some user-space apps to read content from files that is not up-to-date (when they are used in a producer/consumer fashion).

It seems that the offending commit is this one:

 80105ed2fd27 ("9p: Use netfslib read/write_iter")

Reverting the commit seems to fix the problem. However the actual bug might be in netfslib or how netfslib is used in the 9p context.

The regression has been reported upstream and we are still investigating (https://lore.kernel.org/lkml/Zj0ErxVBE3DYT2Ea@gpd/).

In the meantime it probably makes sense to temporarily revert the commit as a SAUCE patch. Then we will drop the SAUCE patch once we'll have a proper fix upstream.

[Test case]

The following test should complete correctly without any timeout:

  pull-lp-source -d hello
  autopkgtest-buildvm-ubuntu-cloud -r noble
  autopkgtest -U hello*.dsc -- qemu ./autopkgtest-noble-amd64.img

[Fix]

Revert the following commit (until we have a proper fix upstream):

 80105ed2fd27 ("9p: Use netfslib read/write_iter")

[Regression potential]

We may experience other regressions related to 9pfs with this change, however it's quite unlikely to happen since we are reverting a commit, restoring the previous behavior.

[Original bug report]

autopkgtest-virt-qemu sometimes hangs when running tests on noble images. Originally reported by schopin, who also provided a reproducer:

  pull-lp-source -d hello
  autopkgtest-buildvm-ubuntu-cloud -r noble
  autopkgtest -U hello*.dsc -- qemu ./autopkgtest-noble-amd64.img

I've been able to reproduce it with debugging enabled:

  autopkgtest -ddd -U hello_2.10-3.dsc -- qemu --debug --show-boot /path/to/image

It can get stuck during different stages, but AFAICT always during "copydown" operations, log excerpts follow. It may be a coincidence, but this started happening around the time linux-image-6.8.0-11-generic (6.8.0-11.11) migrated to noble. The testbeds I used booted 6.6 but then rebooted into that 6.8 kernel after being upgraded by autopkgtest.

-- logs --

Removing autopkgtest-satdep (0) ...
[...]
autopkgtest-virt-qemu: DBG: executing copydown /tmp/autopkgtest.output.g8v75e8g/tests-tree/ /t/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', '/tmp/autopkgtest.output.g8v75e]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'deststdout']
autopkgtest-virt-qemu: DBG: +< tar --directory /tmp/autopkgtest.output.g8v75e8g/tests-tree/ --
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.ztmr6f5k/runcmd sh -ec if ! test -d /tmp-
autopkgtest-virt-qemu: DBG: +>?

-- or --

autopkgtest: DBG: sending command to testbed: copydown /tmp/autopkgtest.output.c9utq3bx/tests-tree/ /tmp/autopkgtest.H8NDfW/build.DLR/src/
autopkgtest-virt-qemu: DBG: executing copydown /tmp/autopkgtest.output.c9utq3bx/tests-tree/ /tmp/autopkgtest.H8NDfW/build.DLR/src/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', '/tmp/autopkgtest.output.c9utq3bx/tests-tree/', '--warning=none', '-c', '.', '-f', '-'], ['/tmp/autopkgtest-qemu.qtkcgg5l/runcm]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'deststdout', "<_io.BufferedReader name='/dev/null'>", 'devnull_read', <_io.BufferedReader name='/dev/null'>]
autopkgtest-virt-qemu: DBG: +< tar --directory /tmp/autopkgtest.output.c9utq3bx/tests-tree/ --warning=none -c . -f -
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.qtkcgg5l/runcmd sh -ec if ! test -d /tmp/autopkgtest.H8NDfW/build.DLR/src/; then mkdir -- /tmp/autopkgtest.H8NDfW/build.DLR/src/; fi; cd-
autopkgtest-virt-qemu: DBG: +>?

Revision history for this message
Paride Legovini (paride) wrote :

Communication (including copydown()) with the qemu virt server happens via the serial terminal. That isn't the most solid approach, and I suspect it's racy for some reason.

Revision history for this message
Tim Andersson (andersson123) wrote :

I've discovered this as well. For me, copydown works perfectly fine for individual files, but fails for directories. My reproducer is as follows:

./autopkgtest/runner/autopkgtest -d -o /tmp/autopkgtest-astra-toolbox8/ --apt-pocket=proposed=src:nvidia-cuda-toolkit astra-toolbox/ --env=ADT_TEST_TRIGGERS=nvidia-cuda-toolkit/12.0.1-4build4 -- qemu /home/andersson123/canonical/code/autopkgtest-noble-amd64.img

with failures as follows:
autopkgtest: DBG: sending command to testbed: copydown astra-toolbox// /tmp/autopkgtest.F/
autopkgtest-virt-qemu: DBG: executing copydown astra-toolbox// /tmp/autopkgtest.FixwAf/tr/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', 'astra-toolbox//', '--warn]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'destst]
autopkgtest-virt-qemu: DBG: +< tar --directory astra-toolbox// --warning=none -c . -f -
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.d2mlr22y/runcmd sh -ec if ! test -d-
autopkgtest-virt-qemu: DBG: +>?

Changed in autopkgtest (Ubuntu):
status: New → Confirmed
Revision history for this message
Aleksandr Mikhalitsyn (mihalicyn) wrote :

Looks like I'm facing the same issue:

/testbed-packages
autopkgtest-virt-qemu: DBG: +> cat
autopkgtest-virt-qemu: DBG: +>?
autopkgtest-virt-qemu: DBG: +<?
autopkgtest [19:13:55]: @@@@@@@@@@@@@@@@@@@@ unbuilt-tree lxc-5.0.3/
autopkgtest-virt-qemu: DBG: executing copydown lxc-5.0.3// /tmp/autopkgtest.5ldgRV/ubtree-/
autopkgtest-virt-qemu: DBG: ['cmdls', "(['tar', '--directory', 'lxc-5.0.3//', '--warning=none', '-c', '.', '-f', '-'], ['/tmp/autopkgtest-qemu.1i2y4v3q/runcmd', 'sh', '-ec', 'if ! test -d /tmp/autopkgtest.5ldgRV/ubtree-/; then mkdir -- /tmp/autopkgtest.5ldgRV/ubtree-/; fi; cd /tmp/autopkgtest.5ldgRV/ubtree-/; tar --warning=none --preserve-permissions --extract --no-same-owner -f -'])"]
autopkgtest-virt-qemu: DBG: ['srcstdin', "<_io.BufferedReader name='/dev/null'>", 'deststdout', "<_io.BufferedReader name='/dev/null'>", 'devnull_read', <_io.BufferedReader name='/dev/null'>]
autopkgtest-virt-qemu: DBG: +< tar --directory lxc-5.0.3// --warning=none -c . -f -
autopkgtest-virt-qemu: DBG: +> /tmp/autopkgtest-qemu.1i2y4v3q/runcmd sh -ec if ! test -d /tmp/autopkgtest.5ldgRV/ubtree-/; then mkdir -- /tmp/autopkgtest.5ldgRV/ubtree-/; fi; cd /tmp/autopkgtest.5ldgRV/ubtree-/; tar --warning=none --preserve-permissions --extract --no-same-owner -f -
autopkgtest-virt-qemu: DBG: +>?

Revision history for this message
Tim Andersson (andersson123) wrote :

For anyone encountering this you can use a different virt server other than qemu to workaround the issue

Revision history for this message
Luca Boccassi (bluca) wrote :

I'm hitting this as well, running autopkgtest (from latest main) against an amd64 noble image built with autopkgtest-buildvm-ubuntu-cloud on Debian stable with qemu 7.2.
I can confirm that switching the kernel in the noble image to the 6.5 shipped in Mantic fixes the issue, with no other changes, so it does seem like a problem introduced by the recent noble kernel update.

Revision history for this message
Skia (hyask) wrote :

Okay, I've spend the last 24h digging into that issue, and here are my findings:
* I confirm I can only reproduce the issue with 6.8 kernels, not with 6.5. Just to be sure, I've also tested 6.8.0-32.32 in kernel team's PPA, and it has the issue too.
* Changing `--cpus` or `--ram-size` in `autopkgtest-virt-qemu` doesn't change anything.
* The code responsible for this in autopkgtest is a wonderful piece of black magic: most lives in `virt/autopkgtest-virt-qemu`, and the main part is the `runcmd` script being crafted in the `make_auxverb` function. The hanging is the `# wait until command has exited` loop, that never exits.

The most funny part is the new workaround that I found.
Run the following from your host that runs `autopkgtest`. Obvious change the port number depending on which free port was available for `autopkgtest`.
```
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -p 10022 ubuntu@localhost stat /run/autopkgtest/shared/job\*/stdin
```
For some reason, `stat`ing that file in the testbed makes the loop exit and everything continues as normal, meaning you can use the test results normally.
I've seen this work at least 20 times on my machine, but I'd like someone else to confirm that this workaround indeed works.

This bug is becoming more and more interesting, stay tuned for more adventures!

Revision history for this message
Luca Boccassi (bluca) wrote :

Can confirm the stat workaround works for me as well

Revision history for this message
Paride Legovini (paride) wrote :

Confirmed working: the `stat` immediately unblocks the copydown.

Revision history for this message
Luca Boccassi (bluca) wrote :
Revision history for this message
Paride Legovini (paride) wrote :

AIUI that's an issue with a test which expects a timeout, and fails when things happen too fast. At first glance it doesn't seem to be related.

Skia (hyask)
tags: added: rls-nn-incoming
Revision history for this message
Skia (hyask) wrote :

Latest findings:
* The deepest loop that actually never exits is actually the `while not os.path.exists(fexit):` in the `eofcat` script. This is because the `block = os.read(0, 1000000)` always reads nothing.
* From the guest:
  * `head`ing the `stdin` file doesn't help.
  * `tail`ing the `stdin` file does help, it unblocks the loop, just the same way as `stat`ing it.
* From the host:
  * `stat`ing `stdin` doesn't change anything.
  * Appending more data to it doesn't change anything either.
  * This adds up with the fact that it's a kernel change in the guest that triggered the bug.
* That `stdin` file is stored on a 9p shared FS. This might be a related change, but nothing confirmed yet: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c9b93cafb69cbbbe375de29c1ebf410dbc33ebfc

Revision history for this message
Andrea Righi (arighi) wrote :

I've done some bisecting and it looks like the culprit is this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=80105ed2fd2715fb09a8fdb0655a8bdc86c120db

I'll prepare a test kernel with this one reverted, then if it's confirmed, we can figure out a proper fix.

Revision history for this message
Andrea Righi (arighi) wrote :

I've uploaded a test kernel here:

https://people.canonical.com/~arighi/lp2056461/amd64/

With this one I can't reproduce the problem. It'd be great if someone else could test it as well, to confirm that it's actually fixing the problem.

Note: this kernel is a 6.8.0-32.32 with this commit reverted:

 commit 80105ed2fd27 ("9p: Use netfslib read/write_iter")

Changed in linux (Ubuntu):
status: New → In Progress
assignee: nobody → Andrea Righi (arighi)
Revision history for this message
Paride Legovini (paride) wrote :

I tested the the 6.8.0-32.32+lp2056461 kernel and it appears to reliably fix the problem. I tested it several times making autopkgtest 'copydown' a large source tree, and also verified that using an image with the standard kernel I did hit the problem.

Thanks Andrea for doing the bisect!

Andrea Righi (arighi)
description: updated
Revision history for this message
Christian Kastner (ckk) wrote :

The Debian ROCm Team has also seen this in their CI instance. One data point that might be relevant is that in our case, all our failures were limited to autopkgtests where there was a reboot between tests.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :
Download full text (3.2 KiB)

Reporting here to track a regression apparently introduced when 9p switched to netfslib in commit 80105ed2fd2715fb09a8fdb0655a8bdc86c120db "9p: Use netfslib read/write_iter" that is being reported on mailing list and in distros:

"Re: [PATCH v5 40/40] 9p: Use netfslib read/write_iter"
https://lore.kernel.org/all/Zj0ErxVBE3DYT2Ea@gpd/

"linux: regression in the 9p protocol in 6.8 breaks autopkgtest qemu jobs (affecting debci)"
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1072004

"autopkgtest-virt-qemu on noble images sometimes hangs doing copydown"
https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461

I've been digging into the bug and collecting data 9p developers may better understand. Host is Debian 12 Bookworm amd64 building the example Debian package "hello" and running autopkgtest in a QEMU virtual machine with Debian unstable (sid) and local build of latest mainline kernel (currently 4a4be1ad3a6). The bug was first reported against v6.8.

I was hoping to create a 'simple' reproducer without the complexity of autopkgtest but first need to understand what is going wrong.

After 31 iterations I believe I may finally have found a clue. It appears (in the guest OS of course) v9fs_file_read_iter() doesn't consume the complete content of the host file being shared via 9p.

In the reproducer the host autopkgtest Python scripts control qemu-system-x86 and send commands to the guest OS for it to execute. These include dynamically writing and then executing a python script named "eofcat" [0] that is used as a "funnel" for stdin [1].

"eofcat" spins on a non-blocking read of its fd 0 until a specific file exists or the number of bytes read from fd 0 is >= to a limit value read from the file name passed in as argv[1].

By modifying the guest's mount -t 9p command [2] options to include "debug=0x08" and using "| ts %.T |" in logging pipelines to capture microsecond accurate timestamps I was able to identify the guest kernel log entries when autopkgtest 'hangs' (spins).

What I see is the host file "stdin" being redirected to "eofcat" fd 0 and guest kernel shows a series of v9fs_file_read_iter(count=1000000, ...) calls with "offset" increasing each time: 0, 1000000, 2000000, 3000000, 4000000, 5000000, 5416960. At that point it starts spinning - repeatedly retrying a v9fs_file_read_iter(count=1000000, offset=5416960) and a single read of "stdin_eof" (to get the number of bytes to expect - the "limit" variable) - and that has 8253440 which is also the size of "stdin" file on the host.

In the guest a stat or ls or tail of the shared "stdin" file will break the deadlock and operations continue. No 9p cache options are in use.

In following comments I'll attach run logs of the host autopkgtest and the guest kernel dmesg with microsecond precision timestamps with mount option debug=0x0f ( P9_DEBUG_VFS | P9_DEBUG_9P | DEBUG_CURRENT | P9_DEBUG_ERROR ) with the exact timestamp where the host hangs, which can then be found in the guest kernel log.

[0] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-virt-qemu/?hl=320#L314

[1] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-virt-qemu/?hl=492#L487

[2] http...

Read more...

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Interesting finding: with debug=0x0f the bug is not triggered - I have to reduce the debug flags to trigger it. Currently working up from P9_DEBUG_VFS:

With run #34 and debug=0x08 (P9_DEBUG_VFS) I captured logs (attached). Hang occurs in autopkgtest at timestamp 10:08:37.390685 (UTC+1). In the guest kernel log that maps to timestamp 09:08:37.408129.

The first read on fd 0 is:

09:08:37.481590 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 0

followed by subsequent reads:

09:08:37.517808 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 1000000
...
09:08:37.565864 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 2000000
...
09:08:37.609522 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 3000000
...
09:08:37.653503 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 4000000
...
09:08:37.693647 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 5000000
...
09:08:37.716325 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 5275648

At which point it starts 'spinning' waiting for more data.

On the host we see:

~/tmp/9pfs$ ls -latr autopkgtest-qemu.ngbyoymu/shared/job.mon5m14s/
total 8072
-rw-r--r-- 1 root root 0 2024-05-31 10:08 stdout
-rw-r--r-- 1 root root 0 2024-05-31 10:08 stderr
drwxrwxrwt 3 root root 4096 2024-05-31 10:08 ..
-rwxr-xr-x 1 root root 8253440 2024-05-31 10:08 stdin
-rw-r--r-- 1 root root 7 2024-05-31 10:08 stdin_eof
drwxr-xr-x 2 root root 4096 2024-05-31 10:08 .

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306383
#34 autopkgtest log

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306384
#34 kernel log, debug=0x08

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Run #36 with debug=0x0c ( P9_DEBUG_VFS | P9_DEBUG_9P). Hang occurs in autopkgtest at timestamp 10:56:44.002011 (UTC+1). In the guest kernel log that maps to timestamp 09:56:44.666943 where the first v9fs_file_read_iter(count=1000000, offset=0) occurs.

What is interesting is the reads fetch slightly more data before spinning at:

09:56:56.732407 v9fs_file_read_iter (885): fid 6 count 1000000 offset 7331840

Do we have a timing/race condition here?

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306386
#36 autopkgtest log

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306387
#36 kernel log, debug=0x0c

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Considering the (changed) source-code of fs/9p/vfs_file.c::v9fs_file_read_iter() in commit 80105ed2fd2715fb09a8fdb0655a8bdc86c120db.

Prior to the commit there was a code path specifically for O_NONBLOCK (as well as for p9L_DIRECT). Now there is only a code-path for P9L_DIRECT.

kernel.36.log shows that within this function the p9L_DIRECT path is taken since there is no debug message from the final debug message:

       if (fid->mode & P9L_DIRECT)
               return netfs_unbuffered_read_iter(iocb, to);

       p9_debug(P9_DEBUG_VFS, "(cached)\n");
       return netfs_file_read_iter(iocb, to);
}

I'm not familiar enough with the netfs layer to understand if O_NONBLOCK is being handled (correctly) there - if special handling is indeed required?

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Update: 6.10-rc2 still fails the same way.

I've been writing a (minimal!) reproducer script that uses the same mechanisms as autopkgtest to copy files across but without any of the autopkgtest infrastructure - just host, qemu guest, and the shared 9P directory. Even with 2500 iterations of the tar-based copy of the "hello" package source and build product so far it has failed to reproduce. I intend to add lots of unique individual file copies since autopkgtest does a lot fo that prior to hitting the tar-based transfer hang.

Revision history for this message
In , dhowells (dhowells-linux-kernel-bugs) wrote :

Can you try turning on some netfs tracepoints?

echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_write/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_write_iter/enable

and then capturing the result?

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306436
Contains run #120 autopkgtest and kernel log files

v6.10-rc2 netfs debug logging as requested in run #120. Attachment contains the autopkgtest and virtual machine kernel log files, timestamped to aid synchronisation between them (autopkgtest is UTC+1 though)

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306437
Test case framework

Although I've not get managed to reproduce the bug with the "minimal" (ha!) reproducer script and tests it is in a state now that others should be able to easily execute it, and add more tests, and otherwise try to replicate the bug.

Once extracted read ./bug218916-reproducer.bash or get it to explain how to use it with:

./bug218917-reproducer.bash

So far it has not reproduced the bug, but it is a framweork and one can easily add additional test scripts (that are source-d) in ./bug218916.d/

Currently it has test_autopkgtest that replicates the sequence of copy{down,up} operations autopkgtest does but without the autopkgtest infrastructure, and test_tar that does the single copydown of a directory (which causes the transfer to be piped via "tar").

It doesn't execute quite as fast native autopkgtest because it has to use a "read -t 0.5" timeout for extracing replies from the controlling ssh bash coproc. That was implemented to avoid repeatedly suffering the delay of creating an ssh session for every transfer operation.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Edit: disastrous typos in my last comment! the command name should, of course, be:

./bug218916-reproducer.bash

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306439
Contains run #121 autopkgtest, kernel, trace log files

Realised I missed the trace log from run #120! This tar archive for run #121 contains it

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

For run #121 the reading of "stdin" by "eofcat" is at dmesg timestamp 13:46:26 with v9fs_read_iter (760) at offset 0 and is at 35.861430 netfs_read s=0 in the trace.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Testing different scenarios in autopkgtest scripts, and especially "eofcat" Python that reads from the 9P mount in the guest, I experimented with setting a larger time.sleep(0.2) instead of 0.05, and separately, removing the O_NONBLOCK flag from its stdin FD. Neither showed any difference - the hang occurs.

I'm about to test with enabling internal.h's CONFIG_NETFS_DEBUG=y to capture more clues from the actual IO dispatch functions.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306443
Contains run #132 netfs_debug autopkgtest, kernel, trace log files

dmesg.132.log now contains the additional netfs_debug messages. the read of "stdin" file by "eofcat" starts at 10:23:43.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306444
Fix option 1

At last I've found the problem (which is, as I thought a week ago, due to 'stat'), but quite subtle.

With extra debugging enabled on the mount (debug=0x0f rather than debug=0x08) it works, so with debug=0x08, after adding additional debug messages in v9fs_stat2inode_dotl(), I can see that in the failing case there are 2 calls to that function immediately before the v9fs_file_read_iter() starts. In those calls the netfs->remote_i_size is updated from stat->st_size and I see st_size has different (increasing) values but smaller than the actual file size.

So, effectively, all the data hasn't arrived before the read and once the read starts the size is no longer updated.

When debugging the additional cumulative delays of writing extreme amounts of debug info to the kernel log gives enough time for the data to arrive so that when v9fs_stat2inode_dotl() is called the final size is correct.

This possible fix updates stat if a read returns 0 bytes.

Tested with 6.8.12 and 6.10-rc2 repeatedly (10+ iterations so far) and no failures with debug logging completely disabled.

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Extract from the fix log with additional debugging enabled:

9pnet: -- v9fs_file_read_iter (889): fid 6 count 1000000 offset 5580800
[eofcat] ==> netfs_unbuffered_read_iter_locked()
[eofcat] ==> netfs_begin_read(R=111 552800-646a3f)
[eofcat] submit 552800 + 0 >= 552800
9pnet: -- v9fs_file_read_iter (889): iocb->ki_flags 0 netfs_unbuffered_read_iter()=0
9pnet: -- v9fs_file_read_iter (889): no data; refreshing inode
9pnet: -- v9fs_stat2inode_dotl (889): P9_STATS_BASIC=2047 V9FS_STAT2INODE_KEEP_ISIZE=0 P9_STATS_SIZE=512
9pnet: -- v9fs_stat2inode_dotl (889): P9_STATS_BASIC: remote_i_size=5580800 stat->st_size=8253440

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

Created attachment 306445
Fix v2 (deals with .U as well as .L )

Revised patch that should handle .U as well as .L protocol.

10 iterations each of autopkgtests with v6.8.12 and v6.10-rc2 all successful.

Revision history for this message
TJ (tj) wrote :

I added the correct fix to the upstream bug report last weekend (8 June):

https://bugzilla.kernel.org/show_bug.cgi?id=218916

Waiting for a response from David Howells as to whether there is another, lower-level, fix in netfs rather than 9p itself.

Changed in autopkgtest (Debian):
status: Unknown → Fix Committed
Revision history for this message
In , kip (kip-linux-kernel-bugs) wrote :

Thanks TJ. I tried applying your patch and can confirm I am still seeing the same problem with autopkgtest on Ubuntu Noble:

...
qemu-system-x86_64: terminating on signal 15 from pid 5349 (/usr/bin/python3)
autopkgtest [17:52:50]: ERROR: testbed failure: sent `auxverb_debug_fail', got `timeout', expected `ok...'

Christian Kastner (ckk)
Changed in autopkgtest (Debian):
status: Fix Committed → Unknown
Changed in autopkgtest (Debian):
status: Unknown → New
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

@Kip can you provide a complete autopkgtest log ? I use the following:

sudo TMPDIR=$PWD autopkgtest -ddd -U ./hello_2.10-3.dsc -- autopkgtest-virt-qemu --debug --s
how-boot ./sid.img |& ts %.T | /usr/bin/tee ./aptest.log

Revision history for this message
In , kip (kip-linux-kernel-bugs) wrote :
Download full text (3.3 KiB)

I'm having a bit of trouble getting your commands to work. I'm assuming the first two lines are one command wrapped, and the third line is its own?

I'm also assuming I should also replace hello_2.10-3.dsc with hello_2.10-3build2.dsc on my system, and the sid image with autopkgtest-noble-amd64.img.

When I try to run the first command I see:

$ sudo TMPDIR=$PWD autopkgtest -ddd -U ./hello_2.10-3build2.dsc -- autopkgtest-virt-qemu --debug --s

autopkgtest: DBG: autopkgtest options: Namespace(override_control=None, only_tests=[], skip_tests=None, built_binaries=True, packages=['./hello_2.10-3build2.dsc'], output_dir=None, logfile=None, summary=None, verbosity=2, setup_commands=['(O=$(bash -o pipefail -ec \'apt-get update | tee /proc/self/fd/2\') ||{ [ "${O%404*Not Found*}" = "$O" ] || exit 100; sleep 15; apt-get update; } || { sleep 60; apt-get update; } || false) && $(command -v eatmydata || true) apt-get dist-upgrade -y -o Dpkg::Options::="--force-confnew" && $(command -v eatmydata || true) apt-get --purge autoremove -y'], setup_commands_boot=[], add_apt_sources=[], add_apt_releases=[], pin_packages=[], apt_pocket=[], apt_default_release=None, enable_apt_fallback=True, copy=[], env=[], ignore_restrictions=[], user=None, gainroot=None, shell_fail=False, shell=False, timeout=0, timeout_short=None, timeout_copy=None, timeout_install=None, timeout_test=None, timeout_build=None, timeout_factor=1.0, set_lang=None, auto_control=True, build_parallel=None, needs_internet='run', validate=False)
autopkgtest: DBG: virt-runner arguments: ['autopkgtest-virt-qemu', '--debug', '--s']
autopkgtest: DBG: actions: [('source', './hello_2.10-3build2.dsc', True)]
autopkgtest: DBG: build binaries: True
autopkgtest: DBG: testbed init
autopkgtest [10:33:25]: starting date and time: 2024-06-14 10:33:25-0700
autopkgtest [10:33:25]: version 5.35
autopkgtest [10:33:25]: host kip-desktop; command line: /usr/bin/autopkgtest -ddd -U ./hello_2.10-3build2.dsc -- autopkgtest-virt-qemu --debug --s
usage: autopkgtest-virt-qemu [-h] [--qemu-architecture QEMU_ARCHITECTURE]
                             [--dpkg-architecture DPKG_ARCHITECTURE]
                             [-q QEMU_COMMAND] [-o OVERLAY_DIR] [-u USER]
                             [-p PASSWORD] [-c CPUS] [--ram-size RAM_SIZE]
                             [--timeout-reboot SECONDS] [--show-boot]
                             [--timeout-poweroff SECONDS] [-d]
                             [--qemu-options QEMU_OPTIONS] [--baseimage]
                             [--boot {auto,bios,efi,ieee1275,none}] [--efi]
                             images [images ...]
autopkgtest-virt-qemu: error: the following arguments are required: images
autopkgtest: DBG: sending command to testbed: auxverb_debug_fail
autopkgtest: DBG: TestbedFailure cannot send to testbed: [Errno 32] Broken pipe
autopkgtest: DBG: testbed stop
autopkgtest: DBG: testbed close, scratch=None
autopkgtest: DBG: sending command to testbed: quit
autopkgtest: DBG: TestbedFailure cannot send to testbed: [Errno 32] Broken pipe
autopkgtest: DBG: testbed stop
autopkgtest: DBG: auxverb_debug_fail failed: cannot send to testbed: [Errno 32] Broken pipe
autopkgtest: DBG: TestbedFailu...

Read more...

Revision history for this message
In , kip (kip-linux-kernel-bugs) wrote :

Created attachment 306462
Kip's log

Apologies, I misread your command line. I ran the following and this is my generated log.

$ sudo TMPDIR=$PWD autopkgtest -ddd -U ./hello_2.10-3build2.dsc -- qemu --debug --show-boot ~/Projects/sbuild/images/autopkgtest-noble-amd64.img |& ts %.T | /usr/bin/tee ./aptest.log

Revision history for this message
In , linux (linux-linux-kernel-bugs) wrote :

@Kip Thanks for the log. The kernel used there does not have my patch applied that I can tell - unless it is included in the Ubuntu kernel build already - but unlikely given that kernel is dated May 20th:

12:09:49.192706 [ 0.000000] Linux version 6.8.0-35-generic (buildd@lcy02-amd64-020) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 (Ubuntu 6.8.0-35.35-generic 6.8.4)

In my case I manually boot the image file with QEMU and install patched kernel builds into it (using ssh/scp/rsync), do an `apt update; apt upgrade` to prevent autopkgtest doing that in just the (temporary) overlay image it creates for each run. Then ensure the required GRUB menu entry will be used with 'grub-set-default' [0].

[0] grub-set-default "1>4" is the kind of syntax required; this would select the 5th "menuentry" from the 1st "submenu" (indexes begin at 0). The value is written into `/boot/grub/grubenv` that is read early when GRUB loads the generated grub.cfg at boot-time.

To identify the correct value I do a manual count using this (one-line!) command:

awk '{id=gensub(/.*(gnulinux-[^'"'"']*).*/,"\\1","g")};/menuentry /{prefix="";if(submenu){prefix=submenu ">";}print("ENTRY \"" prefix menu++ "\"",id)} /^submenu/{print("SUBMENU",++submenu,id);menu_saved=menu;menu=0}/^}/ && submenu {--submenu;menu=menu_saved}' /boot/grub/grub.cfg

Revision history for this message
In , kip (kip-linux-kernel-bugs) wrote :

Ah apologies TJ. I thought the patched kernel needed to be running on the host and not in the guest. I will try and build the kernel now and modify the image for noble via QEMU.

But it might be easier, if you were able to create an image for noble for amd64 for me that I could try because otherwise I might end up fiddling with grub incorrectly, etc., and not get the new kernel to be used.

Revision history for this message
In , kip (kip-linux-kernel-bugs) wrote :

TJ, I'm happy to say that the hello package's DEP-8 tests all appears to pass with your patched kernel. I'll just keep the .img around for my own packages' testing.

Do you think your fix will make it into Ubuntu Noble in the near future?

Revision history for this message
In , ckk (ckk-linux-kernel-bugs) wrote :

I've testing this with a Debian unstable image and a patched 6.8.12 kernel and I'm also happy to say that it fixed the issue for me, too.

Revision history for this message
In , kip (kip-linux-kernel-bugs) wrote :

Created attachment 306479
Kip's second test log

TJ, I may have spoke too soon. I've tried running it again on the hello source. My image uses your patched kernel via a PPA with the upgrade performed via the --setup-commands:

https://launchpad.net/~kip/+archive/ubuntu/linux-kernel-6.8.0

I am still seeing the following:

ERROR: testbed failure: sent `auxverb_debug_fail', got `timeout', expected `ok...'

To rule out the possibility of an incorrectly setup image, would you be able to generate an Ubuntu Noble image for me to test? I've been getting the image via:

$ autopkgtest-buildvm-ubuntu-cloud --verbose -r noble -a amd64 --cloud-image-url http://cloud-images.ubuntu.com/daily/server

...followed by running the --setup-commands to a bash script containing:

$add-apt-repository ppa:kip/linux-kernel-6.8.0
$ apt upgrade --no-install-suggests --no-install-recommends -y

I will attach my log.

Revision history for this message
Skia (hyask) wrote :

I didn't have time to investigate further, but I encountered this hang again this morning with an Oracular image, with the PPA kernel (autopkgtest shows this version: Linux 6.8.0-35-generic #35ubuntu1+kip1-Ubuntu), and was still able to unlock the situation with the `stat` trick mentioned above.

Revision history for this message
Paride Legovini (paride) wrote :

I can confirm the "kip1" kernel from the PPA does _not_ fix the issue for me. Here is a log excerpt, with the kernel version visible:

autopkgtest [17:32:41]: testbed running kernel: Linux 6.8.0-35-generic #35ubuntu1+kip1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 19 02:36:12 U
autopkgtest: DBG: testbed command ['sh', '-c', 'nproc; cat /proc/cpuinfo 2>/dev/null || true'], kind short, sout pipe, serr pipe, env []
autopkgtest: DBG: testbed command exited with code 0
autopkgtest: DBG: Binaries: initialising
autopkgtest [17:32:41]: @@@@@@@@@@@@@@@@@@@@ unbuilt-tree /home/paride/debian/git/autopkgtest/
autopkgtest: DBG: blame += /home/paride/debian/git/autopkgtest/
autopkgtest: DBG: testbed reset: modified=False, deps_installed=[], deps_new=[]
autopkgtest: DBG: testbed command ['mkdir', '-p', '/tmp/autopkgtest.HkRABQ'], kind short, sout raw, serr pipe, env []
autopkgtest: DBG: testbed command exited with code 0
autopkgtest: DBG: sending command to testbed: copydown /home/paride/debian/git/autopkgtest// /tmp/autopkgtest.HkRABQ/ubtree-/
autopkgtest: DBG: got reply from testbed: timeout

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.