Reproducible hang in generic/430 with xfstest from upstream

Bug #1755999 reported by Po-Hsu Lin on 2018-03-15
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
xfsprogs (Ubuntu)

Bug Description

While testing the latest xfstest from upstream, the generic/430 test will hang, no matter using ext4/xfs/btrfs.

It looks like this issue was caused by the following command to copy beyond the file:
/usr/sbin/xfs_io -i -f -c "copy_range -s 4000 -l 2000 /home/ubuntu/test/test-430/file" "/home/ubuntu/test/test-430/beyond"

The copied file will have a correct MD5 as expected.
e68d4a150c4e42f4f9ea3ffe4c9cf4ed beyond

But the command will never return.

The file size of test-430/file is 5000, so a copy_range call with source offset 4000 with length 1000 works, but > 1000 does not.

 1. Deploy a node with Bionic (should have a /dev/sdb available for the test)
 2. Run:
    sudo apt-get install git python-minimal -y
    git clone --depth=1 -b kteam-xfstest-upstream
    git clone --depth=1 git://
    rm -fr autotest/client/tests
    ln -sf ~/autotest-client-tests autotest/client/tests
 3. Run the test with the following command:
    AUTOTEST_PATH=/home/ubuntu/autotest sudo -E autotest/client/autotest-local --verbose autotest/client/tests/

(The test suite can be built manually, but it's easier to do this with autotest framework)

To run this test solely after the test partition has been creation on /dev/sdb:
    mkdir /home/ubuntu/test
    cd autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev
    sudo su
    export TEST_DIR=/home/ubuntu/test
    export TEST_DEV=/dev/sdb1
    ./check generic/430

Tested with the latest mainline kernel, 4.16.0-041600rc5-generic, and the bug still exist.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-10-generic 4.15.0-10.11 [modified: boot/vmlinuz-4.15.0-10-generic]
ProcVersionSignature: User Name 4.15.0-10.11-generic 4.15.3
Uname: Linux 4.15.0-10-generic x86_64
ApportVersion: 2.20.8-0ubuntu10
Architecture: amd64
 /dev/snd/controlC0: ubuntu 1148 F.... pulseaudio
 /dev/snd/controlC1: ubuntu 1148 F.... pulseaudio
Date: Thu Mar 15 14:24:46 2018
InstallationDate: Installed on 2018-03-15 (0 days ago)
InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Alpha amd64 (20180228)
MachineType: Dell Inc. Dell Precision M3800
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-10-generic.efi.signed root=UUID=d1980d27-9063-4d92-aa10-1fb240453d8d ro quiet splash vt.handoff=1
 linux-restricted-modules-4.15.0-10-generic N/A
 linux-backports-modules-4.15.0-10-generic N/A
 linux-firmware 1.172
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install) 10/14/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A07 Dell Precision M3800
dmi.board.vendor: Dell Inc.
dmi.board.version: A07
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: Not Specified
dmi.modalias: dmi:bvnDellInc.:bvrA07:bd10/14/2014:svnDellInc.:pnDellPrecisionM3800:pvrA07:rvnDellInc.:rnDellPrecisionM3800:rvrA07:cvnDellInc.:ct8:cvrNotSpecified: Dell Precision M3800
dmi.product.version: A07
dmi.sys.vendor: Dell Inc.

Po-Hsu Lin (cypressyew) wrote :
description: updated

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

A quick and dirty gdb debug indicates it's looping with copy_file_range syscall(326)

$ grep copy /usr/include/asm/unistd_64.h
#define __NR_copy_file_range 326

gdb --args /usr/sbin/xfs_io -i -f -c "copy_range -s 4000 -l 2000 /home/ubuntu/test/test-430/file" "/home/ubuntu/test/test-430/beyond"

(gdb) catch syscall 326
Catchpoint 1 (syscall 326)
(gdb) run
Starting program: /usr/sbin/xfs_io -i -f -c copy_range\ -s\ 4000\ -l\ 2000\ /home/ubuntu/test/test-430/file /home/ubuntu/test/test-430/beyond
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/".
[New Thread 0x7ffff7036700 (LWP 10450)]

Thread 1 "xfs_io" hit Catchpoint 1 (call to syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.
(gdb) continue

Thread 1 "xfs_io" hit Catchpoint 1 (returned from syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue

Thread 1 "xfs_io" hit Catchpoint 1 (call to syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue

Thread 1 "xfs_io" hit Catchpoint 1 (returned from syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue

Thread 1 "xfs_io" hit Catchpoint 1 (call to syscall 326), syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38 in ../sysdeps/unix/sysv/linux/x86_64/syscall.S
(gdb) continue

Po-Hsu Lin (cypressyew) wrote :

There is a similar bug report [1] for NFS, and the fix has already landed in Bionic kernel [2]

It's not the same issue as the copy_range works except the "beyond" copy test here.

[2] 6d3b5d8d8dd1c14f991ccab84b40f8425f1ae91b in Bionic tree

Po-Hsu Lin (cypressyew) on 2018-03-15
summary: - Reproducible hang in ext4 generic/430 with xfstest from upstream
+ Reproducible hang in generic/430 with xfstest from upstream
description: updated
Po-Hsu Lin (cypressyew) wrote :

Mainline kernel bisect shows that this stuck occurs between 4.9.87 and 4.10rc1

With 4.9.87 this test failed to copy the file, but it won't get stuck.

# export TEST_DIR=/home/ubuntu/test ;export TEST_DEV=/dev/sdb1 ; ./check generic/430
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 M3800 4.9.87-040987-generic

generic/430 - output mismatch (see /home/ubuntu/autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev/results//generic/430.out.bad)
    --- tests/generic/430.out 2018-03-15 12:26:40.285762490 +0800
    +++ /home/ubuntu/autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev/results//generic/430.out.bad 2018-03-15 19:13:36.691401239 +0800
    @@ -4,22 +4,27 @@
     e11fbace556cba26bf0076e74cab90a3 TEST_DIR/test-430/file
     e11fbace556cba26bf0076e74cab90a3 TEST_DIR/test-430/copy
     Copy beginning of original file
    +cmp: EOF on /home/ubuntu/test/test-430/beginning which is empty
     md5sums after copying beginning:
     e11fbace556cba26bf0076e74cab90a3 TEST_DIR/test-430/file
    -cabe45dcc9ae5b66ba86600cca6b8ba8 TEST_DIR/test-430/beginning
    (Run 'diff -u tests/generic/430.out /home/ubuntu/autotest/client/tmp/xfstests/src/xfstests-bld/xfstests-dev/results//generic/430.out.bad' to see the entire diff)
Ran: generic/430
Failures: generic/430
Failed 1 of 1 tests

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
tags: added: kernel-da-key
Po-Hsu Lin (cypressyew) wrote :

This issue has gone with xfsprogs 4.15.1

Po-Hsu Lin (cypressyew) on 2019-09-16
tags: added: xfstests
Po-Hsu Lin (cypressyew) wrote :

This issue does not exist in D AMD64
    generic/430 2s

Po-Hsu Lin (cypressyew) wrote :

BTW this hang can be found on Bionic 5.0 kernel, on ext4 / btrfs / xfs.
So it might has something to do with the userspace tools as well.

Po-Hsu Lin (cypressyew) on 2019-10-16
tags: added: ubuntu-xfstests-btrfs ubuntu-xfstests-ext4 ubuntu-xfstests-xfs
Sean Feole (sfeole) on 2020-02-12
tags: added: sru-20200127
tags: added: 4.15 5.0
Po-Hsu Lin (cypressyew) wrote :

Passed with Focal 5.4 (5.4.0-31.35), with just 1 second to run:
generic/430 1s

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers