rm -r dir on USB disk locks up hdparm on different disk

Bug #2018687 reported by Ian! D. Allen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-meta-hwe-5.15 (Ubuntu)
New
Undecided
Unassigned

Bug Description

Description: Ubuntu 20.04.6 LTS
Linux 5.15.0-72-generic #79~20.04.1-Ubuntu SMP Thu Apr 20 22:12:07 UTC 2023 x86_64 GNU/Linux
Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz

Running a "rm -r dir" on a directory with millions of files that resides
on a disk in an external USB-3 hard drive dock locks up an unrelated
hdparm processes running on an internal disk such that the kernel says:

    May 7 04:24:02 kernel: [163606.041862] INFO: task hdparm:1391162 blocked for more than 120 seconds.
    [...]
    May 7 04:26:03 kernel: [163726.876357] INFO: task hdparm:1391162 blocked for more than 241 seconds.
    [...]
    May 7 04:28:04 kernel: [163847.702980] INFO: task hdparm:1391162 blocked for more than 362 seconds.

First a normal run of "hdparm -t /dev/sda" with the offending "rm -r" SIGSTOPped so that
it doesn't affect anything:

    # \time hdparm -t /dev/sda
    /dev/sda:
     Timing buffered disk reads: 1128 MB in 3.00 seconds = 375.50 MB/sec
    0.01user 0.67system 0:06.21elapsed 11%CPU (0avgtext+0avgdata 4584maxresident)k
    2312704inputs+8outputs (0major+664minor)pagefaults 0swaps

Elapsed time is about six seconds, as expected. /dev/sda is an internal SSD drive.

I now run this loop to show the timings and process states below:

    # while sleep 1 ; do date ; ps laxww | grep '[ ]D' | grep -v refrig ; done

(I have some processes stopped in a freezer cgroup ("refrig") that I
don't want to see in the grep output.) I SIGCONT the offending "rm -r"
running on the drive in the USB3 drive dock and you see the rm appear
in uninterruptible sleep along with a couple of kernel processes:

    Sun May 7 05:01:07 EDT 2023
    Sun May 7 05:01:08 EDT 2023
    Sun May 7 05:01:09 EDT 2023
    Sun May 7 05:01:10 EDT 2023
    Sun May 7 05:01:11 EDT 2023
    1 0 447 2 20 0 0 0 usb_sg D ? 3:17 [usb-storage]
    4 0 1423283 11939 20 0 10648 580 wait_o D+ pts/28 0:00 rm -rf 15tb3
    Sun May 7 05:01:12 EDT 2023
    1 0 447 2 20 0 0 0 usb_sg D ? 3:17 [usb-storage]
    1 0 4016 1 20 0 161136 1900 usbhid Dsl ? 1:39 /sbin/apcupsd
    4 0 1423283 11939 20 0 10648 580 wait_o D+ pts/28 0:00 rm -rf 15tb3
    Sun May 7 05:01:13 EDT 2023

The above lines showing those processes in uninterruptible sleep repeat
over and over each second as the "rm -r" continues. I then start up
"hdparm -t /dev/sda" on the internal SSD disk, and it also appears in
uninterruptible sleep and doesn't finish even after minutes of waiting:

    Sun May 7 05:01:25 EDT 2023
    1 0 368 2 20 0 0 0 md_sup D ? 2:57 [md0_raid5]
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423283 11939 20 0 11260 2544 wait_o D+ pts/28 0:00 rm -rf 15tb3
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:01:26 EDT 2023
    1 0 447 2 20 0 0 0 usb_sg D ? 3:17 [usb-storage]
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423283 11939 20 0 11260 2544 wait_o D+ pts/28 0:00 rm -rf 15tb3
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:01:27 EDT 2023
[...]
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:01:35 EDT 2023
    1 0 447 2 20 0 0 0 usb_sg D ? 3:17 [usb-storage]
    4 0 1423283 11939 20 0 11260 2544 blk_mq D+ pts/28 0:00 rm -rf 15tb3
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:01:36 EDT 2023
    1 0 447 2 20 0 0 0 usb_sg D ? 3:17 [usb-storage]
    1 0 4985 2 20 0 0 0 rq_qos D ? 0:24 [jbd2/sdj1-8]
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423283 11939 20 0 11260 2544 wait_o D+ pts/28 0:00 rm -rf 15tb3
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:01:37 EDT 2023

I keep waiting. The above lines repeat over and over and the hdparm is
blocked and doesn't finish.

    Sun May 7 05:03:32 EDT 2023
    1 0 447 2 20 0 0 0 usb_sg D ? 3:18 [usb-storage]
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423283 11939 20 0 11260 2544 wait_o D+ pts/28 0:03 rm -rf 15tb3
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:34 EDT 2023

At this point, I SIGSTOP the offending "rm -r" and it disappears from
uninterruptible sleep, but the other processes are still in it for
several seconds until the lockup clears and the hdparm finally finishes:

    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:35 EDT 2023
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:36 EDT 2023
    1 0 4016 1 20 0 161136 1900 usbhid Dsl ? 1:39 /sbin/apcupsd
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:37 EDT 2023
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:38 EDT 2023
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:39 EDT 2023
    1 0 1366783 2 20 0 0 0 blk_mq D ? 0:02 [kworker/u16:2+flush-8:144]
    4 0 1423501 9975 20 0 4680 4584 wb_wai DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:40 EDT 2023
    Sun May 7 05:03:41 EDT 2023
    Sun May 7 05:03:42 EDT 2023
    Sun May 7 05:03:43 EDT 2023
    4 0 1423501 9975 20 0 4680 4584 wait_o DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:44 EDT 2023
    4 0 1423501 9975 20 0 4680 4584 wait_o DL+ pts/4 0:00 hdparm -t /dev/sda
    Sun May 7 05:03:45 EDT 2023
    Sun May 7 05:03:46 EDT 2023

And the hdparm is finally done, several minutes after I started it. If I keep
the offending "rm -r" running on that USB3 disk, the "rm -r" locks up an hdparm
(running on a different disk) long enough to cause the above kernel INFO messages.

If I try hdparm on a different disk, e.g. one of the WD Red drives in
my internal RAID5 array, it also hangs until I SIGSTOP or kill the "rm -r":

    1 0 447 2 20 0 0 0 usb_sg D ? 3:51 [usb-storage]
    1 0 4985 2 20 0 0 0 blk_mq D ? 0:31 [jbd2/sdj1-8]
    1 0 1466181 2 20 0 0 0 rq_qos D ? 0:08 [kworker/u16:4+flush-8:144]
    4 0 1469821 11939 20 0 27524 18888 wait_o D+ pts/28 0:44 rm -rf 15tb3
    4 0 1501584 11933 20 0 4680 4584 rq_qos DL+ pts/26 0:00 hdparm -t /dev/sdb

While hung, ^C (interrupt) doesn't affect the hdparm until I SIGSTOP
the offending "rm -r" and then everything eventually unlocks after a few seconds.
Is this the way things are supposed to work? I did not expect this behaviour. Here's the syslog trace:

    [163968.537842] INFO: task hdparm:1391162 blocked for more than 483 seconds.
    [163968.537853] Tainted: G O 5.15.0-72-generic #79~20.04.1-Ubuntu
    [163968.537858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [163968.537861] task:hdparm state:D stack: 0 pid:1391162 ppid:1391130 flags:0x00004004
    [163968.537869] Call Trace:
    [163968.537872] <TASK>
    [163968.537876] __schedule+0x2cd/0x890
    [163968.537889] schedule+0x69/0x110
    [163968.537896] wb_wait_for_completion+0x5a/0x90
    [163968.537907] ? wait_woken+0x60/0x60
    [163968.537928] sync_inodes_sb+0xd8/0x2b0
    [163968.537932] sync_inodes_one_sb+0x19/0x20
    [163968.537933] iterate_supers+0xa3/0x100
    [163968.537937] ? __x64_sys_tee+0xe0/0xe0
    [163968.537939] ksys_sync+0x42/0xa0
    [163968.537941] __do_sys_sync+0xe/0x20
    [163968.537942] do_syscall_64+0x5c/0xc0
    [163968.537946] ? do_syscall_64+0x69/0xc0
    [163968.537948] ? do_syscall_64+0x69/0xc0
    [163968.537950] ? do_syscall_64+0x69/0xc0
    [163968.537953] entry_SYSCALL_64_after_hwframe+0x61/0xcb
    [163968.537956] RIP: 0033:0x7f1d374a822b
    [163968.537958] RSP: 002b:00007ffc973592f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
    [163968.537961] RAX: ffffffffffffffda RBX: 0000000000000d64 RCX: 00007f1d374a822b
    [163968.537962] RDX: 0000000000000000 RSI: 0000000000200000 RDI: 0000000000000003
    [163968.537963] RBP: 0000000000000003 R08: 0000000000000007 R09: 000000000000002a
    [163968.537964] R10: 0000564568bdd017 R11: 0000000000000202 R12: 0000000000000000
    [163968.537965] R13: 00007ffc9735a295 R14: 00007ffc97359370 R15: 00007f1d37193000
    [163968.537967] </TASK>

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-generic-hwe-20.04 5.15.0.72.79~20.04.33
ProcVersionSignature: Ubuntu 5.15.0-72.79~20.04.1-generic 5.15.98
Uname: Linux 5.15.0-72-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.26
Architecture: amd64
CasperMD5CheckResult: skip
Date: Sun May 7 08:38:10 2023
EcryptfsInUse: Yes
InstallationDate: Installed on 2020-10-07 (941 days ago)
InstallationMedia: Lubuntu 20.04.1 LTS "Focal Fossa" - Release amd64 (20200731)
SourcePackage: linux-meta-hwe-5.15
UpgradeStatus: Upgraded to focal on 2022-08-15 (265 days ago)

Revision history for this message
Ian! D. Allen (idallen) wrote :
Ian! D. Allen (idallen)
description: updated
Ian! D. Allen (idallen)
description: updated
Revision history for this message
Ian! D. Allen (idallen) wrote :
Download full text (4.4 KiB)

I now have a Perl script running that is removing duplicate files by doing thousands of hard links on a different external USB3 disk and it is locking up or timing out many disk-related things on all my other disks. Both this USB3 external drive and the one above are plugged directly into the motherboard (HP Z440 Workstation).

F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 2210112 2210111 20 0 53680 42588 blk_mq D+ pts/24 3:00 /usr/bin/perl [...]

May 14 13:37:49 kernel: [259424.745462] INFO: task smartd:2719 blocked for more than 120 seconds.
May 15 00:08:09 kernel: [297244.761855] INFO: task smartd:2719 blocked for more than 120 seconds.
May 15 00:10:10 kernel: [297365.592485] INFO: task smartd:2719 blocked for more than 241 seconds.
May 15 01:08:34 kernel: [300869.682961] INFO: task smartd:2719 blocked for more than 120 seconds.
May 15 01:28:43 kernel: [302077.989582] INFO: task hdparm:2052842 blocked for more than 120 seconds.
May 15 01:30:43 kernel: [302198.820278] INFO: task hdparm:2052842 blocked for more than 241 seconds.
May 15 01:32:44 kernel: [302319.654907] INFO: task hdparm:2052842 blocked for more than 362 seconds.
May 15 01:34:45 kernel: [302440.481601] INFO: task hdparm:2052842 blocked for more than 483 seconds.
May 15 01:36:46 kernel: [302561.316237] INFO: task hdparm:2052842 blocked for more than 604 seconds.
May 15 02:06:58 kernel: [304373.770194] INFO: task smartd:2719 blocked for more than 120 seconds.

From one of the logged events:

May 15 02:06:58 kernel: [304373.770194] INFO: task smartd:2719 blocked for more than 120 seconds.
May 15 02:06:58 kernel: [304373.770209] Tainted: G O 5.15.0-72-generic #79~20.04.1-Ubuntu
May 15 02:06:58 kernel: [304373.770215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 15 02:06:58 kernel: [304373.770218] task:smartd state:D stack: 0 pid: 2719 ppid: 1 flags:0x00000000
May 15 02:06:58 kernel: [304373.770226] Call Trace:
May 15 02:06:58 kernel: [304373.770230] <TASK>
May 15 02:06:58 kernel: [304373.770236] __schedule+0x2cd/0x890
May 15 02:06:58 kernel: [304373.770251] schedule+0x69/0x110
May 15 02:06:58 kernel: [304373.770260] schedule_preempt_disabled+0xe/0x20
May 15 02:06:58 kernel: [304373.770269] __mutex_lock.isra.0+0x20c/0x470
May 15 02:06:58 kernel: [304373.770276] ? iput.part.0+0x61/0x1e0
May 15 02:06:58 kernel: [304373.770287] __mutex_lock_slowpath+0x13/0x20
May 15 02:06:58 kernel: [304373.770294] mutex_lock+0x36/0x40
May 15 02:06:58 kernel: [304373.770299] blkdev_get_by_dev+0x11d/0x2d0
May 15 02:06:58 kernel: [304373.770309] ? blkdev_close+0x30/0x30
May 15 02:06:58 kernel: [304373.770318] blkdev_open+0x50/0x90
May 15 02:06:58 kernel: [304373.770325] do_dentry_open+0x169/0x3e0
May 15 02:06:58 kernel: [304373.770336] vfs_open+0x2d/0x40
May 15 02:06:58 kernel: [304373.770342] do_open.isra.0+0x20d/0x480
May 15 02:06:58 kernel: [304373.770351] path_openat+0x18e/0xe50
May 15 02:06:58 kernel: [304373.770361] ? put_device+0x13/0x20
May 15 02:06:58 kernel: [304373.770371] ? scsi_device_put+0x31/0x40
May 15 02:06:58 kernel: [304373.770380] ? sd_release...

Read more...

Revision history for this message
Ian! D. Allen (idallen) wrote :

If I do this:

    # sysctl -w vm.dirty_ratio=0

the hdparm no longer hangs. It has to be zero; anything non-zero, even 1, causes large delays in disk-related commands such as hdparm, sync, smartctl, etc.

I got this idea from here:

https://serverfault.com/questions/405210/can-high-load-cause-server-hang-and-error-blocked-for-more-than-120-seconds

Revision history for this message
Ian! D. Allen (idallen) wrote :

Unfortunately, an unacceptable side-effect of "sysctl -w vm.dirty_ratio=0" is that disk operations that move a lot of data are taking much too long. An rsync that normally takes less than an hour of real time was still running over 12 hours later (and hasn't finished yet). I'm reverting the vm.dirty_ratio back up to 20 to see if that clears out all the unfinished disk I/O.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.