unmounting NTFS causes mount.ntfs process to get stuck in eternal kernel deep sleep

Bug #1398859 reported by Martin Pitt on 2014-12-03
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
udisks2 (Ubuntu)
High
Martin Pitt

Bug Description

Since Mid-November the udisks2 test (https://jenkins.qa.ubuntu.com/job/vivid-adt-udisks2/? shows it happened between Nov 13 and 19) exhibits a regression with NTFS: Unmounting causes the ntfs-3g process to go into kernel deep sleep and umount hangs forever.

This can be reproduced with

  sudo apt install ntfs-3g
  sudo modprobe scsi_debug
  # check dmesg which drive this created; e. g. /dev/sda
  sudo mkntfs -F /dev/sda
  sudo mount /dev/sda /mnt

Now you have a process which is happy:
root 12821 0.0 0.0 15272 1948 ? Ss 15:49 0:00 /sbin/mount.ntfs /dev/sda /mnt -o rw

You can work with /mnt, edit files, etc. But now unmount:

  sudo umount /mnt

This never returns, and the process goes into uninterruptible kernel sleep ("D"):

root 12928 0.0 0.0 15272 1948 ? Ds 15:55 0:00 /sbin/mount.ntfs /dev/sda /mnt -o rw

The strace for the umount until the hang is

ubuntu@autopkgtest:~$ sudo strace -vvtts1024 -p 12928
Process 12928 attached
15:56:17.938299 read(4, "0\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0", 135168) = 48
15:56:22.338931 read(4, "(\0\0\0&\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 135168) = 40
15:56:22.339637 sendto(5, "<29>Dec 3 15:56:22 ntfs-3g[12928]: Unmounting /dev/sda ()\n", 59, MSG_NOSIGNAL, NULL, 0) = 59
15:56:22.340509 fsync(3) = 0
15:56:22.341188 fcntl(3, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
15:56:22.341759 close(3

ProblemType: Bug
DistroRelease: Ubuntu 15.04
Package: ntfs-3g 1:2013.1.13AR.1-2ubuntu3
ProcVersionSignature: Ubuntu 3.16.0-25.33-generic 3.16.7
Uname: Linux 3.16.0-25-generic x86_64
ApportVersion: 2.15-0ubuntu1
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Dec 3 15:48:12 2014
EcryptfsInUse: Yes
InstallationDate: Installed on 2014-11-20 (13 days ago)
InstallationMedia: Ubuntu 15.04 "Vivid Vervet" - Alpha amd64 (20141119)
SourcePackage: ntfs-3g
UpgradeStatus: No upgrade log present (probably fresh install)

Martin Pitt (pitti) wrote :
Martin Pitt (pitti) wrote :

- ntfs-3g itself did not change since utopic.
- The hang happens with utopic's 3.16.0-23-generic and the previous 3.16.0-24-generic kernel too, so not a kernel regression.
- The hang happens with libc{6,-bin} 2.19-10ubuntu2, so not a glibc regression
- The hang happens with systemd+udev 215-5ubuntu2 (which we had from Nov 03 on), so not an udev regression either.

Martin Pitt (pitti) wrote :

It works again with udev from utopic (208-8ubuntu8.1). The first vivid version (215-5ubuntu1) hangs. Curiously the current libudev1 from vivid works fine.

affects: ntfs-3g (Ubuntu) → systemd (Ubuntu)
Changed in systemd (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
status: New → Triaged
summary: - [regression] unmounting NTFS causes mount.ntfs process to get stuck in
- eternal kernel deep sleep
+ [udev regression] unmounting NTFS causes mount.ntfs process to get stuck
+ in eternal kernel deep sleep
Changed in systemd (Ubuntu):
importance: Undecided → High

It's enough to downgrade udev before the "umount /mnt", everything else works with 215. Conversely, mkfs/mount with 208, upgrading to udev 215, and umount then hangs. stracing umount /mnt indeed hangs forever in the umount() call.

Martin Pitt (pitti) wrote :

This works when stopping udevd. When it's running, a worker gets stuck:

root 25350 0.0 0.2 41212 2060 ? Ds 11:14 0:00 /lib/systemd/systemd-udevd --daemon

Martin Pitt (pitti) wrote :

Correction, previous bisect apparently was unclean. I clean/rebuild everything now, and it's pointing to http://cgit.freedesktop.org/systemd/systemd/commit/?id=02ba8fb3357daf5 . Reverting this indeed fixes the problem.

Martin Pitt (pitti) wrote :

Curiously enough, this does not happen on Debian sid nor Fedora, even though I can reproduce it just fine with udevd built straight out of upstream master on ubuntu. So it seems something else gets in the way then (different kernel?)

Martin Pitt (pitti) wrote :

Just in case I need it again, this is the git bisect run script that I used.

Martin Pitt (pitti) wrote :

This works fine on Debian and Fedora, with the same udev upstream version. I verified that the hang still happens with Debian's kernel or Debian's udev+libudev1 packages. If I install just the systemd binary package from Debian, the hang goes away. More specifically, installing Debian's /lib/systemd/systemd as /lib/systemd/systemd-debian and booting that (with init) also makes the bug go away. So something in our Ubuntu patches causes this.

Martin Pitt (pitti) on 2014-12-17
summary: - [udev regression] unmounting NTFS causes mount.ntfs process to get stuck
- in eternal kernel deep sleep
+ unmounting NTFS causes mount.ntfs process to get stuck in eternal kernel
+ deep sleep
Martin Pitt (pitti) wrote :

So this is interesting. Debian's /lib/systemd/systemd works, but the hang happens under upstart too (where that binary isn't being used!). So it must be something which we change in Ubuntu's systemd, but conceptually also do for upstart/cgmanager. The first candidate that I see is to put processes into all cgroups for session LXC, I'll verify that.

Martin Pitt (pitti) on 2014-12-17
Changed in systemd (Ubuntu):
milestone: none → ubuntu-15.01
status: Triaged → In Progress
Martin Pitt (pitti) wrote :

Indeed reverting core-Put-session-scopes-into-all-cgroup-controllers.patch fixes the hang. So putting processes into all cgroup controllers somehow interacts badly with ntfs-3g, and reverting the above udev commit somehow fixes that.

Martin Pitt (pitti) wrote :

For completeness, this is the testntfs fake source package containing an autopkgtest for this bug. That's what I used for bisecting.

Changed in systemd (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 218-2ubuntu3

---------------
systemd (218-2ubuntu3) vivid; urgency=medium

  * core-Put-session-scopes-into-all-cgroup-controllers.patch: Revert upstream
    commit 02ba8fb3357d to fix eternal kernel deep sleep hangs when unmounting
    an ntfs-3g partition. (LP: #1398859)
 -- Martin Pitt <email address hidden> Thu, 18 Dec 2014 13:32:49 +0100

Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
Martin Pitt (pitti) wrote :

This caused bug 1404143 so I reverted this change.

Changed in systemd (Ubuntu):
status: Fix Released → In Progress
Martin Pitt (pitti) wrote :

It is enough to move the shell process (under which umount runs) out of the logind cgroup user.slice/user-1000.slice/session-1.scope into the / cgroup in the cpu,cpuacct controller:

  echo $$ | sudo tee /sys/fs/cgroup/cpu,cpuacct/tasks

All other cgroup controllers work fine.

Martin Pitt (pitti) wrote :

I'll fix this in udisks2's test suite instead.

affects: systemd (Ubuntu) → udisks2 (Ubuntu)
Martin Pitt (pitti) on 2014-12-30
Changed in udisks2 (Ubuntu):
status: In Progress → Fix Committed
Phillip Susi (psusi) wrote :

Isn't there still an underlying kernel bug that should be fixed here?

Martin Pitt (pitti) wrote :

Quite possibly,but at this point I don't quite understand this any more, at least not yet with the level of debugging that I did so far. :-(

Phillip Susi (psusi) wrote :

A stack trace of the blocked process might help, maybe magic-sysrq-w when it happens?

Martin Pitt (pitti) on 2015-01-01
Changed in udisks2 (Ubuntu):
status: Fix Committed → In Progress
Martin Pitt (pitti) wrote :

strace of udevd while it's hanging:

10134 open("/dev/sda", O_RDONLY|O_NONBLOCK|O_NOFOLLOW|O_CLOEXEC) = 11
10134 flock(11, LOCK_EX|LOCK_NB) = 0
10134 ioctl(11, BLKRRPART, 0

Corresponding kernel trace:

[ 720.228114] INFO: task systemd-udevd:10134 blocked for more than 120 seconds.
[ 720.230656] Not tainted 3.16.0-28-generic #38-Ubuntu
[ 720.232666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 720.236283] systemd-udevd D ffff88007fc13840 0 10134 1 0x00000007
[ 720.236293] ffff88007b76fcd0 0000000000000086 ffff880037641e90 0000000000013840
[ 720.236298] ffff88007b76ffd8 0000000000013840 ffff880037641e90 ffff880037641e90
[ 720.236302] ffff8800371c5068 ffff8800371c5080 ffff88007be723c0 0000000000000000
[ 720.236306] Call Trace:
[ 720.236320] [<ffffffff81785049>] schedule+0x29/0x70
[ 720.236326] [<ffffffff817888b5>] rwsem_down_read_failed+0xf5/0x190
[ 720.236334] [<ffffffff811fc743>] ? find_inode+0xa3/0xb0
[ 720.236341] [<ffffffff813a9284>] call_rwsem_down_read_failed+0x14/0x30
[ 720.236346] [<ffffffff817881e0>] ? down_read+0x20/0x30
[ 720.236351] [<ffffffff811e3db7>] get_super+0x97/0xe0
[ 720.236357] [<ffffffff812197a4>] fsync_bdev+0x14/0x60
[ 720.236363] [<ffffffff8137b604>] invalidate_partition+0x24/0x50
[ 720.236369] [<ffffffff8137e3cb>] drop_partitions.isra.12+0x1b/0x40
[ 720.236373] [<ffffffff8137e88e>] rescan_partitions+0x3e/0x2a0
[ 720.236377] [<ffffffff8137a4dc>] blkdev_ioctl+0x71c/0x7c0
[ 720.236382] [<ffffffff81218e3d>] block_ioctl+0x3d/0x50
[ 720.236386] [<ffffffff811f5138>] do_vfs_ioctl+0x2c8/0x4a0
[ 720.236393] [<ffffffff81082f30>] ? ptrace_notify+0x60/0x110
[ 720.236397] [<ffffffff811f5391>] SyS_ioctl+0x81/0xa0
[ 720.236402] [<ffffffff8178a0ff>] tracesys+0xe1/0xe6

Martin Pitt (pitti) on 2015-01-02
Changed in udisks2 (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udisks2 - 2.1.4-1git2

---------------
udisks2 (2.1.4-1git2) vivid; urgency=medium

  * debian/tests/upstream-system: Work around eternal udev BLKRRPART ioctl
    hang with ntfs-3g (LP: #1398859)
 -- Martin Pitt <email address hidden> Fri, 02 Jan 2015 17:20:05 +0100

Changed in udisks2 (Ubuntu):
status: Fix Committed → Fix Released

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1398859

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Phillip Susi (psusi) on 2015-01-05
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: bot-stop-nagging
Changed in linux (Ubuntu):
importance: Undecided → High
Martin Pitt (pitti) wrote :

The kernel issue still happens (see http://autopkgtest.ubuntu.com/packages/u/udisks2/xenial/amd64/), so the workaround needs to come back.

Changed in udisks2 (Ubuntu):
milestone: ubuntu-15.01 → ubuntu-16.03
status: Fix Released → In Progress
Martin Pitt (pitti) wrote :

The workaround doesn't work any more, and the test does not actually depend on ntfs-3g any more, so this must rather be fixed in our base images.

Changed in udisks2 (Ubuntu):
milestone: ubuntu-16.03 → none
status: In Progress → Fix Released
Martin Pitt (pitti) wrote :

I also see this with exfat, so it rather seems to be related to fuse (as both ntfs-3g and exfat use fuse). Curiously I do not see this when running the tests on my laptop or on the Scalingstack cloud infra, just when I run them in QEMU.

Martin Pitt (pitti) wrote :
Download full text (3.7 KiB)

> Curiously I do not see this when running the tests on my laptop

And of course in the minute I write this it does happen again (I ran the test maybe 20 times today, and only just now it happened):

[11359.569157] INFO: task systemd-udevd:341 blocked for more than 120 seconds.
[11359.569163] Not tainted 4.8.0-29-generic #31~lp1626436ProposedWithTwoPatches
[11359.569165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11359.569167] systemd-udevd D ffffa31788e77ba8 0 341 1 0x00000100
[11359.569173] ffffa31788e77ba8 00ffffff84c71a68 ffffa31499670000 ffffa31788e82ac0
[11359.569177] 0000000088e77db0 ffffa31788e78000 ffffa317893bc070 ffffa317893bc088
[11359.569179] ffffa31788e77be0 ffffa3149cf771c0 ffffa31788e77bc0 ffffffff85296905
[11359.569182] Call Trace:
[11359.569190] [<ffffffff85296905>] schedule+0x35/0x80
[11359.569193] [<ffffffff85299543>] rwsem_down_read_failed+0x103/0x160
[11359.569197] [<ffffffff84e3eb98>] call_rwsem_down_read_failed+0x18/0x30
[11359.569200] [<ffffffff85298da0>] down_read+0x20/0x40
[11359.569202] [<ffffffff84c35e15>] get_super.part.7+0x95/0xd0
[11359.569204] [<ffffffff84c35e76>] get_super+0x26/0x30
[11359.569208] [<ffffffff84c70f98>] fsync_bdev+0x18/0x60
[11359.569211] [<ffffffff84e0f064>] invalidate_partition+0x24/0x50
[11359.569213] [<ffffffff84e12243>] rescan_partitions+0x53/0x2b0
[11359.569216] [<ffffffff84d728d8>] ? security_capable+0x48/0x60
[11359.569218] [<ffffffff84e0d0d5>] __blkdev_reread_part+0x65/0x70
[11359.569220] [<ffffffff84e0d103>] blkdev_reread_part+0x23/0x40
[11359.569222] [<ffffffff84e0db25>] blkdev_ioctl+0x425/0x910
[11359.569225] [<ffffffff84b43907>] ? __seccomp_filter+0x67/0x260
[11359.569228] [<ffffffff84c7090d>] block_ioctl+0x3d/0x50
[11359.569231] [<ffffffff84c47883>] do_vfs_ioctl+0xa3/0x610
[11359.569233] [<ffffffff84b441b3>] ? __secure_computing+0x43/0xe0
[11359.569236] [<ffffffff84a033ae>] ? syscall_trace_enter+0xce/0x2f0
[11359.569239] [<ffffffff84c47e69>] SyS_ioctl+0x79/0x90
[11359.569241] [<ffffffff84a03b9e>] do_syscall_64+0x5e/0xc0
[11359.569244] [<ffffffff8529b165>] entry_SYSCALL64_slow_path+0x25/0x25
[11359.569320] INFO: task mount.ntfs:15670 blocked for more than 120 seconds.
[11359.569323] Not tainted 4.8.0-29-generic #31~lp1626436ProposedWithTwoPatches
[11359.569324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11359.569325] mount.ntfs D ffffa314d0d1bd68 0 15670 5268 0x00000000
[11359.569329] ffffa314d0d1bd68 0000000100000003 ffffa3178bef9c80 ffffa31499670000
[11359.569332] 000000003d862ecf ffffa314d0d1c000 ffffa3149cf771dc ffffa31499670000
[11359.569334] 00000000ffffffff ffffa3149cf771e0 ffffa314d0d1bd80 ffffffff85296905
[11359.569337] Call Trace:
[11359.569340] [<ffffffff85296905>] schedule+0x35/0x80
[11359.569342] [<ffffffff85296bae>] schedule_preempt_disabled+0xe/0x10
[11359.569344] [<ffffffff852987e9>] __mutex_lock_slowpath+0xb9/0x130
[11359.569347] [<ffffffff8529887f>] mutex_lock+0x1f/0x30
[11359.569350] [<ffffffff84c71440>] blkdev_put+0x20/0x110
[11359.569352] [<ffffffff84c71564>] blkdev_close+0x34/0x70
[11359.569354] [<ffffffff84c34f77>] __fput+0xe7/...

Read more...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers