xfs_trans_cancel on mkdir()
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-lts-xenial (Ubuntu) |
Confirmed
|
High
|
Unassigned |
Bug Description
Hi,
We have a server (a swift storage node) where a single XFS filesystem repeatedly becomes unusable (the 5 others are fine).
You'll find kern.log messages at the end of this message.
We can reproduce the problem easily : after an error occurs, we just need to umount/mount the filesystem. XFS will replay the log successfully and mount the filesystem. Then the swift processes will notice that the filesystem is back and will start sync'ing some data to it (not much, because the filesystem is full - swift will remove/add data to it) and then in ~1 or 2 minutes, the error pasted below will fire, and then the filesystem becomes unusable :
$ ls /srv/node/sdg
ls: cannot access /srv/node/sdg: Input/output error
It is still mounted though.
A few observations :
- it's always an rsync process that will trigger the error (probably expected, since most (if not all) write operations on the disk are done by rsync)
- it's always on an mkdir()
The "rsync" process (PID 15036) that crashed in the trace below was running under strace -e mkdir, you will find the output of that command below.
I tried reproducing the problem by repeating the "mkdir" after an umount/mount, but I'm unable to do so.
I believe the fact that the filesystem is nearly full is what causes this bug, which we're not seeing on other disks on this machine. We have the same problem on another disk on another machine (running 4.4.0-42-generic - even though, as I reconnect to it to grab the kernel version, it appears that the filesystem has managed to stay up for ~20 min now).
The question is : what information can we provide to help tracking this bug down ? I tried investigating with blktrace and fatrace, but that didn't really help as I'm still unable to reproduce the problem on my own.
Thanks !
== kernel trace ==
Jun 30 10:35:05 cawesh kernel: [29510.321474] XFS (sdg): Internal error xfs_trans_cancel at line 990 of file /build/
Jun 30 10:35:05 cawesh kernel: [29510.345888] CPU: 3 PID: 15036 Comm: rsync Tainted: G I 4.4.0-83-generic #106~14.04.1-Ubuntu
Jun 30 10:35:05 cawesh kernel: [29510.345891] Hardware name: HP ProLiant DL380 G6, BIOS P62 08/16/2015
Jun 30 10:35:05 cawesh kernel: [29510.345892] 0000000000000000 ffff8802627a3d30 ffffffff813ddefc ffff88058d91b790
Jun 30 10:35:05 cawesh kernel: [29510.345894] 0000000000000001 ffff8802627a3d48 ffffffffc03b095b ffffffffc03c0216
Jun 30 10:35:05 cawesh kernel: [29510.345896] ffff8802627a3d70 ffffffffc03cb166 ffff8805b432aec0 ffff88010444b800
Jun 30 10:35:05 cawesh kernel: [29510.345898] Call Trace:
Jun 30 10:35:05 cawesh kernel: [29510.345903] [<ffffffff813dd
Jun 30 10:35:05 cawesh kernel: [29510.345938] [<ffffffffc03b0
Jun 30 10:35:05 cawesh kernel: [29510.345955] [<ffffffffc03c0
Jun 30 10:35:05 cawesh kernel: [29510.345974] [<ffffffffc03cb
Jun 30 10:35:05 cawesh kernel: [29510.345991] [<ffffffffc03c0
Jun 30 10:35:05 cawesh kernel: [29510.346009] [<ffffffffc03bc
Jun 30 10:35:05 cawesh kernel: [29510.346027] [<ffffffffc03bc
Jun 30 10:35:05 cawesh kernel: [29510.346044] [<ffffffffc03bc
Jun 30 10:35:05 cawesh kernel: [29510.346047] [<ffffffff8120d
Jun 30 10:35:05 cawesh kernel: [29510.346049] [<ffffffff81211
Jun 30 10:35:05 cawesh kernel: [29510.346051] [<ffffffff8180c
Jun 30 10:35:05 cawesh kernel: [29510.346055] XFS (sdg): xfs_do_
Jun 30 10:35:05 cawesh kernel: [29510.346385] XFS (sdg): Corruption of in-memory data detected. Shutting down filesystem
Jun 30 10:35:05 cawesh kernel: [29510.346671] XFS (sdg): xfs_imap_to_bp: xfs_trans_
Jun 30 10:35:06 cawesh kernel: [29510.356952] XFS (sdg): Please umount the filesystem and rectify the problem(s)
Jun 30 10:35:06 cawesh kernel: [29510.362245] XFS (sdg): xfs_imap_to_bp: xfs_trans_
Jun 30 10:35:06 cawesh kernel: [29510.389098] XFS (sdg): xfs_imap_to_bp: xfs_trans_
Jun 30 10:35:32 cawesh kernel: [29536.602241] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:36:02 cawesh kernel: [29566.680878] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:36:32 cawesh kernel: [29596.759510] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:37:02 cawesh kernel: [29626.838049] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:37:32 cawesh kernel: [29656.920612] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:38:02 cawesh kernel: [29686.995254] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:38:32 cawesh kernel: [29717.073860] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:39:02 cawesh kernel: [29747.152459] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:39:32 cawesh kernel: [29777.231115] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:39:39 cawesh kernel: [29784.127051] XFS (sdg): xfs_log_force: error -5 returned.
== strace -e mkdir of the rsync process above ==
15036 mkdir("
15036 <... mkdir resumed> ) = 0
15036 mkdir("15f", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("201", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("254", 0700) = 0
15036 mkdir("282", 0700) = 0
15036 mkdir("365", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("47d", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("4a2", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("4b5", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("4e7", 0700) = 0
15036 mkdir("5cf", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("5ed", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("620", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("641", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = -1 ENOSPC (No space left on device)
15036 mkdir("663", 0700) = -1 EIO (Input/output error)
15036 mkdir("695", 0700) = -1 EIO (Input/output error)
15036 mkdir("6aa", 0700) = -1 EIO (Input/output error)
15036 mkdir("74f", 0700) = -1 EIO (Input/output error)
15036 mkdir("7fa", 0700) = -1 EIO (Input/output error)
15036 mkdir("81c", 0700) = -1 EIO (Input/output error)
15036 mkdir("849", 0700) = -1 EIO (Input/output error)
15036 mkdir("859", 0700) = -1 EIO (Input/output error)
15036 mkdir("9d3", 0700) = -1 EIO (Input/output error)
15036 mkdir("a47", 0700) = -1 EIO (Input/output error)
15036 mkdir("a82", 0700) = -1 EIO (Input/output error)
15036 mkdir("a9d", 0700) = -1 EIO (Input/output error)
15036 mkdir("c78", 0700) = -1 EIO (Input/output error)
15036 mkdir("ce7", 0700) = -1 EIO (Input/output error)
15036 mkdir("d03", 0700) = -1 EIO (Input/output error)
15036 mkdir("d3f", 0700) = -1 EIO (Input/output error)
15036 mkdir("e3f", 0700) = -1 EIO (Input/output error)
15036 mkdir("ee9", 0700) = -1 EIO (Input/output error)
15036 mkdir("f19", 0700) = -1 EIO (Input/output error)
15036 mkdir("f9a", 0700) = -1 EIO (Input/output error)
15036 mkdir("15f", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("201", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("254", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("282", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("365", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("47d", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("4a2", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("4b5", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("4e7", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("5cf", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("5ed", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("620", 040755) = -1 EEXIST (File exists)
15036 mkdir("
15036 mkdir("
15036 mkdir("641", 040755) = -1 EIO (Input/output error)
15036 mkdir("663", 040755) = -1 EIO (Input/output error)
15036 mkdir("695", 040755) = -1 EIO (Input/output error)
15036 mkdir("6aa", 040755) = -1 EIO (Input/output error)
15036 mkdir("74f", 040755) = -1 EIO (Input/output error)
15036 mkdir("7fa", 040755) = -1 EIO (Input/output error)
15036 mkdir("81c", 040755) = -1 EIO (Input/output error)
15036 mkdir("849", 040755) = -1 EIO (Input/output error)
15036 mkdir("859", 040755) = -1 EIO (Input/output error)
15036 mkdir("9d3", 040755) = -1 EIO (Input/output error)
15036 mkdir("a47", 040755) = -1 EIO (Input/output error)
15036 mkdir("a82", 040755) = -1 EIO (Input/output error)
15036 mkdir("a9d", 040755) = -1 EIO (Input/output error)
15036 mkdir("c78", 040755) = -1 EIO (Input/output error)
15036 mkdir("ce7", 040755) = -1 EIO (Input/output error)
15036 mkdir("d03", 040755) = -1 EIO (Input/output error)
15036 mkdir("d3f", 040755) = -1 EIO (Input/output error)
15036 mkdir("e3f", 040755) = -1 EIO (Input/output error)
15036 mkdir("ee9", 040755) = -1 EIO (Input/output error)
15036 mkdir("f19", 040755) = -1 EIO (Input/output error)
15036 mkdir("f9a", 040755) = -1 EIO (Input/output error)
15046 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=15036, si_uid=106} ---
15036 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15046, si_status=0, si_utime=0, si_stime=0} ---
15036 +++ exited with 0 +++
3244 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15036, si_status=0, si_utime=0, si_stime=2} ---
ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-
ProcVersionSign
Uname: Linux 4.4.0-83-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.24
Architecture: amd64
Date: Fri Jun 30 11:35:33 2017
ProcEnviron:
TERM=screen-
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: linux-lts-xenial
UpgradeStatus: No upgrade log present (probably fresh install)
Status changed to 'Confirmed' because the bug affects multiple users.