xfs_trans_cancel on mkdir()

Bug #1701542 reported by Junien Fridrick on 2017-06-30
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux-lts-xenial (Ubuntu)
High
Unassigned

Bug Description

Hi,

We have a server (a swift storage node) where a single XFS filesystem repeatedly becomes unusable (the 5 others are fine).

You'll find kern.log messages at the end of this message.

We can reproduce the problem easily : after an error occurs, we just need to umount/mount the filesystem. XFS will replay the log successfully and mount the filesystem. Then the swift processes will notice that the filesystem is back and will start sync'ing some data to it (not much, because the filesystem is full - swift will remove/add data to it) and then in ~1 or 2 minutes, the error pasted below will fire, and then the filesystem becomes unusable :

$ ls /srv/node/sdg
ls: cannot access /srv/node/sdg: Input/output error

It is still mounted though.

A few observations :
- it's always an rsync process that will trigger the error (probably expected, since most (if not all) write operations on the disk are done by rsync)
- it's always on an mkdir()

The "rsync" process (PID 15036) that crashed in the trace below was running under strace -e mkdir, you will find the output of that command below.

I tried reproducing the problem by repeating the "mkdir" after an umount/mount, but I'm unable to do so.

I believe the fact that the filesystem is nearly full is what causes this bug, which we're not seeing on other disks on this machine. We have the same problem on another disk on another machine (running 4.4.0-42-generic - even though, as I reconnect to it to grab the kernel version, it appears that the filesystem has managed to stay up for ~20 min now).

The question is : what information can we provide to help tracking this bug down ? I tried investigating with blktrace and fatrace, but that didn't really help as I'm still unable to reproduce the problem on my own.

Thanks !

== kernel trace ==
Jun 30 10:35:05 cawesh kernel: [29510.321474] XFS (sdg): Internal error xfs_trans_cancel at line 990 of file /build/linux-lts-xenial-ep3zLI/linux-lts-xenial-4.4.0/fs/xfs/xfs_trans.c. Caller xfs_create+0x4c6/0x6c0 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.345888] CPU: 3 PID: 15036 Comm: rsync Tainted: G I 4.4.0-83-generic #106~14.04.1-Ubuntu
Jun 30 10:35:05 cawesh kernel: [29510.345891] Hardware name: HP ProLiant DL380 G6, BIOS P62 08/16/2015
Jun 30 10:35:05 cawesh kernel: [29510.345892] 0000000000000000 ffff8802627a3d30 ffffffff813ddefc ffff88058d91b790
Jun 30 10:35:05 cawesh kernel: [29510.345894] 0000000000000001 ffff8802627a3d48 ffffffffc03b095b ffffffffc03c0216
Jun 30 10:35:05 cawesh kernel: [29510.345896] ffff8802627a3d70 ffffffffc03cb166 ffff8805b432aec0 ffff88010444b800
Jun 30 10:35:05 cawesh kernel: [29510.345898] Call Trace:
Jun 30 10:35:05 cawesh kernel: [29510.345903] [<ffffffff813ddefc>] dump_stack+0x63/0x87
Jun 30 10:35:05 cawesh kernel: [29510.345938] [<ffffffffc03b095b>] xfs_error_report+0x3b/0x40 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.345955] [<ffffffffc03c0216>] ? xfs_create+0x4c6/0x6c0 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.345974] [<ffffffffc03cb166>] xfs_trans_cancel+0xb6/0xe0 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.345991] [<ffffffffc03c0216>] xfs_create+0x4c6/0x6c0 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.346009] [<ffffffffc03bc993>] xfs_generic_create+0xc3/0x290 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.346027] [<ffffffffc03bcb94>] xfs_vn_mknod+0x14/0x20 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.346044] [<ffffffffc03bcbb6>] xfs_vn_mkdir+0x16/0x20 [xfs]
Jun 30 10:35:05 cawesh kernel: [29510.346047] [<ffffffff8120d3e9>] vfs_mkdir+0x109/0x1b0
Jun 30 10:35:05 cawesh kernel: [29510.346049] [<ffffffff812119c2>] SyS_mkdir+0xb2/0xe0
Jun 30 10:35:05 cawesh kernel: [29510.346051] [<ffffffff8180c9ae>] tracesys_phase2+0x88/0x8d
Jun 30 10:35:05 cawesh kernel: [29510.346055] XFS (sdg): xfs_do_force_shutdown(0x8) called from line 991 of file /build/linux-lts-xenial-ep3zLI/linux-lts-xenial-4.4.0/fs/xfs/xfs_trans.c. Return address = 0xffffffffc03cb17f
Jun 30 10:35:05 cawesh kernel: [29510.346385] XFS (sdg): Corruption of in-memory data detected. Shutting down filesystem
Jun 30 10:35:05 cawesh kernel: [29510.346671] XFS (sdg): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Jun 30 10:35:06 cawesh kernel: [29510.356952] XFS (sdg): Please umount the filesystem and rectify the problem(s)
Jun 30 10:35:06 cawesh kernel: [29510.362245] XFS (sdg): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Jun 30 10:35:06 cawesh kernel: [29510.389098] XFS (sdg): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Jun 30 10:35:32 cawesh kernel: [29536.602241] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:36:02 cawesh kernel: [29566.680878] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:36:32 cawesh kernel: [29596.759510] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:37:02 cawesh kernel: [29626.838049] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:37:32 cawesh kernel: [29656.920612] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:38:02 cawesh kernel: [29686.995254] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:38:32 cawesh kernel: [29717.073860] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:39:02 cawesh kernel: [29747.152459] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:39:32 cawesh kernel: [29777.231115] XFS (sdg): xfs_log_force: error -5 returned.
Jun 30 10:39:39 cawesh kernel: [29784.127051] XFS (sdg): xfs_log_force: error -5 returned.

== strace -e mkdir of the rsync process above ==
15036 mkdir("/sdg/objects/207716", 0777 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("15f", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("201", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("254", 0700) = 0
15036 mkdir("282", 0700) = 0
15036 mkdir("365", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("47d", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("4a2", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("4b5", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("4e7", 0700) = 0
15036 mkdir("5cf", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("5ed", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("620", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = 0
15036 mkdir("641", 0700 <unfinished ...>
15036 <... mkdir resumed> ) = -1 ENOSPC (No space left on device)
15036 mkdir("663", 0700) = -1 EIO (Input/output error)
15036 mkdir("695", 0700) = -1 EIO (Input/output error)
15036 mkdir("6aa", 0700) = -1 EIO (Input/output error)
15036 mkdir("74f", 0700) = -1 EIO (Input/output error)
15036 mkdir("7fa", 0700) = -1 EIO (Input/output error)
15036 mkdir("81c", 0700) = -1 EIO (Input/output error)
15036 mkdir("849", 0700) = -1 EIO (Input/output error)
15036 mkdir("859", 0700) = -1 EIO (Input/output error)
15036 mkdir("9d3", 0700) = -1 EIO (Input/output error)
15036 mkdir("a47", 0700) = -1 EIO (Input/output error)
15036 mkdir("a82", 0700) = -1 EIO (Input/output error)
15036 mkdir("a9d", 0700) = -1 EIO (Input/output error)
15036 mkdir("c78", 0700) = -1 EIO (Input/output error)
15036 mkdir("ce7", 0700) = -1 EIO (Input/output error)
15036 mkdir("d03", 0700) = -1 EIO (Input/output error)
15036 mkdir("d3f", 0700) = -1 EIO (Input/output error)
15036 mkdir("e3f", 0700) = -1 EIO (Input/output error)
15036 mkdir("ee9", 0700) = -1 EIO (Input/output error)
15036 mkdir("f19", 0700) = -1 EIO (Input/output error)
15036 mkdir("f9a", 0700) = -1 EIO (Input/output error)
15036 mkdir("15f", 040755) = -1 EEXIST (File exists)
15036 mkdir("15f/cad923122b5b596c7bddab4d3076415f", 0700) = -1 EIO (Input/output error)
15036 mkdir("15f/cad923122b5b596c7bddab4d3076415f", 040755) = -1 EIO (Input/output error)
15036 mkdir("201", 040755) = -1 EEXIST (File exists)
15036 mkdir("201/cad929a69d50826c89cd552a1727e201", 0700) = -1 EIO (Input/output error)
15036 mkdir("201/cad929a69d50826c89cd552a1727e201", 040755) = -1 EIO (Input/output error)
15036 mkdir("254", 040755) = -1 EEXIST (File exists)
15036 mkdir("254/cad905e97142eec0cca1ebb35156e254", 0700) = -1 EIO (Input/output error)
15036 mkdir("254/cad905e97142eec0cca1ebb35156e254", 040755) = -1 EIO (Input/output error)
15036 mkdir("282", 040755) = -1 EEXIST (File exists)
15036 mkdir("282/cad92f4db09debbb03982be4a8d91282", 0700) = -1 EIO (Input/output error)
15036 mkdir("282/cad92f4db09debbb03982be4a8d91282", 040755) = -1 EIO (Input/output error)
15036 mkdir("365", 040755) = -1 EEXIST (File exists)
15036 mkdir("365/cad913ba60589d9abbc0c89610076365", 0700) = -1 EIO (Input/output error)
15036 mkdir("365/cad913ba60589d9abbc0c89610076365", 040755) = -1 EIO (Input/output error)
15036 mkdir("47d", 040755) = -1 EEXIST (File exists)
15036 mkdir("47d/cad918b63ace0228f67afafd7975147d", 0700) = -1 EIO (Input/output error)
15036 mkdir("47d/cad918b63ace0228f67afafd7975147d", 040755) = -1 EIO (Input/output error)
15036 mkdir("4a2", 040755) = -1 EEXIST (File exists)
15036 mkdir("4a2/cad904dcd84ef22d04d8a28d48edb4a2", 0700) = -1 EIO (Input/output error)
15036 mkdir("4a2/cad904dcd84ef22d04d8a28d48edb4a2", 040755) = -1 EIO (Input/output error)
15036 mkdir("4b5", 040755) = -1 EEXIST (File exists)
15036 mkdir("4b5/cad906e15ef0b16971cc5d2a852a74b5", 0700) = -1 EIO (Input/output error)
15036 mkdir("4b5/cad906e15ef0b16971cc5d2a852a74b5", 040755) = -1 EIO (Input/output error)
15036 mkdir("4e7", 040755) = -1 EEXIST (File exists)
15036 mkdir("4e7/cad90235a9717d7dc2c71643961694e7", 0700) = -1 EIO (Input/output error)
15036 mkdir("4e7/cad90235a9717d7dc2c71643961694e7", 040755) = -1 EIO (Input/output error)
15036 mkdir("5cf", 040755) = -1 EEXIST (File exists)
15036 mkdir("5cf/cad92538caf35dc41b708c7d90c8e5cf", 0700) = -1 EIO (Input/output error)
15036 mkdir("5cf/cad92538caf35dc41b708c7d90c8e5cf", 040755) = -1 EIO (Input/output error)
15036 mkdir("5ed", 040755) = -1 EEXIST (File exists)
15036 mkdir("5ed/cad9147196d3644876dba2fbdf2205ed", 0700) = -1 EIO (Input/output error)
15036 mkdir("5ed/cad9147196d3644876dba2fbdf2205ed", 040755) = -1 EIO (Input/output error)
15036 mkdir("620", 040755) = -1 EEXIST (File exists)
15036 mkdir("620/cad932e8be8f62fd6eafa510b9be6620", 0700) = -1 EIO (Input/output error)
15036 mkdir("620/cad932e8be8f62fd6eafa510b9be6620", 040755) = -1 EIO (Input/output error)
15036 mkdir("641", 040755) = -1 EIO (Input/output error)
15036 mkdir("663", 040755) = -1 EIO (Input/output error)
15036 mkdir("695", 040755) = -1 EIO (Input/output error)
15036 mkdir("6aa", 040755) = -1 EIO (Input/output error)
15036 mkdir("74f", 040755) = -1 EIO (Input/output error)
15036 mkdir("7fa", 040755) = -1 EIO (Input/output error)
15036 mkdir("81c", 040755) = -1 EIO (Input/output error)
15036 mkdir("849", 040755) = -1 EIO (Input/output error)
15036 mkdir("859", 040755) = -1 EIO (Input/output error)
15036 mkdir("9d3", 040755) = -1 EIO (Input/output error)
15036 mkdir("a47", 040755) = -1 EIO (Input/output error)
15036 mkdir("a82", 040755) = -1 EIO (Input/output error)
15036 mkdir("a9d", 040755) = -1 EIO (Input/output error)
15036 mkdir("c78", 040755) = -1 EIO (Input/output error)
15036 mkdir("ce7", 040755) = -1 EIO (Input/output error)
15036 mkdir("d03", 040755) = -1 EIO (Input/output error)
15036 mkdir("d3f", 040755) = -1 EIO (Input/output error)
15036 mkdir("e3f", 040755) = -1 EIO (Input/output error)
15036 mkdir("ee9", 040755) = -1 EIO (Input/output error)
15036 mkdir("f19", 040755) = -1 EIO (Input/output error)
15036 mkdir("f9a", 040755) = -1 EIO (Input/output error)
15046 --- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_USER, si_pid=15036, si_uid=106} ---
15036 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15046, si_status=0, si_utime=0, si_stime=0} ---
15036 +++ exited with 0 +++
3244 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=15036, si_status=0, si_utime=0, si_stime=2} ---

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-4.4.0-83-generic 4.4.0-83.106~14.04.1
ProcVersionSignature: Ubuntu 4.4.0-83.106~14.04.1-generic 4.4.70
Uname: Linux 4.4.0-83-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.24
Architecture: amd64
Date: Fri Jun 30 11:35:33 2017
ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-lts-xenial
UpgradeStatus: No upgrade log present (probably fresh install)

Junien Fridrick (axino) wrote :
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-xenial (Ubuntu):
status: New → Confirmed
Junien Fridrick (axino) wrote :

Same bug happened with mainline kernel (4.12.0-041200rc7-generic) and also with the regular trusty kernel (3.13.0-123-generic)

Joseph Salisbury (jsalisbury) wrote :

It sounds like this is not a regression and you tested the mainline kernel. Version 4.12 final is now available. For completeness, can you give that one a test:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12

Changed in linux-lts-xenial (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
tags: added: kernel-key
removed: kernel-da-key
Junien Fridrick (axino) wrote :

Unfortunately, we managed to free up some space on the disks, so it's going to be complicated to replicate for now.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers