XFS corruption on machine which never suffered a hard reset or disk failure

Bug #1049267 reported by xor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

Using Ubuntu 12.04 server, we installed a machine using the following disk layout:
XFS => dm-crypt => RAID5.

A *complete* list of ALL configuration of the machine including the setup can be provided if you need it, we documented everything.

The harddisks are tested weekly with a full SMART test and they are okay.
The machine is attached to a UPS and therefore never suffered a hard reset.
Also, the memory was tested with memtest86+.

Nevertheless, the kernel reports XFS problems:

Sep 10 10:01:00 server kernel: [379001.376989] XFS (dm-0): xfs_da_do_buf: bno 0 dir: inode 3045868
Sep 10 10:01:00 server kernel: [379001.377011] XFS (dm-0): [00] br_startoff 0 br_startblock -2 br_blockcount 1 br_state 0
Sep 10 10:01:00 server kernel: [379001.377032] XFS (dm-0): Internal error xfs_da_do_buf(1) at line 2011 of file /build/buildd/linux-3.2.0/fs/xfs/xfs_da_btree.c. Caller 0xffffffffa01feeef
Sep 10 10:01:00 server kernel: [379001.377033]
Sep 10 10:01:00 server kernel: [379001.377069] Pid: 26624, comm: updatedb.mlocat Tainted: G C 3.2.0-30-generic #48-Ubuntu
Sep 10 10:01:00 server kernel: [379001.377071] Call Trace:
Sep 10 10:01:00 server kernel: [379001.377089] [<ffffffffa01cb6bf>] xfs_error_report+0x3f/0x50 [xfs]
Sep 10 10:01:00 server kernel: [379001.377099] [<ffffffffa01feeef>] ? xfs_da_reada_buf+0x2f/0x40 [xfs]
Sep 10 10:01:00 server kernel: [379001.377108] [<ffffffffa01fea12>] xfs_da_do_buf+0x182/0x630 [xfs]
Sep 10 10:01:00 server kernel: [379001.377117] [<ffffffffa01feeef>] xfs_da_reada_buf+0x2f/0x40 [xfs]
Sep 10 10:01:00 server kernel: [379001.377124] [<ffffffffa01cbdc8>] xfs_dir_open+0x68/0x80 [xfs]
Sep 10 10:01:00 server kernel: [379001.377127] [<ffffffff81175bd0>] __dentry_open+0x290/0x360
Sep 10 10:01:00 server kernel: [379001.377133] [<ffffffffa01cbd60>] ? xfs_dir_fsync+0x110/0x110 [xfs]
Sep 10 10:01:00 server kernel: [379001.377136] [<ffffffff8129cdbc>] ? security_inode_permission+0x1c/0x30
Sep 10 10:01:00 server kernel: [379001.377138] [<ffffffff8118389a>] ? inode_permission+0x4a/0x110
Sep 10 10:01:00 server kernel: [379001.377139] [<ffffffff8117624d>] vfs_open+0x3d/0x40
Sep 10 10:01:00 server kernel: [379001.377141] [<ffffffff81177130>] nameidata_to_filp+0x40/0x50
Sep 10 10:01:00 server kernel: [379001.377143] [<ffffffff811860d8>] do_last+0x3f8/0x730
Sep 10 10:01:00 server kernel: [379001.377144] [<ffffffff811877b1>] path_openat+0xd1/0x3f0
Sep 10 10:01:00 server kernel: [379001.377146] [<ffffffff811830f5>] ? putname+0x35/0x50
Sep 10 10:01:00 server kernel: [379001.377147] [<ffffffff81187b53>] ? user_path_at_empty+0x63/0xa0
Sep 10 10:01:00 server kernel: [379001.377149] [<ffffffff81187bf2>] do_filp_open+0x42/0xa0
Sep 10 10:01:00 server kernel: [379001.377152] [<ffffffff81319321>] ? strncpy_from_user+0x31/0x40
Sep 10 10:01:00 server kernel: [379001.377153] [<ffffffff81182f3a>] ? do_getname+0x10a/0x180
Sep 10 10:01:00 server kernel: [379001.377156] [<ffffffff8165a41e>] ? _raw_spin_lock+0xe/0x20
Sep 10 10:01:00 server kernel: [379001.377158] [<ffffffff81194eb7>] ? alloc_fd+0xf7/0x150
Sep 10 10:01:00 server kernel: [379001.377159] [<ffffffff8117722d>] do_sys_open+0xed/0x220
Sep 10 10:01:00 server kernel: [379001.377161] [<ffffffff81177380>] sys_open+0x20/0x30
Sep 10 10:01:00 server kernel: [379001.377163] [<ffffffff81662a02>] system_call_fastpath+0x16/0x1b
Sep 10 10:01:00 server kernel: [379001.377170] BUG: unable to handle kernel paging request at 0000000001000008
Sep 10 10:01:00 server kernel: [379001.377197] IP: [<ffffffff81122869>] file_ra_state_init+0x9/0x30
Sep 10 10:01:00 server kernel: [379001.377215] PGD 176937067 PUD 20eb89067 PMD 0
Sep 10 10:01:00 server kernel: [379001.377230] Oops: 0000 [#1] SMP
Sep 10 10:01:00 server kernel: [379001.377241] CPU 2
Sep 10 10:01:00 server kernel: [379001.377247] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat usb_storage uas nfsd nfs lockd fscache binfmt_misc auth_rpcgss nfs_acl sunrpc psmouse joydev serio_raw mei(C) mac_hid lp parport xfs dm_crypt raid10 raid0 multipath linear aesni_intel cryptd aes_x86_64 usbhid hid raid1 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx i915 drm_kms_helper drm i2c_algo_bit video e1000e
Sep 10 10:01:00 server kernel: [379001.377384]
Sep 10 10:01:00 server kernel: [379001.377390] Pid: 26624, comm: updatedb.mlocat Tainted: G C 3.2.0-30-generic #48-Ubuntu /DH67GD
Sep 10 10:01:00 server kernel: [379001.377419] RIP: 0010:[<ffffffff81122869>] [<ffffffff81122869>] file_ra_state_init+0x9/0x30
Sep 10 10:01:00 server kernel: [379001.377441] RSP: 0018:ffff8801d6a35c98 EFLAGS: 00010206
Sep 10 10:01:00 server kernel: [379001.377454] RAX: ffff880073981bc5 RBX: ffff880157dde800 RCX: 0000000000000001
Sep 10 10:01:00 server kernel: [379001.377471] RDX: 0000000000000001 RSI: 0000000000ffff88 RDI: ffff880157dde870
Sep 10 10:01:00 server kernel: [379001.377489] RBP: ffff8801d6a35c98 R08: 000000000000000a R09: 0000000000000000
Sep 10 10:01:00 server kernel: [379001.377506] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d6a35e38
Sep 10 10:01:00 server kernel: [379001.377523] R13: ffff880073981bc0 R14: ffff88013113d000 R15: 0000000000000000
Sep 10 10:01:00 server kernel: [379001.377541] FS: 00007fbaf36b6700(0000) GS:ffff88021f300000(0000) knlGS:0000000000000000
Sep 10 10:01:00 server kernel: [379001.377560] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 10 10:01:00 server kernel: [379001.377575] CR2: 0000000001000008 CR3: 00000001e6fe3000 CR4: 00000000000406e0
Sep 10 10:01:00 server kernel: [379001.377592] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 10 10:01:00 server kernel: [379001.377609] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 10 10:01:00 server kernel: [379001.377626] Process updatedb.mlocat (pid: 26624, threadinfo ffff8801d6a34000, task ffff88020de15c00)
Sep 10 10:01:00 server kernel: [379001.377647] Stack:
Sep 10 10:01:00 server kernel: [379001.377654] ffff8801d6a35cf8 ffffffff81175bf9 ffffffffa01cbd60 ffffffff8129cdbc
Sep 10 10:01:00 server kernel: [379001.377676] ffff8801de5d4600 ffffffff8118389a ffff8801d6a35d18 ffff8801d6a35e38
Sep 10 10:01:00 server kernel: [379001.377698] 0000000000058000 0000000000000000 ffff88013113d000 0000000000000000
Sep 10 10:01:00 server kernel: [379001.377719] Call Trace:
Sep 10 10:01:00 server kernel: [379001.377727] [<ffffffff81175bf9>] __dentry_open+0x2b9/0x360
Sep 10 10:01:00 server kernel: [379001.377747] [<ffffffffa01cbd60>] ? xfs_dir_fsync+0x110/0x110 [xfs]
Sep 10 10:01:00 server kernel: [379001.377763] [<ffffffff8129cdbc>] ? security_inode_permission+0x1c/0x30
Sep 10 10:01:00 server kernel: [379001.377780] [<ffffffff8118389a>] ? inode_permission+0x4a/0x110
Sep 10 10:01:00 server kernel: [379001.377794] [<ffffffff8117624d>] vfs_open+0x3d/0x40
Sep 10 10:01:00 server kernel: [379001.377807] [<ffffffff81177130>] nameidata_to_filp+0x40/0x50
Sep 10 10:01:00 server kernel: [379001.377822] [<ffffffff811860d8>] do_last+0x3f8/0x730
Sep 10 10:01:00 server kernel: [379001.377835] [<ffffffff811877b1>] path_openat+0xd1/0x3f0
Sep 10 10:01:00 server kernel: [379001.377849] [<ffffffff811830f5>] ? putname+0x35/0x50
Sep 10 10:01:00 server kernel: [379001.377862] [<ffffffff81187b53>] ? user_path_at_empty+0x63/0xa0
Sep 10 10:01:00 server kernel: [379001.377878] [<ffffffff81187bf2>] do_filp_open+0x42/0xa0
Sep 10 10:01:00 server kernel: [379001.377892] [<ffffffff81319321>] ? strncpy_from_user+0x31/0x40
Sep 10 10:01:00 server kernel: [379001.377907] [<ffffffff81182f3a>] ? do_getname+0x10a/0x180
Sep 10 10:01:00 server kernel: [379001.377921] [<ffffffff8165a41e>] ? _raw_spin_lock+0xe/0x20
Sep 10 10:01:00 server kernel: [379001.377935] [<ffffffff81194eb7>] ? alloc_fd+0xf7/0x150
Sep 10 10:01:00 server kernel: [379001.377949] [<ffffffff8117722d>] do_sys_open+0xed/0x220
Sep 10 10:01:00 server kernel: [379001.377963] [<ffffffff81177380>] sys_open+0x20/0x30
Sep 10 10:01:00 server kernel: [379001.377975] [<ffffffff81662a02>] system_call_fastpath+0x16/0x1b
Sep 10 10:01:00 server kernel: [379001.377990] Code: ff ff 48 c7 c2 e0 b7 c3 81 e9 d7 fe ff ff 48 8b 73 30 e9 65 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 66 66 66 66 90 <48> 8b 86 80 00 00 00 5d 48 8b 40 10 48 c7 47 18 ff ff ff ff 89
Sep 10 10:01:00 server kernel: [379001.378112] RIP [<ffffffff81122869>] file_ra_state_init+0x9/0x30
Sep 10 10:01:00 server kernel: [379001.378129] RSP <ffff8801d6a35c98>
Sep 10 10:01:00 server kernel: [379001.378138] CR2: 0000000001000008
Sep 10 10:01:01 server kernel: [379001.501110] ---[ end trace 2e597406c2d3462c ]---
---
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Sep 16 21:55 seq
 crw-rw---T 1 root audio 116, 33 Sep 16 21:55 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu13
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 12.04
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 LANGUAGE=en_US:en
 TERM=linux
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-30-generic root=/dev/mapper/md1_crypt ro
ProcVersionSignature: Ubuntu 3.2.0-30.48-generic 3.2.27
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-30-generic N/A
 linux-backports-modules-3.2.0-30-generic N/A
 linux-firmware 1.79.1
RfKill: Error: [Errno 2] No such file or directory
StagingDrivers: mei
Tags: precise staging
Uname: Linux 3.2.0-30-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 06/15/2012
dmi.bios.vendor: Intel Corp.
dmi.bios.version: BLH6710H.86A.0156.2012.0615.1908
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: DH67GD
dmi.board.vendor: Intel Corporation
dmi.board.version: AAG10206-210
dmi.chassis.type: 3
dmi.modalias: dmi:bvnIntelCorp.:bvrBLH6710H.86A.0156.2012.0615.1908:bd06/15/2012:svn:pn:pvr:rvnIntelCorporation:rnDH67GD:rvrAAG10206-210:cvn:ct3:cvr:

Revision history for this message
xor (xor) wrote :
Download full text (5.7 KiB)

After that, what happened very often is the following:

Sep 10 11:58:11 server kernel: [386031.913144] BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:35]
Sep 10 11:58:11 server kernel: [386031.913200] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat usb_storage uas nfsd nfs lockd fscache binfmt_misc auth_rpcgss nfs_acl sunrpc psmouse joydev serio_raw mei(C) mac_hid lp parport xfs dm_crypt raid10 raid0 multipath linear aesni_intel cryptd aes_x86_64 usbhid hid raid1 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx i915 drm_kms_helper drm i2c_algo_bit video e1000e
Sep 10 11:58:11 server kernel: [386031.913512] CPU 0
Sep 10 11:58:11 server kernel: [386031.913526] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat usb_storage uas nfsd nfs lockd fscache binfmt_misc auth_rpcgss nfs_acl sunrpc psmouse joydev serio_raw mei(C) mac_hid lp parport xfs dm_crypt raid10 raid0 multipath linear aesni_intel cryptd aes_x86_64 usbhid hid raid1 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx i915 drm_kms_helper drm i2c_algo_bit video e1000e
Sep 10 11:58:11 server kernel: [386031.921028]
Sep 10 11:58:11 server kernel: [386031.923600] Pid: 35, comm: kswapd0 Tainted: G D C 3.2.0-30-generic #48-Ubuntu /DH67GD
Sep 10 11:58:11 server kernel: [386031.926215] RIP: 0010:[<ffffffff8103dc4d>] [<ffffffff8103dc4d>] __ticket_spin_lock+0xd/0x30
Sep 10 11:58:11 server kernel: [386031.928810] RSP: 0018:ffff88020f911b80 EFLAGS: 00000286
Sep 10 11:58:11 server kernel: [386031.931399] RAX: 00000000ed7ded7d RBX: ffff88021f20ec40 RCX: ffff880073983d80
Sep 10 11:58:11 server kernel: [386031.933958] RDX: ffff88013113d740 RSI: 0000000000000001 RDI: ffff88013113d71c
Sep 10 11:58:11 server kernel: [386031.936490] RBP: ffff88020f911b80 R08: 0000000000000001 R09: dead000000200200
Sep 10 11:58:11 server kernel: [386031.939047] R10: 0000000000000000 R11: dead000000200200 R12: 0000000000000000
Sep 10 11:58:11 server kernel: [386031.941586] R13: 0000000000000000 R14: 0000000000000020 R15: ffffffff8112a74f
Sep 10 11:58:11 server kernel: [386031.944133] FS: 0000000000000000(0000) GS:ffff88021f200000(0000) knlGS:0000000000000000
Sep 10 11:58:11 server kernel: [386031.946716] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 10 11:58:11 server kernel: [386031.949245] CR2: 00007f6cd7267400 CR3: 0000000001c05000 CR4: 00000000000406f0
Sep 10 11:58:11 server kernel: [386031.951695] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 10 11:58:11 server kernel: [386031.954102] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 10 11:58:11 server kernel: [386031.956484] Process kswapd0 (pid: 35, threadinfo ffff88020f910000, task ffff88020f908000)
Sep 10 11:58:11 server kernel: [386031.958863] Stack:
Sep 10 11:58:11 server kernel: [386031.961214] ffff88020f911b90 ffffffff8165a41e ffff88020f911c00 ffffffff8118eadf
Sep 10 11:58:11 server kernel: [386031.963586] ffff88020c7f1000 ffff880073983d80 ffff88013113d740 ffff88018d115600
Sep 10 11:58:11 server kernel: [386031.965963] ffff88020f911bd0 ffff88013113d740 ffff88020f911c30 ffff8801765034dc
Sep 10 11:58:11 server kerne...

Read more...

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1049267/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1049267

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.6 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. Please only remove that one tag and leave the other tags. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-rc5-quantal/

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
tags: added: file-ra-state-init
tags: added: needs-upstream-testing
Revision history for this message
xor (xor) wrote :

(In reply to bot comment #3: We will try to do that. We hope that apport-collect is not a GUI application since the affected machine does not have an X-Server)

In reply to comment #4:
Do you have an actual indication that the upstream kernel would fix this? In other words: Does its changelog contain something about XFS? The machine is a multi-user production machine. We CAN do some testing with it, but it needs to be justified.

Revision history for this message
xor (xor) wrote : AcpiTables.txt

apport information

tags: added: apport-collected staging
description: updated
Revision history for this message
xor (xor) wrote : BootDmesg.txt

apport information

Revision history for this message
xor (xor) wrote : CurrentDmesg.txt

apport information

Revision history for this message
xor (xor) wrote : IwConfig.txt

apport information

Revision history for this message
xor (xor) wrote : Lspci.txt

apport information

Revision history for this message
xor (xor) wrote : Lsusb.txt

apport information

Revision history for this message
xor (xor) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
xor (xor) wrote : ProcInterrupts.txt

apport information

Revision history for this message
xor (xor) wrote : ProcModules.txt

apport information

Revision history for this message
xor (xor) wrote : UdevDb.txt

apport information

Revision history for this message
xor (xor) wrote : UdevLog.txt

apport information

Revision history for this message
xor (xor) wrote : WifiSyslog.txt

apport information

Revision history for this message
xor (xor) wrote :

NOTICE: This happened on the same machine as bug #1051689. After the machine suffered from #1051689, we tried to do a full-backup of the machine on ext4, which then also crashed due to a NULL pointer dereference.
Maybe the underlying issue is a RAID/dm-crypt bug? Both the XFS and ext4 were on RAID/dm-crypt.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

As requested in comment #4, it would be helpful to know if this bug also exists upstream, as well as bug 1051689 . There is no indication that this specific issue is already fixed upstream, but testing the mainline kernel will prove or dis-prove that.

Revision history for this message
xor (xor) wrote :

I now did the following:
- I put the disks of an affected machine (not the original one in this bug report) into a Debian6 machine which has been running rock-solid with XFS for years
- I used a script of my own to generate checksums & file date listing of ALL files (~2.5TB) on the disks using the Debian6.
- I then used an USB stick with Ubuntu12.04 to run xfs_repair on the affected XFS.
- After repair finished, I again put the disks into the Debian6 machine an generated checksums / filedate listing.
- I diff'ed the pre-repair and post-repair checksums and filedates. They are absolutely identical.

Conclusion:
The fact that the Debian did not complain about corruption when generating the checksums and that the checksums are not affected by repair maybe shows that there is no actual physical corruption but it was rather a crash bug?

I will put the affected machine back into operation with a 3.6 kernel as requested.
HOWEVER I should say that it took multiple weeks of operation until the issue first happened, so I don't think that testing this with 3.6 will disprove anything any soon. I think you guys should read the changelogs of the kernels or actually look at the stack trace and see what happened :|

Revision history for this message
xor (xor) wrote :

I was going to install the latest mainline kernel.
HOWEVER

- "dpkg-sig --list" shows that the packages contain no signatures at all.
- Further, there doesn't seem to be any signature files on the webserver [0]
- The webserver does not accept https connections.

While installing a release-candidate kernel on a production machine is something which I dislike already, the fact that it doesn't even contain a signature makes this inacceptable.
Please provide signed packages.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-rc7-quantal/

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.