XFS Oopsing 4.2.0-35-generic Kernel - Block out of range
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
High
|
Unassigned | ||
Wily |
Confirmed
|
High
|
Unassigned |
Bug Description
After updating the Kernel on two of our Trusty Ceph servers from 3.19.0-33-generic (vivid) to 4.2.0-35-generic (wily), both servers have now Oopsed multiple times accessing some of their XFS filesystems.
Once this error occurs, the XFS filesystem becomes permanently inaccessible. Even after a reboot, any attempt to access the filesystem in question either by mounting it, or attempting an xfs_repair will trigger the same Oops again.
These filesystems did not immediately fail until several hours after the upgrade. I suspect that this is related to some housekeeping being triggered within the filesystem.
This is the call trace from the moment a filesystem first failed:
Apr 28 22:47:25 ceph-store5 kernel: [26692.804773] ------------[ cut here ]------------
Apr 28 22:47:25 ceph-store5 kernel: [26692.810046] WARNING: CPU: 8 PID: 5195 at /build/
Apr 28 22:47:25 ceph-store5 kernel: [26692.824715] Modules linked in: bridge openvswitch xfs dcdbas ipmi_devintf intel_rapl x86_pkg_
Apr 28 22:47:25 ceph-store5 kernel: [26692.891939] CPU: 8 PID: 5195 Comm: ceph-osd Tainted: G D W 4.2.0-35-generic #40~14.04.1-Ubuntu
Apr 28 22:47:25 ceph-store5 kernel: [26692.902689] Hardware name: Dell Inc. PowerEdge R720xd/0HJK12, BIOS 2.4.3 07/09/2014
Apr 28 22:47:25 ceph-store5 kernel: [26692.911303] 0000000000000000 ffff8800c7bf74a8 ffffffff817bcbf8 0000000000000000
Apr 28 22:47:25 ceph-store5 kernel: [26692.919677] ffffffffc0a7f300 ffff8800c7bf74e8 ffffffff81079b5a ffff8800c7bf7508
Apr 28 22:47:25 ceph-store5 kernel: [26692.928033] ffff88100a28ed80 0000000000000008 0000000000000000 ffff8800c7bf75f8
Apr 28 22:47:25 ceph-store5 kernel: [26692.936441] Call Trace:
Apr 28 22:47:25 ceph-store5 kernel: [26692.939208] [<ffffffff817bc
Apr 28 22:47:25 ceph-store5 kernel: [26692.944992] [<ffffffff81079
Apr 28 22:47:25 ceph-store5 kernel: [26692.951752] [<ffffffff81079
Apr 28 22:47:25 ceph-store5 kernel: [26692.958346] [<ffffffffc0a3e
Apr 28 22:47:25 ceph-store5 kernel: [26692.965221] [<ffffffffc0a3e
Apr 28 22:47:25 ceph-store5 kernel: [26692.972194] [<ffffffffc0a01
Apr 28 22:47:25 ceph-store5 kernel: [26692.979763] [<ffffffffc0a69
Apr 28 22:47:25 ceph-store5 kernel: [26692.987412] [<ffffffffc0a17
Apr 28 22:47:25 ceph-store5 kernel: [26692.994564] [<ffffffffc0a01
Apr 28 22:47:25 ceph-store5 kernel: [26693.002302] [<ffffffffc0a3d
Apr 28 22:47:25 ceph-store5 kernel: [26693.009379] [<ffffffffc0a32
Apr 28 22:47:25 ceph-store5 kernel: [26693.016229] [<ffffffff813b0
Apr 28 22:47:25 ceph-store5 kernel: [26693.022814] [<ffffffffc0a32
Apr 28 22:47:25 ceph-store5 kernel: [26693.029669] [<ffffffffc0a02
Apr 28 22:47:25 ceph-store5 kernel: [26693.036915] [<ffffffffc0a12
Apr 28 22:47:25 ceph-store5 kernel: [26693.044074] [<ffffffffc0a12
Apr 28 22:47:25 ceph-store5 kernel: [26693.050833] [<ffffffffc0a13
Apr 28 22:47:25 ceph-store5 kernel: [26693.057882] [<ffffffffc0a1d
Apr 28 22:47:25 ceph-store5 kernel: [26693.065403] [<ffffffff811d1
Apr 28 22:47:25 ceph-store5 kernel: [26693.071509] [<ffffffffc0a5b
Apr 28 22:47:25 ceph-store5 kernel: [26693.078078] [<ffffffffc0a1d
Apr 28 22:47:25 ceph-store5 kernel: [26693.085129] [<ffffffffc0a08
Apr 28 22:47:25 ceph-store5 kernel: [26693.093151] [<ffffffffc0a5b
Apr 28 22:47:25 ceph-store5 kernel: [26693.100199] [<ffffffffc0a5b
Apr 28 22:47:25 ceph-store5 kernel: [26693.107251] [<ffffffffc0a04
Apr 28 22:47:25 ceph-store5 kernel: [26693.114009] [<ffffffffc0a5a
Apr 28 22:47:25 ceph-store5 kernel: [26693.120647] [<ffffffff81212
Apr 28 22:47:25 ceph-store5 kernel: [26693.127107] [<ffffffff81213
Apr 28 22:47:25 ceph-store5 kernel: [26693.134045] [<ffffffff81213
Apr 28 22:47:25 ceph-store5 kernel: [26693.145904] [<ffffffff81213
Apr 28 22:47:25 ceph-store5 kernel: [26693.151781] [<ffffffff81213
Apr 28 22:47:25 ceph-store5 kernel: [26693.157626] [<ffffffffc0a73
Apr 28 22:47:25 ceph-store5 kernel: [26693.164599] [<ffffffffc0a74
Apr 28 22:47:25 ceph-store5 kernel: [26693.172451] [<ffffffffc0a5a
Apr 28 22:47:25 ceph-store5 kernel: [26693.179694] [<ffffffff811f1
Apr 28 22:47:25 ceph-store5 kernel: [26693.186278] [<ffffffffc0a5a
Apr 28 22:47:25 ceph-store5 kernel: [26693.193122] [<ffffffff81213
Apr 28 22:47:25 ceph-store5 kernel: [26693.199193] [<ffffffff817c4
Apr 28 22:47:25 ceph-store5 kernel: [26693.206471] ---[ end trace c0a9568b8830fc8b ]---
Apr 28 22:47:25 ceph-store5 kernel: [26693.211904] XFS (bcache7): _xfs_buf_find: Block out of range: block 0x874702fa0, EOFS 0x1d1c0bea0
Apr 28 22:47:25 ceph-store5 kernel: [26693.221992] ------------[ cut here ]------------
I notice that xfs_da_grow_inode. Am I right in thinking that this was a new feature add in the 4.2 kernel?
I expect that the history of these filesystems is important. These would more have been formatted on a 3.16 kernel. Although I can't inspect these failed filesystems, I can give information about the other filesystems on the same servers, which I an 99% sure are identical:
root@ceph-store5:~# xfs_info /var/lib/
meta-data=
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1
data = bsize=4096 blocks=976754644, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=476930, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
==== Attempt to repair a filesystem (after a reboot) ====
root@ceph-store5:~# xfs_repair /dev/disk/
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
root@ceph-store5:~# mount -o noatime,inode64 /dev/disk/
Killed
Dmesg output (dmesg -T):
[Fri Apr 29 10:18:45 2016] XFS (bcache2): Mounting V5 Filesystem
[Fri Apr 29 10:18:45 2016] XFS (bcache2): Starting recovery (logdev: internal)
[Fri Apr 29 10:18:49 2016] XFS (bcache2): _xfs_buf_find: Block out of range: block 0x874702fa0, EOFS 0x1d1c0bea0
[Fri Apr 29 10:18:49 2016] ------------[ cut here ]------------
[Fri Apr 29 10:18:49 2016] WARNING: CPU: 6 PID: 211620 at /build/
[Fri Apr 29 10:18:49 2016] Modules linked in: bridge openvswitch xfs ipmi_devintf dcdbas intel_rapl x86_pkg_
[Fri Apr 29 10:18:49 2016] CPU: 6 PID: 211620 Comm: mount Not tainted 4.2.0-35-generic #40~14.04.1-Ubuntu
[Fri Apr 29 10:18:49 2016] Hardware name: Dell Inc. PowerEdge R720xd/0HJK12, BIOS 2.4.3 07/09/2014
[Fri Apr 29 10:18:49 2016] 0000000000000000 ffff88018e11b8c8 ffffffff817bcbf8 0000000000000000
[Fri Apr 29 10:18:49 2016] ffffffffc0653300 ffff88018e11b908 ffffffff81079b5a ffff88018e11b928
[Fri Apr 29 10:18:49 2016] ffff880806ae8840 0000000000000008 0000000000000000 ffff88018e11ba18
[Fri Apr 29 10:18:49 2016] Call Trace:
[Fri Apr 29 10:18:49 2016] [<ffffffff817bc
[Fri Apr 29 10:18:49 2016] [<ffffffff81079
[Fri Apr 29 10:18:49 2016] [<ffffffff81079
[Fri Apr 29 10:18:49 2016] [<ffffffffc0612
[Fri Apr 29 10:18:49 2016] [<ffffffffc0612
[Fri Apr 29 10:18:49 2016] [<ffffffffc05d5
[Fri Apr 29 10:18:49 2016] [<ffffffffc063d
[Fri Apr 29 10:18:49 2016] [<ffffffffc05eb
[Fri Apr 29 10:18:49 2016] [<ffffffffc05d5
[Fri Apr 29 10:18:49 2016] [<ffffffff811d1
[Fri Apr 29 10:18:49 2016] [<ffffffffc062f
[Fri Apr 29 10:18:49 2016] [<ffffffff813b0
[Fri Apr 29 10:18:49 2016] [<ffffffffc0606
[Fri Apr 29 10:18:49 2016] [<ffffffffc05d6
[Fri Apr 29 10:18:49 2016] [<ffffffffc0637
[Fri Apr 29 10:18:49 2016] [<ffffffffc0639
[Fri Apr 29 10:18:49 2016] [<ffffffffc063c
[Fri Apr 29 10:18:49 2016] [<ffffffffc0631
[Fri Apr 29 10:18:49 2016] [<ffffffffc0628
[Fri Apr 29 10:18:49 2016] [<ffffffffc062b
[Fri Apr 29 10:18:49 2016] [<ffffffff811f1
[Fri Apr 29 10:18:49 2016] [<ffffffffc062b
[Fri Apr 29 10:18:49 2016] [<ffffffffc0629
[Fri Apr 29 10:18:49 2016] [<ffffffff811f2
[Fri Apr 29 10:18:49 2016] [<ffffffff8120d
[Fri Apr 29 10:18:49 2016] [<ffffffff8120f
[Fri Apr 29 10:18:49 2016] [<ffffffff8117d
[Fri Apr 29 10:18:49 2016] [<ffffffff8120f
[Fri Apr 29 10:18:49 2016] [<ffffffff81210
[Fri Apr 29 10:18:49 2016] [<ffffffff817c4
[Fri Apr 29 10:18:49 2016] ---[ end trace ce3e7a80324237e0 ]---
[Fri Apr 29 10:18:49 2016] XFS (bcache2): _xfs_buf_find: Block out of range: block 0x874702fa0, EOFS 0x1d1c0bea0
[Fri Apr 29 10:18:49 2016] ------------[ cut here ]------------
[Fri Apr 29 10:18:49 2016] WARNING: CPU: 6 PID: 211620 at /build/
[Fri Apr 29 10:18:49 2016] Modules linked in: bridge openvswitch xfs ipmi_devintf dcdbas intel_rapl x86_pkg_
[Fri Apr 29 10:18:49 2016] CPU: 6 PID: 211620 Comm: mount Tainted: G W 4.2.0-35-generic #40~14.04.1-Ubuntu
[Fri Apr 29 10:18:49 2016] Hardware name: Dell Inc. PowerEdge R720xd/0HJK12, BIOS 2.4.3 07/09/2014
[Fri Apr 29 10:18:49 2016] 0000000000000000 ffff88018e11b8c8 ffffffff817bcbf8 0000000000000000
[Fri Apr 29 10:18:49 2016] ffffffffc0653300 ffff88018e11b908 ffffffff81079b5a ffff88018e11b928
[Fri Apr 29 10:18:49 2016] ffff880806ae8840 0000000000000008 0000000000000000 ffff88018e11ba18
[Fri Apr 29 10:18:49 2016] Call Trace:
[Fri Apr 29 10:18:49 2016] [<ffffffff817bc
[Fri Apr 29 10:18:49 2016] [<ffffffff81079
[Fri Apr 29 10:18:49 2016] [<ffffffff81079
[Fri Apr 29 10:18:49 2016] [<ffffffffc0612
[Fri Apr 29 10:18:49 2016] [<ffffffffc0612
[Fri Apr 29 10:18:49 2016] [<ffffffffc063d
[Fri Apr 29 10:18:49 2016] [<ffffffffc05eb
[Fri Apr 29 10:18:49 2016] [<ffffffffc05d5
[Fri Apr 29 10:18:49 2016] [<ffffffff811d1
[Fri Apr 29 10:18:49 2016] [<ffffffffc062f
[Fri Apr 29 10:18:49 2016] [<ffffffff813b0
[Fri Apr 29 10:18:49 2016] [<ffffffffc0606
[Fri Apr 29 10:18:49 2016] [<ffffffffc05d6
[Fri Apr 29 10:18:49 2016] [<ffffffffc0637
[Fri Apr 29 10:18:49 2016] [<ffffffffc0639
[Fri Apr 29 10:18:49 2016] [<ffffffffc063c
[Fri Apr 29 10:18:49 2016] [<ffffffffc0631
[Fri Apr 29 10:18:49 2016] [<ffffffffc0628
[Fri Apr 29 10:18:49 2016] [<ffffffffc062b
[Fri Apr 29 10:18:49 2016] [<ffffffff811f1
[Fri Apr 29 10:18:49 2016] [<ffffffffc062b
[Fri Apr 29 10:18:49 2016] [<ffffffffc0629
[Fri Apr 29 10:18:49 2016] [<ffffffff811f2
[Fri Apr 29 10:18:49 2016] [<ffffffff8120d
[Fri Apr 29 10:18:49 2016] [<ffffffff8120f
[Fri Apr 29 10:18:49 2016] [<ffffffff8117d
[Fri Apr 29 10:18:49 2016] [<ffffffff8120f
[Fri Apr 29 10:18:49 2016] [<ffffffff81210
[Fri Apr 29 10:18:49 2016] [<ffffffff817c4
[Fri Apr 29 10:18:49 2016] ---[ end trace ce3e7a80324237e1 ]---
[Fri Apr 29 10:18:49 2016] BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
[Fri Apr 29 10:18:49 2016] IP: [<ffffffffc063e
[Fri Apr 29 10:18:49 2016] PGD 564eec067 PUD 11408b067 PMD 0
[Fri Apr 29 10:18:49 2016] Oops: 0000 [#1] SMP
[Fri Apr 29 10:18:49 2016] Modules linked in: bridge openvswitch xfs ipmi_devintf dcdbas intel_rapl x86_pkg_
[Fri Apr 29 10:18:49 2016] CPU: 6 PID: 211620 Comm: mount Tainted: G W 4.2.0-35-generic #40~14.04.1-Ubuntu
[Fri Apr 29 10:18:49 2016] Hardware name: Dell Inc. PowerEdge R720xd/0HJK12, BIOS 2.4.3 07/09/2014
[Fri Apr 29 10:18:49 2016] task: ffff88080967ee00 ti: ffff88018e118000 task.ti: ffff88018e118000
[Fri Apr 29 10:18:49 2016] RIP: 0010:[<
[Fri Apr 29 10:18:49 2016] RSP: 0018:ffff88018e
[Fri Apr 29 10:18:49 2016] RAX: 0000000000000000 RBX: ffff88018e11bb18 RCX: 00000000003d2d52
[Fri Apr 29 10:18:49 2016] RDX: 00000000003d2d51 RSI: 0000000000000000 RDI: ffff880afa4e4828
[Fri Apr 29 10:18:49 2016] RBP: ffff88018e11ba28 R08: 000000000001a4f0 R09: ffff88080f8da4f0
[Fri Apr 29 10:18:49 2016] R10: ffffffffc061164a R11: ffffea001281d600 R12: ffff880afa4e4828
[Fri Apr 29 10:18:49 2016] R13: ffff880afa4e4828 R14: 0000000000000000 R15: 0000000000000008
[Fri Apr 29 10:18:49 2016] FS: 00007f42d515f88
[Fri Apr 29 10:18:49 2016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Apr 29 10:18:49 2016] CR2: 00000000000000f8 CR3: 0000000139a06000 CR4: 00000000000406e0
[Fri Apr 29 10:18:49 2016] Stack:
[Fri Apr 29 10:18:49 2016] ffff88018e11bb18 ffff880afa4e4828 ffff88075d649ec0 ffff880164bc8800
[Fri Apr 29 10:18:49 2016] ffff88018e11bb08 ffffffffc05d5d26 ffff88015e77eb80 0000000200000000
[Fri Apr 29 10:18:49 2016] ffff88018e11ba98 ffffffff811d1075 ffff88015e77eb80 0000000000000000
[Fri Apr 29 10:18:49 2016] Call Trace:
[Fri Apr 29 10:18:49 2016] [<ffffffffc05d5
[Fri Apr 29 10:18:49 2016] [<ffffffff811d1
[Fri Apr 29 10:18:49 2016] [<ffffffffc062f
[Fri Apr 29 10:18:49 2016] [<ffffffff813b0
[Fri Apr 29 10:18:49 2016] [<ffffffffc0606
[Fri Apr 29 10:18:49 2016] [<ffffffffc05d6
[Fri Apr 29 10:18:49 2016] [<ffffffffc0637
[Fri Apr 29 10:18:49 2016] [<ffffffffc0639
[Fri Apr 29 10:18:49 2016] [<ffffffffc063c
[Fri Apr 29 10:18:49 2016] [<ffffffffc0631
[Fri Apr 29 10:18:49 2016] [<ffffffffc0628
[Fri Apr 29 10:18:49 2016] [<ffffffffc062b
[Fri Apr 29 10:18:49 2016] [<ffffffff811f1
[Fri Apr 29 10:18:49 2016] [<ffffffffc062b
[Fri Apr 29 10:18:49 2016] [<ffffffffc0629
[Fri Apr 29 10:18:49 2016] [<ffffffff811f2
[Fri Apr 29 10:18:49 2016] [<ffffffff8120d
[Fri Apr 29 10:18:49 2016] [<ffffffff8120f
[Fri Apr 29 10:18:49 2016] [<ffffffff8117d
[Fri Apr 29 10:18:50 2016] [<ffffffff8120f
[Fri Apr 29 10:18:50 2016] [<ffffffff81210
[Fri Apr 29 10:18:50 2016] [<ffffffff817c4
[Fri Apr 29 10:18:50 2016] Code: 13 48 85 d2 75 eb e9 5f ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 56 49 89 f6 41 55 49 89 fd 41 54 53 <4c> 8b a6 f8 00 00 00 66 66 66 66 90 41 f6 44 24 78 04 74 4f 5b
[Fri Apr 29 10:18:50 2016] RIP [<ffffffffc063e
[Fri Apr 29 10:18:50 2016] RSP <ffff88018e11ba08>
[Fri Apr 29 10:18:50 2016] CR2: 00000000000000f8
[Fri Apr 29 10:18:50 2016] ---[ end trace ce3e7a80324237e2 ]---
root@ceph-store5:~#
Thanks to the magic of Ceph, I've so far managed to avoid data-loss and serious downtime, but I suspect there's a timebomb here waiting for anyone who upgrades from an older kernel to 4.2.0+.
Please let me know is there's any further information I might be able to provide to track this down.
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Apr 29 01:29 seq
crw-rw---- 1 root audio 116, 33 Apr 29 01:29 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.19
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
MachineType: Dell Inc. PowerEdge R720xd
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
XDG_RUNTIME_
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB:
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 1.127.22
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 4.2.0-35-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
_MarkForUpload: True
dmi.bios.date: 07/09/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.4.3
dmi.board.name: 0HJK12
dmi.board.vendor: Dell Inc.
dmi.board.version: A03
dmi.chassis.
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.
dmi.product.name: PowerEdge R720xd
dmi.sys.vendor: Dell Inc.
affects: | mesa (Ubuntu) → linux (Ubuntu) |
This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:
apport-collect 1576599
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.