We had a similar incident, but our upgrade path was different. We're also using Ceph (Hammwer, 0.94.7)
We updated from 4.1.16 (Gentoo) to 4.4.27 (Gentoo). After updating multiple servers they came up fine. However, after around 30 minutes, multiple OSDs on multiple Hosts exposed the same problem.
We also had the same problem of interacting with the affected filesystems *at all* after the kernel triggered this. Not all FS did show this symptom.
We got out of this by:
- rebooting to 4.1
- xfs_repair -L
- reboot to 4.4
- xfs_repair (always found an off-by-one in one AGL, don't have the log, sorry)
We ran tests after that and the systems appear stable now. I have reviewed all commits from the kernel changelog after 4.1.16 that are marked with 'xfs: ' but have not found any commit that
triggers my spidey senses. :/
Thought I'd leave this here for future reference. Also, when this happened, we did run into consistency issues with Ceph, which you can see here: http://tracker.ceph.com/issues/13837
We had a similar incident, but our upgrade path was different. We're also using Ceph (Hammwer, 0.94.7)
We updated from 4.1.16 (Gentoo) to 4.4.27 (Gentoo). After updating multiple servers they came up fine. However, after around 30 minutes, multiple OSDs on multiple Hosts exposed the same problem.
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796463] XFS (dm-8): _xfs_buf_find: xfs_buf. c:472 _xfs_buf_ find+0x2cc/ 0x330() temp_thermal kvm_intel kvm irqbypass 7b8>] 0x4d/0x65 286>] common+ 0x86/0xc0 37a>] null+0x1a/ 0x20 87c>] find+0x2cc/ 0x330 90a>] get_map+ 0x2a/0x280 dc6>] get_buf_ map+0x106/ 0x190 def>] get_bufs+ 0x4f/0x60 98f>] fix_freelist+ 0x20f/0x3c0 b14>] ? update_ counters. isra.11+ 0x44/0x50 bdd>] ? lookup+ 0xd/0x10 f5a>] ? get+0x2a/ 0xb0 bdd>] ? lookup+ 0xd/0x10 f5a>] ? get+0x2a/ 0xb0 d54>] vextent+ 0x214/0x690 25f>] btalloc+ 0x39f/0x6f0 5be>] alloc+0xe/ 0x10 f12>] write+0x452/ 0x9b0 f64>] write_allocate+ 0x154/0x350 1f2>] blocks+ 0x152/0x220 f59>] writepage+ 0x189/0x610 4e3>] 0x13/0x30 019>] pages+0x219/ 0x4b0 4d0>] ? dirty_limits+ 0x150/0x150 2f3>] writepages+ 0x43/0x60 78d>] writepages+ 0x3d/0x50 dfe>] 0x1e/0x30 1e1>] fdatawrite_ range+0x71/ 0x90 2fa>] write_and_ wait_range+ 0x2a/0x70 1f4>] fsync+0x54/ 0x1c0 edb>] range+0x3b/ 0xa0 5f6>] ? 0x76/0x160 f9d>] ae2>] ? return_ slowpath+ 0x92/0x100 233>] 0x13/0x20 597>] 64_fastpath+ 0x12/0x6a
> Block out of range: block 0x8cbb5c7f8, EOFS 0xe8cfc000
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796469] ------------[ cut here
> ]------------
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796477] WARNING: CPU: 3 PID: 5954
> at fs/xfs/
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796478] Modules linked in:
> deadline_iosched nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6
> nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack
> nf_conntrack_ftp nf_conntrack x86_pkg_
> crc32c_intel nvme ixgbe acpi_cpufreq mdio dm_zero dm_thin_pool
> dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel
> glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio
> dm_crypt dm_mirror dm_region_hash dm_log
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796517] CPU: 3 PID: 5954 Comm:
> ceph-osd Not tainted 4.4.27-gentoo #1
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796519] Hardware name:
> Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796521] 0000000000000000
> ffff88103ca1b610 ffffffff813a67b8 0000000000000000
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796524] ffffffff81c94276
> ffff88103ca1b648 ffffffff81056286 ffff88085a04ba40
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796527] 0000000000000008
> ffff88085a04ba40 0000000000000000 00000008cbb5c7f8
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796530] Call Trace:
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796538] [<ffffffff813a6
> dump_stack+
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796544] [<ffffffff81056
> warn_slowpath_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796548] [<ffffffff81056
> warn_slowpath_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796550] [<ffffffff812ff
> _xfs_buf_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796553] [<ffffffff812ff
> xfs_buf_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796558] [<ffffffff8132b
> xfs_trans_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796563] [<ffffffff812d8
> xfs_btree_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796567] [<ffffffff812c3
> xfs_alloc_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796571] [<ffffffff812c2
> xfs_alloc_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796573] [<ffffffff813ab
> radix_tree_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796578] [<ffffffff812f4
> xfs_perag_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796580] [<ffffffff813ab
> radix_tree_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796583] [<ffffffff812f4
> xfs_perag_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796586] [<ffffffff812c3
> xfs_alloc_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796590] [<ffffffff812d4
> xfs_bmap_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796594] [<ffffffff812d4
> xfs_bmap_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796598] [<ffffffff812d4
> xfs_bmapi_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796602] [<ffffffff8130c
> xfs_iomap_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796605] [<ffffffff812f8
> xfs_map_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796607] [<ffffffff812f8
> xfs_vm_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796613] [<ffffffff81135
> __writepage+
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796616] [<ffffffff81137
> write_cache_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796619] [<ffffffff81135
> domain_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796623] [<ffffffff81137
> generic_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796626] [<ffffffff812f8
> xfs_vm_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796629] [<ffffffff81137
> do_writepages+
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796634] [<ffffffff8112d
> __filemap_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796637] [<ffffffff8112d
> filemap_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796641] [<ffffffff81304
> xfs_file_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796646] [<ffffffff811b6
> vfs_fsync_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796652] [<ffffffff810c7
> SyS_futex+
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796655] [<ffffffff811b6
> do_fsync+0x3d/0x70
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796660] [<ffffffff81002
> syscall_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796664] [<ffffffff811b7
> SyS_fdatasync+
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796668] [<ffffffff81800
> entry_SYSCALL_
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796693] ---[ end trace
> 0f0a5e131ff32586 ]---
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796707] XFS (dm-8): _xfs_buf_find:
> Block out of range: block 0x8cbb5c7f8, EOFS 0xe8cfc000
We also had the same problem of interacting with the affected filesystems *at all* after the kernel triggered this. Not all FS did show this symptom.
We got out of this by:
- rebooting to 4.1
- xfs_repair -L
- reboot to 4.4
- xfs_repair (always found an off-by-one in one AGL, don't have the log, sorry)
We ran tests after that and the systems appear stable now. I have reviewed all commits from the kernel changelog after 4.1.16 that are marked with 'xfs: ' but have not found any commit that
triggers my spidey senses. :/
Thought I'd leave this here for future reference. Also, when this happened, we did run into consistency issues with Ceph, which you can see here: http:// tracker. ceph.com/ issues/ 13837