Comment 16 for bug 1576599

Revision history for this message
Christian Theune (ctheune) wrote :

We had a similar incident, but our upgrade path was different. We're also using Ceph (Hammwer, 0.94.7)

We updated from 4.1.16 (Gentoo) to 4.4.27 (Gentoo). After updating multiple servers they came up fine. However, after around 30 minutes, multiple OSDs on multiple Hosts exposed the same problem.

> Nov 18 22:42:36 cartman07 kernel: [ 4450.796463] XFS (dm-8): _xfs_buf_find:
> Block out of range: block 0x8cbb5c7f8, EOFS 0xe8cfc000
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796469] ------------[ cut here
> ]------------
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796477] WARNING: CPU: 3 PID: 5954
> at fs/xfs/xfs_buf.c:472 _xfs_buf_find+0x2cc/0x330()
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796478] Modules linked in:
> deadline_iosched nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_log_ipv6
> nf_log_common xt_LOG xt_limit nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack
> nf_conntrack_ftp nf_conntrack x86_pkg_temp_thermal kvm_intel kvm irqbypass
> crc32c_intel nvme ixgbe acpi_cpufreq mdio dm_zero dm_thin_pool
> dm_persistent_data dm_bio_prison dm_round_robin dm_multipath xts aesni_intel
> glue_helper lrw ablk_helper cryptd aes_x86_64 fuse dm_snapshot dm_bufio
> dm_crypt dm_mirror dm_region_hash dm_log
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796517] CPU: 3 PID: 5954 Comm:
> ceph-osd Not tainted 4.4.27-gentoo #1
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796519] Hardware name:
> Thomas-Krenn.AG X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796521] 0000000000000000
> ffff88103ca1b610 ffffffff813a67b8 0000000000000000
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796524] ffffffff81c94276
> ffff88103ca1b648 ffffffff81056286 ffff88085a04ba40
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796527] 0000000000000008
> ffff88085a04ba40 0000000000000000 00000008cbb5c7f8
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796530] Call Trace:
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796538] [<ffffffff813a67b8>]
> dump_stack+0x4d/0x65
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796544] [<ffffffff81056286>]
> warn_slowpath_common+0x86/0xc0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796548] [<ffffffff8105637a>]
> warn_slowpath_null+0x1a/0x20
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796550] [<ffffffff812ff87c>]
> _xfs_buf_find+0x2cc/0x330
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796553] [<ffffffff812ff90a>]
> xfs_buf_get_map+0x2a/0x280
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796558] [<ffffffff8132bdc6>]
> xfs_trans_get_buf_map+0x106/0x190
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796563] [<ffffffff812d8def>]
> xfs_btree_get_bufs+0x4f/0x60
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796567] [<ffffffff812c398f>]
> xfs_alloc_fix_freelist+0x20f/0x3c0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796571] [<ffffffff812c2b14>] ?
> xfs_alloc_update_counters.isra.11+0x44/0x50
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796573] [<ffffffff813abbdd>] ?
> radix_tree_lookup+0xd/0x10
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796578] [<ffffffff812f4f5a>] ?
> xfs_perag_get+0x2a/0xb0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796580] [<ffffffff813abbdd>] ?
> radix_tree_lookup+0xd/0x10
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796583] [<ffffffff812f4f5a>] ?
> xfs_perag_get+0x2a/0xb0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796586] [<ffffffff812c3d54>]
> xfs_alloc_vextent+0x214/0x690
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796590] [<ffffffff812d425f>]
> xfs_bmap_btalloc+0x39f/0x6f0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796594] [<ffffffff812d45be>]
> xfs_bmap_alloc+0xe/0x10
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796598] [<ffffffff812d4f12>]
> xfs_bmapi_write+0x452/0x9b0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796602] [<ffffffff8130cf64>]
> xfs_iomap_write_allocate+0x154/0x350
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796605] [<ffffffff812f81f2>]
> xfs_map_blocks+0x152/0x220
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796607] [<ffffffff812f8f59>]
> xfs_vm_writepage+0x189/0x610
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796613] [<ffffffff811354e3>]
> __writepage+0x13/0x30
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796616] [<ffffffff81137019>]
> write_cache_pages+0x219/0x4b0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796619] [<ffffffff811354d0>] ?
> domain_dirty_limits+0x150/0x150
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796623] [<ffffffff811372f3>]
> generic_writepages+0x43/0x60
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796626] [<ffffffff812f878d>]
> xfs_vm_writepages+0x3d/0x50
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796629] [<ffffffff81137dfe>]
> do_writepages+0x1e/0x30
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796634] [<ffffffff8112d1e1>]
> __filemap_fdatawrite_range+0x71/0x90
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796637] [<ffffffff8112d2fa>]
> filemap_write_and_wait_range+0x2a/0x70
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796641] [<ffffffff813041f4>]
> xfs_file_fsync+0x54/0x1c0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796646] [<ffffffff811b6edb>]
> vfs_fsync_range+0x3b/0xa0
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796652] [<ffffffff810c75f6>] ?
> SyS_futex+0x76/0x160
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796655] [<ffffffff811b6f9d>]
> do_fsync+0x3d/0x70
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796660] [<ffffffff81002ae2>] ?
> syscall_return_slowpath+0x92/0x100
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796664] [<ffffffff811b7233>]
> SyS_fdatasync+0x13/0x20
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796668] [<ffffffff81800597>]
> entry_SYSCALL_64_fastpath+0x12/0x6a
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796693] ---[ end trace
> 0f0a5e131ff32586 ]---
> Nov 18 22:42:36 cartman07 kernel: [ 4450.796707] XFS (dm-8): _xfs_buf_find:
> Block out of range: block 0x8cbb5c7f8, EOFS 0xe8cfc000

We also had the same problem of interacting with the affected filesystems *at all* after the kernel triggered this. Not all FS did show this symptom.

We got out of this by:

- rebooting to 4.1
- xfs_repair -L
- reboot to 4.4
- xfs_repair (always found an off-by-one in one AGL, don't have the log, sorry)

We ran tests after that and the systems appear stable now. I have reviewed all commits from the kernel changelog after 4.1.16 that are marked with 'xfs: ' but have not found any commit that
triggers my spidey senses. :/

Thought I'd leave this here for future reference. Also, when this happened, we did run into consistency issues with Ceph, which you can see here: http://tracker.ceph.com/issues/13837