btrfs oops on current 3.13

Bug #1384711 reported by Stéphane Graber
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Colin Ian King

Bug Description

I've recently been getting a few kernel panics which I've tracked down to having panic_on_oops set to 1 and which appear to be related to this oops:

[ 2182.341680] general protection fault: 0000 [#1] SMP
[ 2182.341702] Modules linked in: xt_CHECKSUM esp6 xfrm6_mode_transport ipcomp6 xfrm6_tunnel tunnel6 authenc xfrm4_mode_transport xfrm6_mode_tunnel veth xfrm4_mode_tunnel
 deflate xfrm_user xfrm4_tunnel ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo zram(C) xt_TCPMSS ip6table_mangle iptable_mangle xt_LOG ip6t_REJECT ipt_REJECT xt_nat xt_tcpu
dp xt_mark nf_conntrack_ipv6 nf_defrag_ipv6 ipt_MASQUERADE xt_conntrack iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_t
ables iptable_filter ip_tables x_tables bridge stp llc sit tunnel4 ip_tunnel dm_crypt lpc_ich mac_hid intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
lp kvm parport serio_raw crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd btrfs xor raid6_pq libcrc32c
 ast syscopyarea sysfillrect sysimgblt i2c_algo_bit e1000e ttm drm_kms_helper mpt2sas ahci ptp psmouse drm raid_class libahci pps_core scsi_transport_sas video
[ 2182.342026] CPU: 2 PID: 10845 Comm: java Tainted: G C 3.13.0-37-generic #64-Ubuntu
[ 2182.342053] Hardware name: To be filled by O.E.M. To be filled by O.E.M./P8B-M Series, BIOS 6702 07/23/2013
[ 2182.342086] task: ffff88065fc4b000 ti: ffff880677f12000 task.ti: ffff880677f12000
[ 2182.342111] RIP: 0010:[<ffffffff8136f826>] [<ffffffff8136f826>] memcpy+0x6/0x110
[ 2182.342141] RSP: 0018:ffff880677f139b0 EFLAGS: 00010207
[ 2182.342160] RAX: ffff88008ba80156 RBX: 000000000000010b RCX: 000000000000010b
[ 2182.342187] RDX: 000000000000010b RSI: 0005080000000000 RDI: ffff88008ba80156
[ 2182.342223] RBP: ffff880677f139e8 R08: 0000000000001000 R09: ffff880677f139b8
[ 2182.342248] R10: 0000000000000000 R11: 0000000000000003 R12: ffff8806daea9128
[ 2182.342274] R13: 0000160000000000 R14: ffff88008ba80261 R15: 000000000000010b
[ 2182.342300] FS: 00007fd89af79700(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
[ 2182.342329] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2182.342350] CR2: 00000001009a6400 CR3: 00000006d52b8000 CR4: 00000000001407e0
[ 2182.342378] Stack:
[ 2182.342387] ffffffffa02124cc 0000000000001000 ffff88074f09d800 ffff8808020cfd80
[ 2182.342419] 0000000000000000 ffff880677376df0 ffff880604eebab0 ffff880677f13aa8
[ 2182.342452] ffffffffa01f740c 0000000000003e30 0000000000000000 0000000000001000
[ 2182.342484] Call Trace:
[ 2182.342513] [<ffffffffa02124cc>] ? read_extent_buffer+0xbc/0x110 [btrfs]
[ 2182.342548] [<ffffffffa01f740c>] btrfs_get_extent+0x91c/0x970 [btrfs]
[ 2182.342582] [<ffffffffa020e879>] __do_readpage+0x509/0x730 [btrfs]
[ 2182.342614] [<ffffffffa01f6af0>] ? btrfs_real_readdir+0x5b0/0x5b0 [btrfs]
[ 2182.342644] [<ffffffff811b1986>] ? __mem_cgroup_commit_charge+0x156/0x3d0
[ 2182.342681] [<ffffffffa0209e97>] ? btrfs_lookup_ordered_extent+0x27/0x1e0 [btrfs]
[ 2182.342720] [<ffffffffa020eb65>] __extent_read_full_page+0xc5/0xe0 [btrfs]
[ 2182.342756] [<ffffffffa01f6af0>] ? btrfs_real_readdir+0x5b0/0x5b0 [btrfs]
[ 2182.342792] [<ffffffffa0210807>] extent_read_full_page+0x37/0x60 [btrfs]
[ 2182.342827] [<ffffffffa01f4e25>] btrfs_readpage+0x25/0x30 [btrfs]
[ 2182.342861] [<ffffffffa02028b8>] prepare_uptodate_page+0x38/0x90 [btrfs]
[ 2182.342895] [<ffffffffa0202b20>] prepare_pages.isra.17+0x210/0x340 [btrfs]
[ 2182.342929] [<ffffffffa020380d>] __btrfs_buffered_write+0x28d/0x490 [btrfs]
[ 2182.342963] [<ffffffffa0203c25>] btrfs_file_aio_write+0x215/0x520 [btrfs]
[ 2182.342991] [<ffffffff810d7c08>] ? get_futex_key+0x1d8/0x2c0
[ 2182.343016] [<ffffffff810d8e41>] ? futex_wake+0x1b1/0x1d0
[ 2182.343039] [<ffffffff811bca6a>] do_sync_write+0x5a/0x90
[ 2182.343061] [<ffffffff811bd1f4>] vfs_write+0xb4/0x1f0
[ 2182.344040] [<ffffffff811bdc29>] SyS_write+0x49/0xa0
[ 2182.345042] [<ffffffff8172f82d>] system_call_fastpath+0x1a/0x1f
[ 2182.346056] Code: 43 58 48 2b 43 50 88 43 4e 5b 5d c3 66 0f 1f 84 00 00 00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90 90 90 48 89 f8 48 89 d1 <f3> a4 c3 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b
[ 2182.348249] RIP [<ffffffff8136f826>] memcpy+0x6/0x110
[ 2182.349288] RSP <ffff880677f139b0>
[ 2182.363925] ---[ end trace c24f17a510504fdc ]---

As far as I can tell, that's happening during a simple read access to a file from my zimbra server. It is however not something I can trivially reproduce.

I also can't get a core because for whatever reason linux-crashdump isn't working on that hardware...

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1384711

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This looks similar to bug 1237833 and bug 1235521 . They were both reported against Saucy. One bug expired and the other found a workaround.

Comment #5 in bug 1237833 says: "I removed the mount options "compressed" and booted into recovery mode performing an filesystem check. now my problem is gone."

Are you by any chance have btrfs compression turned on?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Stéphane Graber (stgraber) wrote :

Yes, this is 3.13 with btrfs on a single encrypted block device with zlib compression.

The oops doesn't appear particularly armful now that I've turned off panic_on_oops so it's not something I'm willing to sacrifice 2TB of free space to workaround :)

Revision history for this message
Stéphane Graber (stgraber) wrote :

I have since upgraded that server to the upcoming utopic backport (from the kernel team PPA) and I haven't been able to reproduce this bug. So it may well be that there are some btrfs bugfixes in 3.16 which haven't made it to stable.

Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :

I did originally think "Btrfs: check file extent type before anything else" may have been a suitable fix for this (https://bugzilla.kernel.org/show_bug.cgi?id=60834) however the kernel you were using included that fix. Hrm.

Revision history for this message
Colin Ian King (colin-king) wrote :

Looks like we have a bad src page when doing the memcpy():

memcpy:
        48 89 f8 mov %rdi,%rax
        48 89 d1 mov %rdx,%rcx
       <f3>a4 rep movsb %ds:(%rsi),%es:(%rdi) // e.g. rep on *rdi++ = *rsi++
         c3 retq

The GPF flags 0000 means page read and not found. rsi is 0005080000000000 (src) and the rdi looks valid (ffff88008ba80156)

..and the memcpy from extent_io: was:

                kaddr = page_address(page);
                memcpy(dst, kaddr + offset, cur);

so it appears that the page kaddr from page is wrong somehow.

Revision history for this message
Colin Ian King (colin-king) wrote :

I believe the fix landed in 3.14, namely "Btrfs: don't use ram_bytes for uncompressed inline items" and I think this fixes the issue on disc, so that if you reverted back to 3.13 you would no longer see the bug. Because we can't test this now I'm not sure how to proceed.

From what I understand about this fix, it addresses an issue that can occur of the file system is shutdown incorrectly, hence tripping this issue. Reproducing it is non-trivial, so I can't easily do a backport and see if my backport fixes it as I can't figure out a reproducer.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.