trusty amd64 kernel with bcache module panics on unregister

Bug #1445273 reported by Ryan Harper
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Unassigned

Bug Description

We have a system with 3 block devices (SATA) split into 2 partitions each, and create RAID0 md0 (sda1, sdb1, sdc1), and RAID0 md1 (sda2, sdb2, sdc2). We then create a bcache device on an SSD and set it to writeback mode, (/dev/bcache0).

When we attempt to release/unregister the bcache device, it panics. Here's the release script and panic.

#!/bin/bash -x

sudo su - -c "echo 1 >/sys/fs/bcache/*-*-*-*-*/unregister"
sudo su - -c "echo 1 >/sys/block/bcache0/bcache/stop"
sudo mdadm --stop /dev/md0
sudo mdadm --stop /dev/md1
for dev in sda sdb sdc; do
        sudo mdadm --zero-superblock /dev/${dev}1
        sudo mdadm --zero-superblock /dev/${dev}2
        END="$(sudo blockdev --getsz /dev/${dev})"
        GPT_END=$(($END - 100))
        sudo dd if=/dev/zero of=/dev/${dev} bs=1M count=1
        sudo dd if=/dev/zero of=/dev/${dev} bs=512 seek=${GPT_END}
        sync
        sudo blockdev --rereadpt /dev/${dev}
done
sudo wipefs -a /dev/sdd
cat /proc/partitions

[ 318.047616] BUG: unable to handle kernel NULL pointer dereference at 0000000000000a00
[ 318.055478] IP: [<ffffffffa0278280>] journal_write_unlocked+0x130/0x540 [bcache]
[ 318.062889] PGD 0
[ 318.064917] Oops: 0000 [#1] SMP
[ 318.068177] Modules linked in: btrfs raid6_pq xor ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c bcache raid0 raid1 ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core
ib_addr dm_crypt gpio_ich intel_powerclamp coretemp kvm_intel kvm joydev ioatdma i7core_edac shpchp serio_raw edac_core lpc_ich mac_hid dca overlayfs iscsi_tcp libiscsi_tcp libi
scsi scsi_transport_iscsi hid_generic e1000e usbhid ptp ahci psmouse pata_acpi hid libahci pps_core
[ 318.108586] CPU: 9 PID: 17196 Comm: kworker/9:2 Not tainted 3.13.0-49-generic #81-Ubuntu
[ 318.116669] Hardware name: Supermicro X8DTT-H/X8DTT-H, BIOS 2.1b 10/28/2011
[ 318.124147] Workqueue: events journal_write_work [bcache]
[ 318.129558] task: ffff880c1a6cc800 ti: ffff880c1b9c2000 task.ti: ffff880c1b9c2000
[ 318.137027] RIP: 0010:[<ffffffffa0278280>] [<ffffffffa0278280>] journal_write_unlocked+0x130/0x540 [bcache]
[ 318.146866] RSP: 0018:ffff880c1b9c3d90 EFLAGS: 00010202
[ 318.152170] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 318.159295] RDX: ffff880604560c48 RSI: ffff88060456cad8 RDI: ffff880606c70040
[ 318.166419] RBP: ffff880c1b9c3de8 R08: 2018115b2f400000 R09: 5e80000000000000
[ 318.173542] R10: dfe776aad516cbd0 R11: 0000000000003d06 R12: 0000000000000001
[ 318.180669] R13: ffff88060456cba0 R14: ffff88060456ccc8 R15: ffff880604560000
[ 318.187801] FS: 0000000000000000(0000) GS:ffff880c3fc60000(0000) knlGS:0000000000000000
[ 318.195879] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 318.201616] CR2: 0000000000000a00 CR3: 0000000001c0e000 CR4: 00000000000007e0
[ 318.208741] Stack:
[ 318.210751] ffff88062780ac00 ffff880604560000 0000000000000000 0000000000000240
[ 318.218205] ffff880c1b9c3dc0 ffffffff81076108 ffff880604560000 ffff88060456cba0
[ 318.225657] ffff88060456cbd0 0000000000000000 0000000000000240 ffff880c1b9c3e08
[ 318.233112] Call Trace:
[ 318.235564] [<ffffffff81076108>] ? add_timer+0x18/0x20
[ 318.240790] [<ffffffffa02786e0>] journal_try_write+0x50/0x60 [bcache]
[ 318.247316] [<ffffffffa0278712>] journal_write_work+0x22/0x30 [bcache]
[ 318.253926] [<ffffffff81083a82>] process_one_work+0x182/0x450
[ 318.259757] [<ffffffff81084871>] worker_thread+0x121/0x410
[ 318.265331] [<ffffffff81084750>] ? rescuer_thread+0x430/0x430
[ 318.271166] [<ffffffff8108b5b2>] kthread+0xd2/0xf0
[ 318.276045] [<ffffffff8108b4e0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 318.282571] [<ffffffff8173258c>] ret_from_fork+0x7c/0xb0
[ 318.287969] [<ffffffff8108b4e0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 318.294494] Code: 00 00 e8 64 9f 0f e1 31 c0 66 41 83 bd 94 38 ff ff 00 49 8b 8d a0 40 ff ff 49 8d 97 48 0c 00 00 74 3c 66 0f 1f 84 00 00 00 00 00 <48> 8b b9 00 0a 00 00 0f b7 89 ce 00 00 00 83 c0 01 49 8b 36 48
[ 318.314445] RIP [<ffffffffa0278280>] journal_write_unlocked+0x130/0x540 [bcache]
[ 318.321943] RSP <ffff880c1b9c3d90>
[ 318.325426] CR2: 0000000000000a00
[ 318.328740] ---[ end trace 66f8467ce2b49a92 ]---
[ 318.333407] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 318.340394] IP: [<ffffffff8108bc50>] kthread_data+0x10/0x20
[ 318.345985] PGD 1c11067 PUD 1c13067 PMD 0
[ 318.350129] Oops: 0000 [#2] SMP
[ 318.353388] Modules linked in: btrfs raid6_pq xor ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c bcache raid0 raid1 ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr dm_crypt gpio_ich intel_powerclamp coretemp kvm_intel kvm joydev ioatdma i7core_edac shpchp serio_raw edac_core lpc_ich mac_hid dca overlayfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi hid_generic e1000e usbhid ptp ahci psmouse pata_acpi hid libahci pps_core
[ 318.393796] CPU: 9 PID: 17196 Comm: kworker/9:2 Tainted: G D 3.13.0-49-generic #81-Ubuntu
[ 318.402920] Hardware name: Supermicro X8DTT-H/X8DTT-H, BIOS 2.1b 10/28/2011
[ 318.410401] task: ffff880c1a6cc800 ti: ffff880c1b9c2000 task.ti: ffff880c1b9c2000
[ 318.417873] RIP: 0010:[<ffffffff8108bc50>] [<ffffffff8108bc50>] kthread_data+0x10/0x20
[ 318.425880] RSP: 0018:ffff880c1b9c39e8 EFLAGS: 00010002
[ 318.431184] RAX: 0000000000000000 RBX: 0000000000000009 RCX: 000000000000000f
[ 318.438309] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff880c1a6cc800
[ 318.445433] RBP: ffff880c1b9c39e8 R08: 0000000000000000 R09: ffff880c3fc77bc0
[ 318.452558] R10: ffffffff8133ff6a R11: ffffea00306b0ec0 R12: ffff880c3fc734c0
[ 318.459683] R13: 0000000000000009 R14: ffff880c1a6cc7f0 R15: ffff880c1a6cc800
[ 318.466808] FS: 0000000000000000(0000) GS:ffff880c3fc60000(0000) knlGS:0000000000000000
[ 318.474893] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 318.480632] CR2: 0000000000000028 CR3: 0000000001c0e000 CR4: 00000000000007e0
[ 318.487764] Stack:
[ 318.489775] ffff880c1b9c3a00 ffffffff81084f81 ffff880c1a6cc800 ffff880c1b9c3a60
[ 318.497229] ffffffff81725cd9 ffff880c1a6cc800 ffff880c1b9c3fd8 00000000000134c0
[ 318.504683] 00000000000134c0 ffff880c1a6cc800 ffff880c1a6cce50 ffff880c1a6cc7f0
[ 318.512137] Call Trace:
[ 318.514583] [<ffffffff81084f81>] wq_worker_sleeping+0x11/0x90
[ 318.520420] [<ffffffff81725cd9>] __schedule+0x589/0x7d0
[ 318.525728] [<ffffffff81725f49>] schedule+0x29/0x70
[ 318.530697] [<ffffffff8106a16f>] do_exit+0x6df/0xa50
[ 318.535750] [<ffffffff8172b0a9>] oops_end+0xa9/0x150
[ 318.540801] [<ffffffff8171a75d>] no_context+0x27e/0x28b
[ 318.546114] [<ffffffff8171a7dd>] __bad_area_nosemaphore+0x73/0x1ca
[ 318.552379] [<ffffffff8171a947>] bad_area_nosemaphore+0x13/0x15
[ 318.558389] [<ffffffff8172da97>] __do_page_fault+0xa7/0x570
[ 318.564047] [<ffffffff8172df7a>] do_page_fault+0x1a/0x70
[ 318.569446] [<ffffffff810744e7>] ? internal_add_timer+0x17/0x40
[ 318.575453] [<ffffffff8172a3a8>] page_fault+0x28/0x30
[ 318.580599] [<ffffffffa0278280>] ? journal_write_unlocked+0x130/0x540 [bcache]
[ 318.587899] [<ffffffff81076108>] ? add_timer+0x18/0x20
[ 318.593129] [<ffffffffa02786e0>] journal_try_write+0x50/0x60 [bcache]
[ 318.599655] [<ffffffffa0278712>] journal_write_work+0x22/0x30 [bcache]
[ 318.606265] [<ffffffff81083a82>] process_one_work+0x182/0x450
[ 318.612098] [<ffffffff81084871>] worker_thread+0x121/0x410
[ 318.617671] [<ffffffff81084750>] ? rescuer_thread+0x430/0x430
[ 318.623504] [<ffffffff8108b5b2>] kthread+0xd2/0xf0
[ 318.628383] [<ffffffff8108b4e0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 318.634910] [<ffffffff8173258c>] ret_from_fork+0x7c/0xb0
[ 318.640309] [<ffffffff8108b4e0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 318.646832] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 c0 03 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
[ 318.666786] RIP [<ffffffff8108bc50>] kthread_data+0x10/0x20
[ 318.672461] RSP <ffff880c1b9c39e8>
[ 318.675946] CR2: ffffffffffffffd8
[ 318.679259] ---[ end trace 66f8467ce2b49a93 ]---
[ 318.683877] Fixing recursive fault but reboot is needed!

Tags: trusty
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1445273

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Revision history for this message
Stefan Bader (smb) wrote :

I am not sure... but maybe first doing the stop on the block device attribute and then unregistering might be better.

Revision history for this message
Stefan Bader (smb) wrote :

Possibly fixed indirectly by this because that at least checks for outstanding io to be done in try_write...

commit cb7a583e6a6ace661a5890803e115d2292a293df
Author: Kent Overstreet <email address hidden>
Date: Mon Dec 16 15:27:25 2013 -0800

    bcache: kill closure locking usage

    Signed-off-by: Kent Overstreet <email address hidden>

Revision history for this message
Stefan Bader (smb) wrote :

And of course this patch was not part of the picks I made from 3.14 because it had dependencies on other patches which seemed rather intrusive and not qualify as bug-fixes. So we would have to fix this on our own.

Revision history for this message
Ryan Harper (raharper) wrote :

A quick test shows stopping and then unregistering helps. I've only tested that a few times. That said, clearly if we do things in the wrong order, the module should refuse instead of panic =).

Changed in linux (Ubuntu):
importance: Undecided → High
status: Incomplete → Triaged
Revision history for this message
Peter Maloney (peter-maloney) wrote :
Download full text (6.8 KiB)

I get the same problem with kernel 3.13.0-103-generic on Ubuntu 14.04.5

I can reproduce it easily with this script (ssd1 is a volume group, and sdb is a whole disk that already has bcache with writeback enabled).

(
csetuuid=$(bcache-super-show /dev/ssd1/bcache_sdb | awk '$1 == "cset.uuid" {print $2}')
echo "$csetuuid" > /sys/block/bcache0/bcache/detach
[ -n "$csetuuid" ] && echo 1 > /sys/fs/bcache/"$csetuuid"/stop
dd if=/dev/zero of=/dev/ssd1/bcache_sdb bs=1M count=1
make-bcache -C /dev/ssd1/bcache_sdb
echo /dev/ssd1/bcache_sdb > /sys/fs/bcache/register
csetuuid=$(bcache-super-show /dev/ssd1/bcache_sdb | awk '$1 == "cset.uuid" {print $2}')
echo "$csetuuid" > /sys/block/bcache0/bcache/attach

bcache-super-show /dev/sdb
)

3,612,50699700,-;bcache: __cached_dev_store() Can't attach 375f533e-0fbf-4fd9-9849-a0e4f10fd11c\x0a: cache set not found
5,613,50720567,-;bcache: run_cache_set() invalidating existing data
6,614,50729841,-;bcache: bch_cached_dev_attach() Caching sdb as bcache0 on set 375f533e-0fbf-4fd9-9849-a0e4f10fd11c
6,615,50729857,-;bcache: register_cache() registered cache device dm-1
6,616,63752822,-;bcache: cached_dev_detach_finish() Caching disabled for sdb
6,617,63756250,-;bcache: cache_set_free() Cache set 375f533e-0fbf-4fd9-9849-a0e4f10fd11c unregistered
1,618,63852168,-;BUG: unable to handle kernel NULL pointer dereference at 0000000000000a00
1,619,63852267,-;IP: [<ffffffffa0070440>] journal_write_unlocked+0x130/0x570 [bcache]
4,620,63852363,-;PGD 0
4,621,63852391,-;Oops: 0000 [#1] SMP
4,622,63852435,-;Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ppdev xfs serio_raw lpc_ich shpchp parport_pc mac_hid lp parport btrfs xor bcache raid6_pq libcrc32c psmouse ahci libahci virtio_scsi
4,623,63852702,-;CPU: 0 PID: 32 Comm: kworker/0:1 Not tainted 3.13.0-103-generic #150-Ubuntu
4,624,63852759,-;Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.0-20161027_200622-foutrelis 04/01/2014
4,625,63852836,-;Workqueue: events journal_write_work [bcache]
4,626,63852876,-;task: ffff88005c833000 ti: ffff88005c83a000 task.ti: ffff88005c83a000
4,627,63852929,-;RIP: 0010:[<ffffffffa0070440>] [<ffffffffa0070440>] journal_write_unlocked+0x130/0x570 [bcache]
4,628,63853006,-;RSP: 0018:ffff88005c83bd88 EFLAGS: 00010202
4,629,63853044,-;RAX: 0000000000000000 RBX: ffff880058e6cba0 RCX: 0000000000000000
4,630,63853094,-;RDX: ffff880058e60c48 RSI: ffff880058e6cad8 RDI: ffff880041808040
4,631,63853144,-;RBP: ffff88005c83bde0 R08: 2001639b2f400000 R09: 5e80000000000000
4,632,63853193,-;R10: dffe246529a6cbd0 R11: 0000000000000000 R12: 0000000000000001
4,633,63853243,-;R13: ffff880058e6ccc8 R14: 0000000000000001 R15: ffff880058e60000
4,634,63853294,-;FS: 0000000000000000(0000) GS:ffff88005fc00000(0000) knlGS:0000000000000000
4,635,63853350,-;CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
4,636,63853391,-;CR2: 0000000000000a00 CR3: 000000005a349000 CR4: 00000000000006f0
4,637,63853444,-;Stack:
4,638,63853460,-; ffffffff810a0d78 ffff880058e60000 ffff88005c83bde8 ffffffff81013558
4,639,63853522,-; 000000005c83bdf8 ffff88005c8334e8 ffff880058e60000 ffff880058e6cba0
4,640,63853582,-; ffff88005...

Read more...

Revision history for this message
Peter Maloney (peter-maloney) wrote :

oh and it seems if I "unregister" instead of "stop" on the 3rd line in my script, the problem is gone for kernel 4.4.0-51-generic. But kernel 3.13.0-103-generic still dies the same.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.