bcache: bch_allocator_thread(): hung task timeout

Bug #1784665 reported by Ryan Harper on 2018-07-31
36
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Eoan
Xenial
High
Andrea Righi
Bionic
High
Andrea Righi
Disco
High
Andrea Righi
Eoan
High
Andrea Righi

Bug Description

[Impact]

bcache_allocator() can call the following:

 bch_allocator_thread()
  -> bch_prio_write()
     -> bch_bucket_alloc()
        -> wait on &ca->set->bucket_wait

But the wake up event on bucket_wait is supposed to come from bch_allocator_thread() itself causing a deadlock.

[Test Case]

This is a simple script that can easily trigger the deadlock condition:
https://launchpadlibrarian.net/381282009/bcache-basic-repro.sh

A better test case has been also provided in bug 1796292 (duplicate of this bug):
https://bugs.launchpad.net/curtin/+bug/1796292/+attachment/5280353/+files/curtin-nvme.sh

[Fix]

Fix by making the call to bch_prio_write() non-blocking, so that bch_allocator_thread() never waits on itself. Moreover, make sure to wake up the garbage collector thread when bch_prio_write() is failing to allocate buckets to increase the chance of freeing up more buckets.

In addition to that it would be safe to also import other upstream bcache fixes (all clean cherry picks):

7e865eba00a3df2dc8c4746173a8ca1c1c7f042e bcache: fix potential deadlock in cached_def_free()
80265d8dfd77792e133793cef44a21323aac2908 bcache: acquire bch_register_lock later in cached_dev_free()
ce4c3e19e5201424357a0c82176633b32a98d2ec bcache: Replace bch_read_string_list() by __sysfs_match_string()
ecb37ce9baac653cc09e2b631393dde3df82979f bcache: Move couple of functions to sysfs.c
04cbc21137bfa4d7b8771a5b14f3d6c9b2aee671 bcache: Move couple of string arrays to sysfs.c
5f2b18ec8e1643410a2369f06888951cdedea0bf bcache: Fix a compiler warning in bcache_device_init()
20d3a518713e394efa5a899c84574b4b79ec5098 bcache: Reduce the number of sparse complaints about lock imbalances
42361469ae84c851e40cb1f94c8c9a14cdd94039 bcache: Suppress more warnings about set-but-not-used variables
f0d3814090ac77de94c42b7124c37ece23629197 bcache: Remove an unused variable
47344e330eabc1515cbe6061eb337100a3ab6d37 bcache: Fix kernel-doc warnings
9dfbdec7b7fea1ff1b7b5d5d12980dbc7dca46c7 bcache: Annotate switch fall-through
4a4e443835a43a79113cc237c472c0d268eb1e1c bcache: Add __printf annotation to __bch_check_keys()
fd01991d5c20098c5c1ffc4dca6c821cc60a2f74 bcache: Fix indentation
ca71df31661a0518ed58a1a59cf1993962153ebb bcache: fix using of loop variable in memory shrink
f3641c3abd1da978ee969b0203b71b86ec1bfa93 bcache: fix error return value in memory shrink
688892b3bc05e25da94866e32210e5f503f16f69 bcache: fix incorrect sysfs output value of strip size
09a44ca2114737e0932257619c16a2b50c7807f1 bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
c4dc2497d50d9c6fb16aa0d07b6a14f3b2adb1e0 bcache: fix high CPU occupancy during journal
a728eacbbdd229d1d903e46261c57d5206f87a4a bcache: add journal statistic
616486ab52ab7f9739b066d958bdd20e65aefd74 bcache: fix writeback target calc on large devices
1f0ffa67349c56ea54c03ccfd1e073c990e7411e bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
eb8cbb6df38f6e5124a3d5f1f8a3dbf519537c60 bcache: improve bcache_reboot()
9951379b0ca88c95876ad9778b9099e19a95d566 bcache: never writeback a discard operation

[Regression Potential]

The upstream fixes are all clean cherry picks from stable (most of them are small cleanups), so regression potential is minimal.

The only special patch is "UBUNTU: SAUCE: bcache: fix deadlock in bcache_allocator()" that is addressing the main deadlock bug (that seems to be a mainline bug - not fixed yet). We should spend more time trying to reproduce this deadlock with a mainline kernel and post the patch to the LKML for review / feedback.

However, considering that this patch seems to fix/prevent the specific deadlock problem reported in this bug (tested on the affected platform) it can be considered safe to apply it.

[Original Bug Report]

$ cat /proc/version_signature
Ubuntu 4.15.0-29.31-generic 4.15.18

$ lsb_release -rd
Description: Ubuntu Cosmic Cuttlefish (development branch)
Release: 18.10

$ apt-cache policy linux-image-`uname -r`
linux-image-4.15.0-29-generic:
  Installed: 4.15.0-29.31
  Candidate: 4.15.0-29.31
  Version table:
 *** 4.15.0-29.31 500
        500 http://archive.ubuntu.com/ubuntu cosmic/main amd64 Packages
        100 /var/lib/dpkg/status

3) mkfs.ext4 /dev/bcache0 returns successful creating an ext4 filesystem on top of a bcache device

4) mkfs.ext4 doesn't return and kernel prints hung process info

[ 58.018099] cloud-init[920]: Running command ['mkfs.ext4', '-F', '-L', 'root-fs', '-U', 'f01aec97-9457-11e8-b8d6-525400123401', '/dev/bcache0'] with allowed return codes [0] (capture=True)
[ 242.652018] INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
[ 242.653767] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 242.655391] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.657397] INFO: task kworker/0:2:410 blocked for more than 120 seconds.
[ 242.659126] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 242.660980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.663000] INFO: task bcache_allocato:2326 blocked for more than 120 seconds.
[ 242.664807] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 242.666516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.668503] INFO: task bcache_writebac:2345 blocked for more than 120 seconds.
[ 242.670301] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 242.671936] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.673909] INFO: task mkfs.ext4:2803 blocked for more than 120 seconds.
[ 242.675414] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 242.677038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.483998] INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
[ 363.488441] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 363.489598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.491043] INFO: task kworker/0:2:410 blocked for more than 120 seconds.
[ 363.492252] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 363.494085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.495659] INFO: task bcache_allocato:2326 blocked for more than 120 seconds.
[ 363.496957] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 363.498454] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.499866] INFO: task bcache_writebac:2345 blocked for more than 120 seconds.
[ 363.501156] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 363.502597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.504048] INFO: task mkfs.ext4:2803 blocked for more than 120 seconds.
[ 363.505505] Tainted: P O 4.15.0-29-generic #31-Ubuntu
[ 363.506677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

System has two virtio block devices. bcache was created like so:

make-bcache -C /dev/vdb
make-bcache -B /dev/vda2

resulting in /dev/bcache0

ProblemType: Bug
DistroRelease: Ubuntu 18.10
Package: linux-image-4.15.0-29-generic 4.15.0-29.31
ProcVersionSignature: User Name 4.15.0-29.31-generic 4.15.18
Uname: Linux 4.15.0-29-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jul 31 15:52 seq
 crw-rw---- 1 root audio 116, 33 Jul 31 15:52 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.10-0ubuntu7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
Date: Tue Jul 31 15:53:56 2018
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: QEMU Standard PC (i440FX + PIIX, 1996)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-29-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-29-generic N/A
 linux-backports-modules-4.15.0-29-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/01/2014
dmi.bios.vendor: SeaBIOS
dmi.bios.version: 1.11.1-1
dmi.chassis.type: 1
dmi.chassis.vendor: QEMU
dmi.chassis.version: pc-i440fx-bionic
dmi.modalias: dmi:bvnSeaBIOS:bvr1.11.1-1:bd04/01/2014:svnQEMU:pnStandardPC(i440FX+PIIX,1996):pvrpc-i440fx-bionic:cvnQEMU:ct1:cvrpc-i440fx-bionic:
dmi.product.name: Standard PC (i440FX + PIIX, 1996)
dmi.product.version: pc-i440fx-bionic
dmi.sys.vendor: QEMU

Ryan Harper (raharper) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.18 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc7

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Changed in linux (Ubuntu Cosmic):
status: Confirmed → Triaged
Ryan Harper (raharper) wrote :

> Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

We've had plenty of successful runs on this kernel. I suspect io load or something else may trigger this. We've only seen it once out of tens of not hundreds of runs on Cosmic 4.15.0-29-generic.

We don't currently have an easy way to test upstream kernels in the automation runs. If we see this again, I'll poke a bit more on running mkfs.ext4 over bcache devices on Cosmic to see if we can recreate outside of the automation.

Ryan Harper (raharper) wrote :

We've just had a second occurrence of this, same kernel, but with Xenial. (Xenial HWE Edge kernel 4.15.0-29) with the same scenario, mkfs.ext4 a /dev/bcache0.

I'll see if I can recreate by stressing things.

Joseph Salisbury (jsalisbury) wrote :

It would be great if we can identify a reproducer. It would make it much easier to track down.

Ryan Harper (raharper) wrote :

I think I've got one. It doesn't always hang the mkfs, but it does wedge bcache in a few ways.

[ 484.322546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.326230] INFO: task bcache_allocato:32503 blocked for more than 120 seconds.

1) when it occurs, a backing device never get's released from the kernel, such that a blockdev --rereadpt on the device says the device is busy, but examining /sys/class/block/<dev>/holders shows nothing

2) attempts to open the device exclusively fail (so things like mkfs or make-bcache -B).

The reproducer has two virtio disks, vda, vdb; size doesn't really matter, but they are 10G and 2G respectively. /dev/vdb is a cache set (make-bcache -C /dev/vdb), /dev/vda2 is the backing device (make-bcache -B /dev/vda2) and then they are joined (echo $CSET_UUID > /sys/class/block/vda/vda2/bcache/attach), which results in a /dev/bcache0. Then we mkfs.ext4 it.

Then we tear it down, wipe the partitions, the disks and repeat.

I'll attach the script.

Ryan Harper (raharper) wrote :
Ryan Harper (raharper) wrote :

I'll see if I can reproduce on the mainline kernel.

Ryan Harper (raharper) wrote :

I can recreate on upstream:

# uname -r; dmesg | tail -n 16
4.18.0-041800rc7-generic
[ 967.657492] INFO: task bcache_allocato:14240 blocked for more than 120 seconds.
[ 967.664138] Not tainted 4.18.0-041800rc7-generic #201807292230
[ 967.669641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 967.673503] bcache_allocato D 0 14240 2 0x80000000
[ 967.673512] Call Trace:
[ 967.673529] __schedule+0x29e/0x840
[ 967.673535] ? reschedule_interrupt+0xa/0x20
[ 967.673539] schedule+0x2c/0x80
[ 967.673555] bch_bucket_alloc+0x19d/0x380 [bcache]
[ 967.673561] ? wait_woken+0x80/0x80
[ 967.673574] bch_prio_write+0x186/0x330 [bcache]
[ 967.673583] bch_allocator_thread+0x482/0xd10 [bcache]
[ 967.673589] kthread+0x120/0x140
[ 967.673598] ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
[ 967.673601] ? kthread_bind+0x40/0x40
[ 967.673605] ret_from_fork+0x35/0x40

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu Cosmic):
status: Triaged → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing mainline. We should be able to bisect this down if we confirm it's a regression.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

4.10 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/
4.13 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13/
4.15 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/

You don't have to test every kernel, just up until the kernel that first has this bug. If 4.10 final exhibits the bug, we'll have to test even earlier versions.

Thanks in advance!

Ryan Harper (raharper) wrote :

4.10:
  Status: OK
  Comment: Survived 30 minute loop over recreating with no issues.
4.13:
  Status: OK
  Comment: Survived 30 minute loop over recreating with no issues.
4.15:
  Status: FAIL
  Comment: triggered within less than a minute of the loop.
  Kernel Trace:

[ 122.500943] Oops: 0000 [#24] SMP PTI
[ 122.501612] Modules linked in: bcache nls_utf8 isofs nls_iso8859_1 ppdev kvm_intel kvm irqbypass parport_pc input_leds parport joydev mac_hid serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi virtio_rng ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear psmouse virtio_net virtio_blk virtio_scsi i2c_piix4 pata_acpi floppy
[ 122.508306] CPU: 3 PID: 2212 Comm: kworker/3:249 Tainted: G D 4.15.0-041500-generic #201802011154
[ 122.510459] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 122.512541] Workqueue: bcache bch_data_insert_keys [bcache]
[ 122.513779] RIP: 0010:__bch_bset_search+0x80/0x470 [bcache]
[ 122.515015] RSP: 0018:ffffb49141a83c10 EFLAGS: 00010286
[ 122.516127] RAX: fffffffffff6cacc RBX: fffffffffff6cacc RCX: 0000000000000000
[ 122.516947] RDX: ffffb49141a83e18 RSI: 0000000000000069 RDI: ffff8f6a7ad64cc8
[ 122.517768] RBP: ffffb49141a83c70 R08: 0000000000000001 R09: ffff8f6a7ad64ce0
[ 122.518586] R10: ffff8f6a6735ab80 R11: 0000000000027ec0 R12: ffffb49141a83e18
[ 122.519858] R13: 0000000400656369 R14: 0000000066643165 R15: ffff8f6a7ad64ce0
[ 122.521302] FS: 0000000000000000(0000) GS:ffff8f6a7fd80000(0000) knlGS:0000000000000000
[ 122.522272] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 122.522969] CR2: 0000000400656389 CR3: 0000000038fc2000 CR4: 00000000000006e0
[ 122.524521] Call Trace:
[ 122.525153] ? ttwu_do_activate+0x7a/0x90
[ 122.526147] __bch_btree_iter_init+0x5d/0xc0 [bcache]
[ 122.527411] bch_btree_iter_init+0x12/0x20 [bcache]
[ 122.528578] bch_btree_map_nodes_recurse+0x93/0x190 [bcache]
[ 122.529904] ? __slab_free+0x14d/0x2c0
[ 122.530814] ? bch_btree_insert_check_key+0x1f0/0x1f0 [bcache]
[ 122.532331] __bch_btree_map_nodes+0x142/0x1a0 [bcache]
[ 122.533685] ? bch_btree_insert_check_key+0x1f0/0x1f0 [bcache]
[ 122.534994] bch_btree_insert+0xf9/0x170 [bcache]
[ 122.536059] ? wait_woken+0x80/0x80
[ 122.536876] bch_data_insert_keys+0x36/0x130 [bcache]
[ 122.538004] process_one_work+0x1ef/0x410
[ 122.538928] worker_thread+0x32/0x410
[ 122.539779] kthread+0x121/0x140
[ 122.540538] ? process_one_work+0x410/0x410
[ 122.541487] ? kthread_create_worker_on_cpu+0x70/0x70
[ 122.542619] ret_from_fork+0x35/0x40
[ 122.543547] Code: 41 08 81 e3 ff ff 0f 00 25 ff ff 0f 00 48 39 d8 48 89 5d d0 0f 85 b6 03 00 00 48 8b 42 08 49 2b 41 10 48 85 c0 0f 89 7c 03 00 00 <49> 8b 45 20 25 ff ff 0f 00 48 39 45 d0 0f 85 c0 03 00 00 48 8b
[ 122.546019] RIP: __bch_bset_search+0x80/0x470 [bcache] RSP: ffffb49141a83c10
[ 122.546836] CR2: 0000000400656389
[ 122.547567] ---[ end trace c539406c76c0edd2 ]---

Joseph Salisbury (jsalisbury) wrote :

To narrow it down further, can you test the following two kernels:

4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/

Ryan Harper (raharper) wrote :

4.14 Final: OK
4.15-rc1: OK

I started a further bisect:

4.15-rc9: FAIL
4.15-rc5: OK
4.15-rc7: FAIL
4.15-rc6: OK

So, it appears something between rc6 (last known good) and rc7 (FAIL)

Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Cosmic):
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between v4.15-rc6 and v4.15-rc7. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
d8887f1c7289848e74c92bd4322789a9cd7de699

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ryan Harper (raharper) wrote :

test kernel 4.15-rc6 up to d8887f1c7289848e74c92bd4322789a9cd7de699 FAILED

Attaching oops.

Ryan Harper (raharper) wrote :

test kernel 4.15-rc6 up to d8887f1c7289848e74c92bd4322789a9cd7de699 FAILED

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
abb62c46d4949d44979fa647740feff3f7538799

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ryan Harper (raharper) wrote :

test kernel 4.15-rc1 up to abb62c46d4949d44979fa647740feff3f7538799 FAILED

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
f17b9e764dfcd838dab51572d620a371c05a8e60

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ryan Harper (raharper) wrote :

FAILED: Kernel 4.15-rc1 up to f17b9e764dfcd838dab51572d620a371c05a8e60

Attached is the oops of the failure.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
f48f66a962a54c3d26d688c3df5441c9d1ba8730

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ryan Harper (raharper) wrote :

FAILED: Kernel 4.15-rc1 up to f48f66a962a54c3d26d688c3df5441c9d1ba8730

Attached is the oops of the failure.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
6b457409169b7686d293b408da0b6446ccb57a76

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ryan Harper (raharper) wrote :

PASS: Kernel 4.15-rc1 (bisect5) up to 6b457409169b7686d293b408da0b6446ccb57a76

I've 2600 seconds of uptime with over 800 loops in the test-case.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
87eba0716011e528f7841026f2cc65683219d0ad

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ryan Harper (raharper) wrote :

PASS: Kernel 4.15-rc1 (bisect6) up to 87eba0716011e528f7841026f2cc65683219d0ad

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
bc631943faba6fc3f755748091ada31798fb7d50

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ryan Harper (raharper) wrote :

FAILED: Kernel 4.15-rc1 up to bc631943faba6fc3f755748091ada31798fb7d50

Attached is the oops of the failure.

Joseph Salisbury (jsalisbury) wrote :

The bisect reported the following commit as the first bad commit:

bc631943faba ("arm64: dts: rockchip: limit rk3328-rock64 gmac speed to 100MBit for now")

It's strange that this is an arm64 commit. I'll investigate a bit and see why the bisect came back with this result.

https://patchwork.kernel.org/patch/10094201/

diff --git a/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
b/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
index d4f80786e7c2..3890468678ce 100644
--- a/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
+++ b/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
@@ -132,6 +132,8 @@
  assigned-clocks = <&cru SCLK_MAC2IO>, <&cru SCLK_MAC2IO_EXT>;
  assigned-clock-parents = <&gmac_clkin>, <&gmac_clkin>;
  clock_in_out = "input";
+ /* shows instability at 1GBit right now */
+ max-speed = <100>;
  phy-supply = <&vcc_io>;
  phy-mode = "rgmii";
  pinctrl-names = "default";

That can't be it, which means maybe some of the PASSED didn't pass; I
just didn't wait long enough?
On Tue, Aug 28, 2018 at 2:11 PM Joseph Salisbury
<email address hidden> wrote:
>
> The bisect reported the following commit as the first bad commit:
>
> bc631943faba ("arm64: dts: rockchip: limit rk3328-rock64 gmac speed to
> 100MBit for now")
>
> It's strange that this is an arm64 commit. I'll investigate a bit and
> see why the bisect came back with this result.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1784665
>
> Title:
> mkfs.ext4 over /dev/bcache0 hangs
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1784665/+subscriptions

Ryan Harper (raharper) wrote :

It's always been a locking issue, so either misuse (or missing) locks
in the bcache attach/detach path, or generic locking changes to block
layer path I'd think.

I'll see if I can find any hits on those oops tracebacks too

On Tue, Aug 28, 2018 at 2:29 PM Ryan Harper <email address hidden> wrote:
>
> https://patchwork.kernel.org/patch/10094201/
>
> diff --git a/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> b/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> index d4f80786e7c2..3890468678ce 100644
> --- a/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> +++ b/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> @@ -132,6 +132,8 @@
> assigned-clocks = <&cru SCLK_MAC2IO>, <&cru SCLK_MAC2IO_EXT>;
> assigned-clock-parents = <&gmac_clkin>, <&gmac_clkin>;
> clock_in_out = "input";
> + /* shows instability at 1GBit right now */
> + max-speed = <100>;
> phy-supply = <&vcc_io>;
> phy-mode = "rgmii";
> pinctrl-names = "default";
>
>
> That can't be it, which means maybe some of the PASSED didn't pass; I
> just didn't wait long enough?
> On Tue, Aug 28, 2018 at 2:11 PM Joseph Salisbury
> <email address hidden> wrote:
> >
> > The bisect reported the following commit as the first bad commit:
> >
> > bc631943faba ("arm64: dts: rockchip: limit rk3328-rock64 gmac speed to
> > 100MBit for now")
> >
> > It's strange that this is an arm64 commit. I'll investigate a bit and
> > see why the bisect came back with this result.
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1784665
> >
> > Title:
> > mkfs.ext4 over /dev/bcache0 hangs
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1784665/+subscriptions

Ryan Harper (raharper) wrote :

https://www.spinics.net/lists/linux-bcache/msg04869.html

https://www.spinics.net/lists/linux-bcache/msg05774.html
On Tue, Aug 28, 2018 at 2:30 PM Ryan Harper <email address hidden> wrote:
>
> It's always been a locking issue, so either misuse (or missing) locks
> in the bcache attach/detach path, or generic locking changes to block
> layer path I'd think.
>
> I'll see if I can find any hits on those oops tracebacks too
>
> On Tue, Aug 28, 2018 at 2:29 PM Ryan Harper <email address hidden> wrote:
> >
> > https://patchwork.kernel.org/patch/10094201/
> >
> > diff --git a/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> > b/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> > index d4f80786e7c2..3890468678ce 100644
> > --- a/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> > +++ b/arch/arm64/boot/dts/rockchip/rk3328-rock64.dts
> > @@ -132,6 +132,8 @@
> > assigned-clocks = <&cru SCLK_MAC2IO>, <&cru SCLK_MAC2IO_EXT>;
> > assigned-clock-parents = <&gmac_clkin>, <&gmac_clkin>;
> > clock_in_out = "input";
> > + /* shows instability at 1GBit right now */
> > + max-speed = <100>;
> > phy-supply = <&vcc_io>;
> > phy-mode = "rgmii";
> > pinctrl-names = "default";
> >
> >
> > That can't be it, which means maybe some of the PASSED didn't pass; I
> > just didn't wait long enough?
> > On Tue, Aug 28, 2018 at 2:11 PM Joseph Salisbury
> > <email address hidden> wrote:
> > >
> > > The bisect reported the following commit as the first bad commit:
> > >
> > > bc631943faba ("arm64: dts: rockchip: limit rk3328-rock64 gmac speed to
> > > 100MBit for now")
> > >
> > > It's strange that this is an arm64 commit. I'll investigate a bit and
> > > see why the bisect came back with this result.
> > >
> > > --
> > > You received this bug notification because you are subscribed to the bug
> > > report.
> > > https://bugs.launchpad.net/bugs/1784665
> > >
> > > Title:
> > > mkfs.ext4 over /dev/bcache0 hangs
> > >
> > > To manage notifications about this bug go to:
> > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1784665/+subscriptions

Thanks for digging up this additional information. I'll investigate further. While I do that, the v4.19-rc1 kernel is now out. It might be worthwhile to give that one a go, and see if the fix was already committed to mainline:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc1/

On Wed, Aug 29, 2018 at 12:01 PM Joseph Salisbury
<email address hidden> wrote:
>
> Thanks for digging up this additional information. I'll investigate
> further. While I do that, the v4.19-rc1 kernel is now out. It might be
> worthwhile to give that one a go, and see if the fix was already
> committed to mainline:
>
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc1/

v4.19 also hangs like 4.16+ does, but does not oops like 4.15.

Thanks for testing. I was hoping a fix would be in 4.19, but I'll dig some more.

tags: added: kernel-da-key
removed: kernel-key
Joseph Salisbury (jsalisbury) wrote :

The 4.20 kernel is now available. Can you give v4.20-rc3 a test:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc3

Ryan Harper (raharper) wrote :

4.20-rc3 also fails.

Successful runs: 95
root@ubuntu:~# [ 484.565298] INFO: task kworker/3:206:2091 blocked for more than 120 seconds.
[ 484.566976] Not tainted 4.20.0-042000rc3-generic #201811182231
[ 484.568069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.569579] INFO: task bcache_allocato:16294 blocked for more than 120 seconds.
[ 484.571275] Not tainted 4.20.0-042000rc3-generic #201811182231
[ 484.572300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Joseph Salisbury (jsalisbury) wrote :

I built a new Bionic test kernel. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1784665

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Ryan Harper (raharper) wrote :

The 4.15.0-39 kernel FAILs:

+ make-bcache -C /dev/sdb
Can't open dev /dev/sdb: Device or resource busy
+ exit 1
Successful runs: 24
root@ubuntu:~# [ 242.655013] INFO: task kworker/3:103:2610 blocked for more than 120 seconds.
[ 242.665486] Not tainted 4.15.0-39-generic #43~lp1784665
[ 242.672393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.682480] INFO: task bcache_allocato:5864 blocked for more than 120 seconds.
[ 242.698149] Not tainted 4.15.0-39-generic #43~lp1784665
[ 242.705883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Joseph Salisbury (jsalisbury) wrote :

Can you confirm that v4.15-rc6 does not exhibit the bug? That will ensure I'm reviewing the correct commits:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc6/

Download full text (3.3 KiB)

It took a while, but v4.15-rc6 fails; so we need to move back a bit.

[ 2541.139904] bcache: bcache_device_free() bcache0 stopped
[ 2541.141848] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000d38
[ 2541.143813] IP: search_free+0ex01-b6435-e1/5f0x40 [bcache]
[ 2541.145676] PGD 0 P4D 0
[ 2541.146493] Oops: 0000 [#58] SMP
[ 2541.147509] Modules linked in: bcache nls_utf8 isofs nls_iso8859_1
ppdev kvm_intel kvm irqbypass joydev input_leds parport_pc parport
mac_hid serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi virtio_rng
ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx xor
raid6_pq libcrc32c raid1 raid0 multipath linear psmouse virtio_net
virtio_scsi virtio_blk i2c_piix4 pata_acpi floppy
[ 2541.154248] CPU: 3 PID: 13332 Comm: kworker/3:200 Tainted: G D
         4.15.0-041500rc6-generic #201712312330
[ 2541.156921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2541.159381] Workqueue: bcache bch_data_insert_keys [bcache]
[ 2541.160566] RIP: 0010:search_free+0x31/0x40 [bcache]
[ 2541.161300] RSP: 0018:ffffb04e45b6bdf8 EFLAGS: 00010282
[ 2541.162098] RAX: 0000000000000000 RBX: ffff942c22b3b390 RCX: 0000000180100006
[ 2541.163791] RDX: 0000000180100007 RSI: ffffded4c086a7c0 RDI: ffff942c22b3b390
[ 2541.165486] RBP: ffffb04e45b6be00 R08: ffff942c21a9fc00 R09: 0000000180100006
[ 2541.167191] R10: ffffb04e45b6bd70 R11: ffff942c3ffb4000 R12: ffff942c39610010
[ 2541.168897] R13: ffff942c22b3b518 R14: ffff942c3fda3080 R15: ffff942c26b526c0
[ 2541.170637] FS: 0000000000000000(0000) GS:ffff942c3fd80000(0000)
knlGS:0000000000000000
[ 2541.172574] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2541.174043] CR2: 0000000000000d38 CR3: 0000000011409000 CR4: 00000000000006e0
[ 2541.175847] Call Trace:
[ 2541.176349] cached_dev_cache_miss_done+0x38/0x80 [bcache]
[ 2541.177131] closure_put+0x8b/0xc0 [bcache]
[ 2541.177791] closure_sub+0x6f/0xc0 [bcache]
[ 2541.178924] bch_data_insert_keys+0xd8/0x130 [bcache]
[ 2541.180107] process_one_work+0x1ea/0x410
[ 2541.181176] worker_thread+0x32/0x410
[ 2541.182157] kthread+0x11e/0x140
[ 2541.183043] ? process_one_work+0x410/0x410
[ 2541.184154] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2541.185460] ret_from_fork+0x1f/0x30
[ 2541.186397] Code: 48 89 e5 53 48 89 fb e8 0e ff ff ff 48 8b bb 68
01 00 00 48 85 ff 74 05 e8 cd 0b d0 f5 48 8b 83 e8 00 00 00 48 89 df
48 8b 40 70 <48> 8b b0 38 0d 00 00 e8 c3 18 a9 f5 5b 5d c3 0f 1f 44 00
00 55
[ 2541.189920] RIP: search_free+0x31/0x40 [bcache] RSP: ffffb04e45b6bdf8
[ 2541.191636] CR2: 0000000000000d38
[ 2541.192503] ---[ end trace e2b7818e7472ac25 ]---
On Mon, Nov 26, 2018 at 8:26 PM Joseph Salisbury
<email address hidden> wrote:

>
> Can you confirm that v4.15-rc6 does not exhibit the bug? That will
> ensure I'm reviewing the correct commits:
>
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc6/
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1784665
>
> Title:
> ...

Read more...

Download full text (3.7 KiB)

I also tested v4.15-rc5, which failed with the same stack trace. Then I returned to 4.14 mainline, and got this one:

[ 953.659489] BUG: unable to handle kernel NULL pointer dereference at
0000000000000d40
[ 953.668707] IP: search_free+0x31/0x40 [bcache]
[ 953.674872] PGD 0 P4D 0
[ 953.677680] Oops: 0000 [#34] SMP
[ 953.678796] Modules linked in: nls_utf8 isofs nls_iso8859_1 ppdev kvm_intel
kvm irqbypass parport_pc joydev input_leds parport serio_raw mac_hid
sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi virtio_rng ip_tables x_tables autofs4 btrfs zstd_compress
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
raid6_pq libcrc32c raid1 raid0 multipath linear bcache psmouse virtio_scsi
virtio_blk virtio_net i2c_piix4 floppy pata_acpi
[ 953.700549] CPU: 1 PID: 3132 Comm: kworker/1:229 Tainted: G D
4.14.0-041400-generic #201711122031
[ 953.704522] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 953.707128] Workqueue: bcache bch_data_insert_keys [bcache]
[ 953.708986] task: ffff985fe6050000 task.stack: ffffc01402c4c000
[ 953.711147] RIP: 0010:search_free+0x31/0x40 [bcache]
[ 953.712997] RSP: 0018:ffffc01402c4fdf8 EFLAGS: 00010282
[ 953.714287] RAX: 0000000000000000 RBX: ffff985ffb385e78 RCX: ffffe83580d58e00
[ 953.716883] RDX: 0000000000000007 RSI: 000000000001f540 RDI: ffff985ffb385e78
[ 953.718731] RBP: ffffc01402c4fe00 R08: ffff985ff8121800 R09: 000000018010000f
[ 953.720269] R10: ffffc01402c4fd70 R11: 0000000000000001 R12: ffff985fe3670010
[ 953.721583] R13: ffff985ffb386000 R14: ffff985fffc9bb40 R15: ffff985ff83eab40
[ 953.723315] FS: 0000000000000000(0000) GS:ffff985fffc80000(0000)
knlGS:0000000000000000
[ 953.725738] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 953.727401] CR2: 0000000000000d40 CR3: 0000000028c09000 CR4: 00000000000006e0
[ 953.729737] Call Trace:
[ 953.730687] cached_dev_cache_miss_done+0x38/0x80 [bcache]
[ 953.732939] closure_put+0x8b/0xc0 [bcache]
[ 953.734647] closure_sub+0x6f/0xc0 [bcache]
[ 953.736104] bch_data_insert_keys+0xd8/0x130 [bcache]
[ 953.737311] process_one_work+0x1e7/0x410
[ 953.738233] worker_thread+0x32/0x410
[ 953.739246] kthread+0x125/0x140
[ 953.740231] ? process_one_work+0x410/0x410
[ 953.741407] ? kthread_create_on_node+0x70/0x70
[ 953.742631] ...

Read more...

Ryan Harper (raharper) wrote :

v4.13 shows the same search_free stack as 4.14, but doesn't appear to be fatal (it doesn't result in a hung process nor a block device that cannot be wiped.

I suspect we've got a new upstream issue since 4.20 didn't fix things. Re-running 4.20 to see if we had a stack trace, we don't NULL pointer oops, but there is a stuck worker queue:

[ 242.887966] kworker/0:47 D 0 2165 2 0x80000000
[ 242.888014] Workqueue: events cache_set_flush [bcache]
[ 242.888027] Call Trace:
[ 242.888084] __schedule+0x2a2/0x880
[ 242.888097] ? __switch_to+0xfc/0x4f0
[ 242.888105] schedule+0x2c/0x80
[ 242.888113] schedule_timeout+0x246/0x350
[ 242.888122] wait_for_completion+0xb7/0x140
[ 242.888141] ? wake_up_q+0x80/0x80
[ 242.888151] kthread_stop+0x49/0x110
[ 242.888176] cache_set_flush+0xdb/0x190 [bcache]
[ 242.888186] process_one_work+0x20f/0x410
[ 242.888194] worker_thread+0x34/0x400
[ 242.888202] kthread+0x120/0x140
[ 242.888209] ? process_one_work+0x410/0x410
[ 242.888216] ? kthread_park+0x90/0x90
[ 242.888236] ret_from_fork+0x35/0x40
[ 242.888435] INFO: task bcache_allocato:9015 blocked for more than 120 seconds.
[ 242.893786] Not tainted 4.20.0-042000rc3-generic #201811182231
[ 242.896669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.900428] bcache_allocato D 0 9015 2 0x80000000
[ 242.900434] Call Trace:
[ 242.900448] __schedule+0x2a2/0x880
[ 242.900455] ? __schedule+0x2aa/0x880
[ 242.900462] schedule+0x2c/0x80
[ 242.900480] bch_bucket_alloc+0x19d/0x380 [bcache]
[ 242.900503] ? wait_woken+0x80/0x80
[ 242.900519] bch_prio_write+0x190/0x340 [bcache]
[ 242.900530] bch_allocator_thread+0x482/0xd10 [bcache]
[ 242.900535] kthread+0x120/0x140
[ 242.900546] ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
[ 242.900549] ? kthread_park+0x90/0x90
[ 242.900554] ret_from_fork+0x35/0x40

Changed in linux (Ubuntu):
status: In Progress → Confirmed
Changed in linux (Ubuntu Bionic):
status: In Progress → Confirmed
Changed in linux (Ubuntu Cosmic):
status: In Progress → Confirmed
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Bionic):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Andrea Righi (arighi) wrote :

Possibly a duplicate of bug 1796292, since we've got the same hung task timeout trace in bch_bucket_alloc() with the latest test kernel (4.15.0-55-generic #60+lp796292).

Andrea Righi (arighi) on 2019-07-10
Changed in linux (Ubuntu Bionic):
assignee: nobody → Andrea Righi (arighi)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Andrea Righi (arighi)
Changed in linux (Ubuntu):
assignee: nobody → Andrea Righi (arighi)
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
Andrea Righi (arighi) on 2019-08-05
description: updated
summary: - mkfs.ext4 over /dev/bcache0 hangs
+ bcache: bch_allocator_thread(): hung task timeout
Andrea Righi (arighi) wrote :

SRU pull request sent to the kernel-team mailing list:
https://lists.ubuntu.com/archives/kernel-team/2019-August/102784.html

Andrea Righi (arighi) on 2019-08-06
Changed in linux (Ubuntu Disco):
assignee: nobody → Andrea Righi (arighi)
Andrea Righi (arighi) on 2019-08-07
no longer affects: linux (Ubuntu Cosmic)
Andrea Righi (arighi) on 2019-08-07
Changed in linux (Ubuntu Disco):
status: New → Confirmed
importance: Undecided → High
Andrea Righi (arighi) on 2019-08-07
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Andrea Righi (arighi)
Seth Forshee (sforshee) on 2019-08-09
Changed in linux (Ubuntu Eoan):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Disco):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Ryan Harper (raharper) wrote :

I've verified that bionic-proposed linux-virtual kernel succeeds our test-case (curtin-nvme). I've 50 installs with no issue.

ubuntu@ubuntu:~$ apt-cache policy linux-virtual
linux-virtual:
  Installed: (none)
  Candidate: 4.15.0.59.61
  Version table:
     4.15.0.59.61 500
        500 http://archive.ubuntu.com/ubuntu bionic-proposed/main amd64 Packages
     4.15.0.58.60 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     4.15.0.20.23 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages
ubuntu@ubuntu:~$ lsb_release -rd
Description: Ubuntu 18.04.2 LTS
Release: 18.04
ubuntu@ubuntu:~$ uname -r
4.15.0-59-generic

tags: added: verification-done-bionic
removed: verification-needed-bionic
John George (jog) wrote :

I've also verified Solutions QA tests succeed with the proposed Bionic kernel.

Ryan Harper (raharper) wrote :

I've verified that disco-proposed linux-virtual kernel succeeds our test-case (curtin-nvme). I've 50 installs with no issue.

root@ubuntu:~# apt-cache policy linux-virtual
linux-virtual:
  Installed: 5.0.0.26.27
  Candidate: 5.0.0.26.27
  Version table:
 *** 5.0.0.26.27 500
        500 http://archive.ubuntu.com/ubuntu disco-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     5.0.0.25.26 500
        500 http://archive.ubuntu.com/ubuntu disco-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu disco-security/main amd64 Packages
     5.0.0.13.14 500
        500 http://archive.ubuntu.com/ubuntu disco/main amd64 Packages
root@ubuntu:~# uname -r
5.0.0-26-generic
root@ubuntu:~# lsb_release -rd
Description: Ubuntu 19.04
Release: 19.04

tags: added: verification-done-disco
removed: verification-needed-disco
Jason Hobbs (jason-hobbs) wrote :

We're still seeing a bcache timeout failure during curtin install 2019-08-22T10:16:40+00:00 spinda

cloud-init[1604]: finish: cmd-install/stage-partitioning/builtin/cmd-block-meta/clear-holders: FAIL: removing previous storage devices
2019-08-22T10:16:40+00:00 spinda cloud-init[1604]: TIMED BLOCK_META: 1203.679

I attached the rsyslog from a unit that failed.

Linux version 4.15.0-59-generic (buildd@lgw01-amd64-035) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #66-Ubuntu SMP Wed Aug 14 10:56:44 UTC 2019 (Ubuntu 4.15.0-59.66-generic 4.15.18)

tags: added: verification-failed-bionic
removed: verification-done-bionic
Jason Hobbs (jason-hobbs) wrote :
Changed in linux (Ubuntu Bionic):
status: Fix Committed → New
Ryan Harper (raharper) wrote :

I've adjusted my bionic testing with the simpler configuration. I cannot reproduce the failure so far. I'll leave this running over night.

I suspect there's something else going on baremetal that we can't reproduce in a VM.

Ryan Harper (raharper) wrote :

Also, I had some confusion earlier about what kernel I was testing. apt-cache policy shows the package version 4.15.0.59.61, however that's the meta package, the actual kernel is the .66 one.

# dpkg --list | grep linux-image
ii linux-image-4.15.0-59-generic 4.15.0-59.66 amd64 Signed kernel image generic
ii linux-image-virtual 4.15.0.59.61 amd64 Virtual Linux kernel image

# cat /proc/version
Linux version 4.15.0-59-generic (buildd@lgw01-amd64-035) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #66-Ubuntu SMP Wed Aug 14 10:56:44 UTC 2019

Ryan Harper (raharper) wrote :

Finally, I did verify xenial proposed with our original test. I had over 100 installs with no issue.

@Jason

Have you had any runs on Xenial or Disco? (or do you not test those)?

Download full text (10.5 KiB)

@ Ryan we do not test Xenial or Disco

On Thu, Aug 22, 2019 at 7:41 PM Ryan Harper <email address hidden>
wrote:

> Finally, I did verify xenial proposed with our original test. I had
> over 100 installs with no issue.
>
> @Jason
>
> Have you had any runs on Xenial or Disco? (or do you not test those)?
>
> --
> You received this bug notification because you are a member of Canonical
> Field Critical, which is subscribed to a duplicate bug report (1796292).
> https://bugs.launchpad.net/bugs/1784665
>
> Title:
> bcache: bch_allocator_thread(): hung task timeout
>
> Status in linux package in Ubuntu:
> Fix Committed
> Status in linux source package in Xenial:
> Fix Committed
> Status in linux source package in Bionic:
> New
> Status in linux source package in Disco:
> Fix Committed
> Status in linux source package in Eoan:
> Fix Committed
>
> Bug description:
> [Impact]
>
> bcache_allocator() can call the following:
>
> bch_allocator_thread()
> -> bch_prio_write()
> -> bch_bucket_alloc()
> -> wait on &ca->set->bucket_wait
>
> But the wake up event on bucket_wait is supposed to come from
> bch_allocator_thread() itself causing a deadlock.
>
> [Test Case]
>
> This is a simple script that can easily trigger the deadlock condition:
> https://launchpadlibrarian.net/381282009/bcache-basic-repro.sh
>
> A better test case has been also provided in bug 1796292 (duplicate of
> this bug):
>
> https://bugs.launchpad.net/curtin/+bug/1796292/+attachment/5280353/+files/curtin-nvme.sh
>
> [Fix]
>
> Fix by making the call to bch_prio_write() non-blocking, so that
> bch_allocator_thread() never waits on itself. Moreover, make sure to
> wake up the garbage collector thread when bch_prio_write() is failing
> to allocate buckets to increase the chance of freeing up more buckets.
>
> In addition to that it would be safe to also import other upstream
> bcache fixes (all clean cherry picks):
>
> 7e865eba00a3df2dc8c4746173a8ca1c1c7f042e bcache: fix potential deadlock
> in cached_def_free()
> 80265d8dfd77792e133793cef44a21323aac2908 bcache: acquire
> bch_register_lock later in cached_dev_free()
> ce4c3e19e5201424357a0c82176633b32a98d2ec bcache: Replace
> bch_read_string_list() by __sysfs_match_string()
> ecb37ce9baac653cc09e2b631393dde3df82979f bcache: Move couple of
> functions to sysfs.c
> 04cbc21137bfa4d7b8771a5b14f3d6c9b2aee671 bcache: Move couple of string
> arrays to sysfs.c
> 5f2b18ec8e1643410a2369f06888951cdedea0bf bcache: Fix a compiler warning
> in bcache_device_init()
> 20d3a518713e394efa5a899c84574b4b79ec5098 bcache: Reduce the number of
> sparse complaints about lock imbalances
> 42361469ae84c851e40cb1f94c8c9a14cdd94039 bcache: Suppress more warnings
> about set-but-not-used variables
> f0d3814090ac77de94c42b7124c37ece23629197 bcache: Remove an unused
> variable
> 47344e330eabc1515cbe6061eb337100a3ab6d37 bcache: Fix kernel-doc warnings
> 9dfbdec7b7fea1ff1b7b5d5d12980dbc7dca46c7 bcache: Annotate switch
> fall-through
> 4a4e443835a43a79113cc237c472c0d268eb1e1c bcache: Add __printf annotation
> to __bch_check_keys()
> fd01991d5c20098c5c1ffc4dca6c821cc60a2f74 bcache...

Ryan Harper (raharper) wrote :

Overnight testing of the revised deployment configuration has no errors, 200 runs completed.

Changed in linux (Ubuntu Bionic):
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (27.8 KiB)

This bug was fixed in the package linux - 5.2.0-13.14

---------------
linux (5.2.0-13.14) eoan; urgency=medium

  * eoan/linux: 5.2.0-13.14 -proposed tracker (LP: #1840261)

  * NULL pointer dereference when Inserting the VIMC module (LP: #1840028)
    - media: vimc: fix component match compare

  * Miscellaneous upstream changes
    - selftests/bpf: remove bpf_util.h from BPF C progs

linux (5.2.0-12.13) eoan; urgency=medium

  * eoan/linux: 5.2.0-12.13 -proposed tracker (LP: #1840184)

  * Eoan update: v5.2.8 upstream stable release (LP: #1840178)
    - scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure
    - libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
    - libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
    - ALSA: usb-audio: Sanity checks for each pipe and EP types
    - ALSA: usb-audio: Fix gpf in snd_usb_pipe_sanity_check
    - HID: wacom: fix bit shift for Cintiq Companion 2
    - HID: Add quirk for HP X1200 PIXART OEM mouse
    - atm: iphase: Fix Spectre v1 vulnerability
    - bnx2x: Disable multi-cos feature.
    - drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case
    - ife: error out when nla attributes are empty
    - ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6
    - ip6_tunnel: fix possible use-after-free on xmit
    - ipip: validate header length in ipip_tunnel_xmit
    - mlxsw: spectrum: Fix error path in mlxsw_sp_module_init()
    - mvpp2: fix panic on module removal
    - mvpp2: refactor MTU change code
    - net: bridge: delete local fdb on device init failure
    - net: bridge: mcast: don't delete permanent entries when fast leave is
      enabled
    - net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
    - net: fix ifindex collision during namespace removal
    - net/mlx5e: always initialize frag->last_in_page
    - net/mlx5: Use reversed order when unregister devices
    - net: phy: fixed_phy: print gpio error only if gpio node is present
    - net: phylink: don't start and stop SGMII PHYs in SFP modules twice
    - net: phylink: Fix flow control for fixed-link
    - net: phy: mscc: initialize stats array
    - net: qualcomm: rmnet: Fix incorrect UL checksum offload logic
    - net: sched: Fix a possible null-pointer dereference in dequeue_func()
    - net sched: update vlan action for batched events operations
    - net: sched: use temporary variable for actions indexes
    - net/smc: do not schedule tx_work in SMC_CLOSED state
    - net: stmmac: Use netif_tx_napi_add() for TX polling function
    - NFC: nfcmrvl: fix gpio-handling regression
    - ocelot: Cancel delayed work before wq destruction
    - tipc: compat: allow tipc commands without arguments
    - tipc: fix unitilized skb list crash
    - tun: mark small packets as owned by the tap sock
    - net/mlx5: Fix modify_cq_in alignment
    - net/mlx5e: Prevent encap flow counter update async to user query
    - r8169: don't use MSI before RTL8168d
    - bpf: fix XDP vlan selftests test_xdp_vlan.sh
    - selftests/bpf: add wrapper scripts for test_xdp_vlan.sh
    - selftests/bpf: reduce time to execute test_xdp_vlan.sh
    - net: fix bpf_xdp_adjust_head regression for generic-XDP
    - hv_sock: Fi...

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released

Marking verification done for Xenial based on comment #57.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Andrea Righi (arighi) wrote :

@jason-hobbs, I don't see any kernel error in the log attached to comment #53, so it seems to be a different issue, maybe a performance regression that is causing curtin to timeout. I would keep this bug closed / fixed and open a new one.

If you can easily reproduce it, could you open a new bug and collect some info on the box after the problem happens? (in particular uname -a, dmesg, ps axuw). Thanks!

Launchpad Janitor (janitor) wrote :
Download full text (34.4 KiB)

This bug was fixed in the package linux - 5.0.0-27.28

---------------
linux (5.0.0-27.28) disco; urgency=medium

  * disco/linux: 5.0.0-27.28 -proposed tracker (LP: #1840816)

  * [Potential Regression] System crashes when running ftrace test in
    ubuntu_kernel_selftests (LP: #1840750)
    - x86/kprobes: Set instruction page as executable

linux (5.0.0-26.27) disco; urgency=medium

  * disco/linux: 5.0.0-26.27 -proposed tracker (LP: #1839972)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * alsa/hdmi: add icelake hdmi audio support for a Dell machine (LP: #1836916)
    - ALSA: hda: hdmi - add Icelake support
    - ALSA: hda/hdmi - Remove duplicated define
    - ALSA: hda/hdmi - Fix i915 reverse port/pin mapping

  * input/mouse: alps trackpoint-only device doesn't work (LP: #1836752)
    - Input: alps - don't handle ALPS cs19 trackpoint-only device
    - Input: alps - fix a mismatch between a condition check and its comment

  * [18.04 FEAT] Enhanced hardware support (LP: #1836857)
    - s390: report new CPU capabilities
    - s390: add alignment hints to vector load and store

  * System does not auto detect disconnection of external monitor (LP: #1835001)
    - drm/i915: Add support for retrying hotplug
    - drm/i915: Enable hotplug retry

  * [18.04 FEAT] Enhanced CPU-MF hardware counters - kernel part (LP: #1836860)
    - s390/cpum_cf: Add support for CPU-MF SVN 6
    - s390/cpumf: Add extended counter set definitions for model 8561 and 8562

  * EeePC 1005px laptop backlight is off after system boot up (LP: #1837117)
    - platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from
      asus_nb_wmi

  * br_netfilter: namespace sysctl operations (LP: #1836910)
    - netfilter: bridge: port sysctls to use brnf_net
    - netfilter: bridge: namespace bridge netfilter sysctls
    - netfilter: bridge: prevent UAF in brnf_exit_net()

  * ideapad_laptop disables WiFi/BT radios on Lenovo Y540 (LP: #1837136)
    - platform/x86: ideapad-laptop: Remove no_hw_rfkill_list

  * shiftfs: allow overlayfs (LP: #1838677)
    - SAUCE: shiftfs: enable overlayfs on shiftfs

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665)
    - bcache: never writeback a discard operation
    - bcache: improve bcache_reboot()
    - SAUCE: bcache: fix deadlock in bcache_allocator

  * Regressions in CMA allocation rework (LP: #1839395)
    - dma-contiguous: do not overwrite align in dma_alloc_contiguous()
    - dma-contiguous: page-align the size in dma_free_contiguous()

  * CVE-2019-3900
    - vhost: introduce vhost_exceeds_weight()
    - vhost_net: fix possible infinite loop
    - vhost: vsock: add weight support
    - vhost: scsi: add weight support

  * Disco update: 5.0.21 upstream stable release (LP: #1837518)
    - bonding/802.3ad: fix slave link initialization transition states
    - cxgb4: offload VLAN flows regardless of VLAN ethtype
    - inet: switch IP ID generator to siphash
    - ipv4/igmp: fix another memory leak in igmpv3_del_delrec()
    - ipv4/igmp: fix build error if !CONFIG_IP_MULTICAST
    - ipv6: Consider sk_bound_dev_if when binding a raw socket to an address
    - ipv6: Fix redi...

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (235.3 KiB)

This bug was fixed in the package linux - 4.15.0-60.67

---------------
linux (4.15.0-60.67) bionic; urgency=medium

  * bionic/linux: 4.15.0-60.67 -proposed tracker (LP: #1841086)

  * [Regression] net test from ubuntu_kernel_selftests failed due to bpf test
    compilation issue (LP: #1840935)
    - SAUCE: Fix "bpf: relax verifier restriction on BPF_MOV | BPF_ALU"

  * [Regression] failed to compile seccomp test from ubuntu_kernel_selftests
    (LP: #1840932)
    - Revert "selftests: skip seccomp get_metadata test if not real root"

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis

linux (4.15.0-59.66) bionic; urgency=medium

  * bionic/linux: 4.15.0-59.66 -proposed tracker (LP: #1840006)

  * zfs not completely removed from bionic tree (LP: #1840051)
    - SAUCE: (noup) remove completely the zfs code

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * [18.04 FEAT] Enhanced hardware support (LP: #1836857)
    - s390: report new CPU capabilities
    - s390: add alignment hints to vector load and store

  * [18.04 FEAT] Enhanced CPU-MF hardware counters - kernel part (LP: #1836860)
    - s390/cpum_cf: Add support for CPU-MF SVN 6
    - s390/cpumf: Add extended counter set definitions for model 8561 and 8562

  * ideapad_laptop disables WiFi/BT radios on Lenovo Y540 (LP: #1837136)
    - platform/x86: ideapad-laptop: Remove no_hw_rfkill_list

  * Stacked onexec transitions fail when under NO NEW PRIVS restrictions
    (LP: #1839037)
    - SAUCE: apparmor: fix nnp subset check failure when, stacking

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665) // Tight
    timeout for bcache removal causes spurious failures (LP: #1796292)
    - SAUCE: bcache: fix deadlock in bcache_allocator

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665)
    - bcache: never writeback a discard operation
    - bcache: improve bcache_reboot()
    - bcache: fix writeback target calc on large devices
    - bcache: add journal statistic
    - bcache: fix high CPU occupancy during journal
    - bcache: use pr_info() to inform duplicated CACHE_SET_IO_DISABLE set
    - bcache: fix incorrect sysfs output value of strip size
    - bcache: fix error return value in memory shrink
    - bcache: fix using of loop variable in memory shrink
    - bcache: Fix indentation
    - bcache: Add __printf annotation to __bch_check_keys()
    - bcache: Annotate switch fall-through
    - bcache: Fix kernel-doc warnings
    - bcache: Remove an unused variable
    - bcache: Suppress more warnings about set-but-not-used variables
    - bcache: Reduce the number of sparse complaints about lock imbalances
    - bcache: Fix a compiler warning in bcache_device_init()
    - bcache: Move couple of string arrays to sysfs.c
    - bcache: Move couple of functions to sysfs.c
    - bcache: Replace bch_read_string_list() by __sysfs_match_string()

  * linux hwe i386 kernel 5.0.0-21.22~18.04.1 crashes on Lenovo x220
    (LP: #1838115)
    - x86/mm: Check for pfn instead of page in vmalloc_sync_one()
    - x86/mm: Sync also unmappings in vmalloc_sync_all()
    - mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy()...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (5.5 KiB)

This bug was fixed in the package linux - 4.4.0-161.189

---------------
linux (4.4.0-161.189) xenial; urgency=medium

  * xenial/linux: 4.4.0-161.189 -proposed tracker (LP: #1841544)

  * flock not mediated by 'k' (LP: 1658219)
    - Revert "UBUNTU: SAUCE: apparmor: flock mediation is not being, enforced on
      cache check"

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis

linux (4.4.0-160.188) xenial; urgency=medium

  * xenial/linux: 4.4.0-160.188 -proposed tracker (LP: #1840021)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * EeePC 1005px laptop backlight is off after system boot up (LP: #1837117)
    - platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from
      asus_nb_wmi

  * CVE-2019-10638
    - [Config] CONFIG_TEST_HASH=n
    - siphash: add cryptographically secure PRF
    - inet: switch IP ID generator to siphash

  * Stacked onexec transitions fail when under NO NEW PRIVS restrictions
    (LP: #1839037)
    - SAUCE: apparmor: fix nnp subset check failure, when stacking

  * AppArmor onexec transition causes WARN kernel stack trace (LP: #1838627)
    - SAUCE: apparmor: fix audit failures when performing profile transitions

  * flock not mediated by 'k' (LP: 1658219) // Ubuntu 16.04: read access
    incorrectly implies 'm' rule (LP: 1838090)
    - SAUCE: apparmor: flock mediation is not being, enforced on cache check

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665) // Tight
    timeout for bcache removal causes spurious failures (LP: #1796292)
    - SAUCE: bcache: fix deadlock in bcache_allocator

  * bcache: bch_allocator_thread(): hung task timeout (LP: #1784665)
    - bcache: improve bcache_reboot()
    - bcache: add journal statistic
    - bcache: fix high CPU occupancy during journal
    - bcache: fix incorrect sysfs output value of strip size
    - bcache: fix error return value in memory shrink
    - bcache: fix using of loop variable in memory shrink
    - bcache: Fix indentation
    - bcache: Add __printf annotation to __bch_check_keys()
    - bcache: Annotate switch fall-through
    - bcache: Fix kernel-doc warnings
    - bcache: Remove an unused variable
    - bcache: Suppress more warnings about set-but-not-used variables
    - bcache: Reduce the number of sparse complaints about lock imbalances
    - bcache: Move couple of functions to sysfs.c

  * CVE-2019-3900
    - vhost: introduce vhost_vq_avail_empty()
    - vhost_net: tx batching
    - vhost_net: do not stall on zerocopy depletion
    - vhost-net: set packet weight of tx polling to 2 * vq size
    - vhost_net: use packet weight for rx handler, too
    - vhost_net: introduce vhost_exceeds_weight()
    - vhost: introduce vhost_exceeds_weight()
    - vhost_net: fix possible infinite loop
    - vhost: scsi: add weight support

  * Xenial: ZFS deadlock in shrinker path with xattrs (LP: #1839521)
    - SAUCE: (noup) Update zfs to 0.6.5.6-0ubuntu28

  * CVE-2019-13648
    - powerpc/tm: Fix oops on sigreturn on systems without TM

  * CVE-2018-20856
    - block: blk_init_allocated_queue() set q->fq as NULL in the fail case

  * CVE-2019-14283
    - floppy: fix out-of-bound...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.