Oops and hang when starting LVM snapshots on 5.4.0-47
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
kmemcaches will fail to be created after they have just been removed but not completely ripped out. This will cause some drivers (like lvm snapshots) to properly work and cause kernel traces to go on the logs.
[Test case]
See comment #9.
[Regression potential]
The fix reverts a commit, so we go back to a state of a previously released kernel, where a leak was possible. The regression here, though, is better than the impact that will also lead to a different leak and prevent users from correctly using LVM snapshots.
=======
One of my bionic servers with HWE 5.4.0 hangs on boot (apparently while starting LVM snapshots) after upgrading from Linux 5.4.0-42 to 5.4.0-47, with the following trace:
[ 29.126292] kobject_
[ 29.138854] BUG: kernel NULL pointer dereference, address: 0000000000000020
[ 29.145977] #PF: supervisor read access in kernel mode
[ 29.145979] #PF: error_code(0x0000) - not-present page
[ 29.145981] PGD 0 P4D 0
[ 29.158800] Oops: 0000 [#1] SMP NOPTI
[ 29.162468] CPU: 6 PID: 2532 Comm: lvm Not tainted 5.4.0-46-generic #50~18.04.1-Ubuntu
[ 29.170378] Hardware name: Supermicro AS -2023US-
[ 29.178038] RIP: 0010:free_
[ 29.183786] Code: 43 64 48 01 d0 49 39 c4 0f 83 71 ff ff ff 65 8b 05 a5 4e bc 58 48 8b 15 0e 4e 20 01 48 98 48 8b 3c c2 4c 01 e7 e8 f0 97 02 00 <48> 8b 58 20 48 8b 53 38 e9 48 ff ff ff f3 c3 48 8b 43 38 48 89 45
[ 29.202530] RSP: 0018:ffffa2f69c
[ 29.209204] RAX: 0000000000000000 RBX: ffff92202ff397c0 RCX: ffffffffa880a000
[ 29.216336] RDX: cf35c0f24f2cc3c0 RSI: 43817c451b92afcb RDI: 0000000000000000
[ 29.223469] RBP: ffffa2f69c3d3918 R08: 0000000000000000 R09: ffffffffa74a5300
[ 29.230609] R10: ffffa2f69c3d3820 R11: 0000000000000000 R12: cf35c0f24f14c3c0
[ 29.237745] R13: cf362fb2a054c3c0 R14: 0000000000000287 R15: 0000000000000008
[ 29.244878] FS: 00007f93a04b090
[ 29.252961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 29.258707] CR2: 0000000000000020 CR3: 0000003fa9d90000 CR4: 00000000003406e0
[ 29.265883] Call Trace:
[ 29.268346] __kmem_
[ 29.273913] __kmem_
[ 29.278192] ? __kmalloc_
[ 29.282205] ? kvmalloc_
[ 29.285962] create_
[ 29.291003] kmem_cache_
[ 29.295882] kmem_cache_
[ 29.300152] dm_bufio_
[ 29.305644] ? snapshot_
[ 29.310693] persistent_
[ 29.316627] ? _cond_resched+
[ 29.320384] snapshot_
[ 29.325276] dm_table_
[ 29.329552] table_load+
[ 29.333045] ctl_ioctl+
[ 29.336450] ? retrieve_
[ 29.340551] dm_ctl_
[ 29.343958] do_vfs_
[ 29.347547] ? ksys_semctl.
[ 29.352337] ksys_ioctl+
[ 29.355663] __x64_sys_
[ 29.359421] do_syscall_
[ 29.363094] entry_SYSCALL_
[ 29.368144] RIP: 0033:0x7f939f0286d7
[ 29.371732] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
[ 29.390478] RSP: 002b:00007ffe91
[ 29.398045] RAX: ffffffffffffffda RBX: 0000561c107f672c RCX: 00007f939f0286d7
[ 29.405175] RDX: 0000561c1107c610 RSI: 00000000c138fd09 RDI: 0000000000000009
[ 29.412309] RBP: 00007ffe918df220 R08: 00007f939f59d120 R09: 00007ffe918defd0
[ 29.419442] R10: 0000561c1107c6c0 R11: 0000000000000202 R12: 00007f939f59c4e6
[ 29.426623] R13: 00007f939f59c4e6 R14: 00007f939f59c4e6 R15: 00007f939f59c4e6
[ 29.433778] Modules linked in: dm_snapshot dm_bufio dm_zero nls_iso8859_1 ipmi_ssif input_leds amd64_edac_mod edac_mce_amd joydev kvm_amd kvm ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
[ 29.507853] CR2: 0000000000000020
[ 29.511174] ---[ end trace 43bd923f80cbdf52 ]---
That :a-0000152 is meant to be /sys/kernel/
$ uname -a
Linux <REDACTED> 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ ls -l /sys/kernel/slab | grep a-0000152
lrwxrwxrwx 1 root root 0 Sep 8 03:20 dm_bufio_buffer -> :a-0000152
So on 5.4.0-42 the named node doesn't get created, but at least it doesn't crash. The same thing is visible on my 5.8.0-18 desktop, but I can't reproduce the crash on other machines with snapshot thin volumes despite it happening every time (even with maxcpus=1) on the affected system.
It should be noted that LVM was not in use on this system until just before it was rebooted into the new kernel, but downgrading to -42 does work so it seems like a coincidence. Before I realised it was a recent regression I dug through mm/slub.c's history and found dde3c6b7 ("mm/slub: fix a memory leak in sysfs_slab_add()") kind of suspicious -- it ostensibly fixes a leak from 80da026a ("mm/slub: fix slab double-free in case of duplicate sysfs filename"), exactly the codepath that seems to crash here.
There's clearly some existing bug causing the slab sysfs node to not be added, and I guess dde3c6b7 turns that into a crash on some systems. This is a test system, so I can do whatever debugging is required to narrow down the trigger.
CVE References
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
description: | updated |
Changed in linux (Ubuntu Focal): | |
status: | New → Fix Committed |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1894780
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.