5.19.0-50 --- mdadm RAID 5 with write journal segfaults
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
mdadm (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
After upgrading to kernel 5.19.0-50 for Ubuntu 22.04 LTS, up from 5.15 vintage kernels, the new kernel started pagefaulting after about 2 hours of uptime. The segfault is due to mdadm, and it relates a RAID 5 array that has a write-through journal. The RAID 5 array had 4 HDDs and a journal device being itself a 32gb RAID 1 mdadm array consisting of partitions on SSD devices. The failure details from the syslog are below.
Once this crash happens, the RAID array in question becomes unresponsive. The array cannot be stopped, and the reboot process will not complete successfully. After rebooting, mdadm will report 0 data pages and hundreds of thousands of parity pages have to be recovered from the journal. It looks like there is no data loss, but it's hard to tell obviously.
For reference, previously I had tried to use a write-back journal in the same RAID 5 array. With the earlier 5.15 vintage kernels, periodically mdadm would hang and also prevent a successful reboot of the machine. Upon restarting, mdadm would hang while trying to start the array until I cleared out the write-back journal and added a fresh one. This is similar to bugs reported in the mdadm mailing list in 2020. From that point on, I only used write-through journals that appeared to work ok. With the 5.19 kernel, the write-through journals started causing the crash described here. The present situation is similar to bugs reported in the mdadm mailing list in May of this year.
I dropped the old RAID 5 array with a write journal and switched to a RAID 6 array with an internal bitmap.
Jul 26 04:02:46 <redacted> kernel: [ 7093.186750] BUG: kernel NULL pointer dereference, address: 0000000000000155
Jul 26 04:02:46 <redacted> kernel: [ 7093.186769] #PF: supervisor read access in kernel mode
Jul 26 04:02:46 <redacted> kernel: [ 7093.186774] #PF: error_code(0x0000) - not-present page
Jul 26 04:02:46 <redacted> kernel: [ 7093.186778] PGD 0 P4D 0
Jul 26 04:02:46 <redacted> kernel: [ 7093.186785] Oops: 0000 [#1] PREEMPT SMP PTI
Jul 26 04:02:46 <redacted> kernel: [ 7093.186793] CPU: 4 PID: 5645 Comm: md126_raid5 Tainted: P OE 5.19.0-50-generic #50-Ubuntu
Jul 26 04:02:46 <redacted> kernel: [ 7093.186800] Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.0 08/05/2013
Jul 26 04:02:46 <redacted> kernel: [ 7093.186804] RIP: 0010:submit_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186819] Code: 8b 9c eb b8 00 00 00 0f 1f 44 00 00 41 80 7c 24 14 00 79 09 f6 83 50 01 00 00 04 74 2f 41 8b 44 24 10 83 e0 01 05 54 01 00 00 <0f> b6 1c 03 80 fb 01 0f 87 2e 56 82 00 83 e3 01 74 10 4c 89 e7 e8
Jul 26 04:02:46 <redacted> kernel: [ 7093.186824] RSP: 0018:ffff9aee4f
Jul 26 04:02:46 <redacted> kernel: [ 7093.186830] RAX: 0000000000000155 RBX: 0000000000000000 RCX: 0000000000000000
Jul 26 04:02:46 <redacted> kernel: [ 7093.186834] RDX: 0000000000040001 RSI: 0000000000000000 RDI: ffff8a76132e70b8
Jul 26 04:02:46 <redacted> kernel: [ 7093.186839] RBP: ffff9aee4fc27ce0 R08: 0000000000000000 R09: 0000000000000000
Jul 26 04:02:46 <redacted> kernel: [ 7093.186842] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a76132e70b8
Jul 26 04:02:46 <redacted> kernel: [ 7093.186846] R13: ffff8a6e57d88000 R14: ffff8a6e57bad780 R15: 0000000003fef800
Jul 26 04:02:46 <redacted> kernel: [ 7093.186851] FS: 000000000000000
Jul 26 04:02:46 <redacted> kernel: [ 7093.186856] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 26 04:02:46 <redacted> kernel: [ 7093.186860] CR2: 0000000000000155 CR3: 0000000396a10004 CR4: 00000000001706e0
Jul 26 04:02:46 <redacted> kernel: [ 7093.186865] Call Trace:
Jul 26 04:02:46 <redacted> kernel: [ 7093.186870] <TASK>
Jul 26 04:02:46 <redacted> kernel: [ 7093.186876] submit_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186890] r5l_flush_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186913] handle_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186928] ? md_wakeup_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186937] raid5d+0x377/0x5e0 [raid456]
Jul 26 04:02:46 <redacted> kernel: [ 7093.186953] ? schedule_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186964] md_thread+
Jul 26 04:02:46 <redacted> kernel: [ 7093.186971] ? destroy_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186982] ? md_set_
Jul 26 04:02:46 <redacted> kernel: [ 7093.186988] kthread+0xee/0x120
Jul 26 04:02:46 <redacted> kernel: [ 7093.186997] ? kthread_
Jul 26 04:02:46 <redacted> kernel: [ 7093.187006] ret_from_
Jul 26 04:02:46 <redacted> kernel: [ 7093.187018] </TASK>
Jul 26 04:02:46 <redacted> kernel: [ 7093.187021] Modules linked in: tls intel_rapl_msr intel_rapl_common sb_edac x86_pkg_
summary: |
- 5.19.0-50 --- mdadm with write journal segfaults + 5.19.0-50 --- mdadm RAID 5 with write journal segfaults |