[REGRESSION] md/raid0: cannot assemble multi-zone RAID0 with default_layout setting
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Critical
|
dann frazier | ||
Disco |
Fix Released
|
Undecided
|
Unassigned | ||
Eoan |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This bug tracks the temporary revert of the upstream fix for a corruption issue. Bug 1850540 tracks the re-application of that fix once we have a full solution.
Users of RAID0 arrays are susceptible to a corruption issue if:
- The members of the RAID array are not all the same size[*]
- Data has been written to the array while running kernels < 3.14 *and* >= 3.14.
This is because of an change in v3.14 that accidentally changed how data was written - as described in the upstream commit message:
https:/
To summarize, upstream is dealing with this by adding a versioned layout in v5.4, and that is being backported to stable kernels - which is why we're now seeing it. Layout version 1 is the pre-3.14 layout, version 2 is post 3.14. Mixing version 1 & version 2 layouts can cause corruption. However, unless a layout-
The user experience is pretty awful here. A user upgrades to the next SRU and all of a sudden their system stops at an (initramfs) prompt. A clueful user can spot something like the following in dmesg:
Here's the message which , as you can see from the log in Comment #1, is hidden in a ton of other messages:
[ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with default_layout setting
[ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
[ 72.733979] md: pers->run() failed ...
mdadm: failed to start array /dev/md0: Unknown error 524
What that is trying to say is that you should determine if your data - specifically the data toward the end of your array - was most likely written with a pre-3.14 or post-3.14 kernel. Based on that, reboot with the kernel parameter raid0.default_
https:/
IMHO, we should work with upstream to create a web page that clearly walks the user through this process, and update the error message to point to that page. I'd also like to see if we can detect this problem *before* the user reboots (debconf?) and help the user fix things. e.g. "We detected that you have RAID0 arrays that maybe susceptible to a corruption problem", guide the user to choosing a layout, and update the mdadm initramfs hook to poke the answer in via sysfs before starting the array on reboot.
Note that it also seems like we should investigate backporting this to < 3.14 kernels. Imagine a user switching between the trusty HWE kernel and the GA kernel.
References from users of other distros:
https:/
https:/
[*] Which surprisingly is not the case reported in this bug - the user here had a raid0 of 8 identically-sized devices. I suspect there's a bug in the detection code somewhere.
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Bionic): | |
status: | Confirmed → Fix Committed |
description: | updated |
Changed in linux (Ubuntu Disco): | |
status: | Incomplete → Fix Committed |
no longer affects: | mdadm (Debian) |
no longer affects: | ubuntu-release-notes |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1849682
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.