[REGRESSION] md/raid0: cannot assemble multi-zone RAID0 with default_layout setting

Bug #1849682 reported by dann frazier
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Critical
dann frazier
Disco
Fix Released
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

This bug tracks the temporary revert of the upstream fix for a corruption issue. Bug 1850540 tracks the re-application of that fix once we have a full solution.

Users of RAID0 arrays are susceptible to a corruption issue if:
 - The members of the RAID array are not all the same size[*]
 - Data has been written to the array while running kernels < 3.14 *and* >= 3.14.

This is because of an change in v3.14 that accidentally changed how data was written - as described in the upstream commit message:
https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9

To summarize, upstream is dealing with this by adding a versioned layout in v5.4, and that is being backported to stable kernels - which is why we're now seeing it. Layout version 1 is the pre-3.14 layout, version 2 is post 3.14. Mixing version 1 & version 2 layouts can cause corruption. However, unless a layout-version-aware kernel *created* the array, there's no way for the kernel to know which version(s) was used to write the existing data. This undefined mode is considered "Version 0", and the kernel will now refuse to start these arrays w/o user intervention.

The user experience is pretty awful here. A user upgrades to the next SRU and all of a sudden their system stops at an (initramfs) prompt. A clueful user can spot something like the following in dmesg:

Here's the message which , as you can see from the log in Comment #1, is hidden in a ton of other messages:

[ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with default_layout setting
[ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
[ 72.733979] md: pers->run() failed ...
mdadm: failed to start array /dev/md0: Unknown error 524

What that is trying to say is that you should determine if your data - specifically the data toward the end of your array - was most likely written with a pre-3.14 or post-3.14 kernel. Based on that, reboot with the kernel parameter raid0.default_layout=1 or raid0.default_layout=2 on the kernel command line. And note it should be *raid0.default_layout* not *raid.default_layout* as the message says - a fix for that message is now queued for stable:

https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571)

IMHO, we should work with upstream to create a web page that clearly walks the user through this process, and update the error message to point to that page. I'd also like to see if we can detect this problem *before* the user reboots (debconf?) and help the user fix things. e.g. "We detected that you have RAID0 arrays that maybe susceptible to a corruption problem", guide the user to choosing a layout, and update the mdadm initramfs hook to poke the answer in via sysfs before starting the array on reboot.

Note that it also seems like we should investigate backporting this to < 3.14 kernels. Imagine a user switching between the trusty HWE kernel and the GA kernel.

References from users of other distros:
https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/
https://www.linuxquestions.org/questions/linux-general-1/raid-arrays-not-assembling-4175662774/

[*] Which surprisingly is not the case reported in this bug - the user here had a raid0 of 8 identically-sized devices. I suspect there's a bug in the detection code somewhere.

Revision history for this message
dann frazier (dannf) wrote :
Changed in linux (Ubuntu Focal):
status: Confirmed → New
importance: Critical → Undecided
Changed in linux (Ubuntu Bionic):
importance: Undecided → Critical
status: New → Confirmed
assignee: nobody → dann frazier (dannf)
Changed in linux (Ubuntu Focal):
assignee: dann frazier (dannf) → nobody
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1849682

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Disco):
status: New → Incomplete
Changed in linux (Ubuntu Eoan):
status: New → Incomplete
Revision history for this message
dann frazier (dannf) wrote :

OK - this is a messy one. It is due to the backport of this:
https://github.com/torvalds/linux/commit/c84a1372df929033cb1a0441fb57bd3932f39ac9

Reverting that is probably not the right answer because the point of it is to avoid corruption. But this is a pretty serious usability issue. It is not at all clear from the message that a user needs to do *something* - and what that *something* is is even less clear:

Here's the message, buried in a ton of other messages:
[ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with default_layout setting
[ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
[ 72.733979] md: pers->run() failed ...
mdadm: failed to start array /dev/md0: Unknown error 524

So if you understand from that that you need to pass a kernel parameter, you're more intuitive than I am. And if you understand from that *why*, and *to which one* - well, you probably wrote the patch. And even then, you probably didn't realize the parameter is actually incorrect (HINT: we should backport this as well: https://github.com/torvalds/linux/commit/3874d73e06c9b9dc15de0b7382fc223986d75571).

IMO, the error message should include a URL to page with clear steps on how to proceed which I think is something along the lines of "Use mdadm to figure out when your array was created, figure out what kernel you were running back then (ideally with a mapping to Ubuntu release), and then how to fix it.

That said, it isn't clear to me why we saw this issue on this specific machine. This issue is supposedly restricted to only multi-zone RAID0 configs, which should only happen if not all members are the same size. But I happen to know that all members on this system here *are* the same size! I've tried to reproduce it but, after redeploying the system with MAAS, it upgrades and reboots w/o error :(

dann frazier (dannf)
description: updated
dann frazier (dannf)
description: updated
description: updated
description: updated
dann frazier (dannf)
description: updated
Revision history for this message
dann frazier (dannf) wrote :

I've proposed changing the error message to point to a webpage w/ a better explanation:
  https://marc.info/?l=linux-raid&m=157196348406853&w=2

Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
dann frazier (dannf)
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco
Revision history for this message
Khaled El Mously (kmously) wrote :

@dannf: Ping for verification :)

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
dann frazier (dannf) wrote :

= Verification =
I see 2 pieces to this:
 1) The original report, in Comment #1, where the offending patch caused an issue on a system where it shouldn't have - i.e., a raid0 w/ homogenous member sizes. We were never able to reproduce this in subsequent tests w/ the patch applied. I know sfeole was able to perform the same MAAS install/upgrade w/ the current -proposed kernel (I saw a test report from it), so I think we can confidently say it is not reproducible in that build either.

 2) In configs where this patch *should* prevent a raid0 from assembling (heterogenous sizes), I've verified that if I create such an array on an older kernel, then upgrade to the current -proposed kernel, it starts automatically. Now, of course, I continue to be susceptible to corruption, but that's known and tracked in bug 1850540.

$ cat /proc/version
Linux version 4.15.0-66-generic (buildd@lgw01-amd64-044) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019
$ sudo mdadm --create /dev/md0 --run --metadata=default --homehost=akis --level=0 --raid-devices=2 /dev/vdb1 /dev/vdc1
mdadm: /dev/vdb1 appears to be part of a raid array:
       level=raid0 devices=2 ctime=Thu Oct 31 21:53:40 2019
mdadm: /dev/vdc1 appears to be part of a raid array:
       level=raid0 devices=2 ctime=Thu Oct 31 21:53:40 2019
mdadm: array /dev/md0 started.
$ sudo reboot

$ cat /proc/version
Linux version 4.15.0-68-generic (buildd@lgw01-amd64-037) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #77-Ubuntu SMP Sun Oct 27 06:02:23 UTC 2019
$ cat /proc/mdstat
Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : active raid0 vdc1[1] vdb1[0]
      1567744 blocks super 1.2 512k chunks

unused devices: <none>

Revision history for this message
dann frazier (dannf) wrote :
Download full text (28.6 KiB)

The disco story is not so pretty:

[ 0.000000] Linux version 5.0.0-34-generic (buildd@lgw01-amd64-051) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #36~18.04.1-Ubuntu SMP Wed Oct 30 08:08:56 UTC 2019 (Ubuntu 5.0.0-34.36~18.04.1-generic 5.0.21)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.0.0-34-generic root=UUID=e3626652-82d8-4e95-a6d7-e4920d7941b6 ro console=tty1 console=ttyS0
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Hygon HygonGenuine
[ 0.000000] Centaur CentaurHauls
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001ffd7fff] usable
[ 0.000000] BIOS-e820: [mem 0x000000001ffd8000-0x000000001fffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] SMBIOS 2.8 present.
[ 0.000000] DMI: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1 04/01/2014
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr e601001, primary cpu clock
[ 0.000000] kvm-clock: using sched offset of 79214193145 cycles
[ 0.000002] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000004] tsc: Detected 2298.666 MHz processor
[ 0.002336] last_pfn = 0x1ffd8 max_arch_pfn = 0x400000000
[ 0.002446] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.008265] found SMP MP-table at [mem 0x000f6a60-0x000f6a6f]
[ 0.008362] check: Scanning 1 areas for low memory corruption
[ 0.008413] Using GB pages for direct mapping
[ 0.008561] RAMDISK: [mem 0x1d8fb000-0x1ecacfff]
[ 0.008579] ACPI: Early table checksum verification disabled
[ 0.008637] ACPI: RSDP 0x00000000000F6860 000014 (v00 BOCHS )
[ 0.008640] ACPI: RSDT 0x000000001FFE1505 00002C (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001)
[ 0.008646] ACPI: FACP 0x000000001FFE1419 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001)
[ 0.008654] ACPI: DSDT 0x000000001FFE0040 0013D9 (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001)
[ 0.008657] ACPI: FACS 0x000000001FFE0000 000040
[ 0.008660] ACPI: APIC 0x000000001FFE148D 000078 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001)
[ 0.009155] No NUMA configuration found
[ 0.009157] Faking a node at [mem 0x000000000000...

Revision history for this message
dann frazier (dannf) wrote :

I can 100% reproduce this on 5.0.0-34, but not at all on 5.0.0-32, so marking verification-failed.

tags: added: verification-failed-disco
removed: verification-needed-disco
Changed in linux (Ubuntu Disco):
status: Incomplete → Fix Committed
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (linux-gcp-5.3/5.3.0-1008.9~18.04.1)

All autopkgtests for the newly accepted linux-gcp-5.3 (5.3.0-1008.9~18.04.1) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

linux-gcp-5.3/unknown (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#linux-gcp-5.3

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (53.1 KiB)

This bug was fixed in the package linux - 5.3.0-22.24

---------------
linux (5.3.0-22.24) eoan; urgency=medium

  * [REGRESSION] md/raid0: cannot assemble multi-zone RAID0 with default_layout
    setting (LP: #1849682)
    - Revert "md/raid0: avoid RAID0 data corruption due to layout confusion."

  * refcount underflow and type confusion in shiftfs (LP: #1850867) // CVE-2019-15793
    - SAUCE: shiftfs: Correct id translation for lower fs operations
    - SAUCE: shiftfs: prevent type confusion
    - SAUCE: shiftfs: Fix refcount underflow in btrfs ioctl handling

  * CVE-2018-12207
    - kvm: x86, powerpc: do not allow clearing largepages debugfs entry
    - SAUCE: KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is
      active
    - SAUCE: x86: Add ITLB_MULTIHIT bug infrastructure
    - SAUCE: kvm: mmu: ITLB_MULTIHIT mitigation
    - SAUCE: kvm: Add helper function for creating VM worker threads
    - SAUCE: kvm: x86: mmu: Recovery of shattered NX large pages
    - SAUCE: cpu/speculation: Uninline and export CPU mitigations helpers
    - SAUCE: kvm: x86: mmu: Apply global mitigations knob to ITLB_MULTIHIT

  * CVE-2019-11135
    - x86/msr: Add the IA32_TSX_CTRL MSR
    - x86/cpu: Add a helper function x86_read_arch_cap_msr()
    - x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
    - x86/speculation/taa: Add mitigation for TSX Async Abort
    - x86/speculation/taa: Add sysfs reporting for TSX Async Abort
    - kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
    - x86/tsx: Add "auto" option to the tsx= cmdline parameter
    - x86/speculation/taa: Add documentation for TSX Async Abort
    - x86/tsx: Add config options to set tsx=on|off|auto
    - [Config] Disable TSX by default when possible

  * CVE-2019-0154
    - SAUCE: drm/i915: Lower RM timeout to avoid DSI hard hangs
    - SAUCE: drm/i915/gen8+: Add RC6 CTX corruption WA

  * CVE-2019-0155
    - SAUCE: drm/i915: Rename gen7 cmdparser tables
    - SAUCE: drm/i915: Disable Secure Batches for gen6+
    - SAUCE: drm/i915: Remove Master tables from cmdparser
    - SAUCE: drm/i915: Add support for mandatory cmdparsing
    - SAUCE: drm/i915: Support ro ppgtt mapped cmdparser shadow buffers
    - SAUCE: drm/i915: Allow parsing of unsized batches
    - SAUCE: drm/i915: Add gen9 BCS cmdparsing
    - SAUCE: drm/i915/cmdparser: Use explicit goto for error paths
    - SAUCE: drm/i915/cmdparser: Add support for backward jumps
    - SAUCE: drm/i915/cmdparser: Ignore Length operands during command matching

linux (5.3.0-21.22) eoan; urgency=medium

  * eoan/linux: 5.3.0-21.22 -proposed tracker (LP: #1850486)

  * Fix signing of staging modules in eoan (LP: #1850234)
    - [Packaging] Leave unsigned modules unsigned after adding .gnu_debuglink

linux (5.3.0-20.21) eoan; urgency=medium

  * eoan/linux: 5.3.0-20.21 -proposed tracker (LP: #1849064)

  * eoan: alsa/sof: Enable SOF_HDA link and codec (LP: #1848490)
    - [Config] Enable SOF_HDA link and codec

  * Eoan update: 5.3.7 upstream stable release (LP: #1848750)
    - panic: ensure preemption is disabled during panic()
    - [Config] updateconfigs for USB_RIO500
    - USB: rio500: Remove Rio 500 kernel driver
   ...

Changed in linux (Ubuntu Eoan):
status: Incomplete → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (38.6 KiB)

This bug was fixed in the package linux - 5.0.0-35.38

---------------
linux (5.0.0-35.38) disco; urgency=medium

  * [REGRESSION] md/raid0: cannot assemble multi-zone RAID0 with default_layout
    setting (LP: #1849682)
    - SAUCE: Fix revert "md/raid0: avoid RAID0 data corruption due to layout
      confusion."

  * refcount underflow and type confusion in shiftfs (LP: #1850867) // CVE-2019-15793
    - SAUCE: shiftfs: Correct id translation for lower fs operations
    - SAUCE: shiftfs: prevent type confusion
    - SAUCE: shiftfs: Fix refcount underflow in btrfs ioctl handling

  * CVE-2018-12207
    - kvm: Convert kvm_lock to a mutex
    - kvm: x86: Do not release the page inside mmu_set_spte()
    - KVM: x86: make FNAME(fetch) and __direct_map more similar
    - KVM: x86: remove now unneeded hugepage gfn adjustment
    - KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ON
    - KVM: x86: add tracepoints around __direct_map and FNAME(fetch)
    - kvm: x86, powerpc: do not allow clearing largepages debugfs entry
    - SAUCE: KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is
      active
    - SAUCE: x86: Add ITLB_MULTIHIT bug infrastructure
    - SAUCE: kvm: mmu: ITLB_MULTIHIT mitigation
    - SAUCE: kvm: Add helper function for creating VM worker threads
    - SAUCE: kvm: x86: mmu: Recovery of shattered NX large pages
    - SAUCE: cpu/speculation: Uninline and export CPU mitigations helpers
    - SAUCE: kvm: x86: mmu: Apply global mitigations knob to ITLB_MULTIHIT

  * CVE-2019-11135
    - KVM: x86: use Intel speculation bugs and features as derived in generic x86
      code
    - x86/msr: Add the IA32_TSX_CTRL MSR
    - x86/cpu: Add a helper function x86_read_arch_cap_msr()
    - x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
    - x86/speculation/taa: Add mitigation for TSX Async Abort
    - x86/speculation/taa: Add sysfs reporting for TSX Async Abort
    - kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
    - x86/tsx: Add "auto" option to the tsx= cmdline parameter
    - x86/speculation/taa: Add documentation for TSX Async Abort
    - x86/tsx: Add config options to set tsx=on|off|auto
    - SAUCE: x86/speculation/taa: Call tsx_init()
    - [Config] Disable TSX by default when possible

  * CVE-2019-0154
    - SAUCE: drm/i915: Lower RM timeout to avoid DSI hard hangs
    - SAUCE: drm/i915/gen8+: Add RC6 CTX corruption WA

  * CVE-2019-0155
    - SAUCE: drm/i915: Rename gen7 cmdparser tables
    - SAUCE: drm/i915: Disable Secure Batches for gen6+
    - SAUCE: drm/i915: Remove Master tables from cmdparser
    - SAUCE: drm/i915: Add support for mandatory cmdparsing
    - SAUCE: drm/i915: Support ro ppgtt mapped cmdparser shadow buffers
    - SAUCE: drm/i915: Allow parsing of unsized batches
    - SAUCE: drm/i915: Add gen9 BCS cmdparsing
    - SAUCE: drm/i915/cmdparser: Use explicit goto for error paths
    - SAUCE: drm/i915/cmdparser: Add support for backward jumps
    - SAUCE: drm/i915/cmdparser: Ignore Length operands during command matching

linux (5.0.0-34.36) disco; urgency=medium

  * disco/linux: <version to be filled> -proposed tracker (LP: #1850574)

  * [REGRESSION] md/raid0: cannot as...

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (24.0 KiB)

This bug was fixed in the package linux - 4.15.0-69.78

---------------
linux (4.15.0-69.78) bionic; urgency=medium

  * KVM NULL pointer deref (LP: #1851205)
    - KVM: nVMX: handle page fault in vmread fix

  * CVE-2018-12207
    - KVM: MMU: drop vcpu param in gpte_access
    - kvm: Convert kvm_lock to a mutex
    - kvm: x86: Do not release the page inside mmu_set_spte()
    - KVM: x86: make FNAME(fetch) and __direct_map more similar
    - KVM: x86: remove now unneeded hugepage gfn adjustment
    - KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ON
    - KVM: x86: add tracepoints around __direct_map and FNAME(fetch)
    - kvm: x86, powerpc: do not allow clearing largepages debugfs entry
    - SAUCE: KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is
      active
    - SAUCE: x86: Add ITLB_MULTIHIT bug infrastructure
    - SAUCE: kvm: mmu: ITLB_MULTIHIT mitigation
    - SAUCE: kvm: Add helper function for creating VM worker threads
    - SAUCE: kvm: x86: mmu: Recovery of shattered NX large pages
    - SAUCE: cpu/speculation: Uninline and export CPU mitigations helpers
    - SAUCE: kvm: x86: mmu: Apply global mitigations knob to ITLB_MULTIHIT

  * CVE-2019-11135
    - KVM: x86: use Intel speculation bugs and features as derived in generic x86
      code
    - x86/msr: Add the IA32_TSX_CTRL MSR
    - x86/cpu: Add a helper function x86_read_arch_cap_msr()
    - x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
    - x86/speculation/taa: Add mitigation for TSX Async Abort
    - x86/speculation/taa: Add sysfs reporting for TSX Async Abort
    - kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
    - x86/tsx: Add "auto" option to the tsx= cmdline parameter
    - x86/speculation/taa: Add documentation for TSX Async Abort
    - x86/tsx: Add config options to set tsx=on|off|auto
    - SAUCE: x86/speculation/taa: Call tsx_init()
    - SAUCE: x86/cpu: Include cpu header from bugs.c
    - [Config] Disable TSX by default when possible

  * CVE-2019-0154
    - SAUCE: drm/i915: Lower RM timeout to avoid DSI hard hangs
    - SAUCE: drm/i915/gen8+: Add RC6 CTX corruption WA

  * CVE-2019-0155
    - drm/i915/gtt: Add read only pages to gen8_pte_encode
    - drm/i915/gtt: Read-only pages for insert_entries on bdw+
    - drm/i915/gtt: Disable read-only support under GVT
    - drm/i915: Prevent writing into a read-only object via a GGTT mmap
    - drm/i915/cmdparser: Check reg_table_count before derefencing.
    - drm/i915/cmdparser: Do not check past the cmd length.
    - drm/i915: Silence smatch for cmdparser
    - drm/i915: Move engine->needs_cmd_parser to engine->flags
    - SAUCE: drm/i915: Rename gen7 cmdparser tables
    - SAUCE: drm/i915: Disable Secure Batches for gen6+
    - SAUCE: drm/i915: Remove Master tables from cmdparser
    - SAUCE: drm/i915: Add support for mandatory cmdparsing
    - SAUCE: drm/i915: Support ro ppgtt mapped cmdparser shadow buffers
    - SAUCE: drm/i915: Allow parsing of unsized batches
    - SAUCE: drm/i915: Add gen9 BCS cmdparsing
    - SAUCE: drm/i915/cmdparser: Use explicit goto for error paths
    - SAUCE: drm/i915/cmdparser: Add support for backward jumps
    - SAUCE: drm/i915/cmdpar...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
dann frazier (dannf)
no longer affects: mdadm (Debian)
Revision history for this message
dann frazier (dannf) wrote :

In Comment #3, I noted that it was mysterious that we were seeing this at all on the reported system - but, after staring at the log, I think I've an explanation for that now.

This system is supposed to be configured to have 8 identical NVMe drives in a raid0 mounted at /raid. There are also 2 other NVMes in this system which are supposed to have partitions configured in a raid1 for /. At least, that is what was *supposed* to be the case.

A filtered version of the log in comment #1 shows:
[ 16.757165] md/raid0:md0: cannot assemble multi-zone RAID0 with default_layout setting
[ 16.757165] md/raid0: please set raid.default_layout to 1 or 2
[ 16.757166] md: pers->run() failed ...
[ 19.051379] md1: detected capacity change from 0 to 30724962910208
[ 72.720232] md/raid0:md0: cannot assemble multi-zone RAID0 with default_layout setting
[ 72.728149] md/raid0: please set raid.default_layout to 1 or 2
[ 72.733979] md: pers->run() failed ...

While not explicit, we see that md1 is ginormous - matching the capacity we'd expect for the 8 drive raid0 that's supposed to be mounted at /raid. However, the md/raid0 driver is actually complaining about md*0*. I'm guessing that md0 is the array of 2 partitions that was supposed to be a raid1 mounted at /, but was misconfigured as a raid0. And it therefore makes sense that it is a multi-zone array, as we see that only one NVMe seems to be partitioned:

[ 16.541847] nvme1n1: p1 p2

Presumably nvme1n1p1 is used as the EFI System Partition, and presumably nvme1n1p2 was combined with the full nvme0 block device to form a heterogenous raid0, which would therefore be multi-zone.

Revision history for this message
dann frazier (dannf) wrote :

Aha - found the curtin installation log - this proves the theory in the previous comment:

Oct 24 00:03:33 akis cloud-init[2796]: mdadm detail scan after assemble:
Oct 24 00:03:33 akis cloud-init[2796]: ARRAY /dev/md/akis:0 level=raid0 num-devices=2 metadata=1.2 name=akis:0 UUID=01fd64cc:287c6b15:429f460f:3d53364b
Oct 24 00:03:33 akis cloud-init[2796]: devices=/dev/nvme0n1p2,/dev/nvme1n1
Oct 24 00:03:33 akis cloud-init[2796]: ARRAY /dev/md/akis:1 level=raid0 num-devices=8 metadata=1.2 name=akis:1 UUID=567cc4bb:9ab3f3ac:5d876c61:320a20ad
Oct 24 00:03:33 akis cloud-init[2796]: devices=/dev/nvme2n1,/dev/nvme3n1,/dev/nvme4n1,/dev/nvme5n1,/dev/nvme6n1,/dev/nvme7n1,/dev/nvme8n1,/dev/nvme9n1

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (33.2 KiB)

This bug was fixed in the package linux - 5.3.0-24.26

---------------
linux (5.3.0-24.26) eoan; urgency=medium

  * eoan/linux: 5.3.0-24.26 -proposed tracker (LP: #1852232)

  * Eoan update: 5.3.9 upstream stable release (LP: #1851550)
    - io_uring: fix up O_NONBLOCK handling for sockets
    - dm snapshot: introduce account_start_copy() and account_end_copy()
    - dm snapshot: rework COW throttling to fix deadlock
    - Btrfs: fix inode cache block reserve leak on failure to allocate data space
    - btrfs: qgroup: Always free PREALLOC META reserve in
      btrfs_delalloc_release_extents()
    - iio: adc: meson_saradc: Fix memory allocation order
    - iio: fix center temperature of bmc150-accel-core
    - libsubcmd: Make _FORTIFY_SOURCE defines dependent on the feature
    - perf tests: Avoid raising SEGV using an obvious NULL dereference
    - perf map: Fix overlapped map handling
    - perf script brstackinsn: Fix recovery from LBR/binary mismatch
    - perf jevents: Fix period for Intel fixed counters
    - perf tools: Propagate get_cpuid() error
    - perf annotate: Propagate perf_env__arch() error
    - perf annotate: Fix the signedness of failure returns
    - perf annotate: Propagate the symbol__annotate() error return
    - perf annotate: Fix arch specific ->init() failure errors
    - perf annotate: Return appropriate error code for allocation failures
    - perf annotate: Don't return -1 for error when doing BPF disassembly
    - staging: rtl8188eu: fix null dereference when kzalloc fails
    - RDMA/siw: Fix serialization issue in write_space()
    - RDMA/hfi1: Prevent memory leak in sdma_init
    - RDMA/iw_cxgb4: fix SRQ access from dump_qp()
    - RDMA/iwcm: Fix a lock inversion issue
    - HID: hyperv: Use in-place iterator API in the channel callback
    - kselftest: exclude failed TARGETS from runlist
    - selftests/kselftest/runner.sh: Add 45 second timeout per test
    - nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request
    - arm64: cpufeature: Effectively expose FRINT capability to userspace
    - arm64: Fix incorrect irqflag restore for priority masking for compat
    - arm64: ftrace: Ensure synchronisation in PLT setup for Neoverse-N1 #1542419
    - tty: serial: owl: Fix the link time qualifier of 'owl_uart_exit()'
    - tty: serial: rda: Fix the link time qualifier of 'rda_uart_exit()'
    - serial/sifive: select SERIAL_EARLYCON
    - tty: n_hdlc: fix build on SPARC
    - misc: fastrpc: prevent memory leak in fastrpc_dma_buf_attach
    - RDMA/core: Fix an error handling path in 'res_get_common_doit()'
    - RDMA/cm: Fix memory leak in cm_add/remove_one
    - RDMA/nldev: Reshuffle the code to avoid need to rebind QP in error path
    - RDMA/mlx5: Do not allow rereg of a ODP MR
    - RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu
    - RDMA/mlx5: Add missing synchronize_srcu() for MW cases
    - gpio: max77620: Use correct unit for debounce times
    - fs: cifs: mute -Wunused-const-variable message
    - arm64: vdso32: Fix broken compat vDSO build warnings
    - arm64: vdso32: Detect binutils support for dmb ishld
    - serial: mctrl_gpio: Check for NULL pointer
    - serial: 8250_...

Changed in linux (Ubuntu Focal):
status: Incomplete → Fix Released
dann frazier (dannf)
no longer affects: ubuntu-release-notes
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.