RAID5 reshape stuck due to same badblock on multiple devices

Bug #1882312 reported by Frode Sandholtbraaten
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Linux [hostname removed] 5.3.0-55-generic #49-Ubuntu SMP Thu May 21 12:47:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu release: 19.10 (although the same issue is present in 18.04 and 20.04 as well).

A RAID5 reshape from 3 to 4 devices got stuck:

md127 : active raid5 sde1[5] sdd1[4] sdc1[0] sdf1[3]
      7813769216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      [>....................] reshape = 1.8% (72261116/3906884608) finish=1663133.7min speed=38K/sec
      bitmap: 0/30 pages [0KB], 65536KB chunk

with the following stack trace:

[54979.996871] INFO: task md127_reshape:7090 blocked for more than 1208 seconds.
[54979.996922] Tainted: P OE 5.3.0-55-generic #49-Ubuntu
[54979.996967] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.997018] md127_reshape D 0 7090 2 0x80004080
[54979.997019] Call Trace:
[54979.997022] __schedule+0x2b9/0x6c0
[54979.997023] schedule+0x42/0xb0
[54979.997027] reshape_request+0x878/0x950 [raid456]
[54979.997028] ? wait_woken+0x80/0x80
[54979.997030] raid5_sync_request+0x302/0x3b0 [raid456]
[54979.997032] md_do_sync.cold+0x3ef/0x999
[54979.997034] ? ecryptfs_write_begin+0x70/0x280
[54979.997034] ? __switch_to_asm+0x40/0x70
[54979.997035] ? __switch_to_asm+0x34/0x70
[54979.997035] ? __switch_to_asm+0x40/0x70
[54979.997036] ? __switch_to_asm+0x34/0x70
[54979.997036] ? __switch_to_asm+0x40/0x70
[54979.997037] ? __switch_to_asm+0x34/0x70
[54979.997038] md_thread+0x97/0x160
[54979.997040] kthread+0x104/0x140
[54979.997040] ? md_start_sync+0x60/0x60
[54979.997041] ? kthread_park+0x80/0x80
[54979.997042] ret_from_fork+0x35/0x40

No other hardware errors were reported and the reshape got stuck at somewhat different blocks every time it was restarted (all within the same vicinity of each others). It turns out that md had injected the same exact sector into the badblock log of multiple devices at some point before the reshape was started. This could be seen with "mdadm --examine-badblocks /dev/sdXY". The original cause for the badblocks entries was probably a loose cable as the reported sectors were fully readable with the "dd" and "badblocks" command.

The problem was eventually resolved by removing the badblock log on the RAID5 device using "mdadm --assemble /dev/md0 --update=force-no-bbl". Having removed the badblock log, reshape progressed beyond the previously troublesome area of blocks.

I would have expected at least an error message in the kernel log rather than just a "hung task" message, probably before the reshape was allowed to be initiated (aka early termination). Furthermore, it would be beneficial if mdadm could allow the badblock log to be cleared for a device rather than removed on the array with "update=force-no-bbl".
---
ProblemType: Bug
AlsaVersion: Advanced Linux Sound Architecture Driver Version k5.3.0-55-generic.
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.11-0ubuntu8.9
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/hwC0D2', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D9p', '/dev/snd/pcmC0D8p', '/dev/snd/pcmC0D7p', '/dev/snd/pcmC0D3p', '/dev/snd/pcmC0D2c', '/dev/snd/pcmC0D1p', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Card0.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer': 'amixer'
Card0.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer': 'amixer'
DistroRelease: Ubuntu 19.10
HibernationDevice: RESUME=/dev/mapper/vg0-swap
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: System manufacturer System Product Name
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
ProcEnviron:
 LC_CTYPE=en_US.UTF-8
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.3.0-55-generic root=/dev/mapper/vg0-root ro swapaccount=1 acpi_enforce_resources=lax intel_iommu=on pci=assign-busses
ProcVersionSignature: Ubuntu 5.3.0-55.49-generic 5.3.18
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-55-generic N/A
 linux-backports-modules-5.3.0-55-generic N/A
 linux-firmware 1.183.5
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: eoan
Uname: Linux 5.3.0-55-generic x86_64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: False
dmi.bios.date: 03/15/2018
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1302
dmi.board.asset.tag: Default string
dmi.board.name: PRIME Z270-A
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1302:bd03/15/2018:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnPRIMEZ270-A:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1882312

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: eoan
Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : AlsaDevices.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : CRDA.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : Card0.Codecs.codec.0.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : Card0.Codecs.codec.2.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : Lspci.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : PciMultimedia.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : ProcModules.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : UdevDb.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote : WifiSyslog.txt

apport information

Revision history for this message
Frode Sandholtbraaten (sfrode) wrote :

Please note that the apport report is collected AFTER the issue was resolved.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please test latest mainline kernel:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.8-rc3/

If mainline kernel doesn't work please raise the issue to mailing list.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.