mdmonitor doesn't start recovery immediately

Bug #1888812 reported by Mariusz Tkaczyk
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
Fix Released
Undecided
Unassigned
Impish
Fix Released
Undecided
Unassigned

Bug Description

mdmonitor reacts on md events, it pools on /proc/mdstat file. Those events are generated if a change on any mddevice is observed in kernel. This is done asynchronously and can be caused by user space process (mdadm called by udev or user), or by kernel itself (drive is removed because it has to many errors).

The problem here is that mdmonitor isn't dealing with user space or udev. When drive with metadata is inserted, mdadm adds it to mddevice (it is done by udev). Md Event is generated then and mdmonitor may try to move drive to other mddevice if needed. It relies on by-path links, but this link to newly appeared device is not created yet, udev is still working on. As a result recovery doesn't start immediately.

Observed on Ubuntu 20.04.

Steps to reproduce:

1. Create RAID volume:
# mdadm --create /dev/md/imsm0 --metadata=imsm --raid-devices=4 /dev/nvme6n1 /dev/nvme1n1 /dev/nvme7n1 /dev/nvme3n1 --run
# mdadm --create /dev/md/r10d4s64-20_A --level=10 --chunk 64 --raid-devices=4 /dev/nvme6n1 /dev/nvme1n1 /dev/nvme7n1 /dev/nvme3n1 --run

2. Add spare to container:
# mdadm --add /dev/md/imsm0 /dev/nvme0n1

3. Create appropriate policy line in /etc/mdadm/mdadm.conf.

POLICY domain=RAID_DOMAIN_1 path=* action=spare-same-slot

4. Disconnect spare from container.

5. Start mdadm monitor with big delay (ex. 10 minutes):
# mdadm --monitor --delay 6000 --scan --mail=root@localhost --daemonize --syslog

6. Hot remove disk from array (physical disconnect).

7. Connect previously prepared spare.

Expected results:
Rebuild should start.

Actual results:
Rebuild does not start, added spare is in separate container.

Tags: vroc
Revision history for this message
Mariusz Tkaczyk (mtkaczyk) wrote :

Hi,
mdmonitor needs to deal with other tasks, issue still in development.

Thanks, Mariusz

description: updated
tags: added: vroc
Revision history for this message
Mariusz Tkaczyk (mtkaczyk) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mdadm (Ubuntu):
status: New → Confirmed
Revision history for this message
Jeff Lane  (bladernr) wrote :

Can we please target this for 21.10?

Revision history for this message
Jeff Lane  (bladernr) wrote :

Is this resolved in 4.2-rc2?

Changed in mdadm (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Mariusz Tkaczyk (mtkaczyk) wrote :

yes, it is resolved.

Jeff Lane  (bladernr)
Changed in mdadm (Ubuntu Impish):
status: Incomplete → Fix Committed
Jeff Lane  (bladernr)
Changed in mdadm (Ubuntu):
status: Fix Committed → Fix Released
Changed in mdadm (Ubuntu Impish):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.