md: detects stale members ahead of in-sync members
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
grub2 (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
My system boots from XFS on RAID10 on GPT partitions (no LVM). The RAID10 uses the "far2" layout, and has three component devices. I use grub-pc for non-EFI booting, because this system is old and doesn't support EFI (Intel DG965WH from 2008).
I added a fourth hard drive and shuffled my data around so I could re-partition the existing drives. (http://
A few weeks the final `mdadm /dev/md0 --replace /dev/sda1 --with /dev/sdd1`, grub failed to boot. Error messages included "invalid arch-independent ELF magic", and `insmod linux` giving "not a regular file". Booting an Ubuntu live USB showed no problem with the FS, and none of dpkg-reconfigure grub-pc; grub-install /dev/sda ; update-grub helped. Before those attempts to fix it, grub was loading a messed-up menu but not quite booting Linux. After re-running grub-install, it stopped at the grub rescue> prompt.
sda is the first BIOS disk, but even having my BIOS boot a different disk didn't help. Presumably that doesn't affect the order GRUB detects them in.
I eventually solved the problem by swapping the SATA cables so the drive that didn't have a member of the boot array was not the first BIOS drive anymore. Now everything works perfectly.
I think GRUB's md code is including the first N members it sees, whether they're stale or not. Linux's MD code finds all candidates, and then picks N in-sync ones if available.
This was really hard to diagnose, because disk churn hadn't got the data so far out of sync that there were XFS errors. Directory listings of /boot/grub/i386-pc worked from the grub rescue shell, but the actual data in some of the files didn't match. (And even some of the inode contents were different, too, hence the "not a regular file")
I think wiping the RAID signature would have solved the problem as well. (mdadm --zero-superblock /dev/sda2, after making sure that was actually the stale device in the live-USB environment)
Here's mdadm -E from the stale component (which was sda2 before swapping cables, now it's sdd2).
This is what a component looks like after a --replace and --remove is done with it. After that: mdadm --detail /dev/md/root
peter@tesla:~$ sudo mdadm --examine /dev/sdd2
/dev/sdd2: #######
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : e0ad8202:
Name : tesla:root (local to host tesla)
Creation Time : Thu Apr 16 14:26:50 2015 ### note that's 2015, last year.
Raid Level : raid10
Raid Devices : 3
Avail Dev Size : 30703616 (14.64 GiB 15.72 GB)
Array Size : 23027712 (21.96 GiB 23.58 GB)
Data Offset : 16384 sectors
Super Offset : 8 sectors
Unused Space : before=16296 sectors, after=0 sectors
State : clean
Device UUID : 8ae879d7:
Update Time : Wed Mar 16 02:49:17 2016
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 1c62e134 - correct
Events : 2708
Layout : far=2
Chunk Size : 1024K
Device Role : Active device 2
Array State : AAR ('A' == active, '.' == missing, 'R' == replacing)
/dev/sda2: ##### An in-sync component
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : e0ad8202:
Name : tesla:root (local to host tesla)
Creation Time : Thu Apr 16 14:26:50 2015
Raid Level : raid10
Raid Devices : 3
Avail Dev Size : 30703616 (14.64 GiB 15.72 GB)
Array Size : 23027712 (21.96 GiB 23.58 GB)
Data Offset : 16384 sectors
Super Offset : 8 sectors
Unused Space : before=16296 sectors, after=0 sectors
State : clean
Device UUID : 5d6bb778:
Update Time : Sat Apr 9 16:48:18 2016
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 4e39b4c0 - correct
Events : 2740
Layout : far=2
Chunk Size : 1024K
Device Role : Active device 1
Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
peter@tesla:~$ sudo mdadm --detail /dev/md/root
/dev/md/root:
Version : 1.2
Creation Time : Thu Apr 16 14:26:50 2015
Raid Level : raid10
Array Size : 23027712 (21.96 GiB 23.58 GB)
Used Dev Size : 15351808 (14.64 GiB 15.72 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Sat Apr 9 21:19:32 2016
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : far=2
Chunk Size : 1024K
Name : tesla:root (local to host tesla)
UUID : e0ad8202:
Events : 2740
Number Major Minor RaidDevice State
3 8 18 0 active sync /dev/sdb2
4 8 2 1 active sync /dev/sda2
6 8 34 2 active sync /dev/sdc2
ProblemType: Bug
DistroRelease: Ubuntu 15.10
Package: grub-pc 2.02~beta2-
ProcVersionSign
Uname: Linux 4.2.0-35-generic x86_64
ApportVersion: 2.19.1-0ubuntu5
Architecture: amd64
CurrentDesktop: KDE
Date: Sat Apr 9 20:53:19 2016
SourcePackage: grub2
UpgradeStatus: Upgraded to wily on 2015-11-12 (149 days ago)