[gutsy] partitions no longer detected as RAID components after repairing degraded RAID 1 mirror
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
udev |
Fix Released
|
Undecided
|
Unassigned | ||
udev (Ubuntu) |
Fix Released
|
Medium
|
Scott James Remnant (Canonical) | ||
Bug Description
Binary package hint: udev
My system was correctly booting from a mirrored PATA (via libata) RAID 1 until one of the disks was removed. I then hit the "degraded mode doesn't boot" problem for a while. This all happened under Gutsy (kept up to date daily.)
After that I replaced the missing disk, while the system was powered off, and then:
* booted with 'break=mount' on the kernel command line.
* waited until detection of hardware had completed.
* ran 'exec mdadm -As' to detect the degraded RAID and continue to boot
* ran 'sfdisk -d /dev/sda | sfdisk /dev/sdb' to partition the new disk
* ran 'mdadm -a /dev/md0 /dev/sdb1'
* ran 'mdadm -a /dev/md1 /dev/sdb2'
* waited for the system to resync the data
* checked that /proc/mdstat showed all healthy
* rebooted
At this point I expected, naturally, to have the system boot cleanly without any problems or delay. Instead the system simply ground to a halt after the three minute boot timeout without the RAID detected.
After some investigation this looks, to me, like a problem with identification of the use of the device components.
The udev rules for mdadm depend on ENV{ID_
UDEV [1187654052.620855] add /block/sda/sda1 (block)
UDEV_LOG=3
ACTION=add
DEVPATH=
SUBSYSTEM=block
SEQNUM=1750
MINOR=1
MAJOR=8
PHYSDEVPATH=
PHYSDEVBUS=scsi
PHYSDEVDRIVER=sd
UDEVD_EVENT=1
DEVTYPE=partition
ID_VENDOR=ATA
ID_MODEL=
ID_REVISION=UE10
ID_SERIAL=
ID_SERIAL_
ID_TYPE=disk
ID_BUS=scsi
ID_ATA_
ID_PATH=
ID_FS_USAGE=
ID_FS_TYPE=ext3
ID_FS_VERSION=1.0
ID_FS_UUID=
ID_FS_UUID_
ID_FS_LABEL=
ID_FS_LABEL_
ID_FS_LABEL_
DEVNAME=/dev/sda1
DEVLINKS=
Note that the 'ID_FS_TYPE' value is ext3, the file system in the RAID array, rather than identifying this disk as part of a RAID array.
The same misidentification is present for the swap RAID1 and the other component; I can supply logs showing that if it matters.
The RAID array itself is a healthy RAID1 with version 1.0 metadata:
daniel@enki:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sda2[0] sdb2[1]
1975984 blocks super 1.0 [2/2] [UU]
md0 : active raid1 sdb1[0] sda1[2]
76204224 blocks super 1.0 [2/2] [UU]
unused devices: <none>
daniel@enki:~$ sudo mdadm -D /dev/md0
[sudo] password for daniel:
/dev/md0:
Version : 01.00.03
Creation Time : Wed May 16 01:08:07 2007
Raid Level : raid1
Array Size : 76204224 (72.67 GiB 78.03 GB)
Used Dev Size : 152408448 (72.67 GiB 78.03 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Tue Aug 21 10:21:04 2007
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : enki-root
UUID : 3a6b05ca:
Events : 802619
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 1 1 active sync /dev/sda1
However, one very odd factor in this:
daniel@enki:~$ sudo mdadm -E /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 01
Feature Map : 0x0
Array UUID : 3a6b05ca:
Name : enki-root
Creation Time : Wed May 16 01:08:07 2007
Raid Level : raid1
Raid Devices : 2
Used Dev Size : 152408448 (72.67 GiB 78.03 GB)
Array Size : 152408448 (72.67 GiB 78.03 GB)
Super Offset : 152408576 sectors
State : active
Device UUID : fc1de0a0:
Update Time : Tue Aug 21 10:21:24 2007
Checksum : 655ae10b - correct
Events : 802619
Array Slot : 2 (0, failed, 1)
Array State : uU 1 failed
daniel@enki:~$ sudo mdadm -E /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 01
Feature Map : 0x0
Array UUID : 3a6b05ca:
Name : enki-root
Creation Time : Wed May 16 01:08:07 2007
Raid Level : raid1
Raid Devices : 2
Used Dev Size : 152408448 (72.67 GiB 78.03 GB)
Array Size : 152408448 (72.67 GiB 78.03 GB)
Super Offset : 152408576 sectors
State : active
Device UUID : 71575927:
Update Time : Tue Aug 21 10:21:26 2007
Checksum : 73d44564 - correct
Events : 802619
Array Slot : 0 (0, failed, 1)
Array State : Uu 1 failed
It looks like the MD device has a happy RAID1 header but the individual components have metadata that indicates that /both/ of them are part of a failed RAID array.
In any case the misidentification of the component use means that mdadm is never called on my system.
Please let me know if I can assist in debugging this further or in providing any specific testing. I have no concerns building whatever tools with debugging, etc, as needed.
Changed in udev: | |
assignee: | nobody → keybuk |
status: | New → Confirmed |
Changed in udev: | |
status: | Confirmed → Fix Committed |
Changed in udev: | |
importance: | Undecided → Medium |
Changed in udev: | |
status: | Fix Committed → Fix Released |
Please find attached a patch the resolves this issue.
I have tracked it down to the vol_id code using the wrong superblock offset to locate the metadata within the partition, at least for version 1.0 (new, at end of device) superblocks.
The attached patch implements the correct location calculation for the 1.0 superblock based on the code present in the current gutsy version of mdadm, suitably modified to fit the coding style of the udev helper.
I have tested this and verified that it does, correctly, determine the use of my devices as RAID members rather than as simple ext3 file system content.
I think my patch is technically in error, in that it uses both the old and new calculations to try and locate the superblock on the device for both version 0.9 and 1.0 metadata. I suspect, but due to illness don't have the time to verify, that we should use the older method only for 0.9 superblocks and the new method only for 1.0 superblocks.
That said this isn't actually a big problem. The system notes that there isn't a valid RAID superblock there and simply continues to the next test, so this is harmless.
This is an upstream bug, so far as I can tell, since the vol_id code is not modified in the Ubuntu/Debian patch applied to the package.
I also think this should be pushed into the gutsy release -- at the moment Gutsy will fail to boot on a software RAID device with 1.0 metadata despite the system being full and correct.
Regards, Daniel