Comment 6 for bug 1907262

Revision history for this message
Thimo E (thimoe) wrote :

Hi Matthew and all,

thank you for taking action immediately. I really appreciate your effort.

After investigating the issue further I have to add that the mount option discard seems to trigger the issue, too.

@Trent
The general problem here is that RAID10 can balance single read streams to all disks (which is probably the major advantage over RAID1 effectively providing you RAID0 read speed; RAID1 needs parallel reads to achieve this).

That said it is no big surprise that several machines at our site went to readonly mode after *some time* (probably reading some filesystem relevant data from the "bad disk"). Unfortunately the "clean first disk" only happens if you act immediately, otherwise you might have some data corruption.
I verified this on one system where the root partition was affected using the debsums tool (just run debsums -xa) after fixing FS errors.

My procedure to recover was:
Assembly of the RAID:
mdadm --assemble /dev/md127 /dev/nvme0n1p2
mdadm --run /dev/md127

Filesystem check on all partitions (note the -f parameter, some FS "think" they are clean):
fsck.ext4 -f /dev/VolGroup/...

Re-add the second component:
mdadm --zero-superblock /dev/nvme1n1p2
mdadm --add /dev/md127 /dev/nvme1n1p2

Best regards