hot-add/remove in mixed (IDE/SATA/USB/SD-card/...) RAIDs with device mapper on top => data corruption (bio too big device md0 (248 > 240))

Bug #320638 reported by Stephan Diestelhorst
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
mdadm
Confirmed
Undecided
Unassigned
debian-installer (Ubuntu)
Invalid
Undecided
Unassigned
linux (Ubuntu)
Won't Fix
Critical
Jim Lieb
mdadm (Ubuntu)
Confirmed
Undecided
Unassigned
ubiquity (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Problem: md changes max_sector setting of an already running and busy md device, when a (hotplugable) device is added or removed. However, the device mapper and filesystem layer on top of the raid can not (always?) cope with that.

Observations:
* "bio too big device mdX (248 > 240)" messages in the syslog
* read/write errors (some dropped silently, no noticable errors reported during operation, until things like dhcpclient looses its IP etc.)

Expected:
Adding and removing members to running raids (hotplugging) should not change the raid device characteristics. If the new member supports only smaller max_sector values, buffer and split the data steam, until the raid device can be set up from a clean state with a more appropriate max_sector value. To avoid buffering and splitting in the future, md could save the smallest max_sector value of the known members in the superblock, and use that when setting up the raid even if that member is not present.

Note: This is reproducible in much more common scenarios as the original reporter had (e.g. --add a USB (3.0 these days) drive to an already running SATA raid1 and grow the number of devices).

Fix:
Upsteam has no formal bug tracking, but a mailing list. The response was that finally this needs to be "fixed [outside of mdadm] by cleaning up the bio path so that big bios are split by the device that needs the split, not be the fs sending the bio."

However, in the meantime mdadm needs to saveguard against the date corruption:

> > [The mdadm] fix is to reject the added device [if] its limits are
> > too low.
>
> Good Idea to avoid the data corruption. MD could save the
> max_sectors default limit for arrays. If the array is modified and the new
> limit gets smaller, postpone the sync until the next assembe/restart.
>
> And of course print a message if postponing, that explains when --force would be save.
> What ever that would be: no block device abstraction layer (device mapper, lvm, luks,...)
> between an unmounted? ext, fat?, ...? filesystem and md?

As upsteam does not do public bug tracking, the status and rememberence of this need remains unsure though.

---

This is on a MSI Wind U100 and I've got the following stack running:
HDD & SD card (USB card reader) -> RAID1 -> LUKS -> LVM -> Reiser

Whenever I remove the HDD from the Raid1
> mdadm /dev/md0 --fail /dev/sda2
> mdadm /dev/md0 --remove /dev/sda2)
for powersaving reasons, I cannot run any apt related tools.

> sudo apt-get update
[...]
Hit http://de.archive.ubuntu.com intrepid-updates/multiverse Sources
Reading package lists... Error!
E: Read error - read (5 Input/output error)
E: The package lists or status file could not be parsed or opened.

Taking a look at the kernel log shows (and many more above):
> dmesg|tail
[ 9479.330550] bio too big device md0 (248 > 240)
[ 9479.331375] bio too big device md0 (248 > 240)
[ 9479.332182] bio too big device md0 (248 > 240)
[ 9611.980294] bio too big device md0 (248 > 240)
[ 9742.929761] bio too big device md0 (248 > 240)
[ 9852.932001] bio too big device md0 (248 > 240)
[ 9852.935395] bio too big device md0 (248 > 240)
[ 9852.938064] bio too big device md0 (248 > 240)
[ 9853.081046] bio too big device md0 (248 > 240)
[ 9853.081688] bio too big device md0 (248 > 240)

$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Tue Jan 13 11:25:57 2009
     Raid Level : raid1
     Array Size : 3871552 (3.69 GiB 3.96 GB)
  Used Dev Size : 3871552 (3.69 GiB 3.96 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Jan 23 21:47:35 2009
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 89863068:bc52a0c0:44a5346e:9d69deca (local to host m-twain)
         Events : 0.8767

    Number Major Minor RaidDevice State
       0 0 0 0 removed
       1 8 17 1 active sync writemostly /dev/sdb1

$ sudo ubuntu-bug -p linux-meta
dpkg-query: failed in buffer_read(fd): copy info file `/var/lib/dpkg/status': Input/output error
dpkg-query: failed in buffer_read(fd): copy info file `/var/lib/dpkg/status': Input/output error
[...]

Will provide separate attachements.

Revision history for this message
Stephan Diestelhorst (syon) wrote :
Revision history for this message
Stephan Diestelhorst (syon) wrote :
Revision history for this message
Stephan Diestelhorst (syon) wrote :
Revision history for this message
Stephan Diestelhorst (syon) wrote :
Revision history for this message
Stephan Diestelhorst (syon) wrote :
Revision history for this message
Stephan Diestelhorst (syon) wrote :
Revision history for this message
Stephan Diestelhorst (syon) wrote :

Just stumbled through Linux' source and fooled around a bit.
It seems that the value of 240 is consistent with what I can find in /sys/block/sdb/max_{,hw}_sectors_kb
for the SD card. (Returns 120)

Any other device I attach through USB and / or the card reader has the same value. Where does this come from?

I've tried to understand how this value propagates through the hierarchy, but was not too successful. What is strange is the fact that the reported 248 are exactly one page (4kb = 8 sectors) off the max. Could this be a rounding issue somewhere?

Most of the stuff just kept copying the value from the lower level or propagated the minimum value.
Which layer comes up with the 248 in the first place?

Just sharing my 0.02$, I apparently have no clue about that block layer (stack).

Revision history for this message
Stephan Diestelhorst (syon) wrote :

After re-adding my HDD to the RAID and waiting for the synch to complete, I retried
$ sudo LANG=C apt-get update
[..]
Hit http://de.archive.ubuntu.com intrepid-updates/multiverse Sources
Fetched 378B in 0s (1106B/s)
Reading package lists... Error!
E: Read error - read (5 Input/output error)
E: The package lists or status file could not be parsed or opened.

Strace'ing suggests that the file /var/lib/apt/lists/de.archive.ubuntu.com_ubuntu_dists_intrepid_main_binary-i386_Packages is to blame.
And indeed:

$ LANG=C dd if=/var/lib/apt/lists/de.archive.ubuntu.com_ubuntu_dists_intrepid_main_binary-i386_Packages of=/dev/null
dd: reading `/var/lib/apt/lists/de.archive.ubuntu.com_ubuntu_dists_intrepid_main_binary-i386_Packages': Input/output error
320+0 records in
320+0 records out
163840 bytes (164 kB) copied, 0.00293396 s, 55.8 MB/s

$ LANG=C cat /var/lib/apt/lists/de.archive.ubuntu.com_ubuntu_dists_intrepid_main_binary-i386_Packages > /dev/null
cat: /var/lib/apt/lists/de.archive.ubuntu.com_ubuntu_dists_intrepid_main_binary-i386_Packages: Input/output error

Same is true after I remove the SD card from the RAID.

HELP, this thing eats my data!

Revision history for this message
Stephan Diestelhorst (syon) wrote :

Some more stumbling shows that one of the few places where the bi_size is increased is
fs/bio.c, function __bio_add_page .
While I'm not sure what this function does exactly, I find the check in line 311 somewhat disturbing:

 if (((bio->bi_size + len) >> 9) > max_sectors)
    return 0;

I can only imagine that max_sectors should encompass the entire range. This however is not true due to rounding off the lowest bits.

if ((bio->bi_size + len) > (max_sectors << 9))

would actually serve this purpose.

But again, I'm not sure whether this is actually the issue. Haven't found the time to compile my own kernel, given the broken package management :-/

Revision history for this message
Stephan Diestelhorst (syon) wrote : Re: Raid1 HDD and SD card -> data corruption (bio too big device md0 (248 > 200))

Mhmhm, I'm replying to myself. Again.
https://kerneltrap.org/mailarchive/linux-kernel/2007/4/10/75875

NeilBrown:
"...
dm doesn't know that md/raid1 has just changed max_sectors and there
is no convenient way for it to find out. So when the filesystem tries
to get the max_sectors for the dm device, it gets the value that dm set
up when it was created, which was somewhat larger than one page.
When the request gets down to the raid1 layer, it caused a problem.
..."

This seems to be exactly the issue I see. For whatever reason, Reiser queries
dm for max_sectors and receives the value still valid when the disk was around.
 (248 seems to be chosen due to being <256 and dividable by 8, see this ancient
 thread: http://lkml.indiana.edu/hypermail/linux/kernel/0303.1/0880.html )

Various solutions come to mind:
a) Gracefully handle the issue, i.e. split the request once it does not fit into the limit.
  a1) at the caller
  a2) at the callee
b) Let LVM query max_sectors before every (?) request it sends through to the device below.

I feel that somehting is really wrong here.

Please fix.

Revision history for this message
Andy Whitcroft (apw) wrote :

This is not a bug in the linux-meta package, moving to the linux package.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → Critical
Changed in linux (Ubuntu):
status: New → Triaged
Revision history for this message
Jim Lieb (lieb) wrote :

@Stephan, As I understand this, you are using an md raid1 to save power and have a "backup"? This is failure prone and dangerous to your data. The MD layer is pretty resilient but it is meant for RAID of like disks. This is why the numbers are weird. It "expects" a scsi/sata hdd and works best if they are matched, i.e. same disk model. Mixing in a usb stick is not the same thing. You save little, if any, power because an array restore requires a complete disk copy, not an update of some number of out-of-date blocks. I wouldn't be surprised if it consumes even more since it is a steady state transfer load until the restore is complete. This restore, depending on subsystem traffic and disk size can take a significant amount of time. It can also leave your system vulnerable. I have heard of reiser filesystems fail w/ compromised raid arrays, which this is in your powersaving mode. There is power management in the newer kernels to cycle down the hdd to conserve power but this takes careful tuning. If you want backup with this mix of devices, use rsync to the usb stick. This will be consistent and only write what is needed. A usb stick is not the same as an SSD. The former is meant as a replacement for floppies, namely lower than hdd transaction rates, removability, and expected limits to lifetime. The later is meant for hdd type applications. I suggest you reconsider your configuration.

I did not "invalid" this bug yet pending your reply.

Changed in linux (Ubuntu):
assignee: nobody → Jim Lieb (jim-lieb)
Revision history for this message
Stephan Diestelhorst (syon) wrote :
Download full text (3.3 KiB)

@Jim: Thanks for getting back to me on this one!

Your understanding for my purposes is correct. Let me address your points one by one:

> You save little, if any, power because an array restore requires a complete disk copy, not an update of some number of out-of-date blocks. ...

No. First of all there are write-intent bitmaps that reduce the amount of synchronisation needed. Second the SD-card is in the raid all the time, with option write-mostly. Hence I can just drop the HDD from the RAID, spin it down and save power that way. When I'm with a power supply again, resynching copies only the changed stuff to the HDD.

> It can also leave your system vulnerable. I have heard of reiser filesystems fail w/ compromised raid arrays, which this is in your powersaving mode.

It must not. This is something everybody expects from the two following things: RAID and block device abstraction. Any sane RIAD implementation will guarantee to provide the same semantics on the block layer interface (with performance degradation, sure). The file system in turn must only rely on the semantics provided by the block layer interface. As these semantics remain the same the FS has no way of telling a difference and malfunctioning because of it. Any such behaviour is clearly a bug the RAID implementation or the FS.

> There is power management in the newer kernels to cycle down the hdd to conserve power but this takes careful tuning.

This is improving steadily, but cannot save the same power as if the HDD was not used at all.

> If you want backup with this mix of devices, use rsync to the usb stick.

I can do this copying, but I cannot remount the root FS on the fly to the copy on the USB key / SD card.

> A usb stick is not the same as an SSD. ...

It is for all that the correctness / semantics of the entire FS stack should be concerned: A block device. All the differences you mention all well understood but they may impact only quantitative aspects, such as performance, MTBF etc. of the system.

After skimming through the kernel source once more I feel that the real problem lies within unclear specifications regarding the 'constantness' of max_sectors. MD assumes that it can adjust the value it reports on _polling_ according to the characteristics of the devices in the RAID. Some layer in the stack (LVM or LUKS, both based on the same block layer abstraction or even the FS) apparently cannot cope with variable max_sectors.

In addition to the possible solutions mentioned in previous comments, I can think of several other ways to deal with the issue:
a) Provide a way for MD to _notify_ the upper layers of changing characterisics of the block device. Then all layers would have the responsibility to chain the notification to their respective parent layers. This may be nasty as it requires reversed connection knowledge.

b) Have more graceful handling of the issue on detection of faulty acceses. Once the upper requesting layer receives a specific error message for an access it can initiate a reprobe of the actual max_sectors value from the underlying block device. The contract would then be that intermediate block layers do not cache this information but rather ask...

Read more...

Revision history for this message
Jim Lieb (lieb) wrote :
Download full text (6.4 KiB)

@Stephan,

The purpose of RAID is reliability, not a power saving strategy. It is true that there are bitmaps to minimize the bulk of the re-sync, an optimization, but that is all it is. The re-sync code schedules these so that there is minimal impact on overall performance during the re-sync. On a heavily used system, such as a server, this can take hours. It has been my experience that the disk subsystem gets pounded to death during this time.

There are a number of issues wrt mixing devices in this manner. Whereas HDD storage has access latencies in the msec range, read and write speeds are the same. While an SSD does not have access latency and read performance is in the HDD range, its write speed is not only asymetrically slower, but significantly slower. The manufacturers are not there *yet* to compete with HDDs. Even in private, NDA discussions they tend to be vague about these important details. SDs, CFs, and USB sticks do not even fit this category. They are low cost, low power secondary storage. Your idea of mixing HDD and SSD storage is an interesting idea. However, the mix has problems.

Your comments about vulnerability are true. What RAID *should* do and what it actually does are two different things. This is why Btrfs and ZFS (among others) address these very issues. However, you are treating a degraded RAID as a normal case. In practice, this is not true and unwanted. Case in point, no one uses RAID0 for anything other than low value, "big bit bucket" storage. Yes, this is a candidate for idempotent, large dataset munching but not for your only copy of the photo album (or the AMEX transaction database). RAID1 is an interesting idea but most low end arrays are, in fact, RAID10 to get striping. As I mentioned above, the re-build is pounding the disk and as soon as the drive can come back to the array and start the rebuild, the smaller the rebuild queue will be. This also introduces stripes to improve performance which makes your mix-and-match problematical. RAID arrays *really* want identical units. You really don't even want to mix sizes in the same speed range because sectors/track are usually different. The mis-match results in asymetric performance on the array, making performance match the slowest unit. These are the big issues in a RAID subsystem design. It is all about redundancy and speed optimizations given that every write transaction involves extra writes over the single disk case. Your use case is not on the list for I/O subsystem designers. See below for what they are looking at.

I should address your issues about propagating size changes up and down the stack. The first issue is MD somehow notifying the upper layers that there have been some size changes. This works the wrong way. The application demands determine the various sizings starting with the filesystem. Case in point, a database specifies its own "page" size, often being some multiple of a predominant row size. This in turn determines the write size to the filesystem. This is where ext4 extents come in and where the current linux penchant for one-size-fits-all 4k page gets in the way. This, in turn, mixes with the array ...

Read more...

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
ceg (ceg)
description: updated
ceg (ceg)
summary: - Raid1 HDD and SD card -> data corruption (bio too big device md0 (248 >
- 200))
+ hot-add/remove in mixed HDD/USB/SD-card RAIDs -> data corruption (bio
+ too big device md0 (248 > 200))
summary: - hot-add/remove in mixed HDD/USB/SD-card RAIDs -> data corruption (bio
- too big device md0 (248 > 200))
+ hot-add/remove in mixed (HDD/USB/SD-card/...) RAIDs -> data corruption
+ (bio too big device md0 (248 > 200))
Changed in debian-installer (Ubuntu):
status: New → Confirmed
Revision history for this message
ceg (ceg) wrote : Re: hot-add/remove in mixed (HDD/USB/SD-card/...) RAIDs -> data corruption (bio too big device md0 (248 > 200))

This is a very severe data corrupting bug.

Thus, even if the the md devs can not fix the "bio" abstraction or whatever would be necessary to support mixed interface setups, the md module should definitely have a safeguard and make sure to return a failure on hot add/remove under these circumstances, instead of letting this bug eat it's user's data.

Changed in mdadm (Ubuntu):
status: New → Confirmed
ceg (ceg)
summary: - hot-add/remove in mixed (HDD/USB/SD-card/...) RAIDs -> data corruption
- (bio too big device md0 (248 > 200))
+ hot-add/remove in mixed (HDD/USB/SD-card/...) RAIDs with device mapper
+ on top -> data corruption (bio too big device md0 (248 > 200))
Changed in linux:
status: Unknown → Confirmed
ceg (ceg)
description: updated
summary: hot-add/remove in mixed (HDD/USB/SD-card/...) RAIDs with device mapper
- on top -> data corruption (bio too big device md0 (248 > 200))
+ on top => data corruption (bio too big device md0 (248 > 240))
summary: - hot-add/remove in mixed (HDD/USB/SD-card/...) RAIDs with device mapper
- on top => data corruption (bio too big device md0 (248 > 240))
+ hot-add/remove in mixed (IDE/SATA/USB/SD-card/...) RAIDs with device
+ mapper on top => data corruption (bio too big device md0 (248 > 240))
ceg (ceg)
description: updated
description: updated
description: updated
ceg (ceg)
Changed in mdadm:
status: New → Confirmed
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

This is an upstream issue in kernel / handling nested block device (lvm on top of mdadm).
It would be interesting to know if the patches to "recursively call merge_bvec_fn" became a reality nowadays. [1]
It's best to ping linux-raid mailing list to query if anything changed with respect to this bug.

[1] http://lkml.indiana.edu/hypermail/linux/kernel/0704.1/1008.html

Changed in debian-installer (Ubuntu):
status: Confirmed → Invalid
Changed in ubiquity (Ubuntu):
status: New → Invalid
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Never mind, Neil recently replied about the state of the art w.r.t. to this bug:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=624343#106

Revision history for this message
Josef Hopfgartner (josef-netzagentur) wrote :

I'm using SATA drives for LVM on RAID 1. RAID is configured with internal bitmap, so resyncing is quite fast.

In order to do the best kind of backup, I replace my drives to keep backups of complete systems.
There is a very good reason for this, since in the case of a complete desaster I simply take the last backup disk and use it in a working hardware.

So, sometimes it would be quite useful, to also use some USB external storage to connect a hard drive.
But, because of this error, the whole procedure is not possible.

I know it works with eSATA,
but this error claims always with USB connected drives.

Using rsync is not as simple as using a hotplug script that initiates raid resync.

I don't know, how it behaves with USB3.

Revision history for this message
ceg (ceg) wrote :

> I don't know, how it behaves with USB3.

Same data corruption as well, unfortunately.

ceg (ceg)
description: updated
Changed in linux:
status: Confirmed → Fix Released
Revision history for this message
Frank Ch. Eigler (fche) wrote :

The resolution of "fix released" is incorrect: the kernel bug is still present. The debian bug was closed due to age rather due to being fixed.

Changed in linux:
status: Fix Released → Confirmed
Revision history for this message
xor (xor) wrote :

What you people are forgetting is that RAID1 is in fact the PERFECT backup solution:
It takes a low level copy of the whole system while the system is *running*, and as opposed to cp/rsync, the copy is *coherent*:
Programmers do NOT design software to be robust against their files being randomly copied one after another by an external cp/rsync. How is this even supposed to work? If you copy file A, then file B, nothing guarantees that the software does not modify one of them in between your copying in a way which makes the files incoherent / breaks their compatibility.
And with regards to backup, we're talking about copying whole operating systems, consisting of *thousands* of programs. I would say the probability among thousands of programs is 100% that at least one of them will have its data corrupted if you copy its files using cp/rsync while it is running.

As opposed to that, a RAID1 is always in complete coherence if you shut down the system before you remove the disk. There is no mismatching data in multiple files. You can immediately boot up again, so backup takes 1 minute.
You could of course take offline the system for cp/rsync as well to get coherent data, but that will give you *hours* of downtime because the copying needs to happen while the system is offline. RAID1 can copy while it is running!

Overall, this bug is the worst issue in software design I've encountered in years, and I'm almost screaming :( I have spent over 20 hours migrating my machines to RAID1 so I can reduce the backup procedure from >5 hours to 5 minutes, and now it just doesn't WORK.
This is infuriating :( Can someone *please* come up with a fix / workaround?
I would be willing to pay a bounty of 50 EUR in Bitcoin for this to be fixed.

Revision history for this message
xor (xor) wrote :

Also, keep on mind that the most commonly used personal computers nowadays don't even *support* adding multiple disks of the same type: Laptops. They only have one HD slot, so I *must* use USB to attach the second.

Revision history for this message
Phillip Susi (psusi) wrote :

RAID is *not* a backup solution. If you delete or overwrite a file, then it's done on both disks, so you can't recover. If you want a rapid and coherent backup, use LVM and take a snapshot and back that up.

Also note that this commentary really isn't helping to fix the bug.

Revision history for this message
xor (xor) wrote :

@Phillip Susi / comment #23: Did you actually read what I wrote? :)

I was *NOT* advocating "backup" by having multiple RAID disks constantly connected to the array and in sync. It is completely obvious to me that a hot running copy of data is NOT a backup.
I was advocating the following procedure:
1. Connect disk
2. Wait until it is synced into the array.
3. Shutdown the machine
4. *DISCONNECT* the disc from the machine and consider the completely offline disk as a backup.

This is a backup because the disk is physically disconnected from the machine.
It is much better than a rsync/cp, because it provides a *coherent* copy since all modifications to the data which happen during the copying process are also applied on the backup. With rsync/cp, files which are modified *after* they have already been copied are not up to date in the backup, and for applications which store data in multiple files (which *many* programs do), their data would be corrupt in such a case.

It is relevant to this bugtracker entry because it shows that using multiple kinds of disks in a RAID1, such as SATA+USB, is a common desire and not some exotic border case.
Or can you name any other kind of non-exotic, non-beta (such as btrfs) backup mechanism which can copy data while the system is in use without breaking its coherency? :)

Changed in linux:
status: Confirmed → Fix Released
Revision history for this message
Kevin Lyda (lyda) wrote :

This still isn't fixed as best I can follow.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.