Comment 13 for bug 320638

Revision history for this message
Stephan Diestelhorst (syon) wrote : Re: Raid1 HDD and SD card -> data corruption (bio too big device md0 (248 > 200))

@Jim: Thanks for getting back to me on this one!

Your understanding for my purposes is correct. Let me address your points one by one:

> You save little, if any, power because an array restore requires a complete disk copy, not an update of some number of out-of-date blocks. ...

No. First of all there are write-intent bitmaps that reduce the amount of synchronisation needed. Second the SD-card is in the raid all the time, with option write-mostly. Hence I can just drop the HDD from the RAID, spin it down and save power that way. When I'm with a power supply again, resynching copies only the changed stuff to the HDD.

> It can also leave your system vulnerable. I have heard of reiser filesystems fail w/ compromised raid arrays, which this is in your powersaving mode.

It must not. This is something everybody expects from the two following things: RAID and block device abstraction. Any sane RIAD implementation will guarantee to provide the same semantics on the block layer interface (with performance degradation, sure). The file system in turn must only rely on the semantics provided by the block layer interface. As these semantics remain the same the FS has no way of telling a difference and malfunctioning because of it. Any such behaviour is clearly a bug the RAID implementation or the FS.

> There is power management in the newer kernels to cycle down the hdd to conserve power but this takes careful tuning.

This is improving steadily, but cannot save the same power as if the HDD was not used at all.

> If you want backup with this mix of devices, use rsync to the usb stick.

I can do this copying, but I cannot remount the root FS on the fly to the copy on the USB key / SD card.

> A usb stick is not the same as an SSD. ...

It is for all that the correctness / semantics of the entire FS stack should be concerned: A block device. All the differences you mention all well understood but they may impact only quantitative aspects, such as performance, MTBF etc. of the system.

After skimming through the kernel source once more I feel that the real problem lies within unclear specifications regarding the 'constantness' of max_sectors. MD assumes that it can adjust the value it reports on _polling_ according to the characteristics of the devices in the RAID. Some layer in the stack (LVM or LUKS, both based on the same block layer abstraction or even the FS) apparently cannot cope with variable max_sectors.

In addition to the possible solutions mentioned in previous comments, I can think of several other ways to deal with the issue:
a) Provide a way for MD to _notify_ the upper layers of changing characterisics of the block device. Then all layers would have the responsibility to chain the notification to their respective parent layers. This may be nasty as it requires reversed connection knowledge.

b) Have more graceful handling of the issue on detection of faulty acceses. Once the upper requesting layer receives a specific error message for an access it can initiate a reprobe of the actual max_sectors value from the underlying block device. The contract would then be that intermediate block layers do not cache this information but rather ask their lower layers for up-to-date information.

Please note that hits is a fundamental design issue, which has turned up because of an -admittedly exotic- setup, rather than some used-in-a-way-it-was-not-meant-to-be-used thing.