Ubuntu
linux package

Bug #320638
Comment #14

Comment 14 for bug 320638

Revision history for this message

Jim Lieb (lieb) wrote on 2009-07-24: Re: Raid1 HDD and SD card -> data corruption (bio too big device md0 (248 > 200))

#14

@Stephan,

The purpose of RAID is reliability, not a power saving strategy. It is true that there are bitmaps to minimize the bulk of the re-sync, an optimization, but that is all it is. The re-sync code schedules these so that there is minimal impact on overall performance during the re-sync. On a heavily used system, such as a server, this can take hours. It has been my experience that the disk subsystem gets pounded to death during this time.

There are a number of issues wrt mixing devices in this manner. Whereas HDD storage has access latencies in the msec range, read and write speeds are the same. While an SSD does not have access latency and read performance is in the HDD range, its write speed is not only asymetrically slower, but significantly slower. The manufacturers are not there *yet* to compete with HDDs. Even in private, NDA discussions they tend to be vague about these important details. SDs, CFs, and USB sticks do not even fit this category. They are low cost, low power secondary storage. Your idea of mixing HDD and SSD storage is an interesting idea. However, the mix has problems.

Your comments about vulnerability are true. What RAID *should* do and what it actually does are two different things. This is why Btrfs and ZFS (among others) address these very issues. However, you are treating a degraded RAID as a normal case. In practice, this is not true and unwanted. Case in point, no one uses RAID0 for anything other than low value, "big bit bucket" storage. Yes, this is a candidate for idempotent, large dataset munching but not for your only copy of the photo album (or the AMEX transaction database). RAID1 is an interesting idea but most low end arrays are, in fact, RAID10 to get striping. As I mentioned above, the re-build is pounding the disk and as soon as the drive can come back to the array and start the rebuild, the smaller the rebuild queue will be. This also introduces stripes to improve performance which makes your mix-and-match problematical. RAID arrays *really* want identical units. You really don't even want to mix sizes in the same speed range because sectors/track are usually different. The mis-match results in asymetric performance on the array, making performance match the slowest unit. These are the big issues in a RAID subsystem design. It is all about redundancy and speed optimizations given that every write transaction involves extra writes over the single disk case. Your use case is not on the list for I/O subsystem designers. See below for what they are looking at.

I should address your issues about propagating size changes up and down the stack. The first issue is MD somehow notifying the upper layers that there have been some size changes. This works the wrong way. The application demands determine the various sizings starting with the filesystem. Case in point, a database specifies its own "page" size, often being some multiple of a predominant row size. This in turn determines the write size to the filesystem. This is where ext4 extents come in and where the current linux penchant for one-size-fits-all 4k page gets in the way. This, in turn, mixes with the array and underlying disk geometry to determine stripe size. Tuning this is somewhat a black art. Change the use case to streaming media files and all the sizes change. In other words, the sizing comes down from above, not up from the bottom. Remember, once written to storage, the sizing of the row/db-page/extent/stripe is fixed until re-written.

Your second suggestion does two things, first, it effectively disables the caching, and second, it leaks private information across the layer boundary, and for what purpose? It is the upper layers that advise the configuration of the lower, not the other way around. And, this is a tuning issue. The database will still work if I swap in, via LVM, a new array of different geometry. It will just run crappy if I don't config it to be at least a multiple/modulo equivalent of what it replaces/augments.

Both of your suggested solutions require new code and changes to the APIs of multiple layers, something that will have little chance of making it into the mainline. Since Ubuntu must track with mainline, this means that it would be acceptable for inclusion in Ubuntu only after it has been included in the mainline. The idea is an interesting one but it is not a patch but new development.

Your final paragraph hits the point. This is a fundamental design issue raised by your corner case. I hope the above explains why your need is still a corner case, out of the scope of the design. No, it is not just exotic; it is indeed not as was intended for the above reasons. Please don't take offense, but one could easily say that this is using a vernier caliper as a pipe wrench. Might I suggest an alternative that the industry is moving toward.

Laptop power consumption is a real problem. This is a real problem in the data center as well. Everything from the eye candy to the brower history file consume power. Application programmers don't care. Everything is virtual and free anyway ;) There are three approaches being pursued. The first is optimization of application resource usage which where real savings and significant developer inertia abound. The second is a combination of kernel and hardware. Drive vendors are aggressively using powerdown techniques and the kernel developers are taking advantage. Lastly, the development of SSDs, including hybrid SSD+HDD drives, as power saving replacements for HDD devices feed into this work. This puts the fix where it belongs, namely at the source(s) of the power drain. The apps and O/S can be "energy conscious" and be smarter about batching work and when to hit the disk and the disk controller can be smart about when to spin motors etc. while still guaranteeing that your precious bits don't slide off the platter and on to the floor.

In the meantime, I know this does not help your immediate problem in the way you envisioned in your initial attempt/report but data integrity and the general, long term solution must be our design goal. Sorry.

Might I suggest your trying out an SSD? I can't (and shouldn't) recommend a specific drive but there are a number of choices available now. This would be safer for your data and get you some reasonable power savings. BTW, we know there are tuning/performance issues with SSD storage and your input with real use case data would be valuable to us.

@Stephan,

The purpose of RAID is reliability, not a power saving strategy.  It is true that there are bitmaps to minimize the bulk of the re-sync, an optimization, but that is all it is.  The re-sync code schedules these so that there is minimal impact on overall performance during the re-sync.  On a heavily used system, such as a server, this can take hours.  It has been my experience that the disk subsystem gets pounded to death during this time.

There are a number of issues wrt mixing devices in this manner.  Whereas HDD storage has access latencies in the msec range, read and write speeds are the same.  While an SSD does not have access latency and read performance is in the HDD range, its write speed is not only asymetrically slower, but significantly slower.  The manufacturers are not there *yet* to compete with HDDs.  Even in private, NDA discussions they tend to be vague about these important details.  SDs, CFs, and USB sticks do not even fit this category.  They are low cost, low power secondary storage.  Your idea of mixing HDD and SSD storage is an interesting idea.  However, the mix has problems.

Your comments about vulnerability are true.  What RAID *should* do and what it actually does are two different things.  This is why Btrfs and ZFS (among others) address these very issues.  However, you are treating a degraded RAID as a normal case.  In practice, this is not true and unwanted.  Case in point, no one uses RAID0 for anything other than low value, "big bit bucket" storage.  Yes, this is a candidate for idempotent, large dataset munching but not for your only copy of the photo album (or the AMEX transaction database).  RAID1 is an interesting idea but most low end arrays are, in fact, RAID10 to get striping.  As I mentioned above, the re-build is pounding the disk and as soon as the drive can come back to the array and start the rebuild, the smaller the rebuild queue will be.  This also introduces stripes to improve performance which makes your mix-and-match problematical.  RAID arrays *really* want identical units.  You really don't even want to mix sizes in the same speed range because sectors/track are usually different.  The mis-match results in asymetric performance on the array, making performance match the slowest unit. These are the big issues in a RAID subsystem design.  It is all about redundancy and speed optimizations given that every write transaction involves extra writes over the single disk case.  Your use case is not on the list for I/O subsystem designers.  See below for what they are looking at.

I should address your issues about propagating size changes up and down the stack.  The first issue is MD somehow notifying the upper layers that there have been some size changes.  This works the wrong way.  The application demands determine the various sizings starting with the filesystem.  Case in point, a database specifies its own "page" size, often being some multiple of a predominant row size.  This in turn determines the write size to the filesystem.  This is where ext4 extents come in and where the current linux penchant for one-size-fits-all 4k page gets in the way.  This, in turn, mixes with the array and underlying disk geometry to determine stripe size.  Tuning this is somewhat a black art.  Change the use case to streaming media files and all the sizes change.  In other words, the sizing comes down from above, not up from the bottom.  Remember, once written to storage, the sizing of the row/db-page/extent/stripe is fixed until re-written.

Your second suggestion does two things, first, it effectively disables the caching, and second, it leaks private information across the layer boundary, and for what purpose?  It is the upper layers that advise the configuration of the lower, not the other way around.  And, this is a tuning issue.  The database will still work if I swap in, via LVM, a new array of different geometry.  It will just run crappy if I don't config it to be at least a multiple/modulo equivalent of what it replaces/augments.

Both of your suggested solutions require new code and changes to the APIs of multiple layers, something that will have little chance of making it into the mainline.  Since Ubuntu must track  with mainline, this means that it would be acceptable for inclusion in Ubuntu only after it has been included in the mainline.  The idea is an interesting one but it is not a patch but new development.

Your final paragraph hits the point.  This is a fundamental design issue raised by your corner case.  I hope the above explains why your need is still a corner case, out of the scope of the design.  No, it is not just exotic; it is indeed not as was intended for the above reasons.  Please don't take offense, but one could easily say that this is using a vernier caliper as a pipe wrench.  Might I suggest an alternative that the industry is moving toward.

Laptop power consumption is a real problem.  This is a real problem in the data center as well.  Everything from the eye candy to the brower history file consume power.  Application programmers don't care.  Everything is virtual and free anyway ;)  There are three approaches being pursued.  The first is optimization of application resource usage which where real savings and significant developer inertia abound.  The second is a combination of kernel and hardware.  Drive vendors are aggressively using powerdown techniques and the kernel developers are taking advantage.  Lastly, the development of SSDs, including hybrid SSD+HDD drives, as power saving replacements for HDD devices feed into this work.  This puts the fix where it belongs, namely at the source(s) of the power drain.  The apps and O/S can be "energy conscious" and be smarter about batching work and when to hit the disk and the disk controller can be smart about when to spin motors etc. while still guaranteeing that your precious bits don't slide off the platter and on to the floor.

Might I suggest your trying out an SSD?  I can't (and shouldn't) recommend a specific drive but there are a number of choices available now.  This would be safer for your data and get you some reasonable power savings.  BTW, we know there are tuning/performance issues with SSD storage and your input with real use case data would be valuable to us.

Ubuntulinux package

Comment 14 for bug 320638

Ubuntu
linux package