Ubuntu

array with conflicting changes is assembled with data corruption/silent loss

Reported by Jamie Strandboge on 2010-04-07
92
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Unassigned
mdadm
Undecided
Unassigned
mdadm (Ubuntu)
High
Unassigned

Bug Description

Re-attaching parts of an array that have been running degraded separately and contain
conflicting changes of the same amount, or within the range of a write intent bitmap,
results in the assembly of a corrupt array.

----
Using the latest beta-2 server ISO and following http://testcases.qa.ubuntu.com/Install/ServerRAID1

Booting out of sync RAID1 array fails with ext3: It comes up as synced, but is corrupted.

     (According to comment #18: ext3 vs ext4 seems to be mere happenstance.)

Steps to reproduce:

1. in a kvm virtual machine, using 2 virtio qcow2 disks each 1768M in size, 768M ram and 2 VCPUs, in the installer I create the md devices:
/dev/md0: 1.5G, ext3, /
/dev/md1: ~350M, swap

Choose to boot in degraded mode. All other installer options are defaults

2. reboot into Lucid install and check /proc/mdstat: ok, both disks show up and are in sync

3. shutdown VM. remove 2nd disk, power on the VM and check /proc/mdstat: ok, boots degraded and mdstat shows the disk

4. shutdown VM. reconnect 2nd disk and remove 1st disk, power on the VM and check /proc/mdstat: ok, boots degraded and mdstat shows the disk

5. shutdown VM. reconnect 1st disk (so now both disks are connected, but out of sync), power on the VM

Expected results:
At this point it should boot degraded with /proc/mdstat showing it is syncing (recovering). This is how it works with ext4. Note that in the past one would have to 'sudo mdadm -a /dev/md0 /dev/MISSING-DEVICE' before syncing would occur. This no longer seems to be required.

Actual results:
Array comes up with both disks in the array and in sync.

Sometimes there are error messages saying there are disk errors, and the boot continues to login, but root is mounted readonly and /proc/mdstat shows we are in sync.

Sometimes fsck notices this and complains a *lot*:
/dev/md0 contains a filesystem with errors
Duplicate or bad block in use
Multiply-claimed block(s) in inode...
...
/dev/md0: File /var/log/boot.log (inode #68710, mod time Wed Apr 7 11:35:59 2010) has multiply-claimed block(s), shared with 1 file(s):
 /dev/md0: /var/log/udev (inode #69925, mod time Wed Apr 7 11:35:59 2010)
/dev/md0:
/dev/mdo0: UNEXPECTED CONSISTENCY; RUN fsck MANUALLY.

The boot loops infinitely on this because the mountall reports that fsck terminated with status 4, then reports that '/' is a filesystem with errors, then tries again (and again, and again).

See:
http://iso.qa.ubuntu.com/qatracker/result/3918/286

I filed this against 'linux'; please adjust as necessary.

-----

From linux-raid list:
mdadm --incremental should only included both disks in the array if
1/ their event counts are the same, or +/- 1, or
2/ there is a write-intent bitmap and the older event count is within
   the range recorded in the write-intent bitmap.

Fixing:

* When assembling, mdadm could check for conflicting "failed" states in the
  superblocks of members to detect conflicting changes. On conflicts, i.e. if an
  additional member claims an already running member has failed:
   + that member should not be added to the array
   + report (console and --monitor event) that an alternative
     version with conflicting changes has been detected "mdadm: not
     re-adding /dev/≤member> to /dev/≤array> because constitutes an
     alternative version containing conflicting changes"
   + require and support --force with --add for manual re-syncing of
     alternative versions (because unlike with re-syncing outdated
     devices/versions, in this case changes will get lost).

Enhancement 1)
  To facilitate easy inspection of alternative versions (i.e. for safe and
  easy diffing, merging, etc.) --incremental could assemble array
  components that contain alternative versions into temporary
  auxiliary devices.
  (would require temporarily mangling the fs UUID to ensure there are no
  duplicates in the system)

Enhancement 2)
  Those that want to be able to disable hot-plugging of
  segments with conflicting changes/alternative versions (after an
  incidence with multiple versions connected at the same time occured)
  will need some additional enhancements:
   + A way to mark some raid members (segments) as containing
     known alternative versions, and to mark them as such when an
     incident occurs in which they come up after another
     segment of the array is already running degraded.
     (possibly a superblock marking itself as failed)
   + An option like
     "AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS"
     to disable hotplug support for alternative versions once they came
     up after some other version and got marked as containig an alternative version.

Changed in linux (Ubuntu Lucid):
importance: Undecided → High
description: updated
summary: - booting out of sync RAID1 array fails with ext3 (comes up as syncd)
+ booting out of sync RAID1 array fails with ext3 (comes up as already in
+ sync)
tags: added: iso-testing

Booted into a live cd, installed mdadm and then grabbed the superblocks with:
$ sudo mdadm -E /dev/vda1
$ sudo mdadm -E /dev/vdb1

description: updated
Jamie Strandboge (jdstrand) wrote :

The last was the contents of the superblocks after I connected both disks. Here is the content of the superblock for disk1 after booting degraded with disk2 removed and after shutting down (obtained via live cd).

Jamie Strandboge (jdstrand) wrote :

And here is the content of the superblock for disk2 after booting degraded with disk2 reconnected and disk1 removed and after shutting down (obtained via live cd (note it shows up as vda, not vdb since disk1 is removed)).

Jamie Strandboge (jdstrand) wrote :

Superblocks with both disks attached, but before activate.

Jamie Strandboge (jdstrand) wrote :

Superblocks and /proc/mdstat after both disks are attached and 'sudo mdadm --auto-detect'.

Jamie Strandboge (jdstrand) wrote :

From irc:
13:18 < psusi> jdstrand: when you boot with one disk, you get the warning abut
               being degraded and are given 15 seconds to abort activating
               degraded or not, right?
13:19 < jdstrand> psusi: I don't see a warning cause of plymouth, but there is
                  a pause yes
13:23 < psusi> jdstrand: can you boot with nosplash and noquiet boot options to
               disable that? after plugging both disks back in, the udev
               script tries to do an incremental build when it detects each
               disk. That should fail for both disks, then eventually after a
               timeout, the fallback script should try to do the degraded
               activate... at that point only one disk should be activated and
               the other ignored
13:31 < jdstrand> psusi: I didn't get to grub in time, but after a long pause
                  it flashed a screen at me very clearly stating I am booting
                  in degraded mode (each time with disk1 and disk2 removed)
13:34 < psusi> jdstrand: did you still get that timeout and message about
               degraded when you reconnect the second disk? or does it just
               plod along happily like nothing is wrong at all?
13:34 < psusi> until the fsck fails of course
13:34 < jdstrand> I don't think I got the timeout, let me check
13:36 < jdstrand> psusi: no pause. straight to file system errors

description: updated
description: updated
Phillip Susi (psusi) wrote :

I have reproduced this on Karmic by manually assembling, stopping, and reassembling the array based on two lvm volumes. When mdadm --incremental is run on the first degraded leg of the mirror, it activates it since it now has one out of one disk with the second disk flagged as faulty, removed. You would think that the second disk would show the first as faulty,removed as well, but it only shows it as removed. When mdadm --incremental is run on the second disk, it happily starts using it without a resync. I believe this should fail and refuse to use the second disk until you manually re-add it to the array, causing a full resync. I have mailed the linux-raid mailing list about this.

Changed in linux (Ubuntu Lucid):
status: New → Confirmed
ceg (ceg) wrote :

As it's not caused by kernel raid autodetection I guess this probably belongs to package mdadm.

Have you tested if the 9.10-10.04 update works for raid systems this time?

You can see quite some raid bugs filed and also https://wiki.ubuntu.com/ReliableRaid

affects: linux (Ubuntu Lucid) → mdadm (Ubuntu Lucid)
ceg (ceg) wrote :

Note that initramfs actually also wrongly executes "mdadm --assemble --scan --run" if it finds any arrays degraded.

Bug #497186 initramfs' init-premount degrades *all* arrays (not just those required to boot)

ceg (ceg) wrote :

> I believe this should fail and refuse to use the second disk until you manually re-add it to the array, causing a full resync.

Yes, if there is a way for mdadm to determine if members are out of sync it should fail on conflicting updates that occured on separated parts of the array (as is the case here).

It should not fail if a usable remaining part of an array has been updated and the removed disk is plugged in again unchanged (hotplug readding of a raid member that is used as a backup).

It should never sync depending on device order (since that is rather random in hotplug systems anyway).

Phillip Susi (psusi) wrote :

Activating the degraded array is done only if the root fs is not found, and only if the mdadm package was configured via debconf to do so. There is nothing wrong with this per se, the problem is that the second disk is automatically added back into the array by mdadm --increment. Once the disk has been marked as removed from the array, it should require manual intervention to put it back.

ceg (ceg) wrote :

Looking under "bugs" where this bug has been filed (/ubuntu/lucid/) does not turn up a serious bug besides this one, but mdadm not only in 10.04 actually has some: https://bugs.launchpad.net/ubuntu/+source/mdadm

ceg (ceg) wrote :

> Activating the degraded array is done only if the root fs is not found,

Right, its only in a failure hook, and things like cryptsetup won't be run after that...
The initramfs boot mechanism is just not designed with the right event driven approach yet. Bug #488317

> and only if the mdadm package was configured via debconf to do so.

Not quite right. That debconf question was a rather bogus, unhelpfull an unnecessary implementation. Bug #539597

> There is nothing wrong with this per se,

It is wrong to to use "mdadm --assemble --scan --run", because it will start *all* arrays that have not come up yet in initramfs stage. (They get desynced and need to be resynced) Bug #497186

ceg (ceg) wrote :

> the problem is that the second disk is automatically added back into the array by mdadm --increment. Once the disk has been marked as removed from the array, it should require manual intervention to put it back.

In the case at hand mdadm should not only refuse addition due to it being "removed". Even if you add the disks manually mdadm should not just sync the disk slower to appear to the first one, because the parts are inconsistent!

I think a nice solution to detect this (counter+random) may have been posted to the linux-raid list.

The data corruption comes from the inconsistent parts (conflicting changes) that should require conscious user intervention or maybe configuration to decide about the sync direction.

Not auto re-adding manually removed raid_members, is a usability decision, that could probably made configurable but I see unrelated to the data corruption.

Changed in mdadm (Ubuntu Lucid):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
milestone: none → ubuntu-10.04
status: Confirmed → Triaged
Changed in mdadm (Ubuntu Lucid):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody

Just wanted to post the linux-raid mailing list thread for reference here:

http://marc.info/?l=linux-raid&m=127067374402401&w=2

On 4/9/2010 9:58 AM, ceg wrote:
> In the case at hand mdadm should not only refuse addition due to it
> being "removed". Even if you add the disks manually mdadm should not
> just sync the disk slower to appear to the first one, because the parts
> are inconsistent!

This statement does not make sense. Of course they are inconsistent;
that is why you have to sync them, which will make them consistent.

> I think a nice solution to detect this (counter+random) may have been
> posted to the linux-raid list.

I believe that is overkill and adding a new feature. The bug as I see
it, is that --incremental activates the disk instead of refusing to
because it is marked as removed. Fixing that would solve this problem.

> The data corruption comes from the inconsistent parts (conflicting
> changes) that should require conscious user intervention or maybe
> configuration to decide about the sync direction.

Which they could do after --incremental refuses to use the removed disk.
 The admin could look at the removed disk and salvage any data from it
he wishes to, then manually add it back to the array, causing a full resync.

> Not auto re-adding manually removed raid_members, is a usability
> decision, that could probably made configurable but I see unrelated to
> the data corruption.

It isn't an an option that could be configured; it is the definition of
the word "removed". If I remove the disk from the array, then it is no
longer part of the array and should not automatically be sucked back
into it.

Do I understand it properly that this bug does _not_ affect RAID1 with ext4 filesystem in any case? I know that it's pretty obvious from original description, but I want to be sure before upgrade. Thanks in advance.

Phillip Susi (psusi) wrote :

That seems to be mere happenstance. Using ext3 vs ext4 likely just slightly alters the exact IO pattern to cause a different number of md events. As long as the md event counter is not the same then adding the second modified disk back in causes a resync, destroying the changes specific to the second disk and going with the changes on the first detected disk.

Pip (dirk2) wrote :

Does anybody know the reason why ubuntu still uses mdadm 2.6.7 which is about 2 years old now?
Maybe this problem is solved in a newer mdadm-release...

Pip

ceg (ceg) wrote :

>> mdadm should not just sync the disk slower to appear to the first one, because the parts
>> are inconsistent!
>
>This statement does not make sense.

Oh, right, yes. To put it better: They should not be synced if they "contain conflicting changes"

If I read the case originally reported, the drives actually weren't manually --removed, just disconnected during power-down. If mdadm starts to distinguish between missing/removed, the disk missing at boot time will probably still just be marked missing, same if it has actually failed, I guess.

But more generally: By forcing to manually re-add removed disks, while mdadm is not refusing to sync conflicting parts of an array, we only make re-adding the data-loss inducing action. Manually re-adding and syncing *might* (I am not so sure) assemble/resync a consistent array, but will lead to discard one part of the conflicting changes in the array (data-loss).

(Though, from what was written it does sound valid to me now, not to auto-readd "removed" disks, if "missing" disks are auto-readded.)

To prevent data-corruption, I think mdadm is required to detect conflicting changes. No matter if disks re-added automatically (like in having a auto synced back-up in the docking station/external disk) or re-added manually.

summary: - booting out of sync RAID1 array fails with ext3 (comes up as already in
- sync)
+ booting out of sync RAID1 array comes up as already in sync (data-
+ corruption)
ceg (ceg) on 2010-04-14
description: updated

On 4/14/2010 9:19 AM, ceg wrote:
> But more generally: By forcing to manually re-add removed disks, while
> mdadm is not refusing to sync conflicting parts of an array, we only
> make re-adding the data-loss inducing action. Manually re-adding and
> syncing *might* (I am not so sure) assemble/resync a consistent array,
> but will lead to discard one part of the conflicting changes in the
> array (data-loss).

Correct, if both disks have been changed then you can not combine them
without discarding one change or the other. The admin would have to
decide if there were important changes on the other disk and recover
them before adding it back to the array.

> To prevent data-corruption, I think mdadm is required to detect
> conflicting changes. No matter if disks re-added automatically (like
> in having a auto synced back-up in the docking station/external disk)
> or re-added manually.

Why do you think that, and what exactly would that entail?

As long as the admin manually inserts the disk back into the array, he
KNOWS that any changes specific to that disk will be destroyed, so I
don't see a problem.

> Expected results: At this point it should boot degraded with
> /proc/mdstat showing it is syncing (recovering). This is how it works
> with ext4. Note that in the past one would have to 'sudo mdadm -a
> /dev/md0 /dev/MISSING-DEVICE' before syncing would occur. This no
> longer seems to be required.

This is not correct. You seem to be agreeing with me that automatically
adding the disk back and resyncing causes data loss, thus this should be
avoided. Instead you should have to manually add the disk back. You
say this is how it used to work? When? It doesn't seem to work that
way on Karmic. If it used to work that way, then the fact that it no
longer does is the regression that needs fixed.

ceg (ceg) on 2010-04-14
summary: - booting out of sync RAID1 array comes up as already in sync (data-
- corruption)
+ array with conflicting changes is assembled with data corruption/silent
+ loss
ceg (ceg) wrote :

Though I can sure understand it would be easier if we could just dismiss this to be taken care of by users, data-loss/corruption will allways come back heavy on ubuntu/mdadm.

With ubunu systems in particular, we can not assume there will always be an admin available. And if there is an admin, and he allways has to re-add removed members manually, how does he notice if a user made conflicting changes?

I am not sure if we are considering the valid use case of auto re-adding members enough here, yet. (Even if auto-adding just "missing" and not "removed" members.) I.e. the case of docking-stations / external backup drives.

> You seem to be agreeing with me that automatically
>adding the disk back and resyncing causes data loss, thus this should be
>avoided.

We need to avoid and warn about data-loss, no matter if manually or automatically.
Re-adding needs to be safe operation. If concurrent changes where made syncing has to be refused, if --force is not used.

> you should have to manually add the disk back. You
>say this is how it used to work? When? It doesn't seem to work that
>way on Karmic. If it used to work that way, then the fact that it no
>longer does is the regression that needs fixed.

Creating a fully hot-pluggable system is a major feature of ubuntu.

On 4/14/2010 11:58 AM, ceg wrote:
> Though I can sure understand it would be easier if we could just
> dismiss this to be taken care of by users, data-loss/corruption will
> allways come back heavy on ubuntu/mdadm.

Not necessarily. Data loss because of automatic hardware detection and
activation is a problem certainly, but data loss because the user ran rm
-rf / is not.

> With ubunu systems in particular, we can not assume there will always
> be an admin available. And if there is an admin, and he allways has
> to re- add removed members manually, how does he notice if a user
> made conflicting changes?

He will notice when he sees that the array is degraded and refusing to
use one of the disks.

> I am not sure if we are considering the valid use case of auto
> re-adding members enough here, yet. (Even if auto-adding just
> "missing" and not "removed" members.) I.e. the case of
> docking-stations / external backup drives.

I'm not quite sure what you mean here. A device that is removed should
never be automatically added when detected.

> We need to avoid and warn about data-loss, no matter if manually or
> automatically. Re-adding needs to be safe operation. If concurrent
> changes where made syncing has to be refused, if --force is not
> used.

I'm not sure why --force should be required. When you add a disk to the
array, you always destroy whatever data is on that disk. It goes
without saying.

>> you should have to manually add the disk back. You say this is how
>> it used to work? When? It doesn't seem to work that way on Karmic.
>> If it used to work that way, then the fact that it no longer does
>> is the regression that needs fixed.
>
> Creating a fully hot-pluggable system is a major feature of ubuntu.

Ok... how does that alter the fact that we should not be automatically
adding devices to arrays that have been explicitly removed?

ceg (ceg) wrote :

> Ok... how does that alter the fact that we should not be automatically
> adding devices to arrays that have been explicitly removed?

Not at all, we agree that explicitly --remove(ing) a device is a good way to tell mdadm --incremental (its hotplug control mechanism) not to re-add automatically.

Personally I could even agree that it might be OK for "mdadm --add" not to require --force, but you don't seem to agree that "mdadm --incremental" really needs to be able to auto-re-add (not manually removed but missing) devices, in a safe manner.

>> be an admin available. And if there is an admin, and he allways has
>> to re- add removed members manually, how does he notice if a user
>> made conflicting changes?
>
> He will notice when he sees that the array is degraded and refusing to
> use one of the disks.

If I read your proposal correctly, running an array degraded would always also "remove" the missing disk.

This would imply to
* break all the auto-re-add later feature of mdadm --incremental (it also sports auto-read-only-until-write), even though it is perfectly safe in the majority of cases (no conflicts).
* force users/admins to *allways* re-add manually after an array is running degraded (this is not supporting hot-plugging, rather the contrary)
* make the perfectly safe re-addition of an outdated member device ( i.e. older backup) look indistinguishable from re-adding a member with conflicting changes (with data-loss!). The admin (*allways* forced to --add manually) can not notice when the operation will cause data loss.

>> I am not sure if we are considering the valid use case of auto
>> re-adding members enough here, yet. (Even if auto-adding just
>> "missing" and not "removed" members.) I.e. the case of
>> docking-stations / external backup drives.
>
> I'm not quite sure what you mean here. A device that is removed should
> never be automatically added when detected.

Please check https://wiki.ubuntu.com/HotplugRaid for example, and understand the need of a hot-plugging scheme that supports safe auto-re-adding.
If you manually --remove a member it should not get auto-re-added. If a member is only missing for a while, yes the array should keep running as well as be run degraded upon boot (as long as no conflicting changes were made).

ceg (ceg) wrote :

Always auto-removing as a means to drop auto-re-add features simply isn't an answer for conflict detection.

ceg (ceg) wrote :

Currently mdadm does not seem to distingush manual --removed status from the status a drive that is missing gets when the array is run degraded.

Especially with write intent bitmaps regulary being used for faster syncing in hotplug setups, and mdadm only comparing if "eventcount is in range of bitmap":

* Fixing this "data-loss on conflicting changes" bug, will require a better detection of conflicts.
* Support for tracking explicitly --removed disks in the superblocks to prevent their auto-re-addition, is a valid but separate issue as far as I am concerned.

Phillip Susi (psusi) wrote :

On 4/14/2010 3:18 PM, ceg wrote:
> If I read your proposal correctly, running an array degraded would
> always also "remove" the missing disk.

That is exactly what happens. When you give the go ahead to degrade the
array, you fail and remove the missing disk.

> This would imply to * break all the auto-re-add later feature of
> mdadm --incremental (it also sports auto-read-only-until-write), even
> though it is perfectly safe in the majority of cases (no conflicts).
> * force users/admins to *allways* re-add manually after an array is
> running degraded (this is not supporting hot-plugging, rather the
> contrary) * make the perfectly safe re-addition of an outdated member
> device ( i.e. older backup) look indistinguishable from re-adding a
> member with conflicting changes (with data-loss!). The admin
> (*allways* forced to --add manually) can not notice when the
> operation will cause data loss.

I suppose that you could avoid marking the missing disk as removed when
degrading the array, then --incremental could try to add it again later
automatically. If the disk has not been tampered with then it would be
resynced, hopefully quickly with the help of the write intent bitmap.
In this case where the other disk has also been modified, the conflict
can be easily detected because the first disk says the second disk is
failed, and the second disk says the first disk is failed. If the
second disk was not also degraded then it would still show both disks
are active and in sync.

ceg (ceg) wrote :

I see that we were stumbling about confusing wording in mdadm.

Upon disappearance, a real failure, mdadm --fail or running an array degraded: mdadm -E shows *missing* disks marked as "removed". (What you probably referred to all the time.) Even though nobody actually issued "mdadm --removed" on them. (What I referred to.)

After a manual --fail (disk already marked "removed" now) however you still need to explicitly --remove to unbind a disk from an md device, and one must --fail before --remove is possible ("md device busy")

All would be clearer if
* mdadm -E would report "missing" instead of removed (which sounds like it really got "mdadm --removed")
* "mdadm --remove"ing would not require a prior manual --fail and only this would really mark disks as "removed" in the superblocks.

> I suppose that you could avoid marking the missing disk as removed when
> degrading the array, then --incremental could try to add it again later
> automatically.
> If the disk has not been tampered with then it would be
> resynced, hopefully quickly with the help of the write intent bitmap.

I think --incremental is supporting auto re-adding already since years. And since auto re-adding is a reallity and an important feature, relabeling the "removed" mark into "missing" should remove the confusion.
(Auto re-adding is broken in ubuntu, though (outside of initramfs for disks set up during initramfs), because the map file is not kept Bug #550131)

> In this case where the other disk has also been modified, the conflict
> can be easily detected because the first disk says the second disk is
> failed, and the second disk says the first disk is failed. If the
> second disk was not also degraded then it would still show both disks
> are active and in sync.

That is a good point!
If confliciting changes can be detected by this, why does mdadm not use this conflicting information (when parts of an array are claiming each other to be failed) to just report "conflicting changes" and refuse to --add without --force? (You see I am back asking to report and require --force to make it clear to users/admin that it is not only some bug/hickup in the hot-plug mechanism that made it fail, but -add is a manual operation that implies real data-loss in this case, not as in others when it will only sync an older copy instead of a diverged one.)

Thierry Carrez (ttx) on 2010-04-16
Changed in mdadm (Ubuntu Lucid):
assignee: nobody → Dustin Kirkland (kirkland)
milestone: ubuntu-10.04 → none
Phillip Susi (psusi) wrote :

On 04/15/2010 03:55 AM, ceg wrote:
> Upon disappearance, a real failure, mdadm --fail or running an array
> degraded: mdadm -E shows *missing* disks marked as "removed". (What
> you probably referred to all the time.) Even though nobody actually
> issued "mdadm --removed" on them. (What I referred to.)

Exactly, when mounting an array in degraded mode with missing disks,
mdadm marks the missing disks as removed. It probably should only mark
them as faulty or something less severe than removed.

> All would be clearer if * mdadm -E would report "missing" instead of
> removed (which sounds like it really got "mdadm --removed")

There already exists a faulty state. It might be appropriate to use that.

> That is a good point! If confliciting changes can be detected by
> this, why does mdadm not use this conflicting information (when parts
> of an array are claiming each other to be failed) to just report
> "conflicting changes" and refuse to --add without --force? (You see I
> am back asking to report and require --force to make it clear to
> users/admin that it is not only some bug/hickup in the hot-plug
> mechanism that made it fail, but -add is a manual operation that
> implies real data-loss in this case, not as in others when it will
> only sync an older copy instead of a diverged one.)

That seems to be the heart of the bug. If BOTH disks show the second
disk as removed, then mdadm will not use the second disk, but when the
metadata on the second disk says disk 2 is fine, and it's disk 1 that
has been removed, it happily adds the disk. It should not trust the
wrong metadata on the second disk and refuse to use it unless it can
safely coerse it into agreement with the active metadata in the array
taken from the first disk.

If the second disk says both disks are fine, then the array state of
disk 2 can be changed to active/needs sync, and the metadata on both
disks can be updated to match and the resync started.

If the second disk says that the first disk has been
removed/failed/missing, then you can not reconcile them since failing
the first disk would fail the array, and activating the second disk
could destroy data. In this case the second disk should be marked as
removed and its metadata updated. This will make sure that if you
reboot and the second disk is detected first, that it will not be
activated. In other words, as soon as you have a boot that does see
both disks after they have been independently degraded and modified, ONE
of them will be chosen as the victor, and used from then on, and the
other will be removed until the admin has a chance to investigate and
decide to manually add it back, thus destroying any changes on that disk
that were made during the boot with only that disk available.

ceg (ceg) wrote :

>> All would be clearer if * mdadm -E would report "missing" instead of
>> removed (which sounds like it really got "mdadm --removed")
>
> There already exists a faulty state. It might be appropriate to use that

This and the detection process sounds reasonable to me.

I am not sure how much sense auto-removing a confilicting part from an array makes from the user side. As the order in which devices appear can be random or not, I would rather like mdadm to refrain from doing any metadata updates based on that, if it is not necessary.

With mdadm patched to detect and report conflicting changes and not to sync conflicting changes without --force, it should protect from unaware data-loss and corruption and always provide a coherent state, consisting of the first or (maybe intentionally even) only part that appears from an array.

A coherent mdadm way of informing the user about the appearance of conflicting changes (after a run degraded event) to would probably be to emit a "conflicting changes" mdadm --monitor event.

If mdadm --incremental would auto-remove a part with confilicting changes, it might not remove the correct part the user may actually --remove. But the situation would look very similar as if some admin had actually --removed something. Possibly wrongly suggesting to the admin that its a simple and safe matter of re-adding, while he actually needs to manually reverse the auto-remove operation, to prevent critical data-loss.

ceg (ceg) wrote :

A name for this might be "safe segmentation". It prevents data-loss that could occur by syncing unreliable disks.

Phillip Susi (psusi) wrote :

Updating the metadata is needed to prevent further flip-flopping. Once
the situation is detected, it needs to be noted so that further reboots
will not decide to use the other disk. Pick one, and stick with it
until the admin sorts things out.

ceg (ceg) wrote :

> Once
> the situation is detected, it needs to be noted

Right, this is important especially in cases where segmentation has happened unintentionally. That is why I wanted mdadm to fail on conflicting changes without --force, not auto-sync and emit an event (email, beep, notification, whatever configured).

> so that further reboots
> will not decide to use the other disk. Pick one, and stick with it
> until the admin sorts things out.

Bear in mind that mdadm --incremental is handling more than reboots, and segmenting the array can be intentional by the admin/user.

> Updating the metadata is needed to prevent further flip-flopping.

(Between reboots I haven't seen too random/changing device reordering anyway. Mostly the enumeration seems to stay the same if nothing is rewired. I'd consider the hot-plugging order much more arbitrary, and even less worth of committing to the meta-data.)

But would you see it as necessary to ensure a consistent uncorrupted array and closing this bug?
I think it is enough if mdadm will only assemble the first part attached regularly. It may be good however if mdadm would assemble any conflicting parts as extra devices (normally md128 and up) so the parts are accessible for inspection, can be compared, manually merged etc.

Updating the metadata would prevent working with and switching between concurrent versions in a hot-plugging manner.
Think of the use-case of segmenting a (non-root fs) data-array into two halves in order to do some major refactoring. (This is like keeping a snapshot by using only part of the mirror.)

Dustin Kirkland  (kirkland) wrote :

Solving this bug will require a non-trivial overhaul of mdadm's failure hooks in the initramfs, and potentially new code in mdadm itself.

In my opinion, this is not something that can be solved in Lucid by release. Also in my opinion, this is not a release critical issue, but rather should be addressed in the 10.04 release notes.

As such, I'm marking this bug won't-fix for Lucid, but leaving it triaged for the next development cycle (Maverick), and unassigning myself.

I can see where one of the bug's subscribers has written a spec on what they believe to be a better design for mdadm/initramfs failure handling:
 * https://wiki.ubuntu.com/ReliableRaid

Someone from the Ubuntu Foundations Team or the Ubuntu Community can propose this spec at UDS-Maverick in May, and perhaps implement the re-design in the next release. But the time has past for this level of feature development in Lucid.

Cheers,
:-Dustin

Changed in mdadm (Ubuntu Lucid):
assignee: Dustin Kirkland (kirkland) → nobody
status: Triaged → Won't Fix
ceg (ceg) wrote :

Additional thoughts why updating metadata looks more limiting than beneficial to me:

Unintentional (intermittent) failures of disks won't cause conflicting changes but auto re-sync events to appear.

Segmenting an array into parts with conflicting changes requires repeated boots with separate parts attached or to manually --run separate parts of an array degraded (later hot-plugged disks).

Use-case even with reboots: Prior to doing a dist-upgrade one boots with only part of the root fs array attached, and is able to switch back and forth and rebooting the versions until to decide which way to sync.

Dustin Kirkland  (kirkland) wrote :

As a followup, if this were fixed cleanly, and in a backportable manner, this could be a reasonable candidate for an SRU.

Jamie Strandboge (jdstrand) wrote :

Considering we are a little over a week away from release, Dustin's comment sounds reasonable. This bug existed in 9.10, and we should get it fixed, but rushing a fix before release could easily affect more RAID users than this bug would. Hopefully the solution will be contained enough to make it SRU-worthy so we can get it into Lucid after release.

Phillip Susi (psusi) wrote :

On 4/19/2010 1:10 PM, ceg wrote:
> (Between reboots I haven't seen too random/changing device
> reordering anyway. Mostly the enumeration seems to stay the same if
> nothing is rewired. I'd consider the hot-plugging order much more
> arbitrary, and even less worth of committing to the meta-data.)

You just made my point. The hot plugging case is the best example here.
 If I plug in one disk and make some changes, then unplug it, plug in
the other disk, and make some changes to it, in the future I don't want
which set of changes appears to depend on which disk I plug in first.
As soon as both disks are plugged in and the conflicting changes are
detected, you must record that in the metadata.

> But would you see it as necessary to ensure a consistent uncorrupted
> array and closing this bug? I think it is enough if mdadm will only

Very much so.

> assemble the first part attached regularly. It may be good however if
> mdadm would assemble any conflicting parts as extra devices (normally
> md128 and up) so the parts are accessible for inspection, can be
> compared, manually merged etc.

No need to do that automatically, this is where manual intervention
comes in. Once mdadm has rejected one of the disks and the admin
notices, he can easily ask mdadm to move it to another array by itself
to be mounted, inspected, merged, etc.

> Updating the metadata would prevent working with and switching
> between concurrent versions in a hot-plugging manner. Think of the
> use-case of segmenting a (non-root fs) data-array into two halves in
> order to do some major refactoring. (This is like keeping a snapshot
> by using only part of the mirror.)

If that is the intent, then the user needs to manually remove one disk
from the array and set it aside or add it to a separate array if they
wish. If we /accidentally/ fork the array, we need to set the
conflicting array aside and notify the user that they need to sort the
situation out manually. We avoid making the situation any worse than it
already is by updating the metadata.

Steve Langasek (vorlon) wrote :

As I do think we will want to fix this in SRU once a fix is available, un-wontfixing the lucid task. We definitely *don't* want to try to change this now before release, but we should fix it in Lucid.

Jamie, was this regression first introduced in 9.10, or did it exist in previous releases as well?

Dustin, in the future if you believe an issue should be documented in the release notes, please open a task on the 'ubuntu-release-notes' project. Thanks!

Changed in mdadm (Ubuntu Lucid):
status: Won't Fix → Triaged
Steve Langasek (vorlon) wrote :

Here's candidate text for the release notes, taken from the beta2 tech overview:

Activating a RAID 1 array in degraded mode may lead to RAID disks being reported as in sync when they are not, resulting in data loss. Since RAID 1 arrays will automatically be brought up in degraded mode when a member disk is unavailable, users with production software RAID 1 disks are advised not to upgrade to Ubuntu 9.10 or 10.04 LTS until this bug is resolved. (Bug:557429)

Phillip Susi (psusi) wrote :

I think that warning is a bit misleading/extreme. The damage only
occurs if you bring up one disk degraded, *and* then the other disk
degraded. In practice, this should never happen since usually someone
would notice the degraded event and take action to restore the missing
disk. The release notes should simply explain when the problem occurs,
and warn people to be aware of it and watch out for it. Maybe something
like this:

Activating a RAID1 array with only one disk, then activating the array
with only the other disk, then finally returning to normal operation
with both disks can cause the disks to be combined out of sync, leading
to severe data loss. You should take care to make sure that this
situation does not happen.

I agree with Philip's assessment.

While this is very easy to reproduce in a VM (by just removing/adding
backing disk files), in practice and on real hardware, I think this is
definitely less likely.

When a real hardware disk fails, it should be removed from the system,
and not come back until it's replaced with new hardware, in which case
this bug will not be triggered. As Philip explained, this would only
happen if an admin is adding and removing and booting with just one
disk, and then the other, and then both. Don't do that.

Jamie Strandboge (jdstrand) wrote :

I also agree with Philip's assessment. When it hits, it is devastating, but it takes a very specific series of events to hit, and asking people to not upgrade as a result is too extreme.

Philip, you mentioned to me that 9.10 was also affected-- what about earlier releases?

Dustin Kirkland  (kirkland) wrote :

Jamie-

As for earlier releases, I haven't tested this, but having written the
original logic in the mdadm's failure hooks in the initramfs, I can
tell you that the code handling is present in:
 * 8.04 (via a point release/SRU)
 * 8.10
 * 9.04
 * 9.10
 * 10.04

:-Dustin

Download full text (5.9 KiB)

I'm also fine with this postponed for after release, segmenting a
raid into concurrent hot-pluggable parts is a case, without correct
support now.

> hot-plugging order much more
> > arbitrary, and even less worth of committing to the meta-data.)
>
> If I plug in one disk and make some changes, then unplug it,
> plug in the other disk, and make some changes to it,

What would be your use-case?

> in the future I
> don't want which set of changes appears to depend on which disk I
> plug in first.

In most cases the next thing one would probably want
after conflicting changes are present in a system is to sync, in an
easy way. (Not to keep rebooting or reattaching much. Reattaching is
just a simple way to determine the order.)

As your case does not sound like a hot-plug use-case. Probably handle
that with --remove?

> As soon as both disks are plugged in and the
> conflicting changes are detected, you must record that in the
> metadata.

No, you must prevent data-corruption or loss. But don't do things like
--remove(ing) parts or fixing ordering in a hotplug environment
(and mdadm --incremental is just for that), because it would break
further management of the raid devices in a hot-plugging manner.

> > It may be good however
> > if mdadm would assemble any conflicting parts as extra devices
> > (normally md128 and up) so the parts are accessible for inspection,
> > can be compared, manually merged etc.
>
> No need to do that automatically, this is where manual intervention
> comes in.

Note that mdadm --incremental already does that for "unknown" arrays
(not defined or allowed by AUTO in mdadm.conf), it's not a new feature.

But your comments are a little irritating. We are actually talking
hot-plugging here, right? Plus ubuntu's no config, no intervention
necessary approach. Everything should just work.

> Once mdadm has rejected one of the disks and the admin
> notices, he can easily ask mdadm to move it to another array by itself
> to be mounted, inspected, merged, etc.

Are you actually aware what that means? I am not saying it is not
possible to create a new array from parts of an existing array without
loosing the data, but is sure isn't a trivial mdadm command. And then
you are really breaking up the array and won't be able to just sync the
other parts and still have the same (UUID) array.

>
> > Updating the metadata would prevent working with and switching
> > between concurrent versions in a hot-plugging manner. Think of the
> > use-case of segmenting a (non-root fs) data-array into two halves in
> > order to do some major refactoring. (This is like keeping a snapshot
> > by using only part of the mirror.)
>
> If that is the intent, then the user needs to manually remove one disk
> from the array and set it aside or add it to a separate array if they
> wish. If we /accidentally/ fork the array, we need to set the
> conflicting array aside and notify the user that they need to sort the
> situation out manually.

Yes, yes and yes again, this needs to be done in *any* case of
conflicting changes. If mdadm --incremental (the mdadm hotplug manager)
sets up the confliciting parts on separate md devices they will both
even appe...

Read more...

> I also agree with Philip's assessment. When it hits, it is
> devastating, but it takes a very specific series of events to hit,
> and asking people to not upgrade as a result is too extreme.

I agree:

Re-attaching parts of an array that have been running degraded
separately and contain the same amount and conflicting
changes, results in the assembly of a corrupt array.

> Philip, you mentioned to me that 9.10 was also affected-- what about
> earlier releases?

It was probably present in the current form since the udev rules use
"mdadm --incremental". (9.04 if I remember correctly)

ceg (ceg) wrote :

Dustin, I don't think this has anything to do with the failure hooks in this case. :) (Here it's mdadm that does not pick up on the conflict.)

ceg (ceg) on 2010-04-20
description: updated
Download full text (6.6 KiB)

On 4/20/2010 3:21 PM, ceg wrote:
>> If I plug in one disk and make some changes, then unplug it,
>> plug in the other disk, and make some changes to it,
>
> What would be your use-case?

I don't understand this question. The use case is described in the text
you replied to.

> In most cases the next thing one would probably want
> after conflicting changes are present in a system is to sync, in an
> easy way. (Not to keep rebooting or reattaching much. Reattaching is
> just a simple way to determine the order.)
>
> As your case does not sound like a hot-plug use-case. Probably handle
> that with --remove?

Handle what?

> No, you must prevent data-corruption or loss. But don't do things like

Of course. The question is HOW?

> --remove(ing) parts or fixing ordering in a hotplug environment
> (and mdadm --incremental is just for that), because it would break
> further management of the raid devices in a hot-plugging manner.

This is the HOW part. The removing does not break anything. It
prevents you from continuing to flip flop which disk you are using after
they have been forked, and thus making things worse.

> But your comments are a little irritating. We are actually talking
> hot-plugging here, right? Plus ubuntu's no config, no intervention
> necessary approach. Everything should just work.

Until everything goes all pear shaped, at which point "doing the right
thing" is not clear, so manual intervention is required. Once the array
has been forked, the best thing you can do is not make things any worse.
 Fixing it has to be done by hand.

> Are you actually aware what that means? I am not saying it is not
> possible to create a new array from parts of an existing array without
> loosing the data, but is sure isn't a trivial mdadm command. And then
> you are really breaking up the array and won't be able to just sync the
> other parts and still have the same (UUID) array.

The array is already broken up. Resyncing will destroy data. If you
want to rescue that data you must move the other disk to its own array
so you can mount it. After you have rescued any data, then you can drop
it back into the original array and it will sync.

> Yes, yes and yes again, this needs to be done in *any* case of
> conflicting changes. If mdadm --incremental (the mdadm hotplug manager)
> sets up the confliciting parts on separate md devices they will both
> even appear on the desktop.

Sure, automatically splitting the array would be a nice feature, but the
minimum action required to fix the bug is to simply reject the second
disk, updating its metadata in the process.

> No, it really makes things worse! It prevents the user/admin from
> managing arrays (parts in this case) by simply plugging disks.

No it does not. What it does is prevent the damage from growing worse
without being noticed.

> And what would be the gain of auto-removing writing metadate? If the
> disks are connected during boot the disks will almost always stay in
> the same order anyway, eliminating the gain to save that order
> to metadata. If you want a specific order from the start, you need
> to manually issue mdadm commands anyway. But now also if you need
> another orde...

Read more...

ceg (ceg) wrote :

Phillip, first please explain where/why you think "flip-floping would occur
continiously", in such a way that makes it not enough to never assemble
a corrupt array and notify someone to take care to reconcile the
conflicting changes if desired.
Because this seems to be the reason you want to break mdadm
--incremental support for hot-plugging segmented array parts.
(by committing an arbitrary but not highly fluctuating/flip-flopping
state to metadata, i.e. promoting auto-remove instead of leaving that
always a manual action.)

ceg (ceg) wrote :

> >> If I plug in one disk and make some changes, then unplug it,
> >> plug in the other disk, and make some changes to it,
> >
> > What would be your use-case?
>
> I don't understand this question. The use case is described in the
> text you replied to.

Please explain why someone would do that, a raid does not get segmented
and charged with conflicting changes by itself. Even if an intermittent
and alternating failure causes it, it should get reported independent
from if metadata is altered.

> > In most cases the next thing one would probably want
> > after conflicting changes are present in a system is to sync, in an
> > easy way. (Not to keep rebooting or reattaching much. Reattaching is
> > just a simple way to determine the order.)
> >
> > As your case does not sound like a hot-plug use-case. Probably
> > handle that with --remove?
>
> Handle what?

Manually removing should make sure always only one and the same part
gets assembled. It sounds like you want to hot-plug the parts in an
arbitrary oder and not have the array assembly be determined by this.

> The [auto-]removing does not break anything.

Please stop ignoring that auto-remove breaks hot-plugging. By this
mdadm --incremental would limit the usefullness of itself.

ceg (ceg) wrote :

> [auto-removing]
> prevents you from continuing to flip flop which disk you are using
> after they have been forked, and thus making things worse.

And, this is a program thinking it knows better than the user, mdadm
--incremental should not do that. If you continue to do that after you
have been informed you probably do it intentionally, and mdadm should
not interfere.

I have provided dist-upgrades and refactoring usecases of non-root filesystem array as use-cases that swich between versions. And Neil also told you about backup schemes.

> The array is already broken up.

For conflicting changes to occur, 1) arrays need to be running degraded,
which should only happen automatically with arrays required to boot when
parts are missing during boot. Then 2) the missing part has to reappear
and 3) be run degraded also while 4) the previously remaining part is
removed, and then 5) both parts have to be present again.

Aside from this, hot-plugging/connecting parts of an array to any
machine should never run it degraded.

> Resyncing will destroy data. If you
> want to rescue that data you must move the other disk to its own array
> so you can mount it.

With hot-pluggable devices you don't have to. You should just need plug
them into your system and they should get mounted. So once an array is
run degraded, if you plug just the part you want, it should get mounted,
done.

> After you have rescued any data, then you can
> drop it back into the original array and it will sync.

With hot-pluggable devices you should just need to plug both parts, the
one you want to keep first. Then "mdadm --stop <md-auxilarily>" and
"mdadm --add <md-to-keep> <members-with-conflicting-changes> --force",
done.

ceg (ceg) wrote :

> the minimum action required to fix the bug is to simply reject the
> second disk, updating its metadata in the process.
>
> > No, [updating metadata] really makes things worse! It prevents the
> > user/admin from managing arrays (parts in this case) by simply
> > plugging disks.
>
> No it does not.

Explain how it does not prevent to switch between conflicting changes by
hot-plugging.

> What it does is prevent the damage from growing worse
> without being noticed.

The real damage is prevented by mdadm, if mdadm --incremental returns
"mdadm: not re-adding /dev/... because it contains conflicting changes"
instead of setting up a corrupt array. Notice is also given by the
mdadm --monitor daemon reporting a "conflicting changes" event to
users/admins.

Being able to hot-plug/switch between conflicting changes is a
feature not a bug.

>
> > And what would be the gain of auto-removing writing metadate? If the
> > disks are connected during boot the disks will almost always stay in
> > the same order anyway, eliminating the gain to save that order
> > to metadata. If you want a specific order from the start, you need
> > to manually issue mdadm commands anyway. But now also if you need
> > another order than what was written to metadata. And all that mdadm
> > commands need to be issued in between an active hot-plugging
> > system (interference/no map file updating), instead of just
> > re-plugging your disks in order.
>
> As I already said, the gain is to prevent continued flip-flopping back
> and forth

What continued flip-flopping back and forth? Read again!

> between the two divergent filesystems based only on which
> disk is detected first. Almost always != always.

Almost always != always, because there are use-cases
where the user explicitly wants to hot-plug "flip-flop" several times
between the parts.

> You seem to be suggesting that the
> user physically disconnect one disk if they wish to access data on
> the other disk, rather than run mdadm.

Plugging disks is a nice and easy alternative these days.

Explain why it is a bad idea to plug and unplug (e)SATA disks plugged
into your laptop, or in your docking station, running an udev/mdadm
--incremental system.

Phillip Susi (psusi) wrote :

I'm going to boil this down very simply to try and bring an end to this. If you wish to automatically split the second disk into a new array with a new uuid, you must update the metadata on that disk to indicate you have done so, and if you end up connecting that second disk in the future without the first, it must still show up with the new uuid. Which uuid it appears as must not depend on which one was plugged in first.

That however, would be a new feature of the plug and play system outside the scope of mdadm. If you want to automatically split the disk off into a new array after the desync has been detected, that would be nice, but fixing the bug in mdadm is as simple as having it detect the conflicting metadata on the second disk caused by the divergence, and fixing said metadata to agree with the metadata in the array, which says that disk is failed.

At that point whether some other component automatically invokes mdadm to move the second disk to a brand new array, or the admin has to by hand, I don't really care.

ceg (ceg) wrote :

Phillip, before suggesting something I try to think through the issue,
and the same I try with feedback.

But after several attempts to explain that changing metadata and
removing the "failed" status (of allready running parts) in the
superblocks of the conflicting parts that are plugged-in (but not to be
added to the running array) breaks hot-plugging, I sadly still can't
recognize any consideration of the bad effects your approach would have
for many users.

And if I think about it, your metadata updates may not have the overall
effect you may expect. When the modified part is plugged in
during future boots, it can get run degraded again, the metadata
is then back to what it was before, and it can again be used normally.
So the metadata updates just breaks hotplugging and you could not
explain a case where continous unintentional flip-flopping would occur
and updating metadata would help.

> If you want to automatically split the
> disk off into a new array after the desync has been detected,

Correct, that is unrelated to the metadata problem, I commented on it
because setting this up has its pitfalls (like UUID dupes and this bug
requiring --zero-superblock to prevent it from biting) and it would much
facilitate comparing, copying etc. in a hot-plug environment.

> but fixing the bug in mdadm is as simple as having it
> detect the conflicting metadata on the second disk caused by the
> divergence, and fixing said metadata

It's even simpler once you can see that fixing metadata creates more
issues than are actually there and updating metadata would really be
able solve.

ceg (ceg) wrote :

> whether some other component automatically invokes mdadm
> to move the second disk to a brand new array, or the admin has to by
> hand, I don't really care.

You are probably not aware enough that all udev/hotplug magic for raid
is within mdadm --incremental. I.e. in the future it will even set
up blank disks inserted into DOMAINs defined in mdadm.conf as spares etc.

As a last overall note: Maybe remember again that raid systems are
designed to keep your machine running as long as possible up until
no redundancy is left.

When the redundancy is increased again it can happily resync if possible. When the
system runs without redundancy on different array segments one at a
time, they can not be synced until redundancy has been restored.

In this case conflicting changes may occur, it's the nature of a "only one at a time" failure
that the changes will not allways be available, but raid can keep the
system running until the cause is identified and fixed, while no data is
really lost.

If it happens that both segments get available with
conflicting changes, one needs to be chosen (first one is already
there). But if you update the metadata on this occasion (disabling one segment),
from this moment on the raid system will not keep the
system running as designed, and like it did before both segments came up
together once. (You would change/break behavior.)

Download full text (4.0 KiB)

On 4/22/2010 5:08 AM, ceg wrote:
> Phillip, before suggesting something I try to think through the issue,
> and the same I try with feedback.
>
> But after several attempts to explain that changing metadata and
> removing the "failed" status (of allready running parts) in the
> superblocks of the conflicting parts that are plugged-in (but not to be
> added to the running array) breaks hot-plugging, I sadly still can't
> recognize any consideration of the bad effects your approach would have
> for many users.

That's because it DOESN'T break hot-plugging. I have explained why.

> And if I think about it, your metadata updates may not have the overall
> effect you may expect. When the modified part is plugged in
> during future boots, it can get run degraded again, the metadata
> is then back to what it was before, and it can again be used normally.
> So the metadata updates just breaks hotplugging and you could not
> explain a case where continous unintentional flip-flopping would occur
> and updating metadata would help.

No, the second disk will not be run degraded again; that is the whole
point of correcting the wrong metadata. If the second disk is the only
one there on the next boot, it will show that disk 2 is failed so it
can't be used, and mdadm can't find disk 1, so the array can not be started.

> Correct, that is unrelated to the metadata problem, I commented on it
> because setting this up has its pitfalls (like UUID dupes and this bug
> requiring --zero-superblock to prevent it from biting) and it would much
> facilitate comparing, copying etc. in a hot-plug environment.

As I said before, it does not require --zero-superblock. Once disk2 is
failed and removed from the array, you can create a new array using that
disk. mdadm will warn you that the disk appears to already be part of
an array, but you can tell it to continue and it will put disk2 in a new
array, with a new uuid, and you can mount it and inspect it. Once you
are done with it you can move it back to the original array and a full
resync will be done.

> It's even simpler once you can see that fixing metadata creates more
> issues than are actually there and updating metadata would really be
> able solve.

I have shown why this is wrong.

> If it happens that both segments get available with
> conflicting changes, one needs to be chosen (first one is already
> there). But if you update the metadata on this occasion (disabling one segment),
> from this moment on the raid system will not keep the
> system running as designed, and like it did before both segments came up
> together once. (You would change/break behavior.)

Yes, and this change is entirely intentional because if you don't do
this, then you can unintentionally continue to further diverge the two
disks without noticing, causing further damage. Imagine a server that
boots and decides it can't find disk2, so it goes degraded. It has a
cron job that fetches email from a pop server and deletes them once they
have been downloaded. The server reboots and this time can only find
disk1. Now the cron job again, fetches and deletes some mail. Now some
of your mail is on disk1, and some is on disk2, and you a...

Read more...

ceg (ceg) wrote :

I'd suggest to consider the following option about whether to assemble
segments known to contain conflicting changes or not:

AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS

> That's because it DOESN'T break hot-plugging. I have explained why.

You have the right to think that, obviously we disagree on that point.
You may often think you explained something while I miss explanations
really, especially for what I explicitly asked for.

It seems you consider segments being available one at a time, then on
some incidence eventually together, and then one at a time again later
would indicate a not-mainly-theoretical failure type and it should be
handled by always auto-removing all but the first segment on the
incidence. (So they won't get auto assembled anymore.)

Let's conclude this is OK, for part of the users. (Mostly those that
want to be sure and manage their arrays issuing commands by hand.)

But it does pose a problem if you want to support managing array segments
by just plugging disks and occasionally commanding
simple sync directives (eventually just by right clicking on the
segments showing up on the desktop).

> > from this moment on the raid system will not keep the
> > system running as designed, and like it did before both segments
> > came up together once. (You would change/break behavior.)
>
> Yes, and this change is entirely intentional because if you don't do
> this, then you can unintentionally continue to further diverge the two
> disks without noticing,

You'd have to miss or ignore quite e few notifications to not notice.
Notice is given as soon as the first degradation occurs. So the admin
should know something is going on and usually take action already way
before the incidence happens. At last upon the second notification,
when the same array is run degraded again, he can know it has
split into segments with conflicting changes (even if the message may
(currently) not be explicit about it).

Note however that even if I think a failure showing this type of
behavior seems more fictional than users intentionally
segmenting the array before upgrades and such, I can very well relate to
those not wanting to configure mdadm.confs AUTO option at all (i.e. on
servers) just to be sure nothing happens behind their backs. And just
so to set "AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS" in
order to disable hot-plugging for segments with alternative versions.

That's what settings are for and what can make all happy.

Phillip Susi (psusi) wrote :

On 4/23/2010 6:52 AM, ceg wrote:
> I'd suggest to consider the following option about whether to assemble
> segments known to contain conflicting changes or not:
>
> AUTO -SINGLE_SEGMENTS_WITH_KNOWN_ALTERNATIVE_VERSIONS

As I have said, if you want another component to automatically notice
when one of the disks has been ejected from the array due to conflicting
changes and migrate it to a new array, that is quite fine. It would
then show up as a new mount on your desktop.

> It seems you consider segments being available one at a time, then on
> some incidence eventually together, and then one at a time again later
> would indicate a not-mainly-theoretical failure type and it should be
> handled by always auto-removing all but the first segment on the
> incidence. (So they won't get auto assembled anymore.)
>
> Let's conclude this is OK, for part of the users. (Mostly those that
> want to be sure and manage their arrays issuing commands by hand.)
>
> But it does pose a problem if you want to support managing array segments
> by just plugging disks and occasionally commanding
> simple sync directives (eventually just by right clicking on the
> segments showing up on the desktop).

Even if you intentionally caused the divergence you don't want both
disks to show up as the same volume when plugged in. One of them should
be renamed so it is clear that they are not the same anymore, and if you
do connect them both, then both should show up -- as separate volumes.

ceg (ceg) wrote :

> Even if you intentionally caused the divergence you don't want both
> disks to show up as the same volume when plugged in.

Right, they'd need to show up under an additionally enumerated (or mangled) "version name", if another segment (version) of the same array is allready running. For hot-plug management of segments to work, all segments however would need show up under their real array ID, if connected first or one at a time. Otherwise the system won't recognize the segment of the array as such and boot or open it correctly, and you won't be able to switch between versions by switching the disks that are connected.

> if you want another component to automatically notice
> when one of the disks has been ejected from the array due to conflicting
> changes and migrate it to a new array, that is quite fine. It would
> then show up as a new mount on your desktop.

However that is a differnt thing. That's creating new and different arrays. It is not managing segments of one array.

Phillip Susi (psusi) wrote :

I suppose that the rename could be only temporary while both disks are
connected, if so configured.

After some further testing, it seems that the bug in mdadm is a bit more
general. In --incremental mode it goes ahead and adds removed disks to
the array, so even if you explicitly --fail and --remove one of the
disks from the array, a reboot or other event that causes mdadm
--incremental to be run will put the disk back in the array. The only
acceptable state a disk should be activated in by --incremental other
than in sync is failed. Once it has been removed it should be left alone.

The degraded case seems to just be a more specific way of encountering
this bug since it marks the disk as removed.

ceg (ceg) wrote :

> In --incremental mode it goes ahead and adds removed disks to
> the array

Yes it would be nice if the states would get sorted out a little better. Running an array degraded during boot would only have to mark missing disks as failed for example, just as if they had failed while the array was running complete.

Andrea Grandi (andreagrandi) wrote :

Hi all,

comparing these two changelog:

Ubuntu 10.04 beta 2: http://www.ubuntu.com/testing/lucid/beta2
Ubuntu 10.04 RC: http://www.ubuntu.com/getubuntu/releasenotes/1004overview

You have removed this bug from known issues:

Activating a RAID 1 array in degraded mode is reported to lead to RAID disks being reported as in sync when they are not, resulting in data loss. Since RAID 1 arrays will automatically be brought up in degraded mode when a member disk is unavailable, users with production software RAID 1 disks are advised not to upgrade to the 10.04 LTS Beta until this bug is resolved. (557429)

Looking at this bug page the bug is NOT FIXED yet! This could cause a data loss for users installing and using Ubuntu 10.04 RC.

Steve Langasek (vorlon) wrote :

Dustin,

On Tue, Apr 20, 2010 at 03:33:15PM -0000, Dustin Kirkland wrote:
> I agree with Philip's assessment.

> While this is very easy to reproduce in a VM (by just removing/adding
> backing disk files), in practice and on real hardware, I think this is
> definitely less likely.

> When a real hardware disk fails, it should be removed from the system,
> and not come back until it's replaced with new hardware, in which case
> this bug will not be triggered. As Philip explained, this would only
> happen if an admin is adding and removing and booting with just one
> disk, and then the other, and then both. Don't do that.

Have I misunderstood the nature of this bug, or couldn't it be triggered by
a flaky SATA cable causing intermittent connections to the drives? If one
port flakes on one boot, the other port flakes on the next, and both ports
are available on the third, wouldn't that trigger this same bogus
reassembly?

In fact, if the admin is trying to debug the problem, maybe the system comes
up two out of five times without seeing any drives at all, or they've
physically swapped which disk is on which port *because the cable is
unreliable*, and by the fifth time they've thought to replace the cable and
things are reliable again - and *then* the perfectly-good disks get
corrupted because of this bug.

So while it doesn't appear to be a recent regression, and not a
high-frequency occurence, it does look like a data loss bug that can occur
through no fault of the admin and I certainly think our users need to be
warned of this in the release notes.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

ceg (ceg) on 2010-04-25
description: updated
description: updated
Phillip Susi (psusi) wrote :

On Sat, 2010-04-24 at 22:20 +0000, Steve Langasek wrote:
> Have I misunderstood the nature of this bug, or couldn't it be triggered by
> a flaky SATA cable causing intermittent connections to the drives? If one
> port flakes on one boot, the other port flakes on the next, and both ports
> are available on the third, wouldn't that trigger this same bogus
> reassembly?

Correct.

> So while it doesn't appear to be a recent regression, and not a
> high-frequency occurence, it does look like a data loss bug that can occur
> through no fault of the admin and I certainly think our users need to be
> warned of this in the release notes.

Indeed. I thought we had come to the conclusion that the language of
the release note was to be changed, not be completely removed.

ceg (ceg) on 2010-04-26
description: updated
ceg (ceg) wrote :

> If one
> port flakes on one boot, the other port flakes on the next, and both ports
> are available on the third, wouldn't that trigger this same bogus
> reassembly?

That depends.

On linux-raid list it was said it would only happen if the event count on both segements are eqal +/-1. That would be the case mostly only if nothing but imediately shutting down is done upon booting of both segments (like in the testcase that driggered this). Different uptimes should be enough to cause a difference in the event count, and prevent this bug from happening.

So there should already be some measure in place that prevents this, for common cases.
What makes this hard to detect and debug is Bug #535417 (mdadm monitoring has been broken in ubuntu).

The suggestion that mdadm should test for conflicts in superblocks (marking each other as failed) should however be able to detect independently degraded segments of an array 100%.

Steve Langasek (vorlon) wrote :

Documented at <https://wiki.ubuntu.com/LucidLynx/ReleaseNotes#Use%20of%20degraded%20RAID%201%20array%20may%20cause%20data%20loss%20in%20exceptional%20cases>:

If each member of a RAID 1 array is separately brought up in degraded mode across subsequent runs of the array with no reassembly in between, there is a risk that the disks will be reported as in sync when they are not, resulting in data loss due to inconsistencies between the data that has been written to each member. This is an unlikely occurrence during normal operations, but admins of systems using RAID 1 arrays should take care during maintenance to avoid this situation. (557429)

Changed in ubuntu-release-notes:
status: New → Fix Released
Clint Byrum (clint-fewbar) wrote :

So, its been a while since this issue resurfaced, but I feel it needs to be put to rest.

Are we really sure we should fix this?

http://marc.info/?l=linux-raid&m=127068416016382&w=2

"I don't think there is anything practical that could be changed in md or
mdadm to make it possible to catch this behaviour and refuse the assemble the
array... Maybe mdadm could check that the bitmap on the 'old' device is a
subset of the bitmap on the 'new' device - that might be enough.
But if the devices just happen to have the same event count then as far as md
is concerned, they do contain the same data." -- Neil Brown

I happen to agree with Neil, that this isn't something mdadm or the md driver should be capable of. If nothing else, it is a feature request, and not a High issue. I've changed the ISO testing guide to advise booting with both disks between each boot with a disconnected disk. Other than that limited ISO testing scenario, when is this actually affecting users?

I do also like Billy Crook's random number addition idea here:

http://marc.info/?l=linux-raid&m=127073871318005&w=2

But that sounds a lot like a feature request.

So, what I'm suggesting is that this bug should actually be set as a Wishlist, not High importance, because while it could lead to data corruption, so could putting my disks in the microwave. Booting a RAID1 with 1 disk, then immediately with the other, is just not something a normal user would do, it just came up because of the instructions on the test case itself.

ceg (ceg) wrote :

> "I don't think there is anything practical that could be changed in md or
> mdadm to make it possible to catch this behaviour and refuse the assemble the
> array..."

The original topic of the linux-raid discussion http://comments.gmane.org/gmane.linux.raid/27822 suggested the idea to detect diverted or segmented array parts by checking for superblocks claiming each other as "failed". (The naming convention of that state is actually a differnt topic http://comments.gmane.org/gmane.linux.raid/27820) But Neil did not respond yet.

It may be a feature only if it is not a bug for a raid system not to be able to tell for sure if parts have been segmented/diverged (only relying on a probability of an event count difference, and much worse if a bimap is used.).

iMac (imac-netstatz) wrote :

I use RAID1 everywhere, and I have seen both loose SATA cables and BIOS'es that are not set with enough delay for drives to spin up both lead to degraded RAID1 scenarios, so I am worried about the overall impact of this bug. My current use case is not one of these, but might be one used by anyone leveraging the flexibility of eSATA and RAID1 for replication across systems.

My current use case is that I have two laptops (one work, one personal) and I use RAID 1 to a disk attached by eSATA ports on each to keep a series of LVM volumes (home, virtual machines, etc.) synced between the devices. Typically my work laptop was the master, and whenever I plugged a newer external image into my personal laptop pre-boot, it would auto-rebuild on boot. My RAID1 was created with three devices (n=3), but I am not sure that actually affected the way it chooses to handle degraded disks, except that I suppose it is *always* degraded with only 2 of 3 disks ever active on one system.

I had to modify the original Intrepid udev when I first set this up, I believe to avoid some delay or prompt when starting degraded, and my changes are as follows:

#Original Intrepid (I believe) left commented in my custom 85-mdadm
#SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", \
 RUN+="watershed /sbin/mdadm --incremental --run /dev/%k"

# My current udev from current custom /etc/udev/rules.d/85-mdadm
SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", \
 RUN+="watershed /sbin/mdadm --assemble --scan"

Looking at the current /lib/udev rules, there appears to be little change that would have any effect
SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", \
 RUN+="/sbin/mdadm --incremental $env{DEVNAME}"

However, now on my home laptop, whenever I bring a new image from work, it starts up with both images active and I have a corrupted disk. Every time. Only since 10.10. Soo, I am now always logging in before attaching my eSATA, failing the local RAID1 disk and removing it, stopping the array and starting it degraded with the external one and re-adding internal. It's not really something I can continue to do efficiently. I was considering upgrading my raid superblock from v0.9 to v1.2 .. but from this bug report I am not sure that will help me. There is some sort of regression here from my perspective.

If I can change my udev to work again, great; Skimming through the thread it doesn't appear that I have an actual workaround. It just stopped working.

If I ever start my laptop up with an old eSATA image on my current RAID1 laptop image, I am screwed, and my home directory of that has survived many debian and now ubuntu distros and various hardware upgrades might actually come to an end.

Clint Byrum (clint-fewbar) wrote :

Hi iMac. Thanks for sharing your use case.

I think this is a race condition that has only come to light recently because the startup and volume management has basically caused the number of things happening to remain consistent and small enough where the event-count gets incremented equally on both systems, and so you get this corrupted diverged volumes scenario. Its just as likely that you'd accidentally torch the changes that you want by writing from the older disk to the newer one as it is that you'd merge the two and hit this silent data loss.

I don't think offline replication between two separate machines is really what RAID1 is for, even if it did work at one time. The focus is on replicating data onto two disks, on a single system.

Still I think saving users from accidental data corruption is a useful feature, and should be specified and added *as a new feature*. Since the current documentation and implementation do not define any behavior for this diverged RAID1 scenario, it needs to be specified clearly, precisely what the expected behavior would be, and then implemented as such.

ceg (ceg) wrote :

Sharing a hotplug raid array to two sync two machines is a very nice use case iMac, thanks for sharing your experience. I did only intentionally segment an array prior to performing updates so far.

A place where information experience with the topic is shared and your workarounds would fit in nicely is https://wiki.ubuntu.com/HotplugRaid

ceg (ceg) wrote :

> should be specified and added *as a new feature*. Since the current documentation and implementation do not define any behavior for this diverged RAID1 scenario

You could build upon this thread:
http://comments.gmane.org/gmane.linux.raid/27822
(but leave out the parts that was caused by naming confusion, which is better explained at http://comments.gmane.org/gmane.linux.raid/27820)

no longer affects: mdadm (Ubuntu Lucid)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers