Possible RAID-6 corruption

Bug #1364091 reported by RoyK
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Seth Forshee
mdadm (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

It seems there's a bug in newer kernels that may lead to corruption on RAID-6. There's a fix, too

http://lwn.net/Articles/608896/

Tags: patch
Revision history for this message
swmike (ubuntu-s-plass) wrote :

This problem affects all stable kernels from 2.6.32 and onwards, and this patch needs to be applied to ALL kernels from that date to avoid data corruption on double degraded RAID-6 volumes being written to.

This is not an mdadm problem, it's a kernel problem.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-kernel (Ubuntu):
status: New → Confirmed
Changed in mdadm (Ubuntu):
status: New → Confirmed
Revision history for this message
swmike (ubuntu-s-plass) wrote :

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6b2d615d1094..183588b11fc1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3817,6 +3817,8 @@ static void handle_stripe(struct stripe_head *sh)
     set_bit(R5_Wantwrite, &dev->flags);
     if (prexor)
      continue;
+ if (s.failed > 1)
+ continue;
     if (!test_bit(R5_Insync, &dev->flags) ||
         ((i == sh->pd_idx || i == sh->qd_idx) &&
          s.failed == 0))

Revision history for this message
swmike (ubuntu-s-plass) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "Patch copy/pasted from Linux-raid mailing list" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
RoyK (roysk) wrote :

This fix is already in newer kernel versions - a two line fix, anyone?

affects: linux-kernel (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Seth Forshee (sforshee) wrote :

The fix is 9c4bdf697c39805078392d5ddbbba5ae5680e0dd in Linus's tree, which is already out for review for the next 3.13.x.y release.

Revision history for this message
swmike (ubuntu-s-plass) wrote :

Great, but what about the kernels for older LTS releases still under support? Neil stated this affects all kernels back to and including 2.6.32. So I imagine 12.04 LTS and 10.04 LTS are also affected.

Revision history for this message
Seth Forshee (sforshee) wrote : Re: [Bug 1364091] Re: Possible RAID-6 corruption

On Wed, Oct 01, 2014 at 02:05:14PM -0000, swmike wrote:
> Great, but what about the kernels for older LTS releases still under
> support? Neil stated this affects all kernels back to and including
> 2.6.32. So I imagine 12.04 LTS and 10.04 LTS are also affected.

All supported stable kernels should be picking up the patch, which will
cover all supported Ubuntu kernels. But I'm planning to go ahead and get
the patch now to get it in faster.

Changed in linux (Ubuntu):
assignee: nobody → Seth Forshee (sforshee)
Revision history for this message
Seth Forshee (sforshee) wrote :

RoyK: Okay, so it turns out we already have this patch in the precise and utopic trees from upstream stable, and it's on its way to trusty via upstream stable.

That only leaves lucid, which should also receive this fix via upstream stable in the future. 2.6.32 requires a backport which I'm unequipped to test. Unless you're able to test the lucid backport for me I'm inclined to just wait for this to filter down from upstream stable, given that lucid has been around for 4+ years without anyone reporting problems (to my knowledge at least).

Changed in mdadm (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
RoyK (roysk) wrote :

Even if no bugs has been filed for lucid, I guess this should be fixed there as well. A double disk failure in RAID-6 isn't very common, and corruptions may not be easily detected. AFAICS the issue is also in the lucid kernel, but then, it's just another 6 months before lucid is EOL :P

Revision history for this message
Seth Forshee (sforshee) wrote :

On Thu, Oct 02, 2014 at 11:10:39AM -0000, RoyK wrote:
> Even if no bugs has been filed for lucid, I guess this should be fixed
> there as well. A double disk failure in RAID-6 isn't very common, and
> corruptions may not be easily detected. AFAICS the issue is also in the
> lucid kernel, but then, it's just another 6 months before lucid is EOL
> :P

I didn't say it wouldn't be fixed. I just don't plan to rush out a fix
that I can't get tested. The commit is marked for stable, so it should
be included in the next 2.6.32 stable release.

Revision history for this message
RoyK (roysk) wrote :

seems to me this is rather a small fix and also urgent.

when will the next 2.6.32 release appear?

Revision history for this message
Seth Forshee (sforshee) wrote :

On Thu, Oct 02, 2014 at 06:58:14PM -0000, RoyK wrote:
> seems to me this is rather a small fix and also urgent.
>
> when will the next 2.6.32 release appear?

I can't say, that's not under my control.

The reason I'm reluctant is that the code being patched is different in
2.6.32, split out into separate functions for raid5 and raid6. While the
backport looks straightforward I'd really prefer that someone was able
to test it, and I don't have any machines at my disposal with enough
disks to set up raid5/6.

Revision history for this message
RoyK (roysk) wrote :

I looked at the code in 2.6.32 and I can't find the related bits in handle_stripe6(). The prexor int isn't defined, and the checks after set_bit(R5_Wantwrite, &dev->flags); aren't issued. I don't know enouch about this code to see what's going on. I'm also not sure what really triggers this bug. I tried setting up a raid6, failed two drives, starting to write a bunch of deterministic data to the drives and started to rebuild it - no errors could be detected, neither in metadata or data.

I have a vm setup with lucid for this purpose, and I can allow access to it for deveopers wanting to test.

roy

Revision history for this message
Seth Forshee (sforshee) wrote :

Here's my backport for lucid. I don't have any advice for reproducing the problem; I'd have to look at the code in more detail and try to figure out what triggers that condition.

Revision history for this message
RoyK (roysk) wrote :

According to the article at http://lwn.net/Articles/608896/, this bug shouldn't need fixing in handle_stripe5(), only in handle_stripe6(), but then again, I don't know the code.

Revision history for this message
Seth Forshee (sforshee) wrote :

On Sun, Oct 05, 2014 at 03:00:18PM -0000, RoyK wrote:
> According to the article at http://lwn.net/Articles/608896/, this bug
> shouldn't need fixing in handle_stripe5(), only in handle_stripe6(), but
> then again, I don't know the code.

Hmm. I interpreted the comment to mean that backports to older kernels
should _also_ change handle_stripe6, but reading it again I suspect your
interpretation is probably the right one. But then I'm not sure, because
I don't really know the code either.

tags: removed: kernel-key
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.