md raid5 set inaccessible after some time.

Bug #613872 reported by Rudi Daemen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: linux-image-2.6.32-24-generic-pae

I upgraded from Ubuntu 8.04LTS to 10.04LTS last weekend, the system has been running fine untill today. For some reasons the webserver/mysql server stopped responding and I checked the console and noticed it being flooded by input/output errors. I managed to type dmesg and that was filled with this information as well. By now all services that had a reference to "/var/" (Which is mounted on /dev/md2, a raid5 set based on 4 disks) had stopped.

After typing reboot the system came back up normally and mdadm started a resync of the raid5 set. No dataloss on the raid set detected. I immediately checked all the logs under /var/log and noticed they all stopped at the point the input/output errors started flooding the console. The errors that flooded the console did not end up in the dmesg logs probably because the /dev/md2 raidset was completely unavailable at the moment this occurred. The problem seems similar to the one I reported here: https://bugzilla.kernel.org/show_bug.cgi?id=11328#c48

This issue was NOT present under Ubuntu 8.04 with the 2.6.24 kernel, it was present under Debian Lenny with the 2.6.26 kernel. And unfortunately this bug seems to be present at the 2.6.32 kernel as well... The system has been running for about a year on 8.04LTS without any issues.

System information:
- Via Epia LT15000AG (Via CenterHaul Esther CPU, CX700 chipset), 1GB RAM
- SiI3124 PCI-X sata controller
- 2x Via Rhine II onboard NIC (VT6102)
- 4x Samsung Spinpoint 160GB SATA hard drive connected to the SiI3124 controller.

These drives are configured as follows:
- /dev/md0 is a 4 disk RAID1 set containing the '/' filesystem.
- /dev/md1 is a 4 disk RAID5 set used as swap.
- /dev/md2 is a 4 disk RAID5 set used as the '/var/' filesystem.

The system is running Ubuntu 10.04 LTS configured for only LTS packages and completely up to date (Except for the new kernel image that was ready for download this morning).

I will attempt to create a photograph of the console if the error occurs again. Only thing I can get out of the dmesg log is the know bug that has been with the kernel since 2002 for the Via Rhine ethernet adapters. I've added a USB drive to the system and set up two sessions, one to tail the kernel log and one to tail dmesg. Both are dumping the output to a 512MB flash drive, this should be enough to capture the events preceding the issues.

It is OK to consider this bug report incomplete until I managed to capture some logs. Attached lspci -vvv and dmesg output from the last boot.

Revision history for this message
Rudi Daemen (fludizz) wrote :
Revision history for this message
Rudi Daemen (fludizz) wrote :

Dmesg output from last boot.

Revision history for this message
Rudi Daemen (fludizz) wrote :

Ok, the problem occurred again... However for some reason it did not capture the errors on the USB drive. I did happen to have a console open at the time and managed to get the response from a dmesg output. It was swamped with the following text:

[358579.557869] metapage_write_end_io: I/O error
[358579.559914] metapage_write_end_io: I/O error
[358579.562155] metapage_write_end_io: I/O error
[358614.526341] metapage_write_end_io: I/O error

Again no other errors captured because the data is lost and the system is now no longer able to boot. Just hanging after the filesystem check occurs...

I'll start rebuilding the system *again* using the old Ubuntu 8.04LTS version... The last known good version.

Revision history for this message
Rudi Daemen (fludizz) wrote :

Managed to get the system up and running again. Had to reassemble all mdadm raid arrays using a live CD. After reassembling all arrays I rebooted back into the operating system and had to reboot three times to get it to work again.

First reboot, it prompted it noticed a size change from 0 to <human unreadble high number> for /dev/md0, it did a fsck and then hung. Rebooted the system using crtl+alt+del. This prompted the system to reboot cleanly luckily, and unmounted the "newly discovered" /dev/md0. At this reboot, the same steps where repeated for the swap volume (/dev/md1). Last reboot was for the /var/ volume (/dev/md2). The system is now back up and running and probably will go down again in a few days...

It seems mdadm lost it's config in the original system and it was trying to boot the system using /dev/sda1 (as it did find the "/" filesystem) instead of /dev/md0. After reconstruction using the live CD, the bootlog also showed the RAID arrays and their printouts again. One other thing I did as a precaution is trashing the contents of the "/var/" folder on the /dev/md0 array. It somehow made a lock and run file in there (probably when it lost the RAID5 set containing /var/).

I think the bug is not necesarily in the kernel but could be in mdadm, since I manually had to reassemble the arrays using a live CD.

Revision history for this message
Rudi Daemen (fludizz) wrote :

Took less then 24 hours this time. However I did manage to capture dmesg output this time. See the attachment for the full log, here's the snippet how it goes wrong:

mdadm detects a drive timeout (these soft hangs of the hard drives have been present ever since day one and never caused any issues). mdadm fails the drive and tries to keep the raid drive running on 3 disks. But this reset/hang of the disks seems to happen controller wide, thus it fails the next drive until it has no drives left and fails the array with an uncorrectable error message. Then the I/O write errors occur and the system fails.

Is mdadm to strict in drive timing? Is the sata controller driver to strict? Where does this go wrong?

Any idea's?

tags: added: kj-triage
Brad Figg (brad-figg)
tags: added: acpi-table-checksum
Revision history for this message
Rudi Daemen (fludizz) wrote :

Sorry for the bump but maybe something to keep in mind with this issue. Some reading up on the internet has gotten me on a different track about the loss of the RAID Arrays in my old setup:

It seems starting with kernels newer then 2.6.24 there might be a change in the timers mdadm uses to declare a disk dead or alive. This is possible related to ERC/CCTL/TLER settings: The time a drive is allowed to use to recover from a read error (e.g. bad sector). Normal consumer disks have these ERC/CCTL/TLER setting disabled by default (or don't have it at all) causing them to take up to a minute to respond again to the OS.

With ERC/CCTL/TLER enabled, a disk will respond to the OS/RAID driver again after a specified amount of time (e.g. 7 seconds). The RAID driver then sees the drive is still alive and corrects the data using the data on other drives.

Could it be that mdadm's disk timers have been changed causing newer kernels to drop groups of disks from the array?

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Rudi Daemen, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. As well, please comment on which kernel version specifically you tested.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream', and comment as to why specifically you were unable to test it.

Please let us know your results. Thanks in advance.

tags: added: lucid needs-upstream-testing regression-release
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Rudi Daemen (fludizz) wrote :

Sorry, I have migrated away from md-raid in favor of hardware raid at the time of this bug and have not rebuild any system to use md-raid since. With hardware raid and the same disks the problem does not occur.

Revision history for this message
penalvch (penalvch) wrote :

Rudi Daemen, this bug report is being closed due to your last comment regarding you no longer using the same setup. For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status. Thank you again for taking the time to report this bug and helping to make Ubuntu better. Please submit any future bugs you may find.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.