booting live cd breaks intel matrix raid

Bug #383001 reported by takbal
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned
linux (openSUSE)
Confirmed
Undecided
Unassigned

Bug Description

Ubuntu 9.04 64-bit live CD, kernel 2.6.28.11.15
Hardware: Intel i720, GA-EX58-UD5 motherboard (ICH10R), 6GB RAM
2x500GB HDD in Intel matrix RAID dual configuration: 250GB in RAID1 mirroring, rest is RAID0 striping. Windows XP64 is installed on a 150GB partition of RAID1 drive, goal is to install Ubuntu on the remaining 100GB. Has another two disks but they are not in RAID.

Setup work as expected in Windows. BIOS ROM shows drives as Raid(0,1) members.

Problem: booting the live CD breaks the RAID arrays permanently (even when nothing is installed). After reboot the BIOS RAID utility shows both drive as "Offline member". It can be fixed only by deleting the RAID metadata content by the BIOS utility on one of the drives, re-adding this drive, then the Matrix Raid Manager in Windows can mirror back the RAID1 drive and the RAID0 can be recovered by the "Recover Volume" option (all this takes about 1 hour on my config - please consider it when asking for tests).

As the break happens sometime during booting, I can only report how the disks looks like *after* by launching a terminal and installing/running dmraid. dmraid cannot pair the drives as they are having a different name string (probably they should have the same). My guess is that either the hardware checks or fuse (?) tries to access the drives without knowing the fake raid is there, and spoils the metadata content. A bit strange is however that the first drive is listed as having 3 disks in the array...

root@ubuntu:~# dmraid -s -s -vvvv -dddd
WARN: locking /var/lock/dmraid/.lock
NOTICE: /dev/sdd: asr discovering
NOTICE: /dev/sdd: ddf1 discovering
NOTICE: /dev/sdd: hpt37x discovering
NOTICE: /dev/sdd: hpt45x discovering
NOTICE: /dev/sdd: isw discovering
NOTICE: /dev/sdd: jmicron discovering
NOTICE: /dev/sdd: lsi discovering
NOTICE: /dev/sdd: nvidia discovering
NOTICE: /dev/sdd: pdc discovering
NOTICE: /dev/sdd: sil discovering
NOTICE: /dev/sdd: via discovering
NOTICE: /dev/sdc: asr discovering
NOTICE: /dev/sdc: ddf1 discovering
NOTICE: /dev/sdc: hpt37x discovering
NOTICE: /dev/sdc: hpt45x discovering
NOTICE: /dev/sdc: isw discovering
NOTICE: /dev/sdc: jmicron discovering
NOTICE: /dev/sdc: lsi discovering
NOTICE: /dev/sdc: nvidia discovering
NOTICE: /dev/sdc: pdc discovering
NOTICE: /dev/sdc: sil discovering
NOTICE: /dev/sdc: via discovering
NOTICE: /dev/sdb: asr discovering
NOTICE: /dev/sdb: ddf1 discovering
NOTICE: /dev/sdb: hpt37x discovering
NOTICE: /dev/sdb: hpt45x discovering
NOTICE: /dev/sdb: isw discovering
NOTICE: /dev/sdb: isw metadata discovered
NOTICE: /dev/sdb: jmicron discovering
NOTICE: /dev/sdb: lsi discovering
NOTICE: /dev/sdb: nvidia discovering
NOTICE: /dev/sdb: pdc discovering
NOTICE: /dev/sdb: sil discovering
NOTICE: /dev/sdb: via discovering
NOTICE: /dev/sda: asr discovering
NOTICE: /dev/sda: ddf1 discovering
NOTICE: /dev/sda: hpt37x discovering
NOTICE: /dev/sda: hpt45x discovering
NOTICE: /dev/sda: isw discovering
NOTICE: /dev/sda: isw metadata discovered
NOTICE: /dev/sda: jmicron discovering
NOTICE: /dev/sda: lsi discovering
NOTICE: /dev/sda: nvidia discovering
NOTICE: /dev/sda: pdc discovering
NOTICE: /dev/sda: sil discovering
NOTICE: /dev/sda: via discovering
DEBUG: _find_set: searching isw_bghhhefdec
DEBUG: _find_set: not found isw_bghhhefdec
DEBUG: _find_set: searching isw_bghhhefdec_RAID1
DEBUG: _find_set: searching isw_bghhhefdec_RAID1
DEBUG: _find_set: not found isw_bghhhefdec_RAID1
DEBUG: _find_set: not found isw_bghhhefdec_RAID1
DEBUG: _find_set: searching isw_bghhhefdec_RAID0
DEBUG: _find_set: searching isw_bghhhefdec_RAID0
DEBUG: _find_set: searching isw_bghhhefdec_RAID0
DEBUG: _find_set: not found isw_bghhhefdec_RAID0
DEBUG: _find_set: not found isw_bghhhefdec_RAID0
DEBUG: _find_set: not found isw_bghhhefdec_RAID0
NOTICE: added /dev/sdb to RAID set "isw_bghhhefdec"
DEBUG: _find_set: searching isw_chdbicac
DEBUG: _find_set: not found isw_chdbicac
DEBUG: _find_set: searching isw_chdbicac_RAID1
DEBUG: _find_set: searching isw_chdbicac_RAID1
DEBUG: _find_set: searching isw_chdbicac_RAID1
DEBUG: _find_set: not found isw_chdbicac_RAID1
DEBUG: _find_set: searching isw_chdbicac_RAID1
DEBUG: _find_set: not found isw_chdbicac_RAID1
DEBUG: _find_set: not found isw_chdbicac_RAID1
DEBUG: _find_set: searching isw_chdbicac_RAID1
DEBUG: _find_set: not found isw_chdbicac_RAID1
DEBUG: _find_set: not found isw_chdbicac_RAID1
DEBUG: _find_set: searching isw_chdbicac_RAID0
DEBUG: _find_set: searching isw_chdbicac_RAID0
DEBUG: _find_set: searching isw_chdbicac_RAID0
DEBUG: _find_set: not found isw_chdbicac_RAID0
DEBUG: _find_set: searching isw_chdbicac_RAID0
DEBUG: _find_set: not found isw_chdbicac_RAID0
DEBUG: _find_set: not found isw_chdbicac_RAID0
DEBUG: _find_set: searching isw_chdbicac_RAID0
DEBUG: _find_set: searching isw_chdbicac_RAID0
DEBUG: _find_set: not found isw_chdbicac_RAID0
DEBUG: _find_set: not found isw_chdbicac_RAID0
DEBUG: _find_set: not found isw_chdbicac_RAID0
NOTICE: added /dev/sda to RAID set "isw_chdbicac"
DEBUG: checking isw device "/dev/sdb"
ERROR: isw device for volume "RAID0" broken on /dev/sdb in RAID set "isw_bghhhefdec_RAID0"
ERROR: isw: wrong # of devices in RAID set "isw_bghhhefdec_RAID0" [1/2] on /dev/sdb
DEBUG: set status of set "isw_bghhhefdec_RAID0" to 2
DEBUG: checking isw device "/dev/sdb"
ERROR: isw device for volume "RAID1" broken on /dev/sdb in RAID set "isw_bghhhefdec_RAID1"
ERROR: isw: wrong # of devices in RAID set "isw_bghhhefdec_RAID1" [1/2] on /dev/sdb
DEBUG: set status of set "isw_bghhhefdec_RAID1" to 2
DEBUG: checking isw device "/dev/sda"
ERROR: isw device for volume "RAID0" broken on /dev/sda in RAID set "isw_chdbicac_RAID0"
ERROR: isw: wrong # of devices in RAID set "isw_chdbicac_RAID0" [1/2] on /dev/sda
DEBUG: set status of set "isw_chdbicac_RAID0" to 2
DEBUG: checking isw device "/dev/sda"
ERROR: isw device for volume "RAID1" broken on /dev/sda in RAID set "isw_chdbicac_RAID1"
ERROR: isw: wrong # of devices in RAID set "isw_chdbicac_RAID1" [1/2] on /dev/sda
DEBUG: set status of set "isw_chdbicac_RAID1" to 2
*** Group superset isw_bghhhefdec
--> Subset
name : isw_bghhhefdec_RAID0
size : 452474112
stride : 256
type : stripe
status : broken
subsets: 0
devs : 1
spares : 0
--> Subset
name : isw_bghhhefdec_RAID1
size : 524288256
stride : 128
type : mirror
status : broken
subsets: 0
devs : 1
spares : 0
*** Group superset isw_chdbicac
--> Subset
name : isw_chdbicac_RAID0
size : 452474112
stride : 256
type : stripe
status : broken
subsets: 0
devs : 1
spares : 0
--> Subset
name : isw_chdbicac_RAID1
size : 524288256
stride : 128
type : mirror
status : broken
subsets: 0
devs : 1
spares : 0
WARN: unlocking /var/lock/dmraid/.lock
DEBUG: freeing devices of RAID set "isw_bghhhefdec_RAID0"
DEBUG: freeing device "isw_bghhhefdec_RAID0", path "/dev/sdb"
DEBUG: freeing devices of RAID set "isw_bghhhefdec_RAID1"
DEBUG: freeing device "isw_bghhhefdec_RAID1", path "/dev/sdb"
DEBUG: freeing devices of RAID set "isw_bghhhefdec"
DEBUG: freeing device "isw_bghhhefdec", path "/dev/sdb"
DEBUG: freeing devices of RAID set "isw_chdbicac_RAID0"
DEBUG: freeing device "isw_chdbicac_RAID0", path "/dev/sda"
DEBUG: freeing devices of RAID set "isw_chdbicac_RAID1"
DEBUG: freeing device "isw_chdbicac_RAID1", path "/dev/sda"
DEBUG: freeing devices of RAID set "isw_chdbicac"
DEBUG: freeing device "isw_chdbicac", path "/dev/sda"

/////////////////////////////////////////////////////////////////////////////////////////////////

root@ubuntu:~# dmraid -n
/dev/sdb (isw):
0x000 sig: " Intel Raid ISM Cfg Sig. 1.2.00"
0x020 check_sum: 4201763611
0x024 mpb_size: 648
0x028 family_num: 1677745342
0x02c generation_num: 180315
0x030 error_log_size: 4080
0x034 attributes: 2147483648
0x038 num_disks: 2
0x039 num_raid_devs: 2
0x03a error_log_pos: 2
0x03c cache_size: 0
0x040 orig_family_num: 3440023639
0x0d8 disk[0].serial: " WD-WMASZ0068106"
0x0e8 disk[0].totalBlocks: 976771055
0x0ec disk[0].scsiId: 0x0
0x0f0 disk[0].status: 0x13a
0x0f4 disk[0].owner_cfg_num: 0x0
0x108 disk[1].serial: " WD-WMAT00044411"
0x118 disk[1].totalBlocks: 976773168
0x11c disk[1].scsiId: 0x10000
0x120 disk[1].status: 0x13a
0x124 disk[1].owner_cfg_num: 0x0
0x138 isw_dev[0].volume: " RAID1"
0x14c isw_dev[0].SizeHigh: 0
0x148 isw_dev[0].SizeLow: 524288000
0x150 isw_dev[0].status: 0xc
0x154 isw_dev[0].reserved_blocks: 0
0x158 isw_dev[0].migr_priority: 0
0x159 isw_dev[0].num_sub_vol: 0
0x15a isw_dev[0].tid: 15
0x15b isw_dev[0].cng_master_disk: 0
0x15c isw_dev[0].cache_policy: 0
0x15e isw_dev[0].cng_state: 0
0x15f isw_dev[0].cng_sub_state: 0
0x188 isw_dev[0].vol.curr_migr_unit: 1024000
0x18c isw_dev[0].vol.check_point_id: 0
0x190 isw_dev[0].vol.migr_state: 0
0x191 isw_dev[0].vol.migr_type: 1
0x192 isw_dev[0].vol.dirty: 0
0x193 isw_dev[0].vol.fs_state: 255
0x194 isw_dev[0].vol.verify_errors: 1
0x196 isw_dev[0].vol.verify_bad_blocks: 0
0x1a8 isw_dev[0].vol.map[0].pba_of_lba0: 0
0x1ac isw_dev[0].vol.map[0].blocks_per_member: 524288264
0x1b0 isw_dev[0].vol.map[0].num_data_stripes: 2048000
0x1b4 isw_dev[0].vol.map[0].blocks_per_strip: 128
0x1b6 isw_dev[0].vol.map[0].map_state: 0
0x1b7 isw_dev[0].vol.map[0].raid_level: 1
0x1b8 isw_dev[0].vol.map[0].num_members: 2
0x1b9 isw_dev[0].vol.map[0].num_domains: 2
0x1ba isw_dev[0].vol.map[0].failed_disk_num: 255
0x1bb isw_dev[0].vol.map[0].ddf: 1
0x1d8 isw_dev[0].vol.map[0].disk_ord_tbl[0]: 0x0
0x1dc isw_dev[0].vol.map[0].disk_ord_tbl[1]: 0x1
0x1e0 isw_dev[1].volume: " RAID0"
0x1f4 isw_dev[1].SizeHigh: 0
0x1f0 isw_dev[1].SizeLow: 904947712
0x1f8 isw_dev[1].status: 0xc
0x1fc isw_dev[1].reserved_blocks: 0
0x200 isw_dev[1].migr_priority: 0
0x201 isw_dev[1].num_sub_vol: 0
0x202 isw_dev[1].tid: 1
0x203 isw_dev[1].cng_master_disk: 0
0x204 isw_dev[1].cache_policy: 0
0x206 isw_dev[1].cng_state: 0
0x207 isw_dev[1].cng_sub_state: 0
0x230 isw_dev[1].vol.curr_migr_unit: 0
0x234 isw_dev[1].vol.check_point_id: 0
0x238 isw_dev[1].vol.migr_state: 0
0x239 isw_dev[1].vol.migr_type: 4
0x23a isw_dev[1].vol.dirty: 0
0x23b isw_dev[1].vol.fs_state: 255
0x23c isw_dev[1].vol.verify_errors: 0
0x23e isw_dev[1].vol.verify_bad_blocks: 0
0x250 isw_dev[1].vol.map[0].pba_of_lba0: 524292360
0x254 isw_dev[1].vol.map[0].blocks_per_member: 452474120
0x258 isw_dev[1].vol.map[0].num_data_stripes: 1767476
0x25c isw_dev[1].vol.map[0].blocks_per_strip: 256
0x25e isw_dev[1].vol.map[0].map_state: 0
0x25f isw_dev[1].vol.map[0].raid_level: 0
0x260 isw_dev[1].vol.map[0].num_members: 2
0x261 isw_dev[1].vol.map[0].num_domains: 1
0x262 isw_dev[1].vol.map[0].failed_disk_num: 0
0x263 isw_dev[1].vol.map[0].ddf: 1
0x280 isw_dev[1].vol.map[0].disk_ord_tbl[0]: 0x1000000
0x284 isw_dev[1].vol.map[0].disk_ord_tbl[1]: 0x1

/dev/sda (isw):
0x000 sig: " Intel Raid ISM Cfg Sig. 1.2.00"
0x020 check_sum: 3599977089
0x024 mpb_size: 752
0x028 family_num: 27318202
0x02c generation_num: 158900
0x030 error_log_size: 4080
0x034 attributes: 2147483648
0x038 num_disks: 3
0x039 num_raid_devs: 2
0x03a error_log_pos: 2
0x03c cache_size: 0
0x040 orig_family_num: 3440023639
0x0d8 disk[0].serial: " WD-WMASZ0068106"
0x0e8 disk[0].totalBlocks: 976773168
0x0ec disk[0].scsiId: 0x0
0x0f0 disk[0].status: 0x13a
0x0f4 disk[0].owner_cfg_num: 0x0
0x108 disk[1].serial: " WD-WMAT00044411"
0x118 disk[1].totalBlocks: 976773168
0x11c disk[1].scsiId: 0x10000
0x120 disk[1].status: 0x13a
0x124 disk[1].owner_cfg_num: 0x0
0x138 disk[2].serial: "D-WMAT00044411:1"
0x148 disk[2].totalBlocks: 976773120
0x14c disk[2].scsiId: 0xffffffff
0x150 disk[2].status: 0x6
0x154 disk[2].owner_cfg_num: 0x0
0x168 isw_dev[0].volume: " RAID1"
0x17c isw_dev[0].SizeHigh: 0
0x178 isw_dev[0].SizeLow: 524288000
0x180 isw_dev[0].status: 0xc
0x184 isw_dev[0].reserved_blocks: 0
0x188 isw_dev[0].migr_priority: 0
0x189 isw_dev[0].num_sub_vol: 0
0x18a isw_dev[0].tid: 1
0x18b isw_dev[0].cng_master_disk: 0
0x18c isw_dev[0].cache_policy: 0
0x18e isw_dev[0].cng_state: 0
0x18f isw_dev[0].cng_sub_state: 0
0x1b8 isw_dev[0].vol.curr_migr_unit: 548336
0x1bc isw_dev[0].vol.check_point_id: 0
0x1c0 isw_dev[0].vol.migr_state: 1
0x1c1 isw_dev[0].vol.migr_type: 1
0x1c2 isw_dev[0].vol.dirty: 0
0x1c3 isw_dev[0].vol.fs_state: 255
0x1c4 isw_dev[0].vol.verify_errors: 0
0x1c6 isw_dev[0].vol.verify_bad_blocks: 0
0x1d8 isw_dev[0].vol.map[0].pba_of_lba0: 0
0x1dc isw_dev[0].vol.map[0].blocks_per_member: 524288264
0x1e0 isw_dev[0].vol.map[0].num_data_stripes: 2048000
0x1e4 isw_dev[0].vol.map[0].blocks_per_strip: 128
0x1e6 isw_dev[0].vol.map[0].map_state: 0
0x1e7 isw_dev[0].vol.map[0].raid_level: 1
0x1e8 isw_dev[0].vol.map[0].num_members: 2
0x1e9 isw_dev[0].vol.map[0].num_domains: 2
0x1ea isw_dev[0].vol.map[0].failed_disk_num: 1
0x1eb isw_dev[0].vol.map[0].ddf: 1
0x208 isw_dev[0].vol.map[0].disk_ord_tbl[0]: 0x0
0x20c isw_dev[0].vol.map[0].disk_ord_tbl[1]: 0x1
0x210 isw_dev[0].vol.map[1].pba_of_lba0: 0
0x214 isw_dev[0].vol.map[1].blocks_per_member: 524288264
0x218 isw_dev[0].vol.map[1].num_data_stripes: 2048000
0x21c isw_dev[0].vol.map[1].blocks_per_strip: 128
0x21e isw_dev[0].vol.map[1].map_state: 2
0x21f isw_dev[0].vol.map[1].raid_level: 1
0x220 isw_dev[0].vol.map[1].num_members: 2
0x221 isw_dev[0].vol.map[1].num_domains: 2
0x222 isw_dev[0].vol.map[1].failed_disk_num: 1
0x223 isw_dev[0].vol.map[1].ddf: 1
0x240 isw_dev[0].vol.map[1].disk_ord_tbl[0]: 0x0
0x244 isw_dev[0].vol.map[1].disk_ord_tbl[1]: 0x1000002
0x248 isw_dev[1].volume: " RAID0"
0x25c isw_dev[1].SizeHigh: 0
0x258 isw_dev[1].SizeLow: 904947712
0x260 isw_dev[1].status: 0x20c
0x264 isw_dev[1].reserved_blocks: 0
0x268 isw_dev[1].migr_priority: 0
0x269 isw_dev[1].num_sub_vol: 0
0x26a isw_dev[1].tid: 2
0x26b isw_dev[1].cng_master_disk: 0
0x26c isw_dev[1].cache_policy: 0
0x26e isw_dev[1].cng_state: 0
0x26f isw_dev[1].cng_sub_state: 0
0x298 isw_dev[1].vol.curr_migr_unit: 0
0x29c isw_dev[1].vol.check_point_id: 0
0x2a0 isw_dev[1].vol.migr_state: 0
0x2a1 isw_dev[1].vol.migr_type: 1
0x2a2 isw_dev[1].vol.dirty: 0
0x2a3 isw_dev[1].vol.fs_state: 255
0x2a4 isw_dev[1].vol.verify_errors: 0
0x2a6 isw_dev[1].vol.verify_bad_blocks: 0
0x2b8 isw_dev[1].vol.map[0].pba_of_lba0: 524292360
0x2bc isw_dev[1].vol.map[0].blocks_per_member: 452474120
0x2c0 isw_dev[1].vol.map[0].num_data_stripes: 1767476
0x2c4 isw_dev[1].vol.map[0].blocks_per_strip: 256
0x2c6 isw_dev[1].vol.map[0].map_state: 3
0x2c7 isw_dev[1].vol.map[0].raid_level: 0
0x2c8 isw_dev[1].vol.map[0].num_members: 2
0x2c9 isw_dev[1].vol.map[0].num_domains: 1
0x2ca isw_dev[1].vol.map[0].failed_disk_num: 0
0x2cb isw_dev[1].vol.map[0].ddf: 1
0x2e8 isw_dev[1].vol.map[0].disk_ord_tbl[0]: 0x1000000
0x2ec isw_dev[1].vol.map[0].disk_ord_tbl[1]: 0x1

/////////////////////////////////////////////////////////////////////////////////////////////////

Additional system logs (casper.log, dmesg.txt, lspci.txt, mount, df and the output of 'dmraid -r -D') are attached.

Revision history for this message
takbal (takbal) wrote :
tags: added: livecd
removed: cd live
Revision history for this message
Jeff Enns (cyberpenguinks) wrote :

Thank you for sending in your bug report.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
takbal (takbal) wrote :

Here is a more detailed casper.log (with debug=). It is not the actual boot when the spoil happened but one after.

Revision history for this message
takbal (takbal) wrote :

One more: I can confirm that the bug exists in SuSE 11.1 as well on this config, so it must be in something common. SuSE breaks my onboard lan as well - at least Ubuntu does not do that ;)

Revision history for this message
Jeff Enns (cyberpenguinks) wrote :

Could you post the kernel version you are running on SuSE, please? I'm assuming you are referring to OpenSuSE and not SLES or SLED, right? Thanks!

Revision history for this message
takbal (takbal) wrote :

It is indeed opensuse 11.1 release. Tried both the installation DVD and the live cd - according to distrowatch, it has kernel 2.6.27.7 (64-bit version).

Revision history for this message
Jeff Enns (cyberpenguinks) wrote :

Thank you for the update.

Changed in linux (openSUSE):
status: New → Confirmed
Revision history for this message
takbal (takbal) wrote :

Looks like I found a fix by messing around (which I do not understand why it works).

I became aware that the metadata had quite different content on both drives. Earlier when I got to the "offline member" screen in BIOS, I marked the second drive as non-RAID, then the first drive became "live" but with a missing partner, for which I re-added the second drive. Today I tried what happens if I mark the *first* drive as non-RAID (removing its metadata) and adding it to the *second* one. As usual, I fixed the mirror in Windows as usual and recovered the RAID0, then rebooted.

Things from here went differently: both disks were now marked as "offline" in the BIOS (with the other way around, they stayed green at this phase until the live CD was not used). Now I booted the live CD, and for some reason now the array was shown *valid* in dmraid (even when the BIOS was saying they are invalid)! I was able to mount the drive through /dev/mapper, added the ext4 partition to the RAID1 (although cfdisk was complaining about not being able to read back the partition table), and installed Ubuntu and grub according to the Ubuntu FakeRaidHowto. Now everything looks like working as expected, drives come up live and I am able to dual-boot.

Hope it stays like this...

Revision history for this message
Jouke74 (hottenga) wrote :

Same problem here, but with an AMD 790X chipset. Just starting the live CD degraded my RAID 0 and killed all data on the disks, that was the end of my Vista operating system (oooww).

Gigabyte-MX790X-UD4 mobo
AMD phenom II 940 BE
4 x 2 GB Kingston Value Ram.
2 x WD 500GB Black caviar drives in RAID 0
1 x WD 1 TB Green power drive as backup (not in RAID config).

I think this quite critical, given that a life CD is NOT supposed to change anything on my computer :-)

Revision history for this message
Jeff Enns (cyberpenguinks) wrote :

I concur that the live cd shouldn't change anything on that computer. Thank you for your confirmation.

Revision history for this message
takbal (takbal) wrote :

I was a bit early about that fix.

Now what I have is that the everything works perfectly until a *hard* power-off (cold boot, e.g. when the computer is entirely turned off). Then the drives are again marked "offline member".

It can be fixed by simply booting into the live CD, not touching anything and reboot immediately, then the drives are operating normal again until the next power off. If I turn off the computer with a working RAID *anytime*, from the operating system (Win or Linux), or I press the power button on the GRUB screen, or even during the BIOS startup immediately after the beep, then the drives again go offline after power on. GRUB therefore cannot be the culprit.

The key is definitely going into power-off mode: doing this will invalidate the RAID whatever the circumstances are. No idea why does the live CD boot fix it until the next power off. After booting from the offline state dmraid in the live CD says drives are normal.

Looks like a Gigabyte-specific(?) Intel BIOS-related problem which somehow intertwined with power management. BTW, the board cannot wake up from S3, which is kind of "normal" with these new Gigabyte boards (they are still patching the BIOS).

Ubuntu is still affected in that the problems started after the live CD inserted. Before that it was working normal and the drives were recognized correctly after a power-off from Windows. Linux changed something which made this particular Gigabyte-Intel board confused about its RAID arrays.

Revision history for this message
takbal (takbal) wrote :

Two related bugs:

https://bugs.launchpad.net/ubuntu/+bug/141435
https://bugzilla.novell.com/show_bug.cgi?id=328388

These turned my attention to HPA issues.
From hdparm man page:

[...] The difference between these two values indicates how many sectors
              of the disk are currently hidden from the operating system, in the form
              of a Host Protected Area (HPA). This area is often used by computer mak‐
              ers to hold diagnostic software, and/or a copy of the originally provided
              operating system for recovery purposes. To change the current max (VERY
              DANGEROUS, DATA LOSS IS EXTREMELY LIKELY), a new value should be provided
              (in base10) immediately following the -N flag. This value is specified
              as a count of sectors, rather than the "max sector address" of the drive.
              Drives have the concept of a temporary (volatile) setting which is lost
              on the next hardware reset, as well as a more permanent (non-volatile)
              value which survives resets and power cycles. By default, -N affects
              only the temporary (volatile) setting. To change the permanent (non-
              volatile) value, prepend a leading p character immediately before the
              first digit of the value. Drives are supposed to allow only a single
              permanent change per session. A hardware reset (or power cycle) is
              required before another permanent -N operation can succeed.

Looks like that may explain my problem if the live CD sets up a wrong permament HPA, but then corrects the non-permanent one to the good value, but the power off resets to the wrong value which invalidates the array. I will try dumping my drive's setting and check whether re-writing the correct number of sectors fixes it.

Revision history for this message
Jeff Enns (cyberpenguinks) wrote :

Thanks for the update. The more information we have the more the developers have to go on. Thanks.

Revision history for this message
takbal (takbal) wrote :

I can confirm that indeed HPA was the problem in my case. dmesg log showed the "HPA locked" message in the libata part for the ata0 drive. This forced value made my arrays accessible until the next hard boot. Only /dev/sda was affected. Finally, I got the total sector num with the live CD with

# hdparm -N /dev/sda

/dev/sda:
 max sectors = 976773168/976773168, HPA is disabled

and then setting the permanent value by

# hdparm -N p976773168 /dev/sda

entirely fixed my problem. Now the system comes up without errors from cold boot.

Actually I am not sure that this is the correct setting, as most probably the HPA area was set up by Intel RAID to write something there. Now the HPA is disabled and the operating system can theoretically overwrite the protected area. As the partitioning was created earlier, these sectors are most probably unused, so I hope it is going to work until I do not tamper with the partition table (however it would be nice if somebody could confirm that this is safe).

Looks like other people noticed this HPA issue and started adding ICH10R support:
https://bugs.launchpad.net/ubuntu/+source/dmraid/+bug/372170

However I am not sure how the live CD problem is related to that, as I had no problems with dmraid. Probably the live CD would not affect my system with the libata.ignore_hpa=0 setting as suggested elsewhere, but THIS IS NOT THE DEFAULT. Probably the same libata setting destroyed the array first, then when I was fixing it through Windows I forced the other drive to conform to the new geometry. Because the permanent setting was not changed properly, the problem came back after cold boots.

Anyway, I think it is unacceptable that the live CD touches the array in a way which screws it up. Why in hell does it change the HPA sector settings by default??? I understand now that it is quite possible that the first change was only temporary and a cold boot could have fixed the problem. Still, some warning could be given which says that if you have problems with the RAID after booting the live CD, try cold boot.

I understand that libata.ignore_hpa=0 makes other people's HDD inaccessible, but making something inaccessible sounds less serious to me than *trashing* the arrays which may potentially result in data loss. Definitely the less evil should be the default. Even better, it should be an option at boot, or at least a textual warning should be given.

Revision history for this message
takbal (takbal) wrote :

Jouke74 you can probably fix it by installing Windows on a separate 3rd drive, starting the Intel RAID tool and selecting "recover volume" on the failed HDD pair. Maybe you can try dmsetup to re-create the arrays as well, but no idea whether it works.

Revision history for this message
Kraemer (djkraemer) wrote :

I had this exact same issue running an Asus Maximus II Formula w/ p45 chipset and ich10r raid 0. The solution for me was a simple bios reset. A royal pain in my case, but it worked.

Revision history for this message
Giuseppe Pennisi (giupenni78) wrote :

I have similar problems with GA-G33M-DS2R Motherboard.

Has JMicron IDE PATA\SATA controller your motherboard?

Because I think the problem can be the JMicron controller (JMB368),
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/377633

giuseppe

tags: added: dmraid
removed: raid
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

takbal,
      I understand that this is a rather intense failure, but I wonder if it would be possible to test this against the latest Lucid Alpha LiveCD? If so, please also include the relevant logging and environment information.

Thanks!

-JFo

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
importance: Undecided → Medium
Revision history for this message
takbal (takbal) wrote :

Because the HPA is now spanning my entire array, to reproduce the original problem now I should back up 1TB data from three operating systems to some other place which I do not have, format everything, install Windows on the RAID, put in the live CD and test the bug, then pray that restoring everything will work. All about 2-3 days.

Sorry, but no way. I think I already did the hard part by identifying the problem. Fixing it or just forwarding the bug to the relevant developers should be relatively easy.

Revision history for this message
Heinrich Janzing (heinrich-janzing) wrote :

I have a similar issue with my ICH10R (raid 0): after running the kubuntu live-cd (Lucid Beta 1) both drives are indicated as `Member offline` on every cold boot. I can temporarily fix this by booting from the live-cd. If I allow it to fully boot I can access the volume. I don´t need to start gparted or anything: I can just do a ctrl-alt-del during the live-cd boot process, the drives are then marked as normal again and I can boot from the raid volume.

I never had issues before (running just Win7). I´m not sure if this is the exact same issue (I don´t fully understand the HPA stuff...).

Revision history for this message
WiLLiTo (victor-gg83) wrote :

same problem here but using lucid final release gnome x86_64 live cd. Boot with it still broke raid members. 2 years waiting a fix :-(
Canonical now support dmraid package Why they dont do anything? ¿Are they informed about this isusue? There are a lot of duplicate bugs about it just search about "ich10r" , "intel matrix storage" , "raid 0" or "isw" Sorry about my english

Revision history for this message
phdb (philippe-de-brouwer) wrote :

Same problem here ..

Linux: Ubuntu 10.04 with kernel 2.6.32-24-generic
on the following hardware:
Intel Core2 Quad Q9550 2,83 GHz (S775/45nm) BOX
4 Hitachi DeskStar 7K1000 1 TB (SATA II, NCQ) in bios raid 10 (fakeraid)
Gigabyte GA-X48-DS5

:-(

very sad that there is no solution yet.
The hdparm does not work
(
the -N option reports the following
# hdparm -N /dev/sde
/dev/sde:
 max sectors = 625142448/4385456(625142448?), HPA setting seems invalid (buggy kernel device driver?)

and for the raid device another message:
# hdparm -N /dev/mapper/isw_eaacehbjcf_Volume0

/dev/mapper/isw_eaacehbjcf_Volume0:
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
)

Interesting is though that I used to have it working with Ubuntu 08.10 (Mint flavour). In that sense that a cold boot would always reset the disks to "off-line member" ... but once Linux loaded, I could manually mount the raid. Since I have Ubunty 10.04 this does not work any more. After a cold boot (=power off) I have to start the system once and then make a warm boot (reboot without power off) and then it will work.

Strange and disappointing ... especially that this only gets "medium" priority and is not even assigned to someone :-(
At least a warning message should appear BEFORE the liveCD boots in Ubuntu!
For the novice this will lead to serious disappointment and loss off data.

Revision history for this message
takbal (takbal) wrote :

phdb: I am quite sure that hdparm will not work on a raid array directly, because it is not a physical drive. Your first try seems plausible if /dev/sde was the physical drive. Probably the HPA modification of the live CD was invalid. I never tried raid 10 however.

On my machine a total "cold boot" requires disconnecting the power cable (or switching off the power at the PSU), a plain power off is not enough.

Revision history for this message
phdb (philippe-de-brouwer) wrote :

Hi takbal,
Thanks for the kind and quick reply!

sorry -I forgot to tell- but I failed to install to install Ubuntu on the raid, so I popped in an old HD and / is on that drive (stat is the /dev/sde). The raid is only for data.

Linux sees the raid drives as /dev/sda, /dev/sdb, /dev/sdc, and /dev/sdd

I confirm that on my pc power off and de-connecting power cable both put the raid in offline-member.

Revision history for this message
takbal (takbal) wrote :

In my case the HPA fix with hdparm was done on one of the physical drives of the array, like /dev/sda in your setup. However, I only had 2 drives in raid.

You can probably check the HPA settings on each of those drives and see whether they are inconsistent or messed up by the Linux kernel. Beware however, as changing them can easily break everything.

Revision history for this message
phdb (philippe-de-brouwer) wrote :

something else that might help.

-1-
I tried Fedora 13 ... and that will not only recognise the raid, but also keep the fakeraid working (every reboot type will keep the disks as raid member). Once booting in Ubuntu breaks the raid and the next boot reports the disks as "offline member"
(except a shutdown -r)

-2-
Today it was impossible to make Ubuntu 10.04 recognise the fakeRaid. Even after a boot where the disks were not in "offline member", still then Ubuntu could not see the raid, but finally I found a sequence that brought it back:
step 1: boot in Fedora
step 2: boot in Ubuntu and run update-grub
step 3: reboot in Ubuntu

I hope that those findings help.

Revision history for this message
phdb (philippe-de-brouwer) wrote :

oh yes and also this

-3-
Not only the LiveCD breaks the fakeRAID (as could be understood from the title of this bug), but also the hard-disk install.

Revision history for this message
Artur (scheinemann) wrote :

Yes make that one more.

Tried this with my old board a Gigabyte GA-EP45 and now with the newer Asus P5Q. Both have the Intel Raid and both go broke after trying to boot Ubuntu.

Now 10.10 is out and it gets even better. Now It won't even boot up to the end (get a black screen, though it seems that in the background Ubuntu works.) and it still breaks my Raid. Only complete power off (cold boot) helps to bring back my Raid and with It the Data.

As a consequence I have to say (since I have no alternative for a backup of my Raid 1 and to fiddle around with the Raid{actually I installed a Raid 1 to be a backup of its own.....}) I will move on and try phdb's advice. Maybe I'll get lucky with Red Hat or Fedora.

Revision history for this message
sibidiba (sibidiba) wrote :

I can confirm this on 13.04.

Booting an ASUS P8H77-M PRO completely broke a 1TB ONDEMAND-RAID1 array of two disks!

Even worse, the SATA was actually in AHCI mode on purpose in the BIOS!
Nevertheless it activated the raid, even worse, started a sync without asking!

I could barely get back the partitions with partition recovery. I had extremely hard time to recover my TrueCrypt encrypted volume! (Thanks, EASEUS!)

This is severe! You can loose all your data!

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.