Bug #554398 “Lucid crash on heavy DB i/o (mvsas?)” : Bugs : linux package : Ubuntu

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-03:

#1

dmesg.log Edit (115.7 KiB, text/plain)

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-05:

#2

This bug does not occur in Centos 5.4, which may be mvsas 0.5.4. Perhaps a regression in mvsas?

Revision history for this message

Manoj Iyer (manjo) wrote on 2010-04-06:

#3

Is the sata drive NCQ capable ? and it is NCQ enabled ? What controller do you have ? If you have ICH8 is BIOS settings for the Intel ICH8/Jmicron363 controllers in AHCI mode.?

you can do sudo hdparm -I /dev/sdXXX and sudo lspci -vvnn to get that info.

Manoj Iyer (manjo) on 2010-04-06

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-06:

#4

drive is NCQ capable and enabled
sudo hdparm -I /dev/sda | grep NCQ
* Native Command Queueing (NCQ)

Controller:
Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B [11ab:6485] (rev 01)

full lspci stanza here: http://dpaste.com/180210/

No ICH8, but ICH10.

BIOS Setting is AHCI

Thank you for looking at this. I appreicate it.

Revision history for this message

Manoj Iyer (manjo) wrote on 2010-04-07:

#5

I did find a list of 7 patches that fix a similar problem. http://kerneltrap.org/mailarchive/linux-scsi/2009/12/3/6616463 I need to investigate to see if these made it upstream into any tree and possibly cherry pick them.

Changed in linux (Ubuntu):
assignee:	nobody → Manoj Iyer (manjo)

Revision history for this message

Manoj Iyer (manjo) wrote on 2010-04-07:

#6

@Richard can you please try the kernel in http://people.ubuntu.com/~manjo/lp554398-lucid/ ? I applied 6/7 patches I mentioned in my previous comment, 7th patch seems to cause build failure, and seems not relevant to fix this issue. I have seen reports that there is some success with these patches, but many claim that the mvsas otherwise. Please let me know how well it works for you. The patches are not part of any git tree, I have emailed the author to see if they have plans to get it upstream.

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-08:

#7

@Manoj

Thank you. Installed as follows:
sudo dpkg -i linux-header...
sudo dpkg -i linux-image...

Reboot

With good luck, it will take me more than 36 hours to reply that the database load completed correctly. I'll be sure to reply, either way.

Thank you for your assisstance.

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-08:

#8

osm2pgsql.log Edit (1.8 KiB, text/plain)

This error from osm2pgsql (the application that fills the data base from the compressed xml file)

processing way (193k)get_way failed: ERROR: could not read block 22269 of relation base/16386/557124: Input/output error

ssh connection to the box in question is also dropped and will not reconnect.
more in attached log file and following comment

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-08:

#9

console-errors.log Edit (5.8 KiB, text/plain)

Keyboard still responding. Change to console alt-F3, okay, login; no. see errors in console-errors.log

Revision history for this message

Manoj Iyer (manjo) wrote on 2010-04-08:

#10

But the machine does not crash anymore ? Can you please attach the dmesg output ?

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-08:

#11

from munin graphs when munin stopped updating. My interpretations are inexpert, but, hey! Pretty graphs.:

on sda
Disk latency spiked to 180 seconds
Disk utilization spiked to 100%

munin processing time spiked to 40 seconds

CPU shows increased iowait

interupt 16 looks very high 338/second

Load average jumped to 7

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-08:

#12

dmesg2.log Edit (98.6 KiB, text/plain)

Rebooted. Here is dmesg. dmesg2.log

Revision history for this message

Kristoffer Bergström (kabtoffe) wrote on 2010-04-21:

#13

dmesg.txt Edit (244.2 KiB, text/plain)

I have VERY similar issues. My new ZFS pool sis degraded after I tried some torrenting. In the attached dmesg you can see I'm getting similar errors. I'm running the 2.6.32-21-generic -kernel on amd64. I have two drives attached to the Marvel controller on my board (ASUS P6T Deluxe). I'm not 100% sure they are the same drives the dmesg shows, but it seems likely.

zpool status

pool: data
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-4J
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
   raidz2 DEGRADED 0 0 0
     sdb ONLINE 0 0 0
     sdc UNAVAIL 95 89 0 experienced I/O failures
     sdd UNAVAIL 96 73 0 experienced I/O failures
     sde ONLINE 0 0 0
     sdf ONLINE 0 0 0
     sdg ONLINE 0 0 0

errors: No known data errors

Revision history for this message

Richard Weait (richard-weait) wrote on 2010-04-21:

#14

I've stopped using the motherboard drive ports.

An Intel SASWT41 RAID controller plugged in and "Just Worked".

Revision history for this message

Peter Funk (pf-artcom-gmbh) wrote on 2010-05-19:

#15

Download full text (6.1 KiB)

We just installed a brand new SuperMicro Controller in a SuperMicro board running Lucid 10.04:
04:00.0 SCSI storage controller: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B (rev 01)
Linux version 2.6.32-22-server (buildd@yellow) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010

After building a software RAID10 the controller fails to access the disks with syslog messages like this:

May 19 16:28:36 ac52020 mdadm[4824]: NewArray event detected on md device /dev/md7
May 19 16:28:36 ac52020 kernel: [ 3127.568699] md7: p1
May 19 16:29:08 ac52020 kernel: [ 3159.585665] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585669] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585718] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585721] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585760] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585762] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585767] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585770] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585809] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585812] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919859] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919863] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919905] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919909] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919942] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919945] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919949] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919952] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919985] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919988] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.8...

We just installed a brand new SuperMicro Controller in a SuperMicro board running Lucid 10.04:
04:00.0 SCSI storage controller: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B (rev 01)
Linux version 2.6.32-22-server (buildd@yellow) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #33-Ubuntu SMP Wed Apr 28 14:34:48 UTC 2010

After building a software RAID10 the controller fails to access the disks with syslog messages like this:

May 19 16:28:36 ac52020 mdadm[4824]: NewArray event detected on md device /dev/md7
May 19 16:28:36 ac52020 kernel: [ 3127.568699]  md7: p1
May 19 16:29:08 ac52020 kernel: [ 3159.585665] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585669] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585718] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585721] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585760] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585762] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585767] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585770] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585809] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:08 ac52020 kernel: [ 3159.585812] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919859] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919863] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919905] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919909] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919942] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919945] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919949] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919952] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919985] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:29:38 ac52020 kernel: [ 3189.919988] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862887] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862892] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862934] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862937] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862970] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862973] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862977] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.862980] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.863012] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1669:mvs_abort_task:rc= 5
May 19 16:30:09 ac52020 kernel: [ 3220.863015] /build/buildd/linux-2.6.32/drivers/scsi/mvsas/mv_sas.c 1608:mvs_query_task:rc= 5
. . . 
May 19 16:31:42 ac52020 kernel: [ 3313.696381] sd 10:0:6:0: [sdp] Unhandled error code
May 19 16:31:42 ac52020 kernel: [ 3313.696383] sd 10:0:6:0: [sdp] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
May 19 16:31:42 ac52020 kernel: [ 3313.696387] sd 10:0:6:0: [sdp] CDB: Read(10): 28 00 00 00 c4 00 00 04 00 00
May 19 16:31:42 ac52020 kernel: [ 3313.696397] end_request: I/O error, dev sdp, sector 50176
May 19 16:31:42 ac52020 kernel: [ 3313.696427] sd 10:0:6:0: [sdp] Unhandled error code
May 19 16:31:42 ac52020 kernel: [ 3313.696430] sd 10:0:6:0: [sdp] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
May 19 16:31:42 ac52020 kernel: [ 3313.696433] sd 10:0:6:0: [sdp] CDB: Read(10): 28 00 00 00 c0 00 00 04 00 00
May 19 16:31:42 ac52020 kernel: [ 3313.696441] end_request: I/O error, dev sdp, sector 49152
. . .
May 19 16:32:55 ac52020 kernel: [ 3386.558279] mvsas 0000:04:00.0: mvsas exec failed[-132]!
May 19 16:32:55 ac52020 kernel: [ 3386.558285] ata18: no sense translation for status: 0x00
May 19 16:32:55 ac52020 kernel: [ 3386.558288] ata18: translated ATA stat/err 0x00/00 to SCSI SK/ASC/ASCQ 0xb/00/00
May 19 16:32:55 ac52020 kernel: [ 3386.558292] ata18.00: device reported invalid CHS sector 0
May 19 16:32:55 ac52020 kernel: [ 3386.558294] ata18: status=0x00 { }
May 19 16:32:55 ac52020 kernel: [ 3386.558300] mvsas 0000:04:00.0: mvsas exec failed[-132]!
May 19 16:32:55 ac52020 kernel: [ 3386.558303] ata18: no sense translation for status: 0x00
May 19 16:32:55 ac52020 kernel: [ 3386.558305] ata18: translated ATA stat/err 0x00/00 to SCSI SK/ASC/ASCQ 0xb/00/00
May 19 16:32:55 ac52020 kernel: [ 3386.558308] ata18.00: device reported invalid CHS sector 0
May 19 16:32:55 ac52020 kernel: [ 3386.558310] ata18: status=0x00 { }

And so on.

Revision history for this message

Peter Funk (pf-artcom-gmbh) wrote on 2010-05-26:

#16

Last week I wrote that a SuperMicro 8-Port SATA Controller doesn't work with Ubuntu 10.04 64 Bit.
We got error messages from mv_sas.c driver and the disks became unaccessible.
The same happens with OpenSuSE 11.2 so I assume it is a general driver problem.

Today we tested the same hardware using a CentOS 5.4 x86_64 LiveCD to check whether this
might really be regression as already asked by Richard Weait on 2010-04-05.

Since it is difficult to test with real applications using a live cd, I tested copying a bunch of
huge TIF image files from one directory to another repeatedly. This ran without problems.
(RAID10 build out of 8 SATA disks, filesystem XFS).

So in the moment I've to assume there is really a regression in the mvsas driver module.
CentOS 5.4 contains mvsas version 0.5.4
whereas both Ubuntu Lucid and OpenSuSE 11.2 contain mvsas version 0.8.2
We tested the 32-Bit Desktop version also: It also doesn't work.

Any suggestions how I should proceed to narrow this down any further?

Revision history for this message

josh.k.abbott (josh-k-abbott) wrote on 2010-07-16:

#17

trace Edit (21.5 KiB, text/plain)

Hi

I'm also having problems with a Supermicro card with this module.

I'm trying to build a RAID array with 4 drives but it fails after some i/o. I even set the maximum RAID rate to 3MB/s with dev.raid.speed_limit_max = 10000.

The trace in /var/log/message is attached.

Revision history for this message

josh.k.abbott (josh-k-abbott) wrote on 2010-07-27:

#18

Hi

I installed 10.10 and everything is working perfectly...

Revision history for this message

Freaky (freaky) wrote on 2010-11-06:

#19

/var/log/messages stripped from most irrelevant stuff Edit (18.2 KiB, application/octet-stream)

Download full text (3.5 KiB)

Issues with mvsas and SATA drivers (at least not one of the brands I currently have attached) are unfortunately not solved.

I figured I'd upgrade the server from 10.04 LTS to 10.10 after seeing josh.k.abbott's message.

As stated/guessed by someone earlier, the issues are indeed NOT ubuntu specific. The mvsas driver has issues with SATA disks. It is stable with SAS disks afaik (haven't used them myself).

See for example these:
http://kerneltrap.org/mailarchive/linux-scsi/2010/5/6/6884900/thread#mid-6884900
http://hardforum.com/showthread.php?t=1397855&page=26

Things do see to be improved. In order to test I have only attached 2 drives now (I had more in the past). When building RAID sets on those mdadm would fail before it was finished (completely). It seems to finish now, but it stalls frequently (hey it is improvement :)).

It might be nothing, but I only have issues with /dev/sde (only /dev/sde and /dev/sdf are attached to mvsas. Because of issues and the need to have the other drives stable they're attached to onboard SATA).

The 2 drives are different brands (on purpose, seen to many drives (from all brands) fail at more or less the same time if they're from the same batch).

I'm not sure if certain patches have been backported to this kernel (they're not in 2.6.36 vanilla anyways), like this one from the scsi list:

http://marc.info/?l=linux-scsi&m=128509662225051&w=2

Certainly would like to know if that patch is included, so I know if it's useful to post to SCSI list as well.

One of the drives doesn't seem to fully do SMART (well it does output stuff but it needs -T permissive).

Drives are:

/dev/sde:

Device Model: WDC WD20EARS-00MVWB0

/dev/sdf

Model Family: Seagate Barracuda LP
Device Model: ST32000542AS

As stated mdadm sync DID complete (it never used to make it that far on drives this large, although the set previously existed of at least 4 drives).

lspci:

00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller (rev 12)
00:01.0 PCI bridge: Intel Corporation Core Processor PCI Express x16 Root Port (rev 12)
00:06.0 PCI bridge: Intel Corporation Core Processor Secondary PCI Express Root Port (rev 12)
00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 05)
00:1c.4 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 5 (rev 05)
00:1c.5 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 6 (rev 05)
00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation 3400 Series Chipset LPC Interface Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 5 Series/3400 Series Chipset 6 port SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller (rev 05)
01:00.0 SCSI storage controller: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B (rev 01)
02:00.0 SCSI storag...

It seems to be stable for a longer period, but it has borked out now.

As stated the issue seems way more frequent (if not only occuring) when using md raid (and from what I saw from posts also when using xfs).

The array here has failed now with these messages:

[458706.767334] mvsas 0000:02:00.0: Phy5 : No sig fis
[458706.767344] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2022:phy5 Attached Device
[458706.767384] ata8: translated ATA stat/err 0x01/04 to SCSI SK/ASC/ASCQ 0xb/00/00
[458706.767438] ata8.00: device reported invalid CHS sector 0
[458706.767442] ata8: status=0x01 { Error }
[458706.767447] ata8: error=0x04 { DriveStatusError }
[458706.767456] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2081:port 5 ctrl sts=0x199800.
[458706.767462] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2083:Port 5 irq sts = 0x1081
[458706.767470] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2109:phy5 Unplug Notice
[458706.767525] sd 7:0:0:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[458706.767534] sd 7:0:0:0: [sdf] Sense Key : Aborted Command [current] [descriptor]
[458706.767543] Descriptor sense data with sense descriptors (in hex):
[458706.767547]         
[458706.767551] RAID1 conf printout:
[458706.767556] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
[458706.767573]         00 00 00 00 
[458706.767581] sd 7:0:0:0: [sdf] 
[458706.767585]  --- wd:1 rd:2
[458706.767588] Add. Sense: No additional sense information
[458706.767594] sd 7:0:0:0: [sdf] CDB: 
[458706.767598]  disk 0, wo:0, o:1, dev:sde
[458706.767602]  disk 1, wo:1, o:0, dev:sdf
[458706.767605] Read(10): 28 00 06 7d 80 00 00 02 00 00
[458706.767623] end_request: I/O error, dev sdf, sector 108888064
[458706.767660] md/raid1:md4: sdf: rescheduling sector 108888064
[458706.767694] md/raid1:md4: sdf: rescheduling sector 108888312
[458706.767722] md/raid1:md4: sdf: rescheduling sector 108888560
[458706.805557] md/raid1:md4: redirecting sector 108888064 to other mirror: sde
[458706.811953] md/raid1:md4: redirecting sector 108888312 to other mirror: sde
[458706.813419] md/raid1:md4: redirecting sector 108888560 to other mirror: sde
[458706.813465] RAID1 conf printout:
[458706.813470]  --- wd:1 rd:2
[458706.813475]  disk 0, wo:0, o:1, dev:sde
[458706.813481]  disk 1, wo:1, o:0, dev:sdf
[458706.879176] RAID1 conf printout:
[458706.879182]  --- wd:1 rd:2
[458706.879186]  disk 0, wo:0, o:1, dev:sde
[458708.815481] mvsas 0000:02:00.0: Phy5 : No sig fis
[458708.815488] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2022:phy5 Attached Device
[458708.815523] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2081:port 5 ctrl sts=0x89800.
[458708.815526] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2083:Port 5 irq sts = 0x1001
[458708.815533] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2109:phy5 Unplug Notice
[458708.815561] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2081:port 5 ctrl sts=0x199800.
[458708.815564] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2083:Port 5 irq sts = 0x81
[458710.855410] mvsas 0000:02:00.0: Phy5 : No sig fis
[458710.855417] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2022:phy5 Attached Device
[458710.855459] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2081:port 5 ctrl sts=0x89800.
[458710.855462] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2083:Port 5 irq sts = 0x1001
[458710.855468] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2109:phy5 Unplug Notice
[458710.855489] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2081:port 5 ctrl sts=0x199800.
[458710.855492] /build/buildd/linux-2.6.35/drivers/scsi/mvsas/mv_sas.c 2083:Port 5 irq sts = 0x81
[458712.904463] mvsas 0000:02:00.0: Phy5 : No sig fis

The last part about attaching the device repeats a lot. The disk is accessible now though (w/o reboot), so could probably just re-add it to the array.

To be sure I have stopped the array for now and am running badblocks -svw against the disks on the mvsas controller. This has been running for about 12 hours now without any errors in the logs.

Revision history for this message

Freaky (freaky) wrote on 2010-11-23:

#23

Hi,

I've been in contact with Manoj through IRC. He advised me to try the latest 2.6.37 release candicate kernel.

So far it seems to be a huge improvement. I've run several disks on the machine with this kernel and had bonnie++ running on them several days (in md raid 1).

What is odd is that the first day I frequently had messages that a bonnie++ process hadn't received any I/O for 120 secs. Lowered that message's threshold to 60 seconds. Strangely, the 5 processes where *NEVER* listed at the same time. So it seems only 1 or 2 out of the 5 got out of I/O at the same time.

Anyways... the messages just disappeared after 1 - 1.5 days. Haven't seen them since. Maybe they'll come back (temporarily) if I reboot, but can't test this now.

By now we have bought 8 Seagate ST32000542AS disks. Had these running badblocks (individually). There seems to be some performance difference in the disks (running badblocks, full pass (all 4 patterns) on 2TB disks takes LONG). Most only varied a couple of hours, but the last disk in the chain was a full round (write and read) test behind when the first disk finished. The last disk in the chain was at about 1% writing pattern 0x00 when the first disk finished. This makes is about 25% slower than the rest of the disks.

As stated all disks showed variation, but none of them as much as this. Not sure if it's the disk or the array though. Unfortunately the machine is quite a bit away so can't swap disks easily.

So far the 2.6.37 rc seems stable with the controller with SATA disks. Usually I couldn't even finish building md RAID sets and it's stable for quite some time now. It will be receiving real load in a couple of days, if I notice anything weird I'll get back.

Revision history for this message

Freaky (freaky) wrote on 2010-12-09:

#24

FYI,

we've been running with the 2.6.37 release candidate kernel for some time now.

It seems to be stable (no issues so far with 8 x 2TB RAID-6 (mdadm)). It's slow as hell though. We get about 60MB/s write (sequential) on average. It sometimes peaks to ~300MB/s for a short period. In the beginning of the disks 8 disk 2TB RAID-6 should *easily* sustain 400MB/s+.

It must be something with internal scheduling, as I can start writes to all 8 of them simultaneous (dd if=/dev/zero of=/dev/sdX bs=4M) and get around 105MB/s (in the beginning, write speeds of course deteriorate as the radius on the platters gets smaller) on each of them (thus ~800MB/s sum). CPU isn't heavily loaded on the RAID-6 so I doubt it's anything with checksum calculation or similar.

Hopefully these issues will get resolved in a future release. Stability is more important and that seems to be resolved now. Not sure what in the 2.6.37 kernel has fixed it though, so can't really help indicating what should be backported if anything.

Revision history for this message

Freaky (freaky) wrote on 2010-12-09:

#25

FYI todays daily build is faster. Now just have to pray it's stable :)

root@datavault:~# dd if=/dev/zero of=/dev/md4 bs=4M count=1024
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB) copied, 19.9163 s, 216 MB/s
root@datavault:~# dd if=/dev/zero of=/dev/md4 bs=4M count=10240
10240+0 records in
10240+0 records out
42949672960 bytes (43 GB) copied, 171.566 s, 250 MB/s

Not sure if the CPU is the bottleneck now. It's a Core i3, which should be fast enough, but running 4 boinc (worldcommunitygrid) instances on it. 250MB/s should easily saturate the 1GB uplink, so fine for us.

Revision history for this message

jonaz__ (jonaz-86) wrote on 2011-05-01:

#26

I had those exact same issues with 2.6.32-31-server.

After reading comments here i've now added kernel-ppa and installed 2.6.38-8-server through linux-image-server-lts-backport-natty package.

Now my raid is rebuilding (manually triggered) and it has gotten 7% so far. Before it stopped at 0,1% after loosing contact with the discs connected to the mvsas card resulting in kernel panic if i tried to rehotplug one of those drives.

Revision history for this message

Rik Faith (rikfaith) wrote on 2011-05-16:

#27

I tried the stock 2.6.38-8-server (natty) with 5 drives on an mvsas card, running badblocks simultaneously.
They worked for about 10 minutes, but then all 5 had errors similar to those reported above.
I tried manjo's kernel (2.6.32-19-generic) and 4 out of 5 survived a full badblocks read. Those four also
survived a write cycle, but when the badblocks processes switched to the read (verify) part, IO hung
with:
mvs_abort_task() mvi=ffff880408dc0000 task=ffff88040941ea80 slot=ffff880408de44b8 slot_idx=x0
errors.

In the past, I have had great success with Andy Yan's patches, but when I applied them to 2.6.38-8-server
they made the problems worse, likely because I did the merge badly. I will try re-doing the merge. In the
mean time, manjo, could you post your diff so that I can see how you resolved some of the conflicts?

Revision history for this message

Rik Faith (rikfaith) wrote on 2011-05-20:

#28

Please ignore my previous post. Those tests were done using
WDC WD20EURS disks, which are apparently unreliable.

I have replaced the disks with Hitachi HDS72202 and have
re-run the tests.

With the stock mvsas driver from 2.6.38-8-server, I did a
badblocks readonly test on all 5 spindles simultaneously,
pulling approximately 80-100MB/s from each spindle,
without any errors. I then did a software raid5 reconstruction
(4 spindles read, 1 spindle writes) and that also worked
fine without any problems.

I noticed that the mvsas driver for 2.6.38-8-server contains
some of Andy Yan's patches, but not all. Apparently the parts
included are now sufficient for my hardware.

Revision history for this message

Tom van Leeuwen (tom-vleeuwen) wrote on 2012-01-15:

#29

dmesg output Edit (80.5 KiB, text/plain)

Hi guys,

I've just bought a AOC-SASLP-MV8 controller and put 6 Samsung Ecogreen 1.5 TB disks on it. (/dev/sd[cdefgh]).
I've done a fresh ubuntu10.04.2 x64 server install on a raid1 ext4 WD Raptor (36GB) which is connected to the mainboard SATA.
Don't know if it's relevant, but my AOC-SASLP-MV8 controller is not bootable because I've disabled the interrupt 13 thingy.

First thing I did was write the Samsung disks full with zeros simultaneously (dd if=/dev/zero of=/dev/sd[cdefgh] bs=1M)
Then I read them completely simultaneously (dd if=/dev/sd[cdefgh] bs=1M of=/dev/null)

That went fine, no lines showed up in dmesg and average speed was ~100MB/s per disk.

The following thing I did was create a RAID5 array:
mdadm --create /dev/md5 -v -f -l 5 -n 6 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1
This went fine for at least 20 minutes, then I went to do other stuff. The server was not used for anything else (except routing/firewalling) and /proc/mdstat showed it was rebuilding the drive. Only thing I did while rebuilding was starting the ipv6 tunnel.

When I got back, I noticed the following in /proc/mdstat:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md5 : active raid5 sdh1[6](F) sdg1[7](F) sdf1[8](F) sde1[9](F) sdd1[10](F) sdc1[11](F)
7325679680 blocks level 5, 64k chunk, algorithm 2 [6/0] [______]

Not good! It broke completely with the same messages other users here had.
I've attached my dmesg output.

Also some more info:
root@pollux:~# lsmod | grep mvsas
mvsas 49328 6
libsas 52890 1 mvsas
scsi_transport_sas 33021 2 mvsas,libsas
root@pollux:~# modinfo mvsas
filename: /lib/modules/2.6.32-37-server/kernel/drivers/scsi/mvsas/mvsas.ko
license: GPL
version: 0.8.2
description: Marvell 88SE6440 SAS/SATA controller driver
author: Jeff Garzik <email address hidden>
srcversion: EE82F304DFF3A7F06086B62
alias: pci:v00009005d00000450sv*sd*bc*sc*i*
alias: pci:v000017D3d00001320sv*sd*bc*sc*i*
alias: pci:v000017D3d00001300sv*sd*bc*sc*i*
alias: pci:v000011ABd00009180sv*sd*bc*sc*i*
alias: pci:v000011ABd00009480sv*sd*bc*sc*i*
alias: pci:v000011ABd00006485sv*sd*bc*sc*i*
alias: pci:v000011ABd00006440sv*sd*bc*sc*i*
alias: pci:v000011ABd00006440sv*sd00006480bc*sc*i*
alias: pci:v000011ABd00006340sv*sd*bc*sc*i*
alias: pci:v000011ABd00006320sv*sd*bc*sc*i*
depends: libsas,scsi_transport_sas
vermagic: 2.6.32-37-server SMP mod_unload modversions

Hi guys,

I've just bought a AOC-SASLP-MV8 controller and put 6 Samsung Ecogreen 1.5 TB disks on it. (/dev/sd[cdefgh]).
I've done a fresh ubuntu10.04.2 x64 server install on a raid1 ext4 WD Raptor (36GB) which is connected to the mainboard SATA.
Don't know if it's relevant, but my AOC-SASLP-MV8 controller is not bootable because I've disabled the interrupt 13 thingy.

First thing I did was write the Samsung disks full with zeros simultaneously (dd if=/dev/zero of=/dev/sd[cdefgh] bs=1M)
Then I read them completely simultaneously (dd if=/dev/sd[cdefgh] bs=1M of=/dev/null)

That went fine, no lines showed up in dmesg and average speed was ~100MB/s per disk.

The following thing I did was create a RAID5 array:
mdadm --create /dev/md5 -v -f -l 5 -n 6 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1
This went fine for at least 20 minutes, then I went to do other stuff. The server was not used for anything else (except routing/firewalling) and /proc/mdstat showed it was rebuilding the drive. Only thing I did while rebuilding was starting the ipv6 tunnel.

When I got back, I noticed the following in /proc/mdstat:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md5 : active raid5 sdh1[6](F) sdg1[7](F) sdf1[8](F) sde1[9](F) sdd1[10](F) sdc1[11](F)
      7325679680 blocks level 5, 64k chunk, algorithm 2 [6/0] [______]

Not good! It broke completely with the same messages other users here had.
I've attached my dmesg output.

Also some more info:
root@pollux:~# lsmod | grep mvsas
mvsas                  49328  6 
libsas                 52890  1 mvsas
scsi_transport_sas     33021  2 mvsas,libsas
root@pollux:~# modinfo mvsas
filename:       /lib/modules/2.6.32-37-server/kernel/drivers/scsi/mvsas/mvsas.ko
license:        GPL
version:        0.8.2
description:    Marvell 88SE6440 SAS/SATA controller driver
author:         Jeff Garzik <jgarzik@pobox.com>
srcversion:     EE82F304DFF3A7F06086B62
alias:          pci:v00009005d00000450sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001320sv*sd*bc*sc*i*
alias:          pci:v000017D3d00001300sv*sd*bc*sc*i*
alias:          pci:v000011ABd00009180sv*sd*bc*sc*i*
alias:          pci:v000011ABd00009480sv*sd*bc*sc*i*
alias:          pci:v000011ABd00006485sv*sd*bc*sc*i*
alias:          pci:v000011ABd00006440sv*sd*bc*sc*i*
alias:          pci:v000011ABd00006440sv*sd00006480bc*sc*i*
alias:          pci:v000011ABd00006340sv*sd*bc*sc*i*
alias:          pci:v000011ABd00006320sv*sd*bc*sc*i*
depends:        libsas,scsi_transport_sas
vermagic:       2.6.32-37-server SMP mod_unload modversions

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2019-10-03:

#30

Closing this bug with Won't fix as this kernel / release is no longer supported.
Please feel free to open a new bug report if you're still experiencing this on a newer release (Bionic 18.04.3 / Disco 19.04)
Thanks!

Changed in linux (Ubuntu):
status:	Incomplete → Won't Fix

Ubuntu
linux package

Lucid crash on heavy DB i/o (mvsas?)

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntulinux package

Lucid crash on heavy DB i/o (mvsas?)

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package