aacraid driver stalls on high-load SMP machines

Bug #249964 reported by Matthias Urlichs
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Won't Fix
Medium
linux (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

Under load, this happens rather often:

Jul 18 22:55:24 nun kernel: [86674.467410] aacraid: Host adapter abort request (0,0,2,0)
Jul 18 22:55:24 nun kernel: [86674.467487] aacraid: Host adapter abort request (0,0,3,0)
Jul 18 22:55:24 nun kernel: [86674.467617] aacraid: Host adapter reset request. SCSI hang ?
Jul 18 22:57:26 nun kernel: [86815.728423] aacraid: Host adapter abort request (0,0,0,0)
Jul 18 22:57:26 nun kernel: [86815.728500] aacraid: Host adapter abort request (0,0,3,0)
Jul 18 22:57:26 nun kernel: [86815.728573] aacraid: Host adapter abort request (0,0,2,0)
Jul 18 22:57:26 nun kernel: [86815.728640] aacraid: Host adapter abort request (0,0,1,0)
Jul 18 22:57:26 nun kernel: [86815.728772] aacraid: Host adapter reset request. SCSI hang ?

Access to the storage thus stalls for ten seconds or so.

I have successfully worked around the problem by using "schedtool -a 1 pid-of-basically-everything", so it seems to be an SMP-related problem.

However, one CPU is _somewhat_ slower than four, which is quite noticeable, so we'd like to get this handled somehow :-/

lspci:

05:06.0 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01)
 Subsystem: Dell PowerEdge 2400,2500,2550,4400
 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7
 BIST result: 00
 I/O ports at cc00 [size=256]
 Memory at fccff000 (64-bit, non-prefetchable) [size=4K]
 Expansion ROM at fcd00000 [disabled] [size=128K]
 Capabilities: [dc] Power Management version 2

05:06.1 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01)
 Subsystem: Dell PowerEdge 2400,2500,2550,4400
 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 11
 BIST result: 00
 I/O ports at c800 [size=256]
 Memory at fccfe000 (64-bit, non-prefetchable) [size=4K]
 Expansion ROM at f8100000 [disabled] [size=128K]
 Capabilities: [dc] Power Management version 2

lspci -n:
05:06.0 0100: 9005:00c5 (rev 01)
 Subsystem: 1028:00c5
 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7
 BIST result: 00
 I/O ports at cc00 [size=256]
 Memory at fccff000 (64-bit, non-prefetchable) [size=4K]
 Expansion ROM at fcd00000 [disabled] [size=128K]
 Capabilities: [dc] Power Management version 2

05:06.1 0100: 9005:00c5 (rev 01)
 Subsystem: 1028:00c5
 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 11
 BIST result: 00
 I/O ports at c800 [size=256]
 Memory at fccfe000 (64-bit, non-prefetchable) [size=4K]
 Expansion ROM at f8100000 [disabled] [size=128K]
 Capabilities: [dc] Power Management version 2

Tags: cft-2.6.27
Changed in linux:
status: Unknown → Confirmed
Revision history for this message
Matthias Urlichs (smurf) wrote :

Update: my uniprocessor band-aid, besides significantly decreasing performance, resulted in an eventual CPU soft-hang (all of them) some hours later, so this workaround obviously doesn't.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Matthias Urlichs (smurf) wrote :

The current 2.6.27-pre4 kernel does NOT fix the problem.

It has existed for some time; my conjecture is that due to their greater internal I/O efficiency, newer kernels are far better at sending a lot of requests to the disk controller, thereby overflowing its internal queue.

The Dapper kernel works.

Revision history for this message
Matthias Urlichs (smurf) wrote :

NB: I will test the fix/workaround from http://bugzilla.kernel.org/show_bug.cgi?id=11120 today.

Changed in linux:
status: Confirmed → In Progress
Revision history for this message
Matthias Urlichs (smurf) wrote :

The fix works.

Please apply.

Reduce AACRAID hardware queue size (kernel bug#11120)

Signed-Off-By: Mathias Urlichs <email address hidden>

diff --git a/drivers/scsi/aacraid/aacraid.h b/drivers/scsi/aacraid/aacraid.h
index 73916ad..b1b10b3 100644
--- a/drivers/scsi/aacraid/aacraid.h
+++ b/drivers/scsi/aacraid/aacraid.h
@@ -24,7 +24,7 @@
 #define AAC_MAX_LUN (8)

 #define AAC_MAX_HOSTPHYSMEMPAGES (0xfffff)
-#define AAC_MAX_32BIT_SGBCOUNT ((unsigned short)256)
+#define AAC_MAX_32BIT_SGBCOUNT ((unsigned short)127)

 /*
  * These macros convert from physical channels to virtual channels

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Revision history for this message
Daniel Eckl (daniel-eckl) wrote :

Can anybody tell if Hardy (as being an LTS) will get this fix then, too?

Best,
Daniel

Andy Whitcroft (apw)
Changed in linux (Ubuntu):
assignee: nobody → Andy Whitcroft (apw)
status: Triaged → In Progress
Revision history for this message
Andy Whitcroft (apw) wrote :

As per the reporter the patch on the upstream bug seems to fix the issue and is confirmed by other testers. It seems that it needs pushing upstream. Put together an upstream submission from the tested patch and pushed to the driver maintainer. Also proposed for SRU to Hardy.

Revision history for this message
Hal (hal-foobox) wrote :

I am in the same boat, and in the process of "upgrading" a gentoo system to Ubuntu LTS, and this is a show stopper. Is there a way to install a patched kernel on a system that won't even boot (but has an 8.04 installation on it)? Is there any version of Ubuntu that will run on this setup? Is there anything that could be passed in the boot process to get into the system?

Revision history for this message
Hal (hal-foobox) wrote :

To answer my own question, the issue seems to be resolved with a BIOS update. Runs like a top now. At least, the system boots. I haven't run it long enough or under load to see if the issue is 100% gone or not.

Revision history for this message
Esa Häkkinen (esa+hakkinen) wrote :

Might be related to firmware bug being fixed 2009 January for PERC 3/Di v2.8.1.7692 (build 7692)

http://bugzilla.kernel.org/show_bug.cgi?id=9133 workaround found

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/149071 firmware update reported to fix

Changed in linux:
status: In Progress → Invalid
Andy Whitcroft (apw)
Changed in linux (Ubuntu):
assignee: Andy Whitcroft (apw) → nobody
Changed in linux (Ubuntu):
status: In Progress → Triaged
Changed in linux:
status: Invalid → Won't Fix
Changed in linux:
importance: Unknown → Medium
Revision history for this message
penalvch (penalvch) wrote :
Changed in linux (Ubuntu):
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.