Comment 10 for bug 343749

Revision history for this message
In , Luc (luc-redhat-bugs) wrote :

Description of problem:
When doing "intensive" I/O, the mpt* drivers crashes the filesystem, on Fedora 12.

The problem is on an IBM x3580 M2 machine, using the integrated LSI SAS1078 C1 PCI-express Fusion-MPT SAS.

Steps to Reproduce:
1. Create a big allocated space (20GB for example)
2. dd if=/dev/vg/mybigspace of=/dev/null
3. After a few minutes, the filesystem access becomes impossible. Looking at dmesg, you get the following:

Calgary: DMA error on CalIOC2 PHB 0x3
Calgary: 0x80000000@CSR 0x00000000@PLSSR 0xb0008000@CSMR 0x00000000@MCK
Calgary: 0x00000000@0x810 0x00000000@0x820 0x00000000@0x830 0x00000000@0x840 0x00000000@0x850 0x00000000@0x860 0x00000000@0x870
Calgary: 0x40000000@0xcb0
irq 46: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.31.5-127.fc12.x86_64 #1
Call Trace:
 <IRQ> [<ffffffff8109aefc>] __report_bad_irq+0x3d/0x8c
 [<ffffffff8109b063>] note_interrupt+0x118/0x17d
 [<ffffffff8109b6f2>] handle_fasteoi_irq+0xa1/0xc6
 [<ffffffff8101463c>] handle_irq+0x8b/0x93
 [<ffffffff8141e9cc>] do_IRQ+0x5c/0xbc
 [<ffffffff810126d3>] ret_from_intr+0x0/0x11
 <EOI> [<ffffffff8101907f>] ? mwait_idle+0x91/0xae
 [<ffffffff8101907f>] ? mwait_idle+0x91/0xae
 [<ffffffff81019021>] ? mwait_idle+0x33/0xae
 [<ffffffff8141d079>] ? atomic_notifier_call_chain+0x13/0x15
 [<ffffffff81010bb8>] ? enter_idle+0x25/0x27
 [<ffffffff81010c60>] ? cpu_idle+0xa6/0xe9
 [<ffffffff814145be>] ? start_secondary+0x1f3/0x234
handlers:
[<ffffffffa00e3d7e>] (mpt_interrupt+0x0/0x8bb [mptbase])
Disabling IRQ #46
mptscsih: ioc0: attempting task abort! (sc=ffff880a0d8fa400)
sd 2:1:4:0: [sda] CDB: Write(10): 2a 00 00 fb 5b a7 00 00 60 00
mptscsih: ioc0: WARNING - TaskMgmt type=1: IOC Not operational (0xffffffff)!
mptscsih: ioc0: WARNING - Issuing HardReset from mptscsih_IssueTaskMgmt!!
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: WARNING - NOT READY WARNING!
mptbase: WARNING - (-1) Cannot recover ioc0
mptscsih: ioc0: WARNING - TaskMgmt HardReset FAILED!!
mptscsih: ioc0: task abort: FAILED (sc=ffff880a0d8fa400)
mptscsih: ioc0: attempting task abort! (sc=ffff880a04bb3100)
sd 2:1:4:0: [sda] CDB: Write(10): 2a 00 00 28 f5 ff 00 00 08 00
mptscsih: ioc0: WARNING - TaskMgmt type=1: IOC Not operational (0xffffffff)!
mptscsih: ioc0: WARNING - Issuing HardReset from mptscsih_IssueTaskMgmt!!
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: WARNING - NOT READY WARNING!
mptbase: WARNING - (-1) Cannot recover ioc0
mptscsih: ioc0: WARNING - TaskMgmt HardReset FAILED!!
mptscsih: ioc0: task abort: FAILED (sc=ffff880a04bb3100)
mptscsih: ioc0: attempting task abort! (sc=ffff880a04bb2600)
sd 2:1:4:0: [sda] CDB: Write(10): 2a 00 00 61 de 27 00 00 08 00
mptscsih: ioc0: WARNING - TaskMgmt type=1: IOC Not operational (0xffffffff)!
mptscsih: ioc0: WARNING - Issuing HardReset from mptscsih_IssueTaskMgmt!!
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
[root@flanders ubuntu]#
Message from syslogd@mymachine at Nov 26 10:38:28 ...
 kernel:mpage_da_map_blocks block allocation failed for inode 11762 at logical offset 2 with max blocks 1 with error -30

Message from syslogd@mymachine at Nov 26 10:38:28 ...
 kernel:This should not happen.!! Data will be lost

The first error message ("Calgary: DMA error on CalIOC2 PHB 0x3") seems to be related to a bug in the Calgary code, as detailed in a thread in LKML:
"The calgary code can give drivers addresses above 4GB which is very bad for hardware that is only 32bit DMA addressable" (http://<email address hidden>/2008-06/05248/Re:_%5BPATCH_-mm%5D_x86_calgary:_fix_handling_of_devces_that_aren%27t_behind_the_Calgary ).
But it's from 2008, I thought this would have been corrected...

After looking on another bug report (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343749 ), the temporary solution seems to be to set iommu=soft at boot. But I guess this affects performance... Acccording to that bug report, the bug seems to be corrected on RH 9 ?!

The bug exists in Fedora 12, and makes it unusable on a x3580 M2.