MSI-X support in qla2xxx causes reproducible hangs under moderate I/O load

Bug #268242 reported by John Morrissey
6
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Medium
Jim Lieb

Bug Description

MSI-X support in the qla2xxx driver causes reproducible hangs under moderate I/O load on our BladeCenter HS21 blades with QLogic ISP2422 fibre channel adapters.

    product: IBM eServer BladeCenter HS21 -[8853GLU]-
08:01.0 Fibre Channel: QLogic Corp. ISP2422-based 4Gb Fibre Channel to PCI-X HBA (rev 02)

We first experienced this behavior while mkfs(8)ing a large (5TB) ext3 filesystem on an otherwise unloaded system. The system apparently stops handling interrupts and locks up tight, generally before 50% completion (the exact percentage varies, but it happens every time). Disabling MSI on the kernel command line with 'pci=nomsi' stops the lockups even under much heavier I/O load.

FWIW, Red Hat has disabled MSI-X in (at least) RHEL5's kernel. Unfortunately, the Red Hat bug cited in their kernel patch (252410) is not publicly accessible.

From kernel-2.6.18-92.1.10.el5.src.rpm's linux-2.6-scsi-qla2xxx-disable-msi-x-by-default.patch:

From: Marcus Barrow <email address hidden>
Subject: [Bug 252410][QLOGIC 5.1 bug] qla2xxx MSI-X hardware issues on some platforms
Date: Wed, 15 Aug 2007 20:22:09 -0400
Bugzilla: 252410
Message-Id: <email address hidden>
Changelog: [scsi] qla2xxx: disable MSI-X by default

Testing and upstream have found problems handling MSI-X by some
chipsets. These include major servers. Enabling MSI-X support
has caused a major regression on some machines.

This attached patch disable the MSI-X feature by default, but
allows enabling with a module parameter "ql2x_enable_msix".

It basically ammounts to one line, plus 5 lines to declare
the module parameter.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi John,

I can't seem to see the patch you said should be attached. Care to try attaching it to this bug report again? Thanks.

Changed in linux:
status: New → Incomplete
Revision history for this message
John Morrissey (jwm) wrote :

Sorry Leann, I should have been clearer. I was just citing the header on the RHEL5 patch, which I pasted. Here's a copy of their patch, from the latest RHEL5 kernel source RPM.

Revision history for this message
John Morrissey (jwm) wrote :

Sigh, I should read the patch before I attach it; correct one coming in a minute...

Revision history for this message
John Morrissey (jwm) wrote :

Correct patch (linux-2.6-scsi-qla2xxx-disable-msi-x-by-default.patch).

Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
status: Incomplete → Triaged
Revision history for this message
Jim Lieb (lieb) wrote :

What version of kernel and Ubuntu are you running? These RH patches are
for an older kernel and your system info will help us narrow down a more
appropriate fix. Anything appropriate from kern.log would help too.
Thanks.

Revision history for this message
John Morrissey (jwm) wrote :

This was with Ubuntu 8.04 and its stock -server kernel (2.6.24-19-server). As far as we remember, there wasn't anything syslogged. The problem appeared to present as the machines ceasing to service interrupts. Network activity ground to a halt, Ethernet and FC link dropped, all disk I/O blocked.

These machines have all of their local filesystems mounted via fibre channel, so there wasn't a chance at that point for them to log much of anything to their syslogs. As far as we remember, there wasn't anything useful in the kernel syslogs and the only log output on the console was the qla driver complaining about the fc link dropping and associated noise.

Let me know if there's anything more we can provide.

Jim Lieb (lieb)
Changed in linux:
assignee: ubuntu-kernel-team → lieb
status: Triaged → In Progress
Revision history for this message
Jim Lieb (lieb) wrote :

There were a number of updates to the qla driver updates since the 2.6.24 kernel
from Hardy, some of them addressing both msi issues on 24xx parts and similar
issues to yours. I suggest that you try a 2.6.27 kernel that was built for Hardy. You
can find the packages at:

   http://ppa.launchpad.net/kernel-ppa/ubuntu

Note that this would *not* be a supported kernel. It is only to verify the issue.
Depending on the results of your tests, we would then proceed to working on a
supported solution.

From Documentation/kernel-parameters.txt, you might also try:

pci=option[,option...] [PCI] various PCI subsystem options:
         nomsi [MSI] If the PCI_MSI kernel config parameter is
                              enabled, this kernel boot option can be used to
                              disable the use of MSI interrupts system-wide.

if you haven't already with your current kernel.

Revision history for this message
John Morrissey (jwm) wrote :

Right, we already happened upon pci=nomsi and have been using that since we first experienced the lockups.

Unfortunately, nearly all the nodes we experienced this with were put into production (with pci=nomsi) and we don't have 5T of storage to spare any more. I was able to find one node with 3T of storage and spent some time yesterday trying to reproduce this with the kernel we were running at the time (linux-image-2.6.24-19-server-2.6.24-19.36) and couldn't reproduce the lockups.

I'm not sure what to say; we definitely experienced the lockups and pci=nomsi definitely put a stop to them, so I'm puzzled that I wasn't able to reproduce given a couple of hours. Close this if you need to, but I just don't have the time now to pursue further.

Revision history for this message
Jim Lieb (lieb) wrote :

Thanks for your feedback. We will close the bug but if you encounter
further lockups, try the new kernel referenced in my previous comment
and file a new one with that data.

Changed in linux:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.