Trusty isci module doesn't handle timeouts properly

Bug #1394032 reported by Jason Harley
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Fix Released
Medium
Unassigned

Bug Description

I'm currently running linux 3.13.0-39 on trusty with a disks plugged into an Intel C602 SATA/SAS controller. Occasionally, a timeout and/or SAS event (I'm not 100% sure which..) isn't handled properly ('Unhandled error code') and the kernel gets a bit upset.

I have 12 different hosts with this controller and disk combination and all display the same behaviour (dmesg output):

[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] command ffff8808434fa600 timed out
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] command ffff880843673d00 timed out
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] command ffff88105081bc00 timed out
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] command ffff88084378e100 timed out
[Tue Nov 18 16:56:10 2014] sas: Enter sas_scsi_recover_host busy: 4 failed: 4
[Tue Nov 18 16:56:10 2014] sas: ata7: end_device-7:0: cmd error handler
[Tue Nov 18 16:56:10 2014] sas: ata7: end_device-7:0: dev error handler
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] Unhandled error code
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg]
[Tue Nov 18 16:56:10 2014] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] CDB:
[Tue Nov 18 16:56:10 2014] Write(10): 2a 00 04 9e 77 60 00 00 08 00
[Tue Nov 18 16:56:10 2014] end_request: I/O error, dev sdg, sector 77494112
[Tue Nov 18 16:56:10 2014] EXT4-fs warning (device dm-2): ext4_end_bio:317: I/O error -5 writing to inode 261733 (offset 0 size 0 starting block 5061868)
[Tue Nov 18 16:56:10 2014] Buffer I/O error on device dm-2, logical block 5061868
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] Unhandled error code
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg]
[Tue Nov 18 16:56:10 2014] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] CDB:
[Tue Nov 18 16:56:10 2014] Write(10): 2a 00 04 0f d0 e0 00 00 08 00
[Tue Nov 18 16:56:10 2014] end_request: I/O error, dev sdg, sector 68145376
[Tue Nov 18 16:56:10 2014] EXT4-fs warning (device dm-2): ext4_end_bio:317: I/O error -5 writing to inode 261710 (offset 0 size 0 starting block 3893276)
[Tue Nov 18 16:56:10 2014] Buffer I/O error on device dm-2, logical block 3893276
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] Unhandled error code
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg]
[Tue Nov 18 16:56:10 2014] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] CDB:
[Tue Nov 18 16:56:10 2014] Write(10): 2a 00 02 b8 a1 f8 00 00 08 00
[Tue Nov 18 16:56:10 2014] end_request: I/O error, dev sdg, sector 45654520
[Tue Nov 18 16:56:10 2014] Buffer I/O error on device dm-2, logical block 1081919
[Tue Nov 18 16:56:10 2014] lost page write due to I/O error on dm-2
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] Unhandled error code
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg]
[Tue Nov 18 16:56:10 2014] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[Tue Nov 18 16:56:10 2014] sd 7:0:0:0: [sdg] CDB:
[Tue Nov 18 16:56:10 2014] Write(10): 2a 00 02 b8 a1 58 00 00 08 00
[Tue Nov 18 16:56:10 2014] end_request: I/O error, dev sdg, sector 45654360
[Tue Nov 18 16:56:10 2014] Buffer I/O error on device dm-2, logical block 1081899
[Tue Nov 18 16:56:10 2014] lost page write due to I/O error on dm-2
[Tue Nov 18 16:56:10 2014] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Nov 17 19:42 seq
 crw-rw---- 1 root audio 116, 33 Nov 17 19:42 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=/dev/mapper/root-swap
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
 Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 0557:2221 ATEN International Co., Ltd Winbond Hermon
 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Supermicro X9DRT-PT
Package: linux (not installed)
PciMultimedia:

ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.13.0-39-generic root=/dev/mapper/root-root ro console=tty0 console=ttyS1,115200n8 swapaccount=1 quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 3.13.0-39.66-generic 3.13.11.8
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-39-generic N/A
 linux-backports-modules-3.13.0-39-generic N/A
 linux-firmware 1.127.8
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-39-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

WifiSyslog:

_MarkForUpload: True
dmi.bios.date: 05/06/2014
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 3.0b
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X9DRT-PT
dmi.board.vendor: Supermicro
dmi.board.version: 1.01
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 17
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr3.0b:bd05/06/2014:svnSupermicro:pnX9DRT-PT:pvr0123456789:rvnSupermicro:rnX9DRT-PT:rvr1.01:cvnSupermicro:ct17:cvr0123456789:
dmi.product.name: X9DRT-PT
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Revision history for this message
Jason Harley (redmind) wrote :
Revision history for this message
Jason Harley (redmind) wrote :

dmesg output of the driver initializing:

$ dmesg -T | grep iscsi
[Mon Nov 17 19:42:20 2014] isci: Intel(R) C600 SAS Controller Driver - version 1.1.0
[Mon Nov 17 19:42:20 2014] isci 0000:02:00.0: driver configured for rev: 6 silicon
[Mon Nov 17 19:42:20 2014] isci 0000:02:00.0: OEM parameter table found in OROM
[Mon Nov 17 19:42:20 2014] isci 0000:02:00.0: OEM SAS parameters (version: 1.0) loaded (platform)
[Mon Nov 17 19:42:20 2014] isci 0000:02:00.0: SCU controller 0: phy 3-0 cables: {short, short, short, short}
[Mon Nov 17 19:42:20 2014] scsi7 : isci
[Mon Nov 17 19:42:20 2014] isci 0000:02:00.0: irq 125 for MSI/MSI-X
[Mon Nov 17 19:42:20 2014] isci 0000:02:00.0: irq 126 for MSI/MSI-X

description: updated
Revision history for this message
Jason Harley (redmind) wrote :

I have seen mentions of isci and libsas getting timeout fixes, and I know that 3.18 is using newer libsas and isci code than we have in 3.13.

Jason Harley (redmind)
summary: - isci 1.1.0 doesn't handle timeouts properly
+ Trusty isci module doesn't handle timeouts properly
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1394032

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Revision history for this message
Jason Harley (redmind) wrote : BootDmesg.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Jason Harley (redmind) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote : Lspci.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote : ProcEnviron.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote : ProcModules.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote : UdevDb.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote : UdevLog.txt

apport information

Revision history for this message
Jason Harley (redmind) wrote :

I had already attached apport output, but in the event that this was insufficient I ran "apport-collect" as asked above.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.18 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.18-rc5-vivid/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Jason Harley (redmind) wrote :

I've installed the 3.18.0-031800rc5.201411162035 kernel packages and am running load on the box now to see if I can get it to hiccough. Will report back shortly.

tags: added: kernel-da-key
Jason Harley (redmind)
tags: added: kernel-fixed-upstream
Revision history for this message
Jason Harley (redmind) wrote :

After a night of stress testing with the 3.18-rc5 packages I have not been able to reproduce an unhandled SAS/SCSI event (the new kernel did intensely dislike NTP and apparmour, but that's an other issue altogether).

There were only two 'sas_scsi_recover_host' events early in the boot process: one during initializing the controller and once after what looks like a PCI bus scan (I've attached the full dmesg output).

So far it seems that the new libsas and isci driver/firmware is much better at handling the Intel C602 in my chassis. Is it possible to backport libsas and isci from mainline to trusty?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible to also test the latest upstream 3.13 kernel to see if the fix already made it into stable updates? If it did not, we can perform a reverse bisect to see the exact commit that fixes this issue.

The 3.13 upstream kernel can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11.11-trusty/

Changed in linux (Ubuntu Trusty):
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Jason Harley (redmind) wrote :

I now 3.13.11-03131111-generic running on four machines (all identical hardware) at the moment and am stress testing on all nodes. The boot behaviour is still similar to the 3.18-r5 kernel, thought the isci driver still identifies itself as version 1.1 (and not 1.2).

I will continue to keep pressure on these four machines to see if this is solved in 3.13.11, but things are looking promising for the moment.

Revision history for this message
Jason Harley (redmind) wrote :

It seems that this 3.13.11 build is affected by this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917

Windows guests running on this kernel are very unstable and lose network connectivity.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The 3.13.11.11 updates are now in the 3.13.0-41.70 kernel. Can you apply the latest updates and see if the 3.13.0-41.70 kernel resolves this bug?

Revision history for this message
Jason Harley (redmind) wrote :

I've installed 3.13.0-41.70 on four machines for testing. Will report back shortly.

# dpkg -l | grep 3.13.0-41
ii linux-headers-3.13.0-41 3.13.0-41.70 all Header files related to Linux kernel version 3.13.0
ii linux-headers-3.13.0-41-generic 3.13.0-41.70 amd64 Linux kernel headers for version 3.13.0 on 64 bit x86 SMP
ii linux-image-3.13.0-41-generic 3.13.0-41.70 amd64 Linux kernel image for version 3.13.0 on 64 bit x86 SMP
ii linux-image-extra-3.13.0-41-generic 3.13.0-41.70 amd64 Linux kernel extra modules for version 3.13.0 on 64 bit x86 SMP

Revision history for this message
Jason Harley (redmind) wrote :

I've confirmed that 3.13.0-41.70 isn't affected by bug #1346917 (KVM NUMA stability), so this is an improvement over 3.13.11.11.

I will continue to test 3.13.0-41.70 for isci stability.

Revision history for this message
Jason Harley (redmind) wrote :

I assume yesterday's released 3.13.0-43.72 build contains the mainline updates for the 3.13 series?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yes, 3.13.0-43.72 now has the upstream updates up to kernel 3.13.11.11.

The latest upstream 3.13 kernel is not available at:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11-ckt12-trusty/

Does the 3.13.0-43.72 kernel still exhibit this bug?

Revision history for this message
Jason Harley (redmind) wrote :

3.13.0-43.72 does not exhibit this bug with testing, so I think we can close this issue.

Changed in linux (Ubuntu Trusty):
status: Confirmed → Fix Released
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.