arcmsr times out with ARC1882 RAID card

Bug #1559609 reported by Martin Koniczek on 2016-03-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Tim Gardner
Xenial
High
Tim Gardner

Bug Description

tested the latest xenial iso on a file server featuring an ARC-1882ix-24 RAID controller, and got weird timeout issues, followed by complete loss of access to anything connected to the RAID controller. The timeouts occur after a random amount of uptime (sometimes minutes, sometimes days), for example:

kernel: [1665409.969229] arcmsr2: abort device command of scsi id = 0 lun = 1
kernel: [1665411.727535] arcmsr2: scsi id = 0 lun = 1 ccb = '0xffff884fe008e780' poll command abort successfully
kernel: [1665411.727885] arcmsr2: abort device command of scsi id = 0 lun = 1
kernel: [1665411.727898] arcmsr2: abort device command of scsi id = 0 lun = 1
kernel: [1665413.138235] arcmsr2: scsi id = 0 lun = 1 ccb = '0xffff884fe0012300' poll command abort successfully
...
kernel: [1665445.804546] arcmsr: executing bus reset eh.....num_resets = 2, num_aborts = 146
kernel: [1665455.851353] arcmsr2: pCCB ='0xffff884fe002a700' isr got aborted command
kernel: [1665455.851366] arcmsr2: pCCB ='0xffff884fe01c0a00' isr got aborted command
kernel: [1665455.851373] arcmsr2: isr get an illegal ccb command #011#011#011#011done acb = '0xffff884fe0b8c798'ccb = '0xffff884fe00e9680' ccbacb = '0xffff884fe0b8c798' startdone = 0x0 ccboutstandingcount = -1
kernel: [1665455.851378] arcmsr2: isr get an illegal ccb command #011#011#011#011done acb = '0xffff884fe0b8c798'ccb = '0xffff884fe0070280' ccbacb = '0xffff884fe0b8c798' startdone = 0x0 ccboutstandingcount = -1
...
kernel: [1665455.852655] sd 2:0:0:3: [sdd] Medium access timeout failure. Offlining disk!
kernel: [1665455.890032] sd 2:0:0:4: [sde] Medium access timeout failure. Offlining disk!
kernel: [1665455.926613] sd 2:0:0:1: [sdb] Medium access timeout failure. Offlining disk!
kernel: [1665455.963288] sd 2:0:0:2: [sdc] Medium access timeout failure. Offlining disk!

some digging revealed that mainline 4.4 as well as xenial's 4.4.0-14-generic still feature an old, buggy arcmsr driver v1.30.00.04-20140919, which claims to "supports" the 1882, but does not really...

Areca seems to have managed to get a fixed driver into mainline 4.5 (version v1.30.00.22-20151126), and it seems to be a small patch on arcmsr.h and a large one on arcmsr_hba.c, and upon a first glance, I didn't see anything 4.5-specific in the code:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/diff/drivers/scsi/arcmsr/arcmsr.h?id=v4.5&id2=v4.4
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/diff/drivers/scsi/arcmsr/arcmsr_hba.c?id=v4.5&id2=v4.4

Note that we are using v1.30.0X.21-20151016 (as provided by Areca.com.tw) on productive 14.04.4 LTS servers featuring ARC1882 controllers, so chances are good that version 22 (as included in 4.5 mainline) to work well.

This would not only allow ARC-188x controllers to work properly with Xenial out-of-the-box, it should also add support for the (somewhat popular?) ARC-1203 series

===
Kernel-Description: update arcmsr to version v1.30.00.22-20151126 to fix card timeouts

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1559609

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

can't run apport-collect on the affected system because the filesystem got ovverriden. the log lines posted in the original bug report are from a remote rsyslog server capturing the event. the timeout bug in itself is not new, it is a longstanding issue with that particular version of the areca driver and the 1882 controller.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Andy Whitcroft (apw) on 2016-03-20
Changed in linux (Ubuntu):
importance: Undecided → High
assignee: nobody → Andy Whitcroft (apw)
milestone: none → ubuntu-16.03
description: updated
Tim Gardner (timg-tpi) wrote :

arcmsr: change driver version to v1.30.00.22-20151126
arcmsr: Split dma resource allocation to a new function
arcmsr: more readability improvements
arcmsr: changes driver version number
arcmsr: adds code to support new Areca adapter ARC1203
arcmsr: make code more readable
arcmsr: fixes not release allocated resource
arcmsr: fixed getting wrong configuration data

Changed in linux (Ubuntu Xenial):
assignee: Andy Whitcroft (apw) → Tim Gardner (timg-tpi)
status: Confirmed → Fix Committed
Tim Gardner (timg-tpi) wrote :

These patches will be in Ubuntu-4.4.0-16.32

Launchpad Janitor (janitor) wrote :
Download full text (4.2 KiB)

This bug was fixed in the package linux - 4.4.0-16.32

---------------
linux (4.4.0-16.32) xenial; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1561727

  * fix thermal throttling due to commit "Thermal: initialize thermal zone
    device correctly" (LP: #1561676)
    - Thermal: Ignore invalid trip points

  * Thinkpad T460: Trackpoint mouse buttons instantly generate "release" event
    on press (LP: #1553811)
    - SAUCE: (noup) Input: synaptics - handle spurious release of trackstick
      buttons, again

  * reading /sys/kernel/security/apparmor/profiles requires CAP_MAC_ADMIN
    (LP: #1560583)
    - SAUCE: apparmor: Allow ns_root processes to open profiles file
    - SAUCE: apparmor: Consult sysctl when reading profiles in a user ns

  * linux: sync virtualbox drivers to 5.0.16-dfsg-2 (LP: #1561492)
    - ubuntu: vbox -- update to 5.0.16-dfsg-2

  * s390/kconfig: CONFIG_NUMA without CONFIG_NUMA_EMU does not make any sense on
    s390x (LP: #1557690)
    - [Config] CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=n for s390x

  * spl/zfs fails to build on s390x (LP: #1519814)
    - [Config] s390x -- re-enable zfs
    - [Config] zfs -- disable powerpc until the test failures can be resolved

  * linux: sync to ZFS 0.6.5.6 stable release (LP: #1561483)
    - SAUCE: (noup) Update spl to 0.6.5.6-0ubuntu1, zfs to 0.6.5.6-0ubuntu1

  * zfs: enable zfs for 64bit powerpc kernels (LP: #1558871)
    - [Packaging] zfs -- handle rprovides via dpkg-gencontrol
    - [Config] powerpc -- convert zfs configuration to custom_override

  * Memory arena corruption with FUSE (was Memory allocation failure crashes
    kernel hard, presumably related to FUSE) (LP: #1505948)
    - SAUCE: (noup) fuse: do not use iocb after it may have been freed
    - SAUCE: (noup) fuse: Add reference counting for fuse_io_priv

  * cgroup namespaces: add a 'nsroot=' mountinfo field (LP: #1560489)
    - SAUCE: (noup) cgroup namespaces: add a 'nsroot=' mountinfo field

  * linux packaging: clear remaining redundant delta (LP: #1560445)
    - [Debian] Remove generated intermediate files on clean

  * arm64: guest hangs when ntpd is running (LP: #1549494)
    - Revert "hrtimer: Add support for CLOCK_MONOTONIC_RAW"
    - Revert "hrtimer: Catch illegal clockids"
    - Revert "KVM: arm/arm64: timer: Switch to CLOCK_MONOTONIC_RAW"

  * Need enough contiguous memory to support GICv3 ITS table (LP: #1558828)
    - [Config] CONFIG_FORCE_MAX_ZONEORDER=13 on arm64
    - SAUCE: (no-up) arm64: gicv3: its: Increase FORCE_MAX_ZONEORDER for Cavium
      ThunderX

  * update arcmsr to version v1.30.00.22-20151126 to fix card timeouts
    (LP: #1559609)
    - arcmsr: fixed getting wrong configuration data
    - arcmsr: fixes not release allocated resource
    - arcmsr: make code more readable
    - arcmsr: adds code to support new Areca adapter ARC1203
    - arcmsr: changes driver version number
    - arcmsr: more readability improvements
    - arcmsr: Split dma resource allocation to a new function
    - arcmsr: change driver version to v1.30.00.22-20151126

  * server image has no keyboard, desktop image works (LP: #1559692)
    - [Config] Rework input-modules (d-i) list

  * PMU sup...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers