Ubuntu Server 18.04 LTS aacraid error

Bug #1777586 reported by Patrick Storms on 2018-06-19
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Bionic
High
Unassigned

Bug Description

I upgraded from a previous version of Ubuntu 14.04LTS to 18.04LTS and am now running into these raid adapter driver errors. The server ran fine in older version. My apologies as I lost the exact version, but it never had any errors like this version.

Now when ever I try to copy files to the RAID 5 drive, or untar a file, I get these errors now after a few MB's of written data.

Linux batboat 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 18:02:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

batboat:/var/log$ lsb_release -rd
Description: Ubuntu 18.04 LTS
Release: 18.04

I have tried the IRQ debugging tips to no avail. I loaded in Debian9.4.0 and it only briefly showed this error once. But appears to be much more resilient and appears to work fine.

Jun 19 00:02:21 batboat kernel: [ 498.770839] aacraid: Host adapter reset request. SCSI hang ?
Jun 19 00:02:37 batboat kernel: [ 514.139167] aacraid: Host adapter reset request. SCSI hang ?
Jun 19 00:02:37 batboat kernel: [ 514.795083] aacraid 0000:03:09.0: Adapter health - 199
Jun 19 00:02:37 batboat kernel: [ 514.800376] aacraid 0000:03:09.0: outstanding cmd: midlevel-0
Jun 19 00:02:37 batboat kernel: [ 514.800378] aacraid 0000:03:09.0: outstanding cmd: lowlevel-0
Jun 19 00:02:37 batboat kernel: [ 514.800381] aacraid 0000:03:09.0: outstanding cmd: error handler-0
Jun 19 00:02:37 batboat kernel: [ 514.800383] aacraid 0000:03:09.0: outstanding cmd: firmware-5
Jun 19 00:02:37 batboat kernel: [ 514.800385] aacraid 0000:03:09.0: outstanding cmd: kernel-0
Jun 19 00:02:37 batboat kernel: [ 514.800391] sd 4:0:0:0: Device offlined - not ready after error recovery
Jun 19 00:02:37 batboat kernel: [ 514.800394] sd 4:0:0:0: Device offlined - not ready after error recovery
Jun 19 00:02:37 batboat kernel: [ 514.800396] sd 4:0:0:0: Device offlined - not ready after error recovery
Jun 19 00:02:37 batboat kernel: [ 514.800399] sd 4:0:0:0: Device offlined - not ready after error recovery
Jun 19 00:02:37 batboat kernel: [ 514.800401] sd 4:0:0:0: Device offlined - not ready after error recovery

Patrick Storms (pstorms) wrote :
Patrick Storms (pstorms) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1777586

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.17 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc1

Changed in linux (Ubuntu):
importance: Undecided → High
Joseph Salisbury (jsalisbury) wrote :

Also, does this bug go away if you select the prior kernel version from the GRUB menu?

Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Incomplete
tags: added: kernel-da-key
Patrick Storms (pstorms) wrote :

I have decided to try to eliminate some things. I ended up formatting my RAID drives and cleared that out. I also rebuilt the RAID Array in hopes of clearing up this issue. But these did not fix what we are seeing here.

As to your request regarding selecting versions from the GRUB menu, I tried the two versions of the kernel that I have installed on this server.

The two kernels are:
4.15.0-20-generic
4.15.0-23-generic

This is an interesting development. When I booted the earlier version (4.15.0-20), I do not get any errors. I tested several large copies to a USB drive with no issues. 6 Times I tried to create the issue, but was not successful. When I booted back to the new version(4.15.0-23), it errors almost immediately. When I went and rebooted back to 4.15.0-20 it errored as well. So the issue is systemic in both releases.

I will try the updated kernel next.

Patrick Storms (pstorms) wrote :

So I downloaded the new kernel as you requested. I have 4.17.2-041702-generic installed and booted. After loading the new kernel, I ran the same tests as before where it would fail with the aacraid errors. And it is exhibiting the same behaviour.

I mounted the USB drive at Jun 20 00:46:55 in the attached syslog file. I then copied from the RAID drive to the USB drive and the errors showed up again.

Patrick Storms (pstorms) wrote :

Another note, if I let the system run as a Web Server/eMail server there are no issues. I don't see any errors in the log files. The issue only arises when there is heavy I/O to the Hardware RAID controller. The Software RAID works fine as my boot drives are mirrored SSD's. The DATA drive is the RAID 5 Adaptec controller. 3 1TB Drives. Thanks for the help with this.

Patrick Storms (pstorms) wrote :

For giggles I downloaded the latest kernel available today. 4.18.RC1 to try. And it too exhibits the same thing. No change.

Patrick Storms (pstorms) wrote :

For more information, I am now regressing on kernels. I tried 4.14.50 and it too errors.

Patrick Storms (pstorms) wrote :

I have now went back to 4.13.16 as a test and it too is exhibiting the same behaviour. I think I have chased enough kernels for now. If there is anything else you need, please let me know. Thank you for your help in this matter. It takes approximately 283 seconds before the first error message popped up on the "aacraid: Host adapter abort request".

Patrick Storms (pstorms) wrote :

I am unable to run the requested command as requested by the automated kernel bot. apport-collect 1777586. I am setting the defect case to confirmed as requested.

Changed in linux (Ubuntu Bionic):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Daniel Reinhardt (cryptodan) wrote :
Download full text (4.0 KiB)

this bug goes all the way back to centos 5 and kernel 2.6.

i have a stable machine on the following system:

cryptodan@capricorn:~$ inxi -Fxxxrpc0
System: Host: capricorn Kernel: 3.13.0-24-generic i686 (32 bit, gcc: 4.8.2) Console: tty 1 Distro: Ubuntu 14.04 trusty
Machine: System: Dell product: PowerEdge 4600 Chassis: type: 17
           Mobo: Dell model: 0H3009 version: A00 Bios: Dell version: A13 date: 10/21/2004
CPU(s): 2 Single core Intel Xeon CPUs (-HT-SMP-) cache: 1024 KB flags: (pae sse sse2) bmips: 11961.4
           Clock Speeds: 1: 2990.346 MHz 2: 2990.346 MHz 3: 2990.346 MHz 4: 2990.346 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Rage XL PCI bus-ID: 00:0e.0 chip-ID: 1002:4752
           X-Vendor: N/A driver: N/A tty size: 100x35 Advanced Data: N/A out of X
Network: Card-1: Intel 82557/8/9/0/1 Ethernet Pro 100
           driver: e100 ver: 3.5.24-k2-NAPI port: e8c0 bus-ID: 00:08.0 chip-ID: 8086:1229
           IF: eth2 state: down mac: 00:02:b3:4b:1b:d9
           Card-2: Intel 82546EB Gigabit Ethernet Controller (Copper)
           driver: e1000 ver: 7.3.21-k8-NAPI port: bcc0 bus-ID: 08:06.0 chip-ID: 8086:1010
           IF: eth0 state: down mac: 00:04:23:d0:b5:e2
           Card-3: Intel 82546EB Gigabit Ethernet Controller (Copper)
           driver: e1000 ver: 7.3.21-k8-NAPI port: bc80 bus-ID: 08:06.1 chip-ID: 8086:1010
           IF: eth1 state: up speed: 1000 Mbps duplex: full mac: 00:04:23:d0:b5:e3
Drives: HDD Total Size: 2099.6GB (0.1% used)
           1: id: /dev/sda model: system size: 300.0GB serial: 8EDB485F temp: 0C
           2: id: /dev/sdb model: homepart size: 1799.6GB serial: 326F485F temp: 0C
Partition: ID: / size: 92G used: 377M (1%) fs: ext4 ID: /boot size: 922M used: 35M (5%) fs: ext4
           ID: /usr size: 92G used: 745M (1%) fs: ext4 ID: /var size: 69G used: 527M (1%) fs: ext4
           ID: /home size: 1.7T used: 69M (1%) fs: ext4 ID: swap-1 size: 24.00GB used: 0.00GB (0%) fs: swap
RAID: System: supported: N/A
           No RAID devices detected - /proc/mdstat and md_mod kernel raid module present
           Unused Devices: none
Sensors: None detected - is lm-sensors installed and configured?
Repos: Active apt sources in file: /etc/apt/sources.list
           deb http://us.archive.ubuntu.com/ubuntu/ trusty main restricted
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty main restricted
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates main restricted
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates main restricted
           deb http://us.archive.ubuntu.com/ubuntu/ trusty universe
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty universe
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates universe
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates universe
           deb http://us.archive.ubuntu.com/ubuntu/ trusty multiverse
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty multiverse
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates multiverse
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates multiverse
           deb ...

Read more...

telsch (telsch) wrote :

I run in similar issue after upgrading from 14.04 to 16.04 as well to 18.04 again.

Previously solution was to upgrade the firmware:
    https://storage.microsemi.com/en-us/downloads/bios_fw/bios_fw_ver/productid=sas-6405&dn=adaptec+raid+6405.php

Changelog describe added support:
     http://download.adaptec.com/pdfs/readme/microsemi_relnotes_arc_8_2017.pdf
- Added support for Ubuntu 16.04.
- Added support for Ubuntu 14.04.4.

Support Call gives no hope, cause this raid-controller is eol.

Makere (makere) wrote :

I have this same issue after upgrading 16.04 to 18.04. I have the Adaptec 5805 controller and my computer freezes at fairly random, might be every 1-2 days or 3-4 weeks.

After checking the syslog for freeze reasons, the aacraid seems to be the culpript.

Trent (trentm2) wrote :

I also have this issue after upgrading to 18.04.2 LTS with a host with an Adaptec 5805 and linux-image-4.15.0-46-generic.

Machine hangs on boots and starts returning
aacraid: Host Adapter abort request.
aacraid: Outstanding commands on (0,2,223,0):
aacraid: Host Adapter abort request.
aacraid: Outstanding commands on (0,2,224,0):
aacraid: Host Adapter abort request.
aacraid: Outstanding commands on (0,2,225,0):
etc

First I upgraded the FW on the controller to 19204 (from 19176) no improvement

Rolled back to linux-image-4.15.0-45-generic no change.

I tried mainline kernels
linux-image-unsigned-5.0.0-050000-generic_5.0.0-050000.201903032031_amd64.deb
and
linux-image-unsigned-4.20.14-042014-generic_4.20.14-042014.201903051334_amd64.deb
and issue persisted

But if i roll back to 4.4.0-142-generic no problems.

Trent (trentm2) wrote :

mainline linux-image-4.9.162-0409162-generic_4.9.162-0409162.201903051732_amd64.deb is working fine.

Kai-Heng Feng (kaihengfeng) wrote :

Trent,
Would it be possible to do a kernel bisection?

Trent (trentm2) wrote :

i can try.

someone above found the issue was in a 4.13 kernel, i can move forward from 4.9.x and hopefully we can find where something changed.

Trent (trentm2) wrote :

linux-image-4.13.16-041316-generic_4.13.16-041316.201711240901_amd64 - no good
linux-image-4.13.1-041301-generic_4.13.1-041301.201709100232_amd64 - no good
linux-image-4.13.0-041300rc1-generic_4.13.0-041300rc1.201707151931_amd64 - no good
linux-image-4.12.14-041214-generic_4.12.14-041214.201709200843_amd64 - good
linux-image-4.11.12-041112-generic_4.11.12-041112.201707210350_amd64 - good

the issue appears to have begun with the jump from 4.12.x to 4.13.x

Kai-Heng Feng (kaihengfeng) wrote :

Would it be possible for you to do a kernel bisection?

First, find the last good -rc kernel and the first bad -rc kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/

Then,
$ sudo apt build-dep linux
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git bisect start
$ git bisect good $(the good version you found)
$ git bisect bad $(the bad version found)
$ make localmodconfig
$ make -j`nproc` deb-pkg
Install the newly built kernel, then reboot with it.
If the issue still happens,
$ git bisect bad
Otherwise,
$ git bisect good
Repeat to "make -j`nproc` deb-pkg" until you find the commit that causes the regression.

Kai-Heng Feng (kaihengfeng) wrote :

There aren't many aacraid commits, you can directly test them out:

$ git log v4.12..v4.13-rc1 --pretty=oneline | grep -i aacraid
342ffc26693b528648bdc9377e51e4f2450b4860 scsi: aacraid: Don't copy uninitialized stack memory to userspace
5cc973f09e21b5a2f746307641879bc9f1da623b scsi: aacraid: fix leak of data from stack back to userspace
216e80ff78cd2fe1a92c9e4565d578e540e35cc8 scsi: aacraid: Update driver version to 50834
395e5df79a9588abf1099ea746f11872c9086252 scsi: aacraid: Remove reference to Series-9
4a76be0dc53a2d725ee126a806e5988135952a05 scsi: aacraid: Add reset debugging statements
786e898c86ee532b76ca5e2ec0b8c2d464d553db scsi: aacraid: Enable ctrl reset for both hba and arc
8c41b9b7987e404b3c922cf9d8ff941112051837 scsi: aacraid: Make sure ioctl returns on controller reset
9473ddb2b037161b0bf16b60b37694f961fd6d48 scsi: aacraid: Use correct function to get ctrl health
5aa60732520dd0476ed9e20047b837780bbb7799 scsi: aacraid: Rework aac_src_restart
77cb6d5ea6033e5d477947aa682728959d6c3f8f scsi: aacraid: Rework SOFT reset code
0e9973ed3382652b324971753745cfe08488bb9f scsi: aacraid: Add periodic checks to see IOP reset status
80c7d8a5cffa7187c3b3b78eb67705dae91e9a1a scsi: aacraid: Rework IOP reset
6b24d425881792b16ccf2189b43d57b4aff2a4e6 scsi: aacraid: Using single reset mask for IOP reset
144ecd41f0f43600f0c103cb6d0d2f1619d70e96 scsi: aacraid: Print ctrl status before eh reset
895dc759cf3996a56ca64e3e09cbea64e2a7ff62 scsi: aacraid: Log count info of scsi cmds before reset
2a4a62c03fd0b5f2e361fbda85e043b2c1ff197d scsi: aacraid: Change wait time for fib completion
fed820073f647f4ecb4f4ae310d698520a802891 scsi: aacraid: Remove reset support from check_health
58eaffe54bca77fbf1a1bad7703265950a758cd1 scsi: aacraid: Set correct Queue Depth for HBA1000 RAW disks
d58129c96b3cde4821763b08b5f1bba17f031138 scsi: aacraid: Added 32 and 64 queue depth for arc natives
8105d39d0e7600ebbcce5827c11f15bf77c73af5 scsi: aacraid: Fix DMAR issues with iommu=pt
c831a4a08636d5462a0f9eb479771e2f65ad0378 scsi: aacraid: Remove __GFP_DMA for raw srb memory

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers