System won't boot after upgrade to 16.04 with 4.4.0 kernel

Bug #1653162 reported by Jon Schewe
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Critical
Unassigned
Xenial
Invalid
Critical
Unassigned
Yakkety
Won't Fix
Critical
Unassigned

Bug Description

I have a Dell PowerEdge 1800 with Dell CERC 1.5/6ch RAID controller in it. Everything worked fine under Ubuntu 14.04.1. When I upgraded to 16.04.1 the system won't boot. It can't find the root filesystem. I see errors about host adapter dead on the screen many times. This is with kernel 4.4.0-57 and with 4.8.0-32. When I switch back to kernel 3.13.0-105 the system boots just fine.

I have mptbios 5.06.04.
The CERC card says bios version 4.1-0 [Build 7403]

When booting into the ercovery kernel I'm getting error messages about host adapter dead.
Eventually I get a message:
"aacraid: aac_fib_send: first asynchronous command timed out.
Usually a result of a PCI interrupt routing problem
update moher board BIOS or consider utilizing one of
the SAFE mode kernel options (acpi, apic etc)"

Using kernel parameter "intel_iommu=on" doesn't help. This was suggested on a post that I saw about these errors.

Booting with "noapic" didn't help.
Booting with "noapic noacpi" didn't help.

I've also tried the dkms modules from Adaptec and that doesn't seem to help either.

I can't figure out how to upgrade the firmware or BIOS on the system from Ubuntu either.

This may be related to this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1552551?comments=all
---
ApportVersion: 2.20.1-0ubuntu2.4
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=beec17b5-eac8-4e94-ac99-65b63c428b77
InstallationDate: Installed on 2014-02-24 (1039 days ago)
InstallationMedia: Ubuntu 12.04.3 LTS "Precise Pangolin" - Release amd64 (20130820.1)
IwConfig:
 eth1 no wireless extensions.

 lo no wireless extensions.
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: Dell Computer Corporation PowerEdge 1800
Package: linux (not installed)
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-105-generic root=UUID=83e1b87e-adb9-4798-a6e0-18d401c91a4a ro nosplash noplymouth
ProcVersionSignature: Ubuntu 3.13.0-105.152-generic 3.13.11-ckt39
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-105-generic N/A
 linux-backports-modules-3.13.0-105-generic N/A
 linux-firmware 1.157.6
RfKill:

Tags: xenial
Uname: Linux 3.13.0-105-generic x86_64
UpgradeStatus: Upgraded to xenial on 2016-12-25 (3 days ago)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 09/01/2004
dmi.bios.vendor: Dell Computer Corporation
dmi.bios.version: A00
dmi.board.name: 0X7500
dmi.board.vendor: Dell Computer Corporation
dmi.board.version: A00
dmi.chassis.type: 17
dmi.chassis.vendor: Dell Computer Corporation
dmi.modalias: dmi:bvnDellComputerCorporation:bvrA00:bd09/01/2004:svnDellComputerCorporation:pnPowerEdge1800:pvr:rvnDellComputerCorporation:rn0X7500:rvrA00:cvnDellComputerCorporation:ct17:cvr:
dmi.product.name: PowerEdge 1800
dmi.sys.vendor: Dell Computer Corporation

Revision history for this message
Jon Schewe (jpschewe) wrote :
Revision history for this message
Jon Schewe (jpschewe) wrote :

I tried some other kernel parameters:

scsi_scan=sync -> didn't help
pci=nomsi,nommconf -> didn't help

For now I'm just booting the 3.13.0 kernel since I need the server up and don't have good physical access to it.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1653162

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jon Schewe (jpschewe) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected xenial
description: updated
Revision history for this message
Jon Schewe (jpschewe) wrote : CRDA.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : JournalErrors.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : Lspci.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : ProcModules.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : UdevDb.txt

apport information

Revision history for this message
Jon Schewe (jpschewe) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
penalvch (penalvch)
tags: added: bios-outdated-a07
Changed in linux (Ubuntu):
importance: Undecided → Low
status: Confirmed → Incomplete
Revision history for this message
Jon Schewe (jpschewe) wrote :

I updated the bios to A07 as this is the latest listed for my service tag 1VMBS61.
I updated the BMC bios and the SAS controller BIOS.
Still won't boot with kernel 4.4.0-57.

>sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date
A07
09/29/2006

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Jon Schewe, in order to allow additional upstream developers to examine the issue, at your earliest convenience, could you please test the latest upstream kernel available from http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D ? Please keep in mind the following:
1) The one to test is at the very top line at the top of the page (not the daily folder).
2) The release names are irrelevant.
3) The folder time stamps aren't indicative of when the kernel actually was released upstream.
4) Install instructions are available at https://wiki.ubuntu.com/Kernel/MainlineBuilds .

If testing on your main install would be inconvenient, one may:
1) Install Ubuntu to a different partition and then test this there.
2) Backup, or clone the primary install.

If the latest kernel did not allow you to test to the issue (ex. you couldn't boot into the OS) please make a comment in your report about this, and continue to test the next most recent kernel version until you can test to the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this issue is fixed in the mainline kernel, please add the following tags by clicking on the yellow circle with a black pencil icon, next to the word Tags, located at the bottom of the report description:
kernel-fixed-upstream
kernel-fixed-upstream-X.Y-rcZ

Where X, and Y are the first two numbers of the kernel version, and Z is the release candidate number if it exists.

If the mainline kernel does not fix the issue, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-X.Y-rcZ

Please note, an error to install the kernel does not fit the criteria of kernel-bug-exists-upstream.

Also, you don't need to apport-collect further unless specifically requested to do so.

It is most helpful that after testing of the latest upstream kernel is complete, you mark this report Status Confirmed.

Lastly, to keep this issue relevant to upstream, please continue to test the latest mainline kernel as it becomes available.

Thank you for your help.

tags: added: latest-bios-a07 latest-firmware
removed: bios-outdated-a07
Changed in linux (Ubuntu):
importance: Low → Medium
status: Confirmed → Incomplete
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Medium → High
Revision history for this message
Jon Schewe (jpschewe) wrote :

I just tried the kernel in the directory v4.10-rc4/.

tags: added: ernel-bug-exists-upstream-4.10-rc4 kernel-bug-exists-upstream
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

3.13 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-trusty/
3.16 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-utopic/
4.0 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0-vivid/
4.2 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2-wily/
4.4 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-wily/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

Changed in linux (Ubuntu):
importance: High → Critical
Changed in linux (Ubuntu Xenial):
status: New → Incomplete
importance: Undecided → Critical
Changed in linux (Ubuntu Yakkety):
status: New → Triaged
Changed in linux (Ubuntu Xenial):
status: Incomplete → Triaged
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Yakkety):
importance: Undecided → Critical
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Joseph Salisbury (jsalisbury)
tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Jon Schewe, were you able to test any of the kernels posted in comment #19?

Revision history for this message
Jon Schewe (jpschewe) wrote :

I installed all of the kernels and then rebooted multiple times starting with my working kernel first and then going up in version numbers.

3.13.0-105 -> boots

3.13 -> kernel panic, can't execute /bin/sh
3.16 -> boots
4.0 -> boots
4.2 -> boots, after some time got some messages about host adapter reset with possible SCSI hang, but it appears OK
4.4 -> boots

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

So the 3.13 final kernel hits the panic, but the 4.4 final kernel does not, using the links posted in comment #19?

Can you also test the Xenial -proposed kernel? It is available from:
https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/11877924

For that kernel, you need to install both the linux-image and linux-image-extra .deb packages.

tags: added: kernel-da-key
removed: kernel-key
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Changed in linux (Ubuntu Xenial):
status: Triaged → Incomplete
Changed in linux (Ubuntu Yakkety):
status: Triaged → Incomplete
Revision history for this message
Jon Schewe (jpschewe) wrote :

I finally got time to get access to the computer again. I tried the 4.4.0-62 kernel from xenial-proposed and received "host adapter dead".

Since I last tested I see that 4.4.0-75 is available, so I tried that. The system booted. It gave me some messages about scsi adapter reset and a messag about the scsi bus being hung, but after that the system appears to have booted ok and I'm able to access the disk.

Revision history for this message
Jon Schewe (jpschewe) wrote :

After running for about 24 hours I got a large number of "aacraid: Host adapter abort request (2,0,0,0)" errors and then the following:

AAC: Host adapter BLINK LED 0x7
AAC0: adapter kernel panic'd 7.
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery

At this point my root filesystem is read-only. It won't reboot remotely because I've got logging turned on for sudo. So I'll need to physically goto the system and do a hard reset.

Revision history for this message
inch eye (incheye) wrote :

I'm getting the same issue trying to install ubuntu-16.04.2-desktop-amd64

TBH im a bit of a newb when it comes to linux, can anyone tell me how to work around this? Only difference here is I'm trying to dual boot windows 7 and ubuntu

any help would be really appreciated

Revision history for this message
Andy Whitcroft (apw) wrote : Closing unsupported series nomination.

This bug was nominated against a series that is no longer supported, ie yakkety. The bug task representing the yakkety nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Yakkety):
status: Incomplete → Won't Fix
Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

As far as I know, this is still a current issue. It has nothing to do with Ubuntu, it's a mainline kernel problem. I was working with someone earlier this year and after tracing it to a specific commit (I believe the same one that added "4k" sector support), I was sent somewhere else and kinda dropped it there. I was working on a customer's PC that was finished and needed to be returned. I have since bought a CERC card that it an exact match and may be able to help more if needed. I looked over the code a bit and there were quite a number of changes made to the module at the time to use a whole new set of functions :(. It acts to me like an interrupt timing/sync issue, but the last time I dealt with that was an ISA interrupt sharing methodology in 1990.

Revision history for this message
Daniel A. Gauthier (fractal2010) wrote :

Forgot to add I think it was either 4.0.1 or 4.1.0 that broke it, if I remember right. This bug is linked to my earlier one somehow, but I don't know how to point you there. I got the system running by removing a package like "linux-kernel-generic" which depends on the current version and then installing 4.0.1 (or 4.0.9 or 4.0.0 or whatever) as a permanent kernel version. It's probably also possible to "pin" a certain version of the generic-kernel package, but I was never able to successfully do that.

Revision history for this message
Jon Schewe (jpschewe) wrote :

The last Ubuntu kernel that worked for me was 3.13.0-105. I had the 4.0 vivid testing kernel also worked for me. However I've gone back to 3.13.0-105 as it's part of the Ubuntu repository. I'm hoping that this gets fixed so that I can move to a recent kernel.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

There are lots of fixes in mainline, it's probably a good idea to try v4.13-rc3.

Revision history for this message
Jon Schewe (jpschewe) wrote :

I just tried the latest stable kernel in the repository, 4.4.0-91 and it boots, but after a bit I get aacraid host adapter abort request messages and a scsi hang message. It seems to happen most often under load.

4.13.0-041300rc3-generic_4.13.0-041300rc3.201707301631 ->
[ 138.983840] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 139.815831] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 139.828527] aacraid: Host adapter reset request. SCSI hang ?
[ 139.834955] aacraid 0000:03:09.0: outstanding cmd: midlevel-0
[ 139.834957] aacraid 0000:03:09.0: outstanding cmd: lowlevel-0
[ 139.834959] aacraid 0000:03:09.0: outstanding cmd: error handler-1
[ 139.834961] aacraid 0000:03:09.0: outstanding cmd: firmware-1
[ 139.834963] aacraid 0000:03:09.0: outstanding cmd: kernel-0

4.13.0-041300rc4-lowlatency_4.13.0-041300rc4.201708062231 ->
[ 127.974997] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 127.988132] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 128.000780] aacraid: Host adapter reset request. SCSI hang ?
[ 128.006827] aacraid 0000:03:09.0: outstanding cmd: midlevel-0
[ 128.006830] aacraid 0000:03:09.0: outstanding cmd: lowlevel-0
[ 128.006833] aacraid 0000:03:09.0: outstanding cmd: error handler-1
[ 128.006835] aacraid 0000:03:09.0: outstanding cmd: firmware-1
[ 128.006837] aacraid 0000:03:09.0: outstanding cmd: kernel-0

Revision history for this message
Daniel Reinhardt (cryptodan) wrote :
Download full text (4.0 KiB)

joseph,

this bug goes all the way back to centos 5 and kernel 2.6.

i have a stable machine on the following system:

cryptodan@capricorn:~$ inxi -Fxxxrpc0
System: Host: capricorn Kernel: 3.13.0-24-generic i686 (32 bit, gcc: 4.8.2) Console: tty 1 Distro: Ubuntu 14.04 trusty
Machine: System: Dell product: PowerEdge 4600 Chassis: type: 17
           Mobo: Dell model: 0H3009 version: A00 Bios: Dell version: A13 date: 10/21/2004
CPU(s): 2 Single core Intel Xeon CPUs (-HT-SMP-) cache: 1024 KB flags: (pae sse sse2) bmips: 11961.4
           Clock Speeds: 1: 2990.346 MHz 2: 2990.346 MHz 3: 2990.346 MHz 4: 2990.346 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Rage XL PCI bus-ID: 00:0e.0 chip-ID: 1002:4752
           X-Vendor: N/A driver: N/A tty size: 100x35 Advanced Data: N/A out of X
Network: Card-1: Intel 82557/8/9/0/1 Ethernet Pro 100
           driver: e100 ver: 3.5.24-k2-NAPI port: e8c0 bus-ID: 00:08.0 chip-ID: 8086:1229
           IF: eth2 state: down mac: 00:02:b3:4b:1b:d9
           Card-2: Intel 82546EB Gigabit Ethernet Controller (Copper)
           driver: e1000 ver: 7.3.21-k8-NAPI port: bcc0 bus-ID: 08:06.0 chip-ID: 8086:1010
           IF: eth0 state: down mac: 00:04:23:d0:b5:e2
           Card-3: Intel 82546EB Gigabit Ethernet Controller (Copper)
           driver: e1000 ver: 7.3.21-k8-NAPI port: bc80 bus-ID: 08:06.1 chip-ID: 8086:1010
           IF: eth1 state: up speed: 1000 Mbps duplex: full mac: 00:04:23:d0:b5:e3
Drives: HDD Total Size: 2099.6GB (0.1% used)
           1: id: /dev/sda model: system size: 300.0GB serial: 8EDB485F temp: 0C
           2: id: /dev/sdb model: homepart size: 1799.6GB serial: 326F485F temp: 0C
Partition: ID: / size: 92G used: 377M (1%) fs: ext4 ID: /boot size: 922M used: 35M (5%) fs: ext4
           ID: /usr size: 92G used: 745M (1%) fs: ext4 ID: /var size: 69G used: 527M (1%) fs: ext4
           ID: /home size: 1.7T used: 69M (1%) fs: ext4 ID: swap-1 size: 24.00GB used: 0.00GB (0%) fs: swap
RAID: System: supported: N/A
           No RAID devices detected - /proc/mdstat and md_mod kernel raid module present
           Unused Devices: none
Sensors: None detected - is lm-sensors installed and configured?
Repos: Active apt sources in file: /etc/apt/sources.list
           deb http://us.archive.ubuntu.com/ubuntu/ trusty main restricted
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty main restricted
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates main restricted
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates main restricted
           deb http://us.archive.ubuntu.com/ubuntu/ trusty universe
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty universe
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates universe
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates universe
           deb http://us.archive.ubuntu.com/ubuntu/ trusty multiverse
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty multiverse
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates multiverse
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates multiverse
           deb http://us.archive....

Read more...

Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Xenial):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Yakkety):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Xenial):
status: Incomplete → Invalid
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
James Hanna (james-isdept) wrote :

Please forgive if this comment is outside etiquette.

I am trying to install Ubuntu or Debian on a Dell Poweredge 750, and get this error after a few minutes of uptime:

aacraid: Host adapter abort request (0,0,0,0) [repeat >15x]
AAC: Host adapter BLINK LED 0xc7
AAC0: adapter kernel panic'd c7.
[last two lines repeat again twice]

Am installing in a workplace at request of IS Dept manager, to determine feasibility of using Linux on a server to run SQL server. If it fails, management probably won't consider using Linux, period.

The server has been running Windows 7 to support its previous role at the company, and was not experiencing any issues of any kind.

Question: is there currently a workaround for this issue, so I can use the Dell PowerEdge 750 hardware with a RAID-1 setup?

Considering moving the drives to non-RAID controller to avoid driver issue. Not sure if this will work, but would like to avoid project failure.

Thanks for your time and efforts here!

Revision history for this message
James Hanna (james-isdept) wrote :

Follow up of my previous comment:

Compiled and installed RHEL 8 kernel. RHEL 8 aacraid driver does not seem to have the bug.

System Rescue CD v5.2.2 (http://www.system-rescue-cd.org) aacraid module does not encounter the bug.

However Debian did not boot "out of the box" if I installed either of these kernels and generated initrd. It would always drop to a shell after failing to mount the root volume, which was an ext4 MBR partition. The partition mounts fine after booting System Rescue CD, and does not suffer from the aacraid module bug.

Speculation: failure to mount at boot with RHEL 8 or System Rescue CD is a Debian-specific init issue. Did not perform further troubleshooting using the RAID card.

Pulled the CERC 1.5/6 ch RAID card and attached directly to the SATA ports on the motherboard, to continue the project.

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.