System won't boot after upgrade to 16.04 with 4.4.0 kernel

Bug #1653162 reported by Jon Schewe on 2016-12-29
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Unassigned
Xenial
Critical
Unassigned
Yakkety
Critical
Unassigned

Bug Description

I have a Dell PowerEdge 1800 with Dell CERC 1.5/6ch RAID controller in it. Everything worked fine under Ubuntu 14.04.1. When I upgraded to 16.04.1 the system won't boot. It can't find the root filesystem. I see errors about host adapter dead on the screen many times. This is with kernel 4.4.0-57 and with 4.8.0-32. When I switch back to kernel 3.13.0-105 the system boots just fine.

I have mptbios 5.06.04.
The CERC card says bios version 4.1-0 [Build 7403]

When booting into the ercovery kernel I'm getting error messages about host adapter dead.
Eventually I get a message:
"aacraid: aac_fib_send: first asynchronous command timed out.
Usually a result of a PCI interrupt routing problem
update moher board BIOS or consider utilizing one of
the SAFE mode kernel options (acpi, apic etc)"

Using kernel parameter "intel_iommu=on" doesn't help. This was suggested on a post that I saw about these errors.

Booting with "noapic" didn't help.
Booting with "noapic noacpi" didn't help.

I've also tried the dkms modules from Adaptec and that doesn't seem to help either.

I can't figure out how to upgrade the firmware or BIOS on the system from Ubuntu either.

This may be related to this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1552551?comments=all
---
ApportVersion: 2.20.1-0ubuntu2.4
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=beec17b5-eac8-4e94-ac99-65b63c428b77
InstallationDate: Installed on 2014-02-24 (1039 days ago)
InstallationMedia: Ubuntu 12.04.3 LTS "Precise Pangolin" - Release amd64 (20130820.1)
IwConfig:
 eth1 no wireless extensions.

 lo no wireless extensions.
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: Dell Computer Corporation PowerEdge 1800
Package: linux (not installed)
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-105-generic root=UUID=83e1b87e-adb9-4798-a6e0-18d401c91a4a ro nosplash noplymouth
ProcVersionSignature: Ubuntu 3.13.0-105.152-generic 3.13.11-ckt39
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-105-generic N/A
 linux-backports-modules-3.13.0-105-generic N/A
 linux-firmware 1.157.6
RfKill:

Tags: xenial
Uname: Linux 3.13.0-105-generic x86_64
UpgradeStatus: Upgraded to xenial on 2016-12-25 (3 days ago)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 09/01/2004
dmi.bios.vendor: Dell Computer Corporation
dmi.bios.version: A00
dmi.board.name: 0X7500
dmi.board.vendor: Dell Computer Corporation
dmi.board.version: A00
dmi.chassis.type: 17
dmi.chassis.vendor: Dell Computer Corporation
dmi.modalias: dmi:bvnDellComputerCorporation:bvrA00:bd09/01/2004:svnDellComputerCorporation:pnPowerEdge1800:pvr:rvnDellComputerCorporation:rn0X7500:rvrA00:cvnDellComputerCorporation:ct17:cvr:
dmi.product.name: PowerEdge 1800
dmi.sys.vendor: Dell Computer Corporation

Jon Schewe (jpschewe) wrote :

I tried some other kernel parameters:

scsi_scan=sync -> didn't help
pci=nomsi,nommconf -> didn't help

For now I'm just booting the 3.13.0 kernel since I need the server up and don't have good physical access to it.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1653162

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected xenial
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: bios-outdated-a07
Changed in linux (Ubuntu):
importance: Undecided → Low
status: Confirmed → Incomplete
Jon Schewe (jpschewe) wrote :

I updated the bios to A07 as this is the latest listed for my service tag 1VMBS61.
I updated the BMC bios and the SAS controller BIOS.
Still won't boot with kernel 4.4.0-57.

>sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date
A07
09/29/2006

Changed in linux (Ubuntu):
status: Incomplete → Confirmed

Jon Schewe, in order to allow additional upstream developers to examine the issue, at your earliest convenience, could you please test the latest upstream kernel available from http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D ? Please keep in mind the following:
1) The one to test is at the very top line at the top of the page (not the daily folder).
2) The release names are irrelevant.
3) The folder time stamps aren't indicative of when the kernel actually was released upstream.
4) Install instructions are available at https://wiki.ubuntu.com/Kernel/MainlineBuilds .

If testing on your main install would be inconvenient, one may:
1) Install Ubuntu to a different partition and then test this there.
2) Backup, or clone the primary install.

If the latest kernel did not allow you to test to the issue (ex. you couldn't boot into the OS) please make a comment in your report about this, and continue to test the next most recent kernel version until you can test to the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this issue is fixed in the mainline kernel, please add the following tags by clicking on the yellow circle with a black pencil icon, next to the word Tags, located at the bottom of the report description:
kernel-fixed-upstream
kernel-fixed-upstream-X.Y-rcZ

Where X, and Y are the first two numbers of the kernel version, and Z is the release candidate number if it exists.

If the mainline kernel does not fix the issue, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-X.Y-rcZ

Please note, an error to install the kernel does not fit the criteria of kernel-bug-exists-upstream.

Also, you don't need to apport-collect further unless specifically requested to do so.

It is most helpful that after testing of the latest upstream kernel is complete, you mark this report Status Confirmed.

Lastly, to keep this issue relevant to upstream, please continue to test the latest mainline kernel as it becomes available.

Thank you for your help.

tags: added: latest-bios-a07 latest-firmware
removed: bios-outdated-a07
Changed in linux (Ubuntu):
importance: Low → Medium
status: Confirmed → Incomplete
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Medium → High
Jon Schewe (jpschewe) wrote :

I just tried the kernel in the directory v4.10-rc4/.

tags: added: ernel-bug-exists-upstream-4.10-rc4 kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that exhibits this bug:

3.13 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-trusty/
3.16 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-utopic/
4.0 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0-vivid/
4.2 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2-wily/
4.4 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-wily/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

Changed in linux (Ubuntu):
importance: High → Critical
Changed in linux (Ubuntu Xenial):
status: New → Incomplete
importance: Undecided → Critical
Changed in linux (Ubuntu Yakkety):
status: New → Triaged
Changed in linux (Ubuntu Xenial):
status: Incomplete → Triaged
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Yakkety):
importance: Undecided → Critical
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Joseph Salisbury (jsalisbury)
tags: added: kernel-key
removed: kernel-da-key
Joseph Salisbury (jsalisbury) wrote :

@Jon Schewe, were you able to test any of the kernels posted in comment #19?

Jon Schewe (jpschewe) wrote :

I installed all of the kernels and then rebooted multiple times starting with my working kernel first and then going up in version numbers.

3.13.0-105 -> boots

3.13 -> kernel panic, can't execute /bin/sh
3.16 -> boots
4.0 -> boots
4.2 -> boots, after some time got some messages about host adapter reset with possible SCSI hang, but it appears OK
4.4 -> boots

Joseph Salisbury (jsalisbury) wrote :

So the 3.13 final kernel hits the panic, but the 4.4 final kernel does not, using the links posted in comment #19?

Can you also test the Xenial -proposed kernel? It is available from:
https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/11877924

For that kernel, you need to install both the linux-image and linux-image-extra .deb packages.

tags: added: kernel-da-key
removed: kernel-key
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Changed in linux (Ubuntu Xenial):
status: Triaged → Incomplete
Changed in linux (Ubuntu Yakkety):
status: Triaged → Incomplete
Jon Schewe (jpschewe) wrote :

I finally got time to get access to the computer again. I tried the 4.4.0-62 kernel from xenial-proposed and received "host adapter dead".

Since I last tested I see that 4.4.0-75 is available, so I tried that. The system booted. It gave me some messages about scsi adapter reset and a messag about the scsi bus being hung, but after that the system appears to have booted ok and I'm able to access the disk.

Jon Schewe (jpschewe) wrote :

After running for about 24 hours I got a large number of "aacraid: Host adapter abort request (2,0,0,0)" errors and then the following:

AAC: Host adapter BLINK LED 0x7
AAC0: adapter kernel panic'd 7.
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery
sd 2:0:0:0: Device offlined - not ready after error recovery

At this point my root filesystem is read-only. It won't reboot remotely because I've got logging turned on for sudo. So I'll need to physically goto the system and do a hard reset.

inch eye (incheye) wrote :

I'm getting the same issue trying to install ubuntu-16.04.2-desktop-amd64

TBH im a bit of a newb when it comes to linux, can anyone tell me how to work around this? Only difference here is I'm trying to dual boot windows 7 and ubuntu

any help would be really appreciated

This bug was nominated against a series that is no longer supported, ie yakkety. The bug task representing the yakkety nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Yakkety):
status: Incomplete → Won't Fix

As far as I know, this is still a current issue. It has nothing to do with Ubuntu, it's a mainline kernel problem. I was working with someone earlier this year and after tracing it to a specific commit (I believe the same one that added "4k" sector support), I was sent somewhere else and kinda dropped it there. I was working on a customer's PC that was finished and needed to be returned. I have since bought a CERC card that it an exact match and may be able to help more if needed. I looked over the code a bit and there were quite a number of changes made to the module at the time to use a whole new set of functions :(. It acts to me like an interrupt timing/sync issue, but the last time I dealt with that was an ISA interrupt sharing methodology in 1990.

Forgot to add I think it was either 4.0.1 or 4.1.0 that broke it, if I remember right. This bug is linked to my earlier one somehow, but I don't know how to point you there. I got the system running by removing a package like "linux-kernel-generic" which depends on the current version and then installing 4.0.1 (or 4.0.9 or 4.0.0 or whatever) as a permanent kernel version. It's probably also possible to "pin" a certain version of the generic-kernel package, but I was never able to successfully do that.

Jon Schewe (jpschewe) wrote :

The last Ubuntu kernel that worked for me was 3.13.0-105. I had the 4.0 vivid testing kernel also worked for me. However I've gone back to 3.13.0-105 as it's part of the Ubuntu repository. I'm hoping that this gets fixed so that I can move to a recent kernel.

Kai-Heng Feng (kaihengfeng) wrote :

There are lots of fixes in mainline, it's probably a good idea to try v4.13-rc3.

Jon Schewe (jpschewe) wrote :

I just tried the latest stable kernel in the repository, 4.4.0-91 and it boots, but after a bit I get aacraid host adapter abort request messages and a scsi hang message. It seems to happen most often under load.

4.13.0-041300rc3-generic_4.13.0-041300rc3.201707301631 ->
[ 138.983840] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 139.815831] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 139.828527] aacraid: Host adapter reset request. SCSI hang ?
[ 139.834955] aacraid 0000:03:09.0: outstanding cmd: midlevel-0
[ 139.834957] aacraid 0000:03:09.0: outstanding cmd: lowlevel-0
[ 139.834959] aacraid 0000:03:09.0: outstanding cmd: error handler-1
[ 139.834961] aacraid 0000:03:09.0: outstanding cmd: firmware-1
[ 139.834963] aacraid 0000:03:09.0: outstanding cmd: kernel-0

4.13.0-041300rc4-lowlatency_4.13.0-041300rc4.201708062231 ->
[ 127.974997] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 127.988132] aacraid: Host adapter abort request.
               aacraid: Outstanding commands on (2,0,0,0):
[ 128.000780] aacraid: Host adapter reset request. SCSI hang ?
[ 128.006827] aacraid 0000:03:09.0: outstanding cmd: midlevel-0
[ 128.006830] aacraid 0000:03:09.0: outstanding cmd: lowlevel-0
[ 128.006833] aacraid 0000:03:09.0: outstanding cmd: error handler-1
[ 128.006835] aacraid 0000:03:09.0: outstanding cmd: firmware-1
[ 128.006837] aacraid 0000:03:09.0: outstanding cmd: kernel-0

Daniel Reinhardt (cryptodan) wrote :
Download full text (4.0 KiB)

joseph,

this bug goes all the way back to centos 5 and kernel 2.6.

i have a stable machine on the following system:

cryptodan@capricorn:~$ inxi -Fxxxrpc0
System: Host: capricorn Kernel: 3.13.0-24-generic i686 (32 bit, gcc: 4.8.2) Console: tty 1 Distro: Ubuntu 14.04 trusty
Machine: System: Dell product: PowerEdge 4600 Chassis: type: 17
           Mobo: Dell model: 0H3009 version: A00 Bios: Dell version: A13 date: 10/21/2004
CPU(s): 2 Single core Intel Xeon CPUs (-HT-SMP-) cache: 1024 KB flags: (pae sse sse2) bmips: 11961.4
           Clock Speeds: 1: 2990.346 MHz 2: 2990.346 MHz 3: 2990.346 MHz 4: 2990.346 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Rage XL PCI bus-ID: 00:0e.0 chip-ID: 1002:4752
           X-Vendor: N/A driver: N/A tty size: 100x35 Advanced Data: N/A out of X
Network: Card-1: Intel 82557/8/9/0/1 Ethernet Pro 100
           driver: e100 ver: 3.5.24-k2-NAPI port: e8c0 bus-ID: 00:08.0 chip-ID: 8086:1229
           IF: eth2 state: down mac: 00:02:b3:4b:1b:d9
           Card-2: Intel 82546EB Gigabit Ethernet Controller (Copper)
           driver: e1000 ver: 7.3.21-k8-NAPI port: bcc0 bus-ID: 08:06.0 chip-ID: 8086:1010
           IF: eth0 state: down mac: 00:04:23:d0:b5:e2
           Card-3: Intel 82546EB Gigabit Ethernet Controller (Copper)
           driver: e1000 ver: 7.3.21-k8-NAPI port: bc80 bus-ID: 08:06.1 chip-ID: 8086:1010
           IF: eth1 state: up speed: 1000 Mbps duplex: full mac: 00:04:23:d0:b5:e3
Drives: HDD Total Size: 2099.6GB (0.1% used)
           1: id: /dev/sda model: system size: 300.0GB serial: 8EDB485F temp: 0C
           2: id: /dev/sdb model: homepart size: 1799.6GB serial: 326F485F temp: 0C
Partition: ID: / size: 92G used: 377M (1%) fs: ext4 ID: /boot size: 922M used: 35M (5%) fs: ext4
           ID: /usr size: 92G used: 745M (1%) fs: ext4 ID: /var size: 69G used: 527M (1%) fs: ext4
           ID: /home size: 1.7T used: 69M (1%) fs: ext4 ID: swap-1 size: 24.00GB used: 0.00GB (0%) fs: swap
RAID: System: supported: N/A
           No RAID devices detected - /proc/mdstat and md_mod kernel raid module present
           Unused Devices: none
Sensors: None detected - is lm-sensors installed and configured?
Repos: Active apt sources in file: /etc/apt/sources.list
           deb http://us.archive.ubuntu.com/ubuntu/ trusty main restricted
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty main restricted
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates main restricted
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates main restricted
           deb http://us.archive.ubuntu.com/ubuntu/ trusty universe
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty universe
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates universe
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates universe
           deb http://us.archive.ubuntu.com/ubuntu/ trusty multiverse
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty multiverse
           deb http://us.archive.ubuntu.com/ubuntu/ trusty-updates multiverse
           deb-src http://us.archive.ubuntu.com/ubuntu/ trusty-updates multiverse
           deb http://us.archive....

Read more...

Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Xenial):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Yakkety):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Xenial):
status: Incomplete → Invalid
Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers