Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with intel_iommu=on
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Incomplete
|
High
|
Unassigned |
Bug Description
TL;DR
- one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on
- the Disk controller fails
- Xenial seems to work for a while but then fails
- Zesty 100% crashes on boot
- An identical system seems to work, so need HW replace to finally confirm
After reboot one sees a HW report like this:
After the boot I see the HW telling me this on boot:
Embedded RAID : Smart HBA H240ar Controller - Operation Failed
- 1719-Slot 0 Drive Array - A controller failure event occurred prior
to this power-up. (Previous lock up code = 0x13)
I tried several things (In between always redeploy zesty with MAAS).
I think my debugging might be helpful, but I wanted to keep the documentation in the bug in case you'd go another route or that others find useful information in here.
0. I retried what I did twice, fully reproducible
That is:
0.1 install zesty
0.2 change grub default cmdline in /etc/default/
0.3 sudo update-grub
0.4 reboot
1. I tried a Recovery boot from the boot options in gub.
=> Failed as well
2. iLO rebooted vis "request reboot" and as well via "full system reset"
=> both Failed
3. Reboot the system as deployed by MAAS
# /proc/cmdline before that
BOOT_
The orig grub.cfg is like http://
It reboots as-is.
=> Reboot worked
4. without a change to anything in /etc run update-grub
$ sudo update-grub
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-
Found initrd image: /boot/initrd.
Adding boot menu entry for EFI firmware configuration
done
There was no diff between the new grub.cfg and the one I saved.
=> Reboot worked
5. add the intel_iommu=on arg
$ sudo sed -i 's/GRUB_
$ sudo update-grub
# Diff in grub.cfg really only is the iommu setting
=> Reboot Failed
So this doesn't seem so much of a cloud-init/
- Check grub cfg pre/post - not change but the expected?
6. Install Xenial and do the same
=> Reboot working
7. Upgrade to Z
Since the Xenial system just worked and one can assume that almost only kernel is working so early in the boot process I upgraded the working system with intel_iommu=on to Zesty.
That would be 4.4.0-71-generic to 4.10.0-1
On this upgrade I finally saw my I/O errors again :-/
Note: these issues are hard to miss as they mount root as read-only.
I wonder if they only ever appear with intel_iommu=on as this is the only combo I ever saw them,
8. Redeploy and upgrade to Z without intel_iommu=on enabled
Then enable intel_iommu=on and reboot
=> Reboot Fail
From here I rebooted into the Xenial kerenl (that since this is an update was still there)
Here I saw:
Loading Linux 4.4.0-71-generic ...
Loading initial ramdisk ...
error: invalid video mode specification `text'.
Booting in blind mode
Hrm, as outlined above the "blind mode" might be a red herring, but since this kernel worked before it might still be a red herring that swims in the initrd that got regenerated on the upgrade.
=> Xenial Kernel Reboot - works !!
So "blind mode" is a red herring of some sort.
But this might allow to find some logs
=> No
This appears as if the Failing boot has never made it to the point to actually write anything.
I see:
1. the original xenial
2. the upgraded zesty
3. NOT THE zesty+iommu
4. the xenial+iommu
$ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Linux version 4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu 6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu 4.10.0-
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=
9. Trying to avoiding HW replacement if not needed
I was afraid I might need the HW to be replaced to be 100% sure, but this very much smells broken in SW to me already.
To avoid RT ticket replacing without real need I asked to free another system up.
So I finally could free up a identical machine.
I especially checked the failing HP smart array, it has the same Product Version and FW revision.
There things seem to work, so I might be down to replacing the HW :-/
10. get some messages of the fail:
With the following grub cmdline I got to see the fail:
GRUB_CMDLINE_
It looks just like the one I found on the running system when intel_iommu=on is set on the Xenial kernel happening later (sometimes minutes, sometimes days, but never without intel_iommu).
But on zesty it seems to trigger 100% on boot and by that not even get up.
I'll attach a few logs of the crashes, but the heads are
[ 33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD Smart Path configuration change)
[ 618.567636] DMAR: DRHD: handling fault status reg 2
[ 618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr ffafc000
Or
[ 159.779566] hpsa 0000:03:00.0: Command timed out.
[ 159.801113] hpsa 0000:03:00.0: hpsa_send_
While it might be a HW issue I file this still to be "findable" for anyone else if it is no HW eventually.
But I assign myself for now to close/confirm once I have replaced HW.
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
status: | Incomplete → Triaged |
tags: | added: kernel-da-key |
tags: |
added: kernel-key removed: kernel-da-key |
tags: |
added: kernel-da-key removed: kernel-key |
Please note that on the "good" system nobody ever used iommu device assignment, I'll do so after the next days. That way we should also learn if the can bring a good system into the failing mode.