Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with intel_iommu=on

Bug #1679208 reported by Christian Ehrhardt 
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High
Unassigned

Bug Description

TL;DR
- one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on
- the Disk controller fails
- Xenial seems to work for a while but then fails
- Zesty 100% crashes on boot
- An identical system seems to work, so need HW replace to finally confirm

After reboot one sees a HW report like this:
After the boot I see the HW telling me this on boot:
Embedded RAID : Smart HBA H240ar Controller - Operation Failed
 - 1719-Slot 0 Drive Array - A controller failure event occurred prior
   to this power-up. (Previous lock up code = 0x13)

I tried several things (In between always redeploy zesty with MAAS).
I think my debugging might be helpful, but I wanted to keep the documentation in the bug in case you'd go another route or that others find useful information in here.

0. I retried what I did twice, fully reproducible
   That is:
   0.1 install zesty
   0.2 change grub default cmdline in /etc/default/grub.d/50- to add intel_iommu=on
   0.3 sudo update-grub
   0.4 reboot

1. I tried a Recovery boot from the boot options in gub.
   => Failed as well

2. iLO rebooted vis "request reboot" and as well via "full system reset"
   => both Failed

3. Reboot the system as deployed by MAAS
   # /proc/cmdline before that
   BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
   The orig grub.cfg is like http://paste.ubuntu.com/24305945/
   It reboots as-is.
   => Reboot worked

4. without a change to anything in /etc run update-grub
   $ sudo update-grub
   Generating grub configuration file ...
   Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
   Found linux image: /boot/vmlinuz-4.10.0-14-generic
   Found initrd image: /boot/initrd.img-4.10.0-14-generic
   Adding boot menu entry for EFI firmware configuration
   done

   There was no diff between the new grub.cfg and the one I saved.
   => Reboot worked

5. add the intel_iommu=on arg
  $ sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/' /etc/default/grub.d/50-curtin-settings.cfg
  $ sudo update-grub
  # Diff in grub.cfg really only is the iommu setting
  => Reboot Failed
  So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me - maybe intel_iommu bheaves different?
- Check grub cfg pre/post - not change but the expected?

6. Install Xenial and do the same
   => Reboot working

7. Upgrade to Z
   Since the Xenial system just worked and one can assume that almost only kernel is working so early in the boot process I upgraded the working system with intel_iommu=on to Zesty.
   That would be 4.4.0-71-generic to 4.10.0-1
   On this upgrade I finally saw my I/O errors again :-/
   Note: these issues are hard to miss as they mount root as read-only.
   I wonder if they only ever appear with intel_iommu=on as this is the only combo I ever saw them,

8. Redeploy and upgrade to Z without intel_iommu=on enabled
   Then enable intel_iommu=on and reboot
   => Reboot Fail
   From here I rebooted into the Xenial kerenl (that since this is an update was still there)
   Here I saw:
    Loading Linux 4.4.0-71-generic ...
    Loading initial ramdisk ...
    error: invalid video mode specification `text'.
    Booting in blind mode
   Hrm, as outlined above the "blind mode" might be a red herring, but since this kernel worked before it might still be a red herring that swims in the initrd that got regenerated on the upgrade.
   => Xenial Kernel Reboot - works !!
   So "blind mode" is a red herring of some sort.

   But this might allow to find some logs
   => No
   This appears as if the Failing boot has never made it to the point to actually write anything.
   I see:
    1. the original xenial
    2. the upgraded zesty
    3. NOT THE zesty+iommu
    4. the xenial+iommu

$ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 4.4.0-71.92-generic 4.4.49)
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Linux version 4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu 6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu 4.10.0-14.16-generic 4.10.3)
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 4.4.0-71.92-generic 4.4.49)
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on

9. Trying to avoiding HW replacement if not needed
I was afraid I might need the HW to be replaced to be 100% sure, but this very much smells broken in SW to me already.
To avoid RT ticket replacing without real need I asked to free another system up.

So I finally could free up a identical machine.
I especially checked the failing HP smart array, it has the same Product Version and FW revision.

There things seem to work, so I might be down to replacing the HW :-/

10. get some messages of the fail:
With the following grub cmdline I got to see the fail:
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200"

It looks just like the one I found on the running system when intel_iommu=on is set on the Xenial kernel happening later (sometimes minutes, sometimes days, but never without intel_iommu).
But on zesty it seems to trigger 100% on boot and by that not even get up.

I'll attach a few logs of the crashes, but the heads are
[ 33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD Smart Path configuration change)
[ 618.567636] DMAR: DRHD: handling fault status reg 2
[ 618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr ffafc000
               DMAR:[fault reason 06] PTE Read access is not set

Or
[ 159.779566] hpsa 0000:03:00.0: Command timed out.
[ 159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: Tag:0x00000000:000000d0: unknown abort service response 0x00

While it might be a HW issue I file this still to be "findable" for anyone else if it is no HW eventually.
But I assign myself for now to close/confirm once I have replaced HW.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Changed in linux (Ubuntu):
assignee: nobody → ChristianEhrhardt (paelzer)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Please note that on the "good" system nobody ever used iommu device assignment, I'll do so after the next days. That way we should also learn if the can bring a good system into the failing mode.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1679208

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

In what appeared similar https://bugzilla.redhat.com/show_bug.cgi?id=649766
it was recommended to set iommu=pt, but in our case that does not help.

It changes the messaging but sitll fails on boot:

[ 75.256554] DMAR: [DMA Read] Request device [03:00.0] fault addr fec0e000 [fault reason 06] PTE Read access is not set
[ 199.315689] blk_update_request: I/O error, dev sda, sector 1116802096
[ 199.345283] EXT4-fs error (device sda2): ext4_find_entry:1463: inode #34865359: comm ureadahead: reading directory lblock 0
[ 199.345284] blk_update_request: I/O error, dev sda, sector 399530240
[ 199.345294] blk_update_request: I/O error, dev sda, sector 399532288
[ 199.353290] sd 0:0:1:0: rejecting I/O to offline device
[ 199.353314] sd 0:0:1:0: rejecting I/O to offline device

Changed in linux (Ubuntu):
importance: Undecided → High
status: Incomplete → Triaged
tags: added: kernel-da-key
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We just found that the second box we tested just needed some time (or I/O) to run into the same.

[ 8710.266192] DMAR: DRHD: handling fault status reg 2
[ 8710.289318] DMAR: [DMA Read] Request device [03:00.0] fault addr f8bf5000 [fault reason 06] PTE Read access is not set
[ 8865.745527] blk_update_request: I/O error, dev sda, sector 349218832
[ 8865.775217] Buffer I/O error on device bcache0, logical block 19238912
[ 8865.804664] Buffer I/O error on device bcache0, logical block 19238913
[ 8865.834530] Buffer I/O error on device bcache0, logical block 19238914
[ 8865.864004] Buffer I/O error on device bcache0, logical block 19238915
[ 8865.893787] Buffer I/O error on device bcache0, logical block 19238916
[ 8865.923772] Buffer I/O error on device bcache0, logical block 19238917
[ 8865.953105] Buffer I/O error on device bcache0, logical block 19238918
[ 8865.982733] Buffer I/O error on device bcache0, logical block 19238919
[ 8866.012426] Buffer I/O error on device bcache0, logical block 19238920
[ 8866.041939] Buffer I/O error on device bcache0, logical block 19238921
[ 8866.071403] sd 0:0:1:0: rejecting I/O to offline device
[ 8866.095709] sd 0:0:1:0: rejecting I/O to offline device

Note: in those states the system is still alive but remounted r/o.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Unassigning myself as "broken HW" no more seems an option, so for the kernel Team to re-asssign.

Please let me know what the next steps you need would be.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This was brought to my attention:
http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04805565

While it has no relation to why it would be triggered by iommu (it should isolate, not link access together right?) it might be worth the FW upgrade to verify if it fixes the issue.

I'll report back once I was able to do so.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Updated the System to latest FW for the storage controller 4.52 (and iLO to 2.50).
With that updated to Zesty again all till working fine.

From here I enabled intel_iommu=on and it booted which already is an improvement.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From there I ran some I/O as on the other system we had the impression tp trigger it by further I/O.
But the system is still fine - so the bug might be good documentation for the next one hitting it, but TL;DR is: "FW bug - FW update".

Per this conclusion I'm setting the kernel task to "invalid".

Changed in linux (Ubuntu):
status: Triaged → Invalid
assignee: ChristianEhrhardt (paelzer) → nobody
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Seems I was happy about the FW being the fix too early, it turns out to still pop up.
Just not on boot.
About 6 hours working I ran into it again.

... attaching the latest dmesg messages

Changed in linux (Ubuntu):
status: Invalid → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1679208

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

After reboot it only took like 10 minutes this time to hit me again :-/
There seems no reliable way to be sure anymore.

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.11 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc5

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hard to install when the system is so broken :-/
I installed the latest Mainline which is 4.11.0-041100rc5.201704022131 and enabled the intel_iommu on it.

- Reboot: ok
- Try to trigger with I/O
  - Fio: still ok
  - apt: working
  - random working on the system: crash

That said verified to fail on 4.11.0-041100rc5.201704022131 as well.
The error message on that kernel is similar:

[ 5624.375286] DMAR: DRHD: handling fault status reg 2
[ 5624.397959] DMAR: [DMA Read] Request device [03:00.0] fault addr fbd85000 [fault reason 06] PTE Read access is not set
[ 5686.804464] blk_update_request: I/O error, dev sda, sector 824203256

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Is this bug a regression? Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

It might be worth testing the Trusty or Precise kernel.

If it is a regression, we can perform a kernel bisect to identify the commit that introduced this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Joseph,
no regression IMHO.
Only the frequency or signature of the issue changed by Kernel upgrades.
I did not yet go back further than Xenial but that is worth a try as soon as I find time for it again.

I'd almost think it is a FW issue still, but then there is no better FW.
Do we have a way to mirror issues to HP being the HW manufacturer?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Subscribing Narinder to map that to HPE if possible.

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Narinder Gupta (narindergupta) wrote :

I have subscribed to HPE Eddie and Ganesh to this bug whether they have tested this with old kernel or some parameter needs to be added.

Revision history for this message
Eddie Campbell (eddie-campbell) wrote :

Christian, was this ever resolved?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.