AACRAID for power9 platform

Bug #1689980 reported by Narinder Gupta on 2017-05-11
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Undecided
Unassigned
linux (Ubuntu)
Medium
Seth Forshee
Zesty
Medium
Seth Forshee

Bug Description

SRU Justification

Impact: Various issues found during integration of Microsemi adapters on Power 8 and Power 9 systems.

Fix: Update the aacraid driver to version 50834 by cherry picking patches from upstream and from linux-next.

Test case: Updated driver has been tested by IBM on Power systems and regression tested by Microsemi on other platforms.

Regression potential: Changes (aside from trivial typo fixes) are limited to aacraid driver, so there's no regression potential outside of this driver. Regression testing has been done by Microsemi on relevant platforms.

---

Hello team,
It seems Microsemi team has requested to backport few AACRAID driver patches for Power 9 platform for 17.04 and 16.04.3 kernel.

Hi Narinder,

We have submitted a handful of critical fixes for the AACRAID driver on to the kernel.org, targeting 4.11. These patches are bug fixes and our customer IBM is expecting these patches to go into the 16.04.3 release, for to use with the Power9 platforms.

Can you please let me know if these patches will be backported by your kernel team on to the upcoming 16.04.3? Do you also have a schedule in place for 16.04.3 release?

Thanks,
Gana

Below are the patch details.

Subject: [PATCH V2 00/19] aacraid: Patchset with reset rework and misc fixes

This patchset primarily focuses on tweaking and hardening the controller reset support for both ARC and HBA1000 devices. Now the driver can only reset the controller thru eh reset. Included a srb memory fix and pci dma allocation fix.

Changes in V2:
 - Corrected heading and description for srb memory patch and removed stray
   comment.
 - Removed incorrect up function call and cleared fib wait flag after call
   to down interruptible in the ioctl return on ctrl reset patch.
 - Added review acknowledgements by David Carroll thank you Dave for
   finding the above issues in the above 2 patches.

Raghava Aditya Renukunta (19):
[SCSI] aacraid: Remove __GFP_DMA for raw srb memory
[SCSI] aacraid: Fix DMAR issues with iommu=pt
[SCSI] aacraid: Added 32 and 64 queue depth for arc natives
[SCSI] aacraid: Set correct Queue Depth for HBA1000 RAW disks
[SCSI] aacraid: Remove reset support from check_health
[SCSI] aacraid: Change wait time for fib completion
[SCSI] aacraid: Log count info of scsi cmds before reset
[SCSI] aacraid: Print ctrl status before eh reset
[SCSI] aacraid: Using single reset mask for IOP reset
[SCSI] aacraid: Rework IOP reset
[SCSI] aacraid: Add periodic checks to see IOP reset status
[SCSI] aacraid: Rework SOFT reset code
[SCSI] aacraid: Rework aac_src_restart
[SCSI] aacraid: Use correct function to get ctrl health
[SCSI] aacraid: Make sure ioctl returns on controller reset
[SCSI] aacraid: Enable ctrl reset for both hba and arc
[SCSI] aacraid: Add reset debugging statements
[SCSI] aacraid: Remove reference to Series-9
[SCSI] aacraid: Update driver version to 50834

drivers/scsi/aacraid/aachba.c | 17 ++-
drivers/scsi/aacraid/aacraid.h | 22 +++-
drivers/scsi/aacraid/commctrl.c | 15 ++-
drivers/scsi/aacraid/comminit.c | 18 +---
drivers/scsi/aacraid/commsup.c | 78 +++++++-------
drivers/scsi/aacraid/linit.c | 232 ++++++++++++++++++++++++----------------
drivers/scsi/aacraid/src.c | 136 +++++++++++++----------

 7 files changed, 298 insertions(+), 220 deletions(-)

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1689980

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Narinder Gupta (narindergupta) wrote :

This is driver enablement for Power 9 so apport-collect does not make sense.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
Changed in linux (Ubuntu Zesty):
status: New → Triaged
importance: Undecided → Medium
tags: added: kernel-da-key zesty
Manoj Iyer (manjo) wrote :

Patches are in linux-next:

5d3a0880787b scsi: aacraid: Remove __GFP_DMA for raw srb memory
96a580fcd745 scsi: aacraid: Fix DMAR issues with iommu=pt
7ad76ab57203 scsi: aacraid: Added 32 and 64 queue depth for arc natives
db62b8248626 scsi: aacraid: Set correct Queue Depth for HBA1000 RAW disks
be70a99e5386 scsi: aacraid: Remove reset support from check_health
30ad417155a8 scsi: aacraid: Change wait time for fib completion
dce4bd517aa2 scsi: aacraid: Log count info of scsi cmds before reset
b791530208ae scsi: aacraid: Print ctrl status before eh reset
52fa7b51eedd scsi: aacraid: Using single reset mask for IOP reset
f2ff3fdf82bc scsi: aacraid: Rework IOP reset
dec430cefee5 scsi: aacraid: Add periodic checks to see IOP reset status
92ea09df1c11 scsi: aacraid: Rework SOFT reset code
8022c90ec2b9 scsi: aacraid: Rework aac_src_restart
b5a7478a18eb scsi: aacraid: Use correct function to get ctrl health
69c727ba1840 scsi: aacraid: Make sure ioctl returns on controller reset
9c35add987b2 scsi: aacraid: Enable ctrl reset for both hba and arc
2d5504c53a18 scsi: aacraid: Add reset debugging statements
5d9cb9c96198 scsi: aacraid: Remove reference to Series-9
0e98ceae7bfe scsi: aacraid: Update driver version to 50834

------- Comment From <email address hidden> 2017-05-26 16:42 EDT-------
Reverse mirror for LP1689980: AACRAID for power9 platform

tags: added: architecture-ppc64le bugnameltc-155096 severity-high targetmilestone-inin---
bugproxy (bugproxy) on 2017-05-26
tags: added: targetmilestone-inin16043
removed: targetmilestone-inin---
Seth Forshee (sforshee) wrote :

Based on just the patch descriptions I'm a bit skeptical that all of these qualify as "critical fixes."

This is a shared driver so we're going to be more conservative about what we would take than if it was architecture specific code. The best case is if the actual fixes can be isolated and backported. Regression testing on a variety of hardware which uses that driver would be benifical, and likely required if this is a large delta.

Brian King (brking) wrote :

The aacraid driver is critical for support of Power 9 for IBM, since it is used for the onboard SATA controller for one of our systems. Given the patches have been tested as a group by Microsemi, my preference would be to take the entire series and remain more in sync with upstream. If Canonical is not comfortable with that, we can review the patch series in more detail and look for non critical patches, then test with the remaining patches.

Manoj Iyer (manjo) wrote :

Assigning this to Canonical kernel team as this could potentially impact cert of power9 since IBM says this driver is also required for the onboard SATA.

Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Manoj Iyer (manjo) on 2017-06-01
tags: added: ubuntu-17.04
Seth Forshee (sforshee) wrote :

Diffstat from the requested backports:

 drivers/scsi/aacraid/aachba.c | 17 +++---
 drivers/scsi/aacraid/aacraid.h | 22 ++++++--
 drivers/scsi/aacraid/commctrl.c | 15 +++---
 drivers/scsi/aacraid/comminit.c | 18 ++-----
 drivers/scsi/aacraid/commsup.c | 78 +++++++++++++--------------
 drivers/scsi/aacraid/linit.c | 232 ++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
 drivers/scsi/aacraid/src.c | 136 +++++++++++++++++++++++++++-------------------
 7 files changed, 298 insertions(+), 220 deletions(-)

So that's fairly substantial. I skimmed the patches and most of them look pretty safe, just a handful that are somewhat concerning.

It's likely we can take these if/when the backport has been thoroughly tested. Is this hardware commonly used on platforms other than Power 9? If so it would be good to see some testing on those platforms, if possible.

Brian King (brking) wrote :

This is hardware that is new for Power 9, was not on Power 8. As to availability on other architectures, Microsemi would need to comment.

Hello Seth,
Yes the hardware is tested thoroughly on platforms other than Power 9 (was tested on x86 etc). Could you please point us to the patches that might seem concerning?

Also could you please point me to the kernel that these patches will be added on? I want make sure that we did not miss any important patches.

Regards,
Raghava Aditya

Seth Forshee (sforshee) wrote :

It was mainly the reset changes that raised concerns in my mind, since it does make changes to interaction with the hardware for performing a reset. If there's other hardware out there then I'd like to have some regression testing with that hardware, because sometimes even changes which should be correct can expose firmware bugs. But it sounds like regression testing has been done.

The kernel for zesty (which will eventually be used for the 16.04.3 kernel) is hosted here:

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/zesty

The master-next branch is kind of the staging area for the next kernel, so you may want to look there as it will have all the most recent commits (be aware that it may be force pushed from time to time). The master branch always tracks the kernel currently in -updates.

Thank you Seth for the link. It looks like other than this set of patches, the others made it in.

bugproxy (bugproxy) wrote :
Download full text (3.5 KiB)

------- Comment From <email address hidden> 2017-06-13 13:30 EDT-------
I have reviewed the upstream patches for aacraid that have been accepted since commit e498520edec6655e93ac5e768b04f4fd2299fe4d (upstream level that matches current Ubuntu Zesty aacraid level). With the exception of the spelling correction in comments, none of the patches are unimportant. Most of them address problem areas we found during integration of Microsemi adapters on Power 8 and Power 9 systems, due to differences in how PCI errors and recovery are managed on Power. None are trivial enough to warrant being out of sync with upstream and/or incurring the risk of missing a dependency or having conflict merge issues.

Here are the patches needed to bring the driver up-to-date:

0e98ceae7bfe76fda99d95421161ab6b8f60c0c5 Wed May 10 09:39:53 2017 -0700 scsi: aacraid: Update driver version to 50834
5d9cb9c961981b0f7dcd588c1b1d7a566d5dcae8 Wed May 10 09:39:52 2017 -0700 scsi: aacraid: Remove reference to Series-9
2d5504c53a1888da1c70086f5a999940c9432cf4 Wed May 10 09:39:51 2017 -0700 scsi: aacraid: Add reset debugging statements
9c35add987b238a8cb9fa12b0beba7bc06d918bf Wed May 10 09:39:50 2017 -0700 scsi: aacraid: Enable ctrl reset for both hba and arc
69c727ba1840a22aa19c5d4bc7aa14a28fd72773 Wed May 10 09:39:49 2017 -0700 scsi: aacraid: Make sure ioctl returns on controller reset
b5a7478a18ebbe45e70e8cccf71dd877bfdc8281 Wed May 10 09:39:48 2017 -0700 scsi: aacraid: Use correct function to get ctrl health
8022c90ec2b9a2bcddc2df34418eadda8f935800 Wed May 10 09:39:47 2017 -0700 scsi: aacraid: Rework aac_src_restart
92ea09df1c113829e2fd479fc952a855c36d7e53 Wed May 10 09:39:46 2017 -0700 scsi: aacraid: Rework SOFT reset code
dec430cefee5941ae1a7132057e11a1ac7395a17 Wed May 10 09:39:45 2017 -0700 scsi: aacraid: Add periodic checks to see IOP reset status
f2ff3fdf82bcf61f4d3c52175cee22bbecc90cc9 Wed May 10 09:39:44 2017 -0700 scsi: aacraid: Rework IOP reset
52fa7b51eedd6fce654bd2fc43ba607636e1a60b Wed May 10 09:39:43 2017 -0700 scsi: aacraid: Using single reset mask for IOP reset
b791530208ae289db2a80d4a011e8bbeefbb1d09 Wed May 10 09:39:42 2017 -0700 scsi: aacraid: Print ctrl status before eh reset
dce4bd517aa2a2277ba92458679f962b6af3e239 Wed May 10 09:39:41 2017 -0700 scsi: aacraid: Log count info of scsi cmds before reset
30ad417155a8026c12cfefa9b2ca7f448d6570ba Wed May 10 09:39:40 2017 -0700 scsi: aacraid: Change wait time for fib completion
be70a99e53862ce0abf579a8a6116cefcb9155d7 Wed May 10 09:39:39 2017 -0700 scsi: aacraid: Remove reset support from check_health
db62b82486269148d2f90f3bd750c4abe2af2840 Wed May 10 09:39:38 2017 -0700 scsi: aacraid: Set correct Queue Depth for HBA1000 RAW disks
7ad76ab572037fae99c244dbd97cc5db763a31db Wed May 10 09:39:37 2017 -0700 scsi: aacraid: Added 32 and 64 queue depth for arc natives
96a580fcd7452dc4c136a8159501d4b60399f80d Wed May 10 09:39:36 2017 -0700 scsi: aacraid: Fix DMAR issues with iommu=pt
5d3a0880787bf64b2749e1073b810b02f4deb03b Wed May 10 09:39:35 2017 -0700 scsi: aacraid: Remove __GFP_DMA for raw srb memory
f481973d5efdb63b7c6ca6b0ecd2b8462556a461 Wed Apr 5 16:14:16 2017 +0530 scsi: aacraid: pci_alloc_consistent() failures on AR...

Read more...

Hello Seth,
Is there any update on the status of this request if any?, Could you please let us know if there is a delay and if Canonical needs anything from microsemi.

Regards,
Raghava Aditya

Seth Forshee (sforshee) wrote :

Sorry for the delay - I think we have what we need, I've just been a little buried with work and haven't gotten around to following through yet. It's on my todo list, I should get to it soon.

Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Seth Forshee (sforshee)

Thank you Seth for the confirmation. Could let me know a rough estimate?

Seth Forshee (sforshee) wrote :

Pull request sent to the mailing list:

https://lists.ubuntu.com/archives/kernel-team/2017-June/084977.html

All patches have been applied to artful/master-next and unstable/master.

description: updated
Changed in linux (Ubuntu Zesty):
assignee: nobody → Seth Forshee (sforshee)
status: Triaged → In Progress
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-06-21 09:09 EDT-------
> All patches have been applied to artful/master-next and unstable/master.

Thanks. I understand we also need them into Zesty. We would like to have these patches into the 16.04.3 intaller kernel, since this is a driver being used in the install process and having the fixes in the installer is critical.

On Wed, Jun 21, 2017 at 01:19:47PM -0000, bugproxy wrote:
> ------- Comment From <email address hidden> 2017-06-21 09:09 EDT-------
> > All patches have been applied to artful/master-next and unstable/master.
>
> Thanks. I understand we also need them into Zesty. We would like to have
> these patches into the 16.04.3 intaller kernel, since this is a driver
> being used in the install process and having the fixes in the installer
> is critical.

The link I supplied in comment #17 is the pull request for zesty. I can
apply things directly for the development kernel, but for stable kernels
the patches must go through the SRU process.

Stefan Bader (smb) on 2017-06-22
Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
bugproxy (bugproxy) on 2017-07-10
tags: added: verification-done-zesty
removed: verification-needed-zesty
Launchpad Janitor (janitor) wrote :
Download full text (4.0 KiB)

This bug was fixed in the package linux - 4.11.0-10.15

---------------
linux (4.11.0-10.15) artful; urgency=low

  * linux: 4.11.0-10.15 -proposed tracker (LP: #1701271)

  * Artful update to v4.11.8 stable release (LP: #1701269)
    - clk: sunxi-ng: a31: Correct lcd1-ch1 clock register offset
    - clk: sunxi-ng: v3s: Fix usb otg device reset bit
    - clk: sunxi-ng: sun5i: Fix ahb_bist_clk definition
    - xen/blkback: fix disconnect while I/Os in flight
    - xen-blkback: don't leak stack data via response ring
    - ALSA: firewire-lib: Fix stall of process context at packet error
    - ALSA: pcm: Don't treat NULL chmap as a fatal error
    - ALSA: hda - Add Coffelake PCI ID
    - ALSA: hda - Apply quirks to Broxton-T, too
    - fs/exec.c: account for argv/envp pointers
    - powerpc/perf: Fix oops when kthread execs user process
    - autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAIL
    - fs/dax.c: fix inefficiency in dax_writeback_mapping_range()
    - lib/cmdline.c: fix get_options() overflow while parsing ranges
    - perf/x86/intel: Add 1G DTLB load/store miss support for SKL
    - perf probe: Fix probe definition for inlined functions
    - KVM: x86: fix singlestepping over syscall
    - KVM: MIPS: Fix maybe-uninitialized build failure
    - KVM: s390: gaccess: fix real-space designation asce handling for gmap
      shadows
    - KVM: PPC: Book3S HV: Cope with host using large decrementer mode
    - KVM: PPC: Book3S HV: Preserve userspace HTM state properly
    - KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1
    - KVM: PPC: Book3S HV: Context-switch EBB registers properly
    - KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit
    - KVM: PPC: Book3S HV: Save/restore host values of debug registers
    - CIFS: Improve readdir verbosity
    - CIFS: Fix some return values in case of error in 'crypt_message'
    - cxgb4: notify uP to route ctrlq compl to rdma rspq
    - HID: Add quirk for Dell PIXART OEM mouse
    - random: silence compiler warnings and fix race
    - signal: Only reschedule timers on signals timers have sent
    - powerpc/kprobes: Pause function_graph tracing during jprobes handling
    - powerpc/64s: Handle data breakpoints in Radix mode
    - Input: i8042 - add Fujitsu Lifebook AH544 to notimeout list
    - brcmfmac: add parameter to pass error code in firmware callback
    - brcmfmac: use firmware callback upon failure to load
    - brcmfmac: unbind all devices upon failure in firmware callback
    - time: Fix clock->read(clock) race around clocksource changes
    - time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
    - arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW
    - target: Fix kref->refcount underflow in transport_cmd_finish_abort
    - iscsi-target: Fix delayed logout processing greater than
      SECONDS_FOR_LOGOUT_COMP
    - iscsi-target: Reject immediate data underflow larger than SCSI transfer
      length
    - drm/radeon: add a PX quirk for another K53TK variant
    - drm/radeon: add a quirk for Toshiba Satellite L20-183
    - drm/amdgpu/atom: fix ps allocation size for EnableDispPowerGating
    - drm/amdgpu: adjust default display clock
   ...

Read more...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
bugproxy (bugproxy) on 2017-07-12
tags: removed: bugnameltc-155096 kernel-da-key severity-high ubuntu-17.04 verification-done-zesty zesty
tags: added: verification-done-zesty
Launchpad Janitor (janitor) wrote :
Download full text (8.1 KiB)

This bug was fixed in the package linux - 4.10.0-28.32

---------------
linux (4.10.0-28.32) zesty; urgency=low

  * linux: 4.10.0-28.32 -proposed tracker (LP: #1701013)

  * KILLER1435-S[0489:e0a2] BT cannot search BT 4.0 device (LP: #1699651)
    - Bluetooth: btusb: Add support for 0489:e0a2 QCA_ROME device

  * aacraid driver may return uninitialized stack data to userspace
    (LP: #1700077)
    - SAUCE: scsi: aacraid: Don't copy uninitialized stack memory to userspace

  * CVE-2017-9605
    - drm/vmwgfx: Make sure backup_handle is always valid

  * CVE-2017-1000380
    - ALSA: timer: Fix race between read and ioctl
    - ALSA: timer: Fix missing queue indices reset at SNDRV_TIMER_IOCTL_SELECT

  * XDP eBPF programs fail to verify on Zesty ppc64el (LP: #1699627)
    - [Config] ppc64el: build for Power8 not Power7

  * AACRAID for power9 platform (LP: #1689980)
    - scripts/spelling.txt: add "therfore" pattern and fix typo instances
    - scsi: aacraid: fix PCI error recovery path
    - scsi: aacraid: pci_alloc_consistent() failures on ARM64
    - scsi: aacraid: Remove __GFP_DMA for raw srb memory
    - scsi: aacraid: Fix DMAR issues with iommu=pt
    - scsi: aacraid: Added 32 and 64 queue depth for arc natives
    - scsi: aacraid: Set correct Queue Depth for HBA1000 RAW disks
    - scsi: aacraid: Remove reset support from check_health
    - scsi: aacraid: Change wait time for fib completion
    - scsi: aacraid: Log count info of scsi cmds before reset
    - scsi: aacraid: Print ctrl status before eh reset
    - scsi: aacraid: Using single reset mask for IOP reset
    - scsi: aacraid: Rework IOP reset
    - scsi: aacraid: Add periodic checks to see IOP reset status
    - scsi: aacraid: Rework SOFT reset code
    - scsi: aacraid: Rework aac_src_restart
    - scsi: aacraid: Use correct function to get ctrl health
    - scsi: aacraid: Make sure ioctl returns on controller reset
    - scsi: aacraid: Enable ctrl reset for both hba and arc
    - scsi: aacraid: Add reset debugging statements
    - scsi: aacraid: Remove reference to Series-9
    - scsi: aacraid: Update driver version to 50834

  * arm64 kernel crashdump support (LP: #1694859)
    - memblock: add memblock_clear_nomap()
    - memblock: add memblock_cap_memory_range()
    - arm64: limit memory regions based on DT property, usable-memory-range
    - arm64: kdump: reserve memory for crash dump kernel
    - arm64: mm: add set_memory_valid()
    - arm64: mm: use phys_addr_t instead of unsigned long in __map_memblock
    - arm64: kdump: protect crash dump kernel memory
    - arm64: hibernate: preserve kdump image around hibernation
    - arm64: kdump: implement machine_crash_shutdown()
    - arm64: kdump: add VMCOREINFO's for user-space tools
    - [Config] CONFIG_CRASH_DUMP=y on arm64
    - arm64: kdump: provide /proc/vmcore file
    - Documentation: kdump: describe arm64 port
    - Documentation: dt: chosen properties for arm64 kdump
    - efi/libstub/arm*: Set default address and size cells values for an empty dtb

  * hibmc driver does not include "pci:" prefix in bus ID (LP: #1698700)
    - SAUCE: drm: hibmc: Use set_busid function from drm core

  * Processes in "D" state due to za...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Manoj Iyer (manjo) on 2017-07-19
Changed in ubuntu-power-systems:
status: New → Fix Released
Manoj Iyer (manjo) on 2017-07-24
tags: added: triage-g
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers