crash with "Data Access Out of Range" when using nx-842 zswap on POWER9

Bug #1831536 reported by Stewart Smith
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
New
Undecided
Unassigned
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

On my two socket POWER9 system (powernv) with 842 zwap set up, I
recently got a crash with the Ubuntu kernel (I haven't tried with
upstream, and this is the first time the system has died like this, so
I'm not sure how repeatable it is).

[ 2.891463] zswap: loaded using pool 842-nx/zbud
...
[15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 5000000 us, giving up : 00 00 00 00 00000000
[16868.932913] Unable to handle kernel paging request for data at address 0x6655f67da816cdb8
[16868.933726] Faulting instruction address: 0xc000000000391600

cpu 0x68: Vector: 380 (Data Access Out of Range) at [c000001c9d98b9a0]
    pc: c000000000391600: kmem_cache_alloc+0x2e0/0x340
    lr: c0000000003915ec: kmem_cache_alloc+0x2cc/0x340
    sp: c000001c9d98bc20
   msr: 900000000280b033
   dar: 6655f67da816cdb8
  current = 0xc000001ad43cb400
  paca = 0xc00000000fac7800 softe: 0 irq_happened: 0x01
    pid = 8319, comm = make
Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 4.15.0-50.54-generic 4.15.18)

68:mon> t
[c000001c9d98bc20] c0000000003914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
[c000001c9d98bc80] c0000000003b1e14 __khugepaged_enter+0x54/0x220
[c000001c9d98bcc0] c00000000010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
[c000001c9d98bda0] c00000000010fe4c _do_fork+0xec/0x510
[c000001c9d98be30] c00000000000b584 ppc_clone+0x8/0xc
--- Exception: c00 (System Call) at 00007afe9daf87f4
SP (7fffca606880) is in userspace

So, it looks like there could be a problem in the error path, plausibly
fixed by this patch:

commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
Author: Haren Myneni <email address hidden>
Date: Wed Jun 13 00:32:40 2018 -0700

    crypto/nx: Initialize 842 high and normal RxFIFO control registers

    NX increments readOffset by FIFO size in receive FIFO control register
    when CRB is read. But the index in RxFIFO has to match with the
    corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
    may be processing incorrect CRBs and can cause CRB timeout.

    VAS FIFO offset is 0 when the receive window is opened during
    initialization. When the module is reloaded or in kexec boot, readOffset
    in FIFO control register may not match with VAS entry. This patch adds
    nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
    control register for both high and normal FIFOs.

    Signed-off-by: Haren Myneni <email address hidden>
    [mpe: Fixup uninitialized variable warning]
    Signed-off-by: Michael Ellerman <email address hidden>

$ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
v4.19-rc1~24^2~50

Which was never backported to any stable release, so probably needs to
be for v4.14 through v4.18. Notably, Ubuntu is on v4.15 and it doesn't
seem to have picked up the patch.

Reported to upstream (and there may be further discussion) over at https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-June/191438.html

Tags: bionic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1831536

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.