Ubuntu 17.10 crashes on vmalloc.c

Bug #1739498 reported by bugproxy on 2017-12-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Critical
Canonical Kernel Team
linux (Ubuntu)
Critical
Joseph Salisbury
Artful
Critical
Joseph Salisbury
Bionic
Critical
Joseph Salisbury

Bug Description

== SRU Justification ==
IBM is seeing a crash on Power9 when running on Artful.
This issue happens only when disabling RPT ("disable_radix"), which is not the default in Artful.

These fixes are already in Bionic master-next, so they are only being requested
in Artful.

== Fixes ==
63ee9b2ff9d3 ("powerpc/mm/book3s64: Make KERN_IO_START a variable")
b5048de04b32 ("powerpc/mm/slb: Move comment next to the code it's")
21a0e8c14bf6 ("powerpc/mm/hash64: Make vmalloc 56T on hash")

== Regression Potential ==
These commits are specific to powerpc and fix a crash.

== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

== Comment: #0 - Breno Leitao - 2017-12-19 09:48:10 ==
When running Ubuntu 17.10 on POWER9, I got the following error:

[409038.118908] WARNING: CPU: 47 PID: 294 at /build/linux-LIHoWc/linux-4.13.0/mm/vmalloc.c:2527 pcpu_get_vm_areas+0x62c/0x660
[409038.118909] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter kvm_hv kvm at24 ofpart ipmi_powernv ipmi_devintf ipmi_msghandler cmdlinepart powernv_flash uio_pdrv_genirq uio mtd vmx_crypto ibmpowernv crct10dif_vpmsum opal_prd binfmt_misc ip_tables x_tables autofs4 crc32c_vpmsum ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm tg3 ahci libahci
[409038.118933] CPU: 47 PID: 294 Comm: kworker/47:0 Tainted: G W 4.13.0-12-generic #13-Ubuntu
[409038.118934] Workqueue: events pcpu_balance_workfn
[409038.118936] task: c000003fe3cdcc00 task.stack: c000003fe3be0000
[409038.118937] NIP: c00000000032c1fc LR: c0000000002f5fd4 CTR: 0000000000000000
[409038.118937] REGS: c000003fe3be3810 TRAP: 0700 Tainted: G W (4.13.0-12-generic)
[409038.118938] MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[409038.118944] CR: 24024828 XER: 20040000
[409038.118944] CFAR: c00000000032bdb8 SOFTE: 1
                GPR00: 000020000df00000 c000003fe3be3a90 c0000000015e3000 c000203fff6b6880
                GPR04: c000203fff223608 0000000000000008 c000203fff6b6888 0000000000000000
                GPR08: 000020000df00000 0000080000000000 0000000001600000 c000203fff6b6888
                GPR12: 0000000000000002 c00000000faded80 c000000000f6c050 c000003fe3be3bc0
                GPR16: 0000000000100000 0000000000000000 c00000000189daf8 c000203fff223608
                GPR20: 000020000f500000 c000203fff2235f8 c000203fff223600 0000000000000000
                GPR24: 0000000000000000 c000203fff6b6888 0000000000000001 000020000f500000
                GPR28: 0000000000000002 00000000000fffff d000080000000000 d000000000000000
[409038.118963] NIP [c00000000032c1fc] pcpu_get_vm_areas+0x62c/0x660
[409038.118964] LR [c0000000002f5fd4] pcpu_create_chunk+0xb4/0x1b0
[409038.118965] Call Trace:
[409038.118966] [c000003fe3be3a90] [c000003fe3be3ad0] 0xc000003fe3be3ad0 (unreliable)
[409038.118968] [c000003fe3be3b60] [c0000000002f5fd4] pcpu_create_chunk+0xb4/0x1b0
[409038.118970] [c000003fe3be3ba0] [c0000000002f7890] pcpu_balance_workfn+0x600/0x960
[409038.118972] [c000003fe3be3ca0] [c0000000001205d8] process_one_work+0x298/0x5a0
[409038.118975] [c000003fe3be3d30] [c000000000120968] worker_thread+0x88/0x620
[409038.118977] [c000003fe3be3dc0] [c00000000012980c] kthread+0x1ac/0x1c0
[409038.118979] [c000003fe3be3e30] [c00000000000b4e8] ret_from_kernel_thread+0x5c/0x74
[409038.118980] Instruction dump:
[409038.118981] eae30000 4191fad0 7ed3b378 e9210030 7efbbb78 7c791b78 3b400000 e9530000

---uname output---
4.13.0-12-generic

== Comment: #3 - ANEESH K. K V - 2017-12-20 05:59:13 ==
https://<email address hidden>

The above may be related?

Related discussions https://<email address hidden>

-aneesh

== Comment: #4 - Breno Leitao - 2017-12-20 09:48:07 ==
I just tested with kernel 4.15.0-041500rc4 and I didn't see a problem so far.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-162737 severity-critical targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g

------- Comment From <email address hidden> 2017-12-21 06:42 EDT-------
Hi,
FYI I tried to patch kernel 4.13 of 17.10 with the patch that Aneesh mentioned but this error still happens at boot time.

F.

Joseph Salisbury (jsalisbury) wrote :

If we don't yet know what patch in v4.15-rc4 fixes the bug, we can perform a "Reverse" bisect to identify the fix.

Can you test v4.14-rc1, so we can narrow down the last bad kernel version and the first good one? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc1/

Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-01-10 11:24 EDT-------
Joseph,

Frederic is working to bisect this problem. He will update you once we find something.

bugproxy (bugproxy) wrote :
Download full text (3.2 KiB)

------- Comment From <email address hidden> 2018-01-10 11:40 EDT-------
Hi. Just a headsup on the issue. Build -25 still reproduces the issue:

gromero@ltc-wspoon3:~$ uname -a
Linux ltc-wspoon3 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:15:55 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

[ 118.751494] ------------[ cut here ]------------
[ 118.751497] WARNING: CPU: 63 PID: 1144 at /build/linux-268tGw/linux-4.13.0/mm/vmalloc.c:2521 pcpu_get_vm_areas+0x62c/0x660
[ 118.751498] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter idt_89hpesx ofpart cmdlinepart powernv_flash mtd at24 vmx_crypto crct10dif_vpmsum opal_prd ipmi_powernv uio_pdrv_genirq uio binfmt_misc ibmpowernv ipmi_devintf ipmi_msghandler kvm_hv kvm ip_tables x_tables autofs4 crc32c_vpmsum ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops tg3 drm ahci libahci
[ 118.751529] CPU: 63 PID: 1144 Comm: kworker/63:1 Tainted: G W 4.13.0-25-generic #29-Ubuntu
[ 118.751532] Workqueue: events pcpu_balance_workfn
[ 118.751534] task: c0000000fc358700 task.stack: c0000000fc2b4000
[ 118.751536] NIP: c00000000032fedc LR: c0000000002f9c54 CTR: 0000000000000000
[ 118.751537] REGS: c0000000fc2b7800 TRAP: 0700 Tainted: G W (4.13.0-25-generic)
[ 118.751538] MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[ 118.751545] CR: 24024828 XER: 20040000
[ 118.751546] CFAR: c00000000032fa98 SOFTE: 1
GPR00: 000020000df00000 c0000000fc2b7a80 c0000000015f6200 c000203fff21eb00
GPR04: c000203fff21ea88 0000000000000008 c000203fff21eb08 0000000000000000
GPR08: 000020000df00000 0000080000000000 0000000001600000 c000203fff21eb08
GPR12: 0000000000000002 c00000000fae9580 c000000000f7dbe0 c0000000fc2b7bb0
GPR16: 0000000000100000 0000000000000000 c0000000018adff8 c000203fff21ea88
GPR20: 000020000f500000 c000203fff21ea78 c000203fff21ea80 0000000000000000
GPR24: 0000000000000000 c000203fff21eb08 0000000000000001 000020000f500000
GPR28: 0000000000000002 00000000000fffff d000080000000000 d000000000000000
[ 118.751571] NIP [c00000000032fedc] pcpu_get_vm_areas+0x62c/0x660
[ 118.751573] LR [c0000000002f9c54] pcpu_create_chunk+0xb4/0x1b0
[ 118.751574] Call Trace:
[ 118.751575] [c0000000fc2b7a80] [c0000000fc2b7ac0] 0xc0000000fc2b7ac0 (unreliable)
[ 118.751578] [c0000000fc2b7b50] [c0000000002f9c54] pcpu_create_chunk+0xb4/0x1b0
[ 118.751581] [c0000000fc2b7b90] [c0000000002fb510] pcpu_balance_workfn+0x600/0x960
[ 118.751584] [c0000000fc2b7c90] [c000000000124018] process_one_work+0x298/0x5a0
[ 118.751587] [c0000000fc2b7d20] [c0000000001243a8] worker_thread+0x88/0x620
[ 118.751589] [c0000000fc2b7dc0] [c00000000012d21c] kthread+0x1ac/0x1c0
[ 118.751593] [c0000000fc2b7e30] [c00000000000b4e8] ret_from_kernel_thread+0x5c/0x74
[ 118.751594] Instruction dump:
[ 118.751595] eae30000 4191fad0 7ed3b378 e9210030 7efbbb78 7c791b78 3b400000 e9530000
[ 118.751600] 7d3f4850 7f7b5214 7fa9d840 409c...

Read more...

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-01-12 08:59 EDT-------
Hi,

Aneesh, right, I missed the 2 other ones. I rebuilt a kernel with the 3 patches, thanks for having thought of this, because that fixes the issue. Some details now :
actually this issue happens only when disabling RPT ("disable_radix") which is not the default in Artful.
Also I could have it on Witherspoon systems but not on Boston (having an other issue, which may hide this one... FYI the kernel loops infinitly displaying :
[ 0.000000] Allocated bitmap for 2040 MSIs (base IRQ 0x1fd000)
[ 0.000000] Initializing IODA2 PHB (/pciex@620c3c0300000)
[ 0.000000] PCI host bridge /pciex@620c3c0300000 ranges:
[ 0.0000[ 89.349426896,3] opalmsg: No available node in the free list, allocating
[ 89.352371632,3] opalmsg: No available node in the free list, allocating
00] M[ 89.355953552,3] opalmsg: No available node in the free list, allocating
[ 89.359518752,3] opalmsg: No available node in the free list, allocating
[ 89.363009568,3] opalmsg: No available node in the free list, allocating
[ 89.366608224,3] opalmsg: No available node in the free list, allocating
[ 89.370155632,3] opalmsg: No available node in the free list, allocating
[ 89.373689632,3] opalmsg: No available node in the free list, allocating
[ 89.376579088,3] opalmsg: No available node in the free list, allocating
[ 89.380134464,3] opalmsg: No available node in the free list, allocating
[ 89.383665744,3] opalmsg: No available node in the free list, allocating
[ 89.387155088,3] opalmsg: No available node in the free list, allocating
[ 89.390741040,3] opalmsg: No available node in the free list, allocating
[ 89.393600592,3] opalmsg: No available node in the free list, allocating
[ 89.397146720,3] opalmsg: No available node in the free list, allocating
[ 89.400687600,3] opalmsg: No available node in the free list, allocating
[ 89.404226032,3] opalmsg: No available node in the free list, allocating
[ 89.407772816,3] opalmsg: No available node in the free list, allocating
[ 89.410644096,3] opalmsg: No available node in the free list, allocating
....
).

So it seems those commits help :
-----------
commit 21a0e8c14bf61472723d2acc83f98ab35ff321b4
Author: Michael Ellerman <email address hidden>
Date: Tue Aug 1 20:29:24 2017 +1000

powerpc/mm/hash64: Make vmalloc 56T on hash

commit b5048de04b32104140e5b251005404c3e0d03ccd
Author: Michael Ellerman <email address hidden>
Date: Tue Aug 1 20:29:23 2017 +1000

powerpc/mm/slb: Move comment next to the code it's referring to

commit 63ee9b2ff9d306efaa61b04b8710fafe339ae441
Author: Michael Ellerman <email address hidden>
Date: Tue Aug 1 20:29:22 2017 +1000

powerpc/mm/book3s64: Make KERN_IO_START a variable

-----------

F.

Manoj Iyer (manjo) wrote :

Are you still investigating this bug to figure out what patches needs to be backported to fix this issue? Could you please post the list of patches that fix this issue? Please note that the SRU process is currently delayed due to security patches for Spectre and Meltdown.

Frédéric Bonnard (frediz) wrote :

Hi Manoj, the patches that fixed this issue are the 3 listed above in my comment : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1739498/comments/6

F.

Joseph Salisbury (jsalisbury) wrote :

I'll build a test kernel for those three patches for testing. Once confirmed they fix the bug, I can submit an SRU request.

Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Artful):
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I built an Artful test kernel with the three patches listed in comment #6.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1739498

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-02-07 11:21 EDT-------
Hi Joseph,
I tested the kernel your provided and that works perfectly.
Thanks,

F.

Manoj Iyer (manjo) on 2018-02-12
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
description: updated
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Manoj Iyer (manjo) on 2018-03-05
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-03-22 13:20 EDT-------
Hi smb/Canonical,
could we extend the validation period ? : I couldn't find a WitherSpoon system so far to test this.
Let me know, thanks

F.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-03-23 12:29 EDT-------
Hi,
I got my hands on a WS system and could boot kernel 4.13.0-38.43 with disable_radix kernel option
and I confirm that this issue does not appear anymore.
(I checked before that I actually got it with 4.13.0-37.42)

Regards,

F.

tags: added: verification-done-artful
removed: verification-needed-artful
Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Manoj Iyer (manjo) wrote :

Marking bionic as fix-released since patches identified here are in master
63ee9b2ff9d3 powerpc/mm/book3s64: Make KERN_IO_START a variable
b5048de04b32 powerpc/mm/slb: Move comment next to the code it's referring to
21a0e8c14bf6 powerpc/mm/hash64: Make vmalloc 56T on hash

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
bugproxy (bugproxy) on 2018-04-16
tags: added: targetmilestone-inin1804
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments