System crashes on hot adding a core with drmgr command (4.15.0-48-generic)

Bug #1833716 reported by bugproxy on 2019-06-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Critical
Manoj Iyer
linux (Ubuntu)
Critical
Manoj Iyer
Bionic
High
Manoj Iyer

Bug Description

[Impact]
On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. The patches identified to fix these issues disables changing the NUMA associations for CPUs and Memory at runtime by default.

[Test]
# drmgr -c cpu -r -q 1
# drmgr -c cpu -a -q 1
Test kernel available in ppa:ubuntu-power-triage/lp1833716
Please see comment #2 for before and after results with the patches applied.

[Fix]
558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default
2d4d9b308f8f powerpc/numa: improve control of topology updates

[Regression Potential]
The two patches relate to powerpc/numa and does not impact other architectures or platform code. Regression potential is low.

[Other Information]
== Comment: #0 - Hari Krishna Bathini <email address hidden> - 2019-05-07 13:18:35 ==
---Problem Description---
On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel
with below traces:

---
root@ubuntu:~# drmgr -c cpu -r -q 1
Validating CPU DLPAR capability...yes.
CPU 9
root@ubuntu:~#
root@ubuntu:~#
root@ubuntu:~#
root@ubuntu:~#
root@ubuntu:~#
root@ubuntu:~# drmgr -c cpu -a -q 1
Validating CPU DLPAR capability...yes.
[ 218.555493] BUG: arch topology borken
[ 218.555503] the DIE domain not a subset of the NODE domain
[ 218.555512] BUG: arch topology borken
[ 218.555516] the DIE domain not a subset of the NODE domain
[ 218.555523] BUG: arch topology borken
[ 218.555528] the DIE domain not a subset of the NODE domain
[ 218.555535] BUG: arch topology borken
[ 218.555539] the DIE domain not a subset of the NODE domain
[ 218.555545] BUG: arch topology borken
[ 218.555550] the DIE domain not a subset of the NODE domain
[ 218.555556] BUG: arch topology borken
[ 218.555560] the DIE domain not a subset of the NODE domain
[ 218.555567] BUG: arch topology borken
[ 218.555571] the DIE domain not a subset of the NODE domain
[ 218.555577] BUG: arch topology borken
[ 218.555581] the DIE domain not a subset of the NODE domain
[ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f
[ 218.555679] Faulting instruction address: 0xc0000000001768cc
[ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1]
[ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries
[ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum
[ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu
[ 218.555757] Workqueue: events cpuset_hotplug_workfn
[ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000
[ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic)
[ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004
[ 218.555789] CFAR: c000000000176920 SOFTE: 1
[ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200
[ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010
[ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000
[ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900
[ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001
[ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78
[ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af
[ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f
[ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0
[ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0
[ 218.555871] Call Trace:
[ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable)
[ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0
[ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870
[ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0
[ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410
[ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80
[ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60
[ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60
[ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0
[ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630
[ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0
[ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84
[ 218.555971] Instruction dump:
[ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78
[ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010
[ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]---
---

---uname output---
Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = na

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 1. Install a 4.15 kernel (4.15.0-48-generic)
2. Hot remove a core: drmgr -c cpu -r -q 1
3. Hot add a core: drmgr -c cpu -a -q 1

Actual Result:
System crashes after "drmgr -c cpu -a -q 1" command is issued

Expected result:
Hot add succeeds without any crash

== Comment: #20 - SEETEENA THOUFEEK <email address hidden> - 2019-06-20 07:00:39 ==
Please integrate these two patches

1.
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f

powerpc/numa: improve control of topology updates

When booted with "topology_updates=no", or when "off" is written to
/proc/powerpc/topology_updates, NUMA reassignments are inhibited for
PRRN and VPHN events. However, migration and suspend unconditionally
re-enable reassignments via start_topology_update(). This is
incoherent.

Check the topology_updates_enabled flag in
start/stop_topology_update() so that callers of those APIs need not be
aware of whether reassignments are enabled. This allows the
administrative decision on reassignments to remain in force across
migrations and suspensions.

Signed-off-by: Nathan Lynch <email address hidden>
Signed-off-by: Michael Ellerman <email address hidden>

2.
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab

powerpc/numa: document topology_updates_enabled, disable by default

Changing the NUMA associations for CPUs and memory at runtime is
basically unsupported by the core mm, scheduler etc. We see all manner
of crashes, warnings and instability when the pseries code tries to do
this. Disable this behavior by default, and document the switch a bit.

Signed-off-by: Nathan Lynch <email address hidden>
Signed-off-by: Michael Ellerman <email address hidden>

Thanks in advance for your support.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-177451 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → kernel-package (Ubuntu)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
tags: added: powervm
Changed in ubuntu-power-systems:
importance: Undecided → High
Manoj Iyer (manjo) on 2019-06-21
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Manoj Iyer (manjo)
importance: Undecided → High
Changed in ubuntu-power-systems:
assignee: Canonical Kernel Team (canonical-kernel-team) → Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: New → In Progress
Changed in linux (Ubuntu):
status: New → In Progress
Manoj Iyer (manjo) on 2019-06-24
description: updated
description: updated
description: updated
Manoj Iyer (manjo) wrote :
Download full text (5.1 KiB)

== Bionic GA 4.15.0 kernel ==

ubuntu@P8lpar4:~$ sudo su
root@P8lpar4:/home/ubuntu# drmgr -c cpu -r -q 1
Validating CPU DLPAR capability...yes.
CPU 121
root@P8lpar4:/home/ubuntu# drmgr -c cpu -a -q 1

[262984.440091] BUG: arch topology borken [262984.440110] the DIE domain not a subset of the NODE domain
[262984.440114] BUG: arch topology borken
[262984.440116] the DIE domain not a subset of the NODE domain
[262984.440120] BUG: arch topology borken
[262984.440122] the DIE domain not a subset of the NODE domain
[262984.443241] Unable to handle kernel paging request for data at address 0x50dbee4a
[262984.443261] Faulting instruction address: 0xc00000000017982c [262984.443281] LE SMP NR_CPUS=2048 NUMA pSeries
m iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
ync_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscs
[262984.443323] CPU: 120 PID: 1467 Comm: kworker/120:2 Not tainted 4.15.0-52-generic
#56-Ubuntu
[262984.443331] Workqueue: events cpuset_hotplug_workfn
[262984.443334] NIP: c00000000017982c LR: c000000000179908 CTR: 0000000000000000
[262984.443341] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 000
00005
      [262984.443349] GPR00: c000000000179908 c000000fcec0b7b0 c0000000016eb800 c000000f7c4
[262984.443349] GPR04: 0000000000000001 0000000000000000 0000000000000008 00000000000
[262984.443349] GPR08: 0000000000000018 ffffffffffffffff c000000fe22f5808 00000000000
1c600
[262984.443349] GPR16: 00000000000003c0 c000000001722494 c000000fd249d800 00000000000
00001
[262984.443349] GPR20: c000000ff7068580 0000000000000008 c0000000011d8580 c0000000017
[262984.443349] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000
0102f
65d75
[262984.443389] LR [c000000000179908] destroy_sched_domain+0x38/0xc0
[262984.443392] Call Trace:
[262984.443395] [c000000fcec0b7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable)
[262984.443400] [c000000fcec0b7f0] [c000000000179908] destroy_sched_domain+0x38/0xc0
[262984.443404] [c000000fcec0b820] [c000000000179e10] cpu_attach_domain+0xf0/0x870
[262984.443408] [c000000fcec0b960] [c00000000017b7e4] build_sched_domains+0x1254/0x12
f0
[262984.443412] [c000000fcec0ba90] [c00000000017c9d0] partition_sched_domains+0x2d0/0
x410
[262984.443416] [c000000fcec0bb20] [c000000000202b20] rebuild_sched_domains_locked+0x
60/0x80
[26...

Read more...

description: updated
Manoj Iyer (manjo) on 2019-06-24
Changed in ubuntu-power-systems:
importance: High → Critical
Changed in linux (Ubuntu):
importance: High → Critical
Manoj Iyer (manjo) wrote :

The issue does not reproduce on Disco.
root@P8lpar4:/home/ubuntu# uname -a
Linux P8lpar4 5.0.0-19-generic #20-Ubuntu SMP Wed Jun 19 21:50:53 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
root@P8lpar4:/home/ubuntu#

root@P8lpar4:/home/ubuntu# drmgr -c cpu -r -q 1
Validating CPU DLPAR capability...yes.
CPU 121
root@P8lpar4:/home/ubuntu# drmgr -c cpu -a -q 1
Validating CPU DLPAR capability...yes.
CPU 121
root@P8lpar4:/home/ubuntu#

------- Comment From <email address hidden> 2019-06-25 13:47 EDT-------
The issue is only seen with 4.15 kernel..

Stefan Bader (smb) on 2019-06-28
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
status: New → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Manoj Iyer (manjo) on 2019-07-08
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Changed in linux (Ubuntu Bionic):
assignee: nobody → Manoj Iyer (manjo)
Manoj Iyer (manjo) wrote :

ubuntu@P8lpar4:~$ uname -a
Linux P8lpar4 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:21:40 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@P8lpar4:~$
ubuntu@P8lpar4:~$ apt policy linux-image-generic
linux-image-generic:
  Installed: 4.15.0.55.57
  Candidate: 4.15.0.55.57
  Version table:
 *** 4.15.0.55.57 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-proposed/main ppc64el Packages
        100 /var/lib/dpkg/status
     4.15.0.54.56 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic-updates/main ppc64el Packages
        500 http://ports.ubuntu.com/ubuntu-ports bionic-security/main ppc64el Packages
     4.15.0.20.23 500
        500 http://ports.ubuntu.com/ubuntu-ports bionic/main ppc64el Packages
ubuntu@P8lpar4:~$
ubuntu@P8lpar4:~$ sudo su
root@P8lpar4:/home/ubuntu# drmgr -c cpu -r -q 1
Validating CPU DLPAR capability...yes.
CPU 121
root@P8lpar4:/home/ubuntu# drmgr -c cpu -a -q 1
Validating CPU DLPAR capability...yes.
CPU 121
root@P8lpar4:/home/ubuntu#

-- dmesg --
[ 476.574556] cpu 120 (hwid 120) Ready to die...
[ 476.647003] cpu 121 (hwid 121) Ready to die...
[ 476.710155] cpu 122 (hwid 122) Ready to die...
[ 476.766678] cpu 123 (hwid 123) Ready to die...
[ 476.829791] cpu 124 (hwid 124) Ready to die...
[ 476.883594] cpu 125 (hwid 125) Ready to die...
[ 476.933738] cpu 126 (hwid 126) Ready to die...
[ 476.986045] cpu 127 (hwid 127) Ready to die...
root@P8lpar4:/home/ubuntu#

tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (11.2 KiB)

This bug was fixed in the package linux - 4.15.0-55.60

---------------
linux (4.15.0-55.60) bionic; urgency=medium

  * linux: 4.15.0-55.60 -proposed tracker (LP: #1834954)

  * Request backport of ceph commits into bionic (LP: #1834235)
    - ceph: use atomic_t for ceph_inode_info::i_shared_gen
    - ceph: define argument structure for handle_cap_grant
    - ceph: flush pending works before shutdown super
    - ceph: send cap releases more aggressively
    - ceph: single workqueue for inode related works
    - ceph: avoid dereferencing invalid pointer during cached readdir
    - ceph: quota: add initial infrastructure to support cephfs quotas
    - ceph: quota: support for ceph.quota.max_files
    - ceph: quota: don't allow cross-quota renames
    - ceph: fix root quota realm check
    - ceph: quota: support for ceph.quota.max_bytes
    - ceph: quota: update MDS when max_bytes is approaching
    - ceph: quota: add counter for snaprealms with quota
    - ceph: avoid iput_final() while holding mutex or in dispatch thread

  * QCA9377 isn't being recognized sometimes (LP: #1757218)
    - SAUCE: USB: Disable USB2 LPM at shutdown

  * hns: fix ICMP6 neighbor solicitation messages discard problem (LP: #1833140)
    - net: hns: fix ICMP6 neighbor solicitation messages discard problem
    - net: hns: fix unsigned comparison to less than zero

  * Fix occasional boot time crash in hns driver (LP: #1833138)
    - net: hns: Fix probabilistic memory overwrite when HNS driver initialized

  * use-after-free in hns_nic_net_xmit_hw (LP: #1833136)
    - net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()

  * hns: attempt to restart autoneg when disabled should report error
    (LP: #1833147)
    - net: hns: Restart autoneg need return failed when autoneg off

  * systemd 237-3ubuntu10.14 ADT test failure on Bionic ppc64el (test-seccomp)
    (LP: #1821625)
    - powerpc: sys_pkey_alloc() and sys_pkey_free() system calls
    - powerpc: sys_pkey_mprotect() system call

  * [UBUNTU] pkey: Indicate old mkvp only if old and curr. mkvp are different
    (LP: #1832625)
    - pkey: Indicate old mkvp only if old and current mkvp are different

  * [UBUNTU] kernel: Fix gcm-aes-s390 wrong scatter-gather list processing
    (LP: #1832623)
    - s390/crypto: fix gcm-aes-s390 selftest failures

  * System crashes on hot adding a core with drmgr command (4.15.0-48-generic)
    (LP: #1833716)
    - powerpc/numa: improve control of topology updates
    - powerpc/numa: document topology_updates_enabled, disable by default

  * Kernel modules generated incorrectly when system is localized to a non-
    English language (LP: #1828084)
    - scripts: override locale from environment when running recordmcount.pl

  * [UBUNTU] kernel: Fix wrong dispatching for control domain CPRBs
    (LP: #1832624)
    - s390/zcrypt: Fix wrong dispatching for control domain CPRBs

  * CVE-2019-11815
    - net: rds: force to destroy connection if t_sock is NULL in
      rds_tcp_kill_sock().

  * Sound device not detected after resume from hibernate (LP: #1826868)
    - drm/i915: Force 2*96 MHz cdclk on glk/cnl when audio power is enabled
    - drm/i915: Save the old CDCLK atomic state
...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Manoj Iyer (manjo) on 2019-07-22
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
bugproxy (bugproxy) on 2019-08-06
tags: added: targetmilestone-inin18042
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments