Activity log for bug #1833716

Date Who What changed Old value New value Message
2019-06-21 12:53:14 bugproxy bug added bug
2019-06-21 12:53:18 bugproxy tags architecture-ppc64le bugnameltc-177451 severity-high targetmilestone-inin---
2019-06-21 12:53:20 bugproxy attachment added vmcore.log https://bugs.launchpad.net/bugs/1833716/+attachment/5272093/+files/vmcore.log
2019-06-21 12:53:21 bugproxy ubuntu: assignee Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
2019-06-21 12:53:23 bugproxy affects ubuntu kernel-package (Ubuntu)
2019-06-21 13:45:49 Frank Heimes affects kernel-package (Ubuntu) linux (Ubuntu)
2019-06-21 13:46:09 Frank Heimes bug task added ubuntu-power-systems
2019-06-21 13:46:57 Frank Heimes tags architecture-ppc64le bugnameltc-177451 severity-high targetmilestone-inin--- architecture-ppc64le bugnameltc-177451 powervm severity-high targetmilestone-inin---
2019-06-21 13:47:27 Frank Heimes ubuntu-power-systems: importance Undecided High
2019-06-21 14:44:45 Manoj Iyer linux (Ubuntu): assignee Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) Canonical Kernel Team (canonical-kernel-team)
2019-06-21 14:44:56 Manoj Iyer ubuntu-power-systems: assignee Canonical Kernel Team (canonical-kernel-team)
2019-06-21 14:45:54 Manoj Iyer linux (Ubuntu): assignee Canonical Kernel Team (canonical-kernel-team) Manoj Iyer (manjo)
2019-06-21 14:45:57 Manoj Iyer linux (Ubuntu): importance Undecided High
2019-06-21 14:46:03 Manoj Iyer ubuntu-power-systems: assignee Canonical Kernel Team (canonical-kernel-team) Manoj Iyer (manjo)
2019-06-24 14:02:52 Andrew Cloke ubuntu-power-systems: status New In Progress
2019-06-24 14:02:55 Andrew Cloke linux (Ubuntu): status New In Progress
2019-06-24 16:45:40 Manoj Iyer description == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce--- 1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support. [Impact] On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. [Test] # drmgr -c cpu -r -q 1 # drmgr -c cpu -a -q 1 [Fix] 558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default 2d4d9b308f8f powerpc/numa: improve control of topology updates [Regression Potential] The two patches == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce---  1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support.
2019-06-24 16:49:21 Manoj Iyer description [Impact] On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. [Test] # drmgr -c cpu -r -q 1 # drmgr -c cpu -a -q 1 [Fix] 558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default 2d4d9b308f8f powerpc/numa: improve control of topology updates [Regression Potential] The two patches == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce---  1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support. [Impact] On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. The patches identified to fix these issues disables changing the NUMA associations for CPUs and Memory at runtime by default. [Test] # drmgr -c cpu -r -q 1 # drmgr -c cpu -a -q 1 [Fix] 558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default 2d4d9b308f8f powerpc/numa: improve control of topology updates [Regression Potential] The two patches relate to powerpc/numa and does not impact other architectures or platform code. Regression potential is low. == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce---  1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support.
2019-06-24 16:49:40 Manoj Iyer description [Impact] On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. The patches identified to fix these issues disables changing the NUMA associations for CPUs and Memory at runtime by default. [Test] # drmgr -c cpu -r -q 1 # drmgr -c cpu -a -q 1 [Fix] 558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default 2d4d9b308f8f powerpc/numa: improve control of topology updates [Regression Potential] The two patches relate to powerpc/numa and does not impact other architectures or platform code. Regression potential is low. == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce---  1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support. [Impact] On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. The patches identified to fix these issues disables changing the NUMA associations for CPUs and Memory at runtime by default. [Test] # drmgr -c cpu -r -q 1 # drmgr -c cpu -a -q 1 [Fix] 558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default 2d4d9b308f8f powerpc/numa: improve control of topology updates [Regression Potential] The two patches relate to powerpc/numa and does not impact other architectures or platform code. Regression potential is low. [Other Information] == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce---  1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support.
2019-06-24 16:56:38 Manoj Iyer description [Impact] On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. The patches identified to fix these issues disables changing the NUMA associations for CPUs and Memory at runtime by default. [Test] # drmgr -c cpu -r -q 1 # drmgr -c cpu -a -q 1 [Fix] 558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default 2d4d9b308f8f powerpc/numa: improve control of topology updates [Regression Potential] The two patches relate to powerpc/numa and does not impact other architectures or platform code. Regression potential is low. [Other Information] == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce---  1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support. [Impact] On Bionic GA kernel (4.15.0), hot add of cpu with drmgr causes the kernel to crash. The patches identified to fix these issues disables changing the NUMA associations for CPUs and Memory at runtime by default. [Test] # drmgr -c cpu -r -q 1 # drmgr -c cpu -a -q 1 Test kernel available in ppa:ubuntu-power-triage/lp1833716 Please see comment #2 for before and after results with the patches applied. [Fix] 558f86493df0 powerpc/numa: document topology_updates_enabled, disable by default 2d4d9b308f8f powerpc/numa: improve control of topology updates [Regression Potential] The two patches relate to powerpc/numa and does not impact other architectures or platform code. Regression potential is low. [Other Information] == Comment: #0 - Hari Krishna Bathini <hbathini@in.ibm.com> - 2019-05-07 13:18:35 == ---Problem Description--- On 4.15.0-48-generic kernel, hot adding a cpu with drmgr is crashing the kernel with below traces: --- root@ubuntu:~# drmgr -c cpu -r -q 1 Validating CPU DLPAR capability...yes. CPU 9 root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# root@ubuntu:~# drmgr -c cpu -a -q 1 Validating CPU DLPAR capability...yes. [ 218.555493] BUG: arch topology borken [ 218.555503] the DIE domain not a subset of the NODE domain [ 218.555512] BUG: arch topology borken [ 218.555516] the DIE domain not a subset of the NODE domain [ 218.555523] BUG: arch topology borken [ 218.555528] the DIE domain not a subset of the NODE domain [ 218.555535] BUG: arch topology borken [ 218.555539] the DIE domain not a subset of the NODE domain [ 218.555545] BUG: arch topology borken [ 218.555550] the DIE domain not a subset of the NODE domain [ 218.555556] BUG: arch topology borken [ 218.555560] the DIE domain not a subset of the NODE domain [ 218.555567] BUG: arch topology borken [ 218.555571] the DIE domain not a subset of the NODE domain [ 218.555577] BUG: arch topology borken [ 218.555581] the DIE domain not a subset of the NODE domain [ 218.555672] Unable to handle kernel paging request for data at address 0x9332ae80f961139f [ 218.555679] Faulting instruction address: 0xc0000000001768cc [ 218.555686] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.555691] LE SMP NR_CPUS=2048 NUMA pSeries [ 218.555699] Modules linked in: vmx_crypto crct10dif_vpmsum sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi ibmveth crc32c_vpmsum [ 218.555745] CPU: 8 PID: 276 Comm: kworker/8:1 Not tainted 4.15.0-48-generic #51-Ubuntu [ 218.555757] Workqueue: events cpuset_hotplug_workfn [ 218.555763] NIP: c0000000001768cc LR: c0000000001769a8 CTR: 0000000000000000 [ 218.555770] REGS: c0000001f5f1f530 TRAP: 0380 Not tainted (4.15.0-48-generic) [ 218.555776] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22824228 XER: 00000004 [ 218.555789] CFAR: c000000000176920 SOFTE: 1 [ 218.555789] GPR00: c0000000001769a8 c0000001f5f1f7b0 c0000000016eb400 c0000001f7bfd200 [ 218.555789] GPR04: 0000000000000001 0000000000000000 0000000000000008 0000000000000010 [ 218.555789] GPR08: 0000000000000018 ffffffffffffffff c0000001f7bfd408 0000000000000000 [ 218.555789] GPR12: 0000000000008000 c000000007a35800 0000000000000007 c0000001f549d900 [ 218.555789] GPR16: 0000000000000040 c000000001722494 c0000001f0f29400 0000000000000001 [ 218.555789] GPR20: c0000001ffb68580 0000000000000008 c0000000011d8580 c00000000171dd78 [ 218.555789] GPR24: 0000000000000000 ffffffffffffe830 ffffffffffffec30 00000000000012af [ 218.555789] GPR28: 000000000000102f c0000001f7bfd200 9332ae80f961139f 9332ae80f961139f [ 218.555859] NIP [c0000000001768cc] free_sched_groups.part.2+0x4c/0xf0 [ 218.555866] LR [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555871] Call Trace: [ 218.555875] [c0000001f5f1f7b0] [ffffffffffffec30] 0xffffffffffffec30 (unreliable) [ 218.555884] [c0000001f5f1f7f0] [c0000000001769a8] destroy_sched_domain+0x38/0xc0 [ 218.555892] [c0000001f5f1f820] [c000000000176eb0] cpu_attach_domain+0xf0/0x870 [ 218.555900] [c0000001f5f1f960] [c000000000178884] build_sched_domains+0x1254/0x12f0 [ 218.555908] [c0000001f5f1fa90] [c000000000179a70] partition_sched_domains+0x2d0/0x410 [ 218.555916] [c0000001f5f1fb20] [c0000000001ffb60] rebuild_sched_domains_locked+0x60/0x80 [ 218.555924] [c0000001f5f1fb50] [c000000000202e68] rebuild_sched_domains+0x38/0x60 [ 218.555932] [c0000001f5f1fb80] [c000000000202fc8] cpuset_hotplug_workfn+0x138/0xb60 [ 218.555941] [c0000001f5f1fc90] [c000000000135858] process_one_work+0x298/0x5a0 [ 218.555949] [c0000001f5f1fd20] [c000000000135bf8] worker_thread+0x98/0x630 [ 218.555956] [c0000001f5f1fdc0] [c00000000013e7e8] kthread+0x1a8/0x1b0 [ 218.555964] [c0000001f5f1fe30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84 [ 218.555971] Instruction dump: [ 218.555975] 7d908026 fbe1fff8 91810008 f8010010 f821ffc1 7c7d1b78 2e240000 7c7f1b78 [ 218.555985] 48000010 7fbee840 7fdff378 419e0074 <ebdf0000> 4192002c 7c0004ac e95f0010 [ 218.555997] ---[ end trace 1d7b9b38e50835a4 ]--- --- ---uname output--- Linux ubuntu 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux Machine Type = na ---Debugger--- A debugger is not configured ---Steps to Reproduce---  1. Install a 4.15 kernel (4.15.0-48-generic) 2. Hot remove a core: drmgr -c cpu -r -q 1 3. Hot add a core: drmgr -c cpu -a -q 1 Actual Result: System crashes after "drmgr -c cpu -a -q 1" command is issued Expected result: Hot add succeeds without any crash == Comment: #20 - SEETEENA THOUFEEK <sthoufee@in.ibm.com> - 2019-06-20 07:00:39 == Please integrate these two patches 1. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=2d4d9b308f8f8dec68f6dbbff18c68ec7c6bd26f powerpc/numa: improve control of topology updates When booted with "topology_updates=no", or when "off" is written to /proc/powerpc/topology_updates, NUMA reassignments are inhibited for PRRN and VPHN events. However, migration and suspend unconditionally re-enable reassignments via start_topology_update(). This is incoherent. Check the topology_updates_enabled flag in start/stop_topology_update() so that callers of those APIs need not be aware of whether reassignments are enabled. This allows the administrative decision on reassignments to remain in force across migrations and suspensions. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> 2. https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=next&id=558f86493df09f68f79fe056d9028d317a3ce8ab powerpc/numa: document topology_updates_enabled, disable by default Changing the NUMA associations for CPUs and memory at runtime is basically unsupported by the core mm, scheduler etc. We see all manner of crashes, warnings and instability when the pseries code tries to do this. Disable this behavior by default, and document the switch a bit. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Thanks in advance for your support.
2019-06-24 19:18:30 Manoj Iyer ubuntu-power-systems: importance High Critical
2019-06-24 19:18:32 Manoj Iyer linux (Ubuntu): importance High Critical
2019-06-28 14:24:15 Stefan Bader nominated for series Ubuntu Bionic
2019-06-28 14:24:15 Stefan Bader bug task added linux (Ubuntu Bionic)
2019-06-28 14:24:53 Stefan Bader linux (Ubuntu Bionic): importance Undecided High
2019-07-01 15:53:11 Kleber Sacilotto de Souza linux (Ubuntu Bionic): status New Fix Committed
2019-07-03 13:07:22 Ubuntu Kernel Bot tags architecture-ppc64le bugnameltc-177451 powervm severity-high targetmilestone-inin--- architecture-ppc64le bugnameltc-177451 powervm severity-high targetmilestone-inin--- verification-needed-bionic
2019-07-08 13:37:20 Manoj Iyer ubuntu-power-systems: status In Progress Fix Committed
2019-07-08 13:39:57 Manoj Iyer linux (Ubuntu Bionic): assignee Manoj Iyer (manjo)
2019-07-08 15:20:31 Manoj Iyer tags architecture-ppc64le bugnameltc-177451 powervm severity-high targetmilestone-inin--- verification-needed-bionic architecture-ppc64le bugnameltc-177451 powervm severity-high targetmilestone-inin--- verification-done-bionic
2019-07-22 10:53:34 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2019-07-22 10:53:34 Launchpad Janitor cve linked 2018-12126
2019-07-22 10:53:34 Launchpad Janitor cve linked 2018-12127
2019-07-22 10:53:34 Launchpad Janitor cve linked 2018-12130
2019-07-22 10:53:34 Launchpad Janitor cve linked 2019-11085
2019-07-22 10:53:34 Launchpad Janitor cve linked 2019-11091
2019-07-22 10:53:34 Launchpad Janitor cve linked 2019-11815
2019-07-22 10:53:34 Launchpad Janitor cve linked 2019-11833
2019-07-22 10:53:34 Launchpad Janitor cve linked 2019-11884
2019-07-22 13:52:44 Manoj Iyer linux (Ubuntu): status In Progress Fix Released
2019-07-24 20:54:43 Brad Figg tags architecture-ppc64le bugnameltc-177451 powervm severity-high targetmilestone-inin--- verification-done-bionic architecture-ppc64le bugnameltc-177451 cscc powervm severity-high targetmilestone-inin--- verification-done-bionic
2019-08-05 13:41:03 Frank Heimes ubuntu-power-systems: status Fix Committed Fix Released
2019-08-06 05:09:34 bugproxy tags architecture-ppc64le bugnameltc-177451 cscc powervm severity-high targetmilestone-inin--- verification-done-bionic architecture-ppc64le bugnameltc-177451 cscc powervm severity-high targetmilestone-inin18042 verification-done-bionic
2019-08-22 16:18:02 Ubuntu Kernel Bot tags architecture-ppc64le bugnameltc-177451 cscc powervm severity-high targetmilestone-inin18042 verification-done-bionic architecture-ppc64le bugnameltc-177451 cscc powervm severity-high targetmilestone-inin18042 verification-done-bionic verification-needed-xenial
2019-08-23 11:49:42 bugproxy tags architecture-ppc64le bugnameltc-177451 cscc powervm severity-high targetmilestone-inin18042 verification-done-bionic verification-needed-xenial architecture-ppc64le bugnameltc-177451 cscc powervm severity-high targetmilestone-inin18042 verification-done-bionic verification-done-xenial