Ubuntu
kernel-package package

[Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck at booting after triggering crash

Bug #2077722 reported by bugproxy on 2024-08-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	kernel-package (Ubuntu)	New	Undecided	Ubuntu on IBM Power Systems Bug Triage

Bug Description

Problem:
While bringing up 2 Ubuntu 24.04 guests and running stress-ng (90% load) on both and triggering crash simultaneously, 1st guest gets stuck and does not boot up. In one of the attempts, both the guests got stuck on booting with console hang.

Attempts:
Reproducible 3/3 consecutive times
Run 1: L2-1 guest got stuck
Run 2: L2-1 guest got stuck
Run 3: L2-1 and L2-2 guest got stuck

=================================================================
L1 Host:
1. PowerVM
2. OS: Ubuntu 24.04
3. Kernel: 6.8.0-31-generic
4. Mem (free -mh): 47Gi
5. cpus: 40

Guest L2-1:
1. OS: Ubuntu 24.04
2. Kernel: 6.8.0-31-generic
3. Mem (free -mh): 9.5Gi
4. cpus: 8
5. Stress: stress-ng - 90% load
6. XML configuration:
   <vcpu placement='static' current='8'>16</vcpu>
   <memory unit='KiB'>10971520</memory>
   <topology sockets='8' dies='1' cores='1' threads='2'/>

Guest L2-2:
1. OS: Ubuntu 24.04
2. Kernel: 6.8.0-31-generic
3. Mem (free -mh): 9.5Gi
4. cpus: 8
5. Stress: stress-ng - 90% load
6. XML configuration:
   <vcpu placement='static' current='8'>16</vcpu>
   <memory unit='KiB'>10971520</memory>
   <topology sockets='2' dies='1' cores='1' threads='8'/>

=================================================================
Steps to reproduce:
1. Bring up 2 Ubuntu 24.04 L2 guests with configuration mentioned as above
2. Run the attached stress-ng.sh script on both L2 guests
3. Trigger crash: echo c >/proc/sysrq-trigger on both L2 guests at the same time

After triggering the crash, 1 or both guest consoles will get stuck. And then, we will not be able to enter the guest neither shut it down. In oder to boot into the guest, virsh destroy of the guest will be required.

=================================================================
Run1: Console.log Error message of L2-1
Booting `Ubuntu'

Loading Linux 6.8.0-31-generic ...
Loading initial ramdisk ...
OF stdout device is: /vdevice/vty@30000000
Preparing to boot Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) (powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 6.8.0-31.31-generic 6.8.1)
Detected machine type: 0000000000000101
command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
Max number of cores passed to firmware: 1024 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
memory layout at init:
  memory_limit : 0000000000000000 (16 MB aligned)
  alloc_bottom : 0000000009d70000
  alloc_top : 0000000030000000
  alloc_top_hi : 00000002a0000000
  rmo_top : 0000000030000000
  ram_top : 00000002a0000000
instantiating rtas at 0x000000002fff0000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000009d80000 -> 0x0000000009d80bc6
Device tree struct 0x0000000009d90000 -> 0x0000000009da0000
Quiescing Open Firmware ...
Booting Linux via __start() @ 0x0000000000230000 ...
[ 0.000000] random: crng init done
[ 0.000000] Reserving 512MB of memory at 512MB for crashkernel (System RAM: 10752MB)
[ 0.000000] radix-mmu: Page sizes from device-tree:
[ 0.000000] radix-mmu: Page size shift = 12 AP=0x0
[ 0.000000] radix-mmu: Page size shift = 16 AP=0x5
[ 0.000000] radix-mmu: Page size shift = 21 AP=0x1
[ 0.000000] radix-mmu: Page size shift = 30 AP=0x2
[ 0.000000] Activating Kernel Userspace Access Prevention
[ 0.000000] Activating Kernel Userspace Execution Prevention
[ 0.000000] radix-mmu: Mapped 0x0000000000000000-0x00000000038a0000 with 64.0 KiB pages (exec)
[ 0.000000] radix-mmu: Mapped 0x00000000038a0000-0x00000002a0000000 with 64.0 KiB pages
[ 0.000000] lpar: Using radix MMU under hypervisor
[ 0.000000] Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) (powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 6.8.0-31.31-generic 6.8.1)
[ 0.000000] Secure boot mode disabled
[ 0.000000] Found initrd at 0xc000000006200000:0xc000000009d6da29
[ 0.000000] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries
[ 0.000000] printk: legacy bootconsole [udbg0] enabled
[ 0.000000] Partition configured for 16 cpus.
[ 0.000000] CPU maps initialized for 2 threads per core
[ 0.000000] numa: Partition configured for 1 NUMA nodes.
[ 0.000000] -----------------------------------------------------
[ 0.000000] phys_mem_size = 0x2a0000000
[ 0.000000] dcache_bsize = 0x80
[ 0.000000] icache_bsize = 0x80
[ 0.000000] cpu_features = 0x001400eb8f5f9187
[ 0.000000] possible = 0x001ffbfbcf5fb187
[ 0.000000] always = 0x0000000380008181
[ 0.000000] cpu_user_features = 0xdc0065c2 0xaef60000
[ 0.000000] mmu_features = 0x3c007641
[ 0.000000] firmware_features = 0x00000a85455a445f
[ 0.000000] vmalloc start = 0xc008000000000000
[ 0.000000] IO start = 0xc00a000000000000
[ 0.000000] vmemmap start = 0xc00c000000000000
[ 0.000000] -----------------------------------------------------
[ 0.000000] numa: NODE_DATA [mem 0x28ae09c00-0x28ae1197f]
[ 0.000000] rfi-flush: fallback displacement flush available
[ 0.000000] rfi-flush: ori type flush available
[ 0.000000] rfi-flush: mttrig type flush available
[ 0.000000] count-cache-flush: hardware flush enabled.
[ 0.000000] link-stack-flush: software flush enabled.
[ 0.000000] stf-barrier: eieio barrier available
[ 0.000000] PPC64 nvram contains 65536 bytes
[ 0.000000] barrier-nospec: using ORI speculation barrier
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000000000000-0x000000029fffffff]
[ 0.000000] Device empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000000000-0x000000029fffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000029fffffff]
[ 0.000000] percpu: Embedded 12 pages/cpu s609960 r0 d176472 u786432
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
[ 0.000000] Unknown kernel command line parameters "BOOT_IMAGE=/vmlinux-6.8.0-31-generic", will be passed to user space.
[ 0.000000] Dentry cache hash table entries: 2097152 (order: 8, 16777216 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear)
[ 0.000000] Fallback order for Node 0: 0
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 171864
[ 0.000000] Policy zone: Normal
[ 0.000000] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
[ 0.000000] Memory: 9947840K/11010048K available (23680K kernel code, 4096K rwdata, 25472K rodata, 8832K init, 1901K bss, 1062208K reserved, 0K cma-reserved)
[ 0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[ 0.000000] ftrace: allocating 51717 entries in 19 pages
[ 0.000000] ftrace: allocated 19 pages with 3 groups
[ 0.000000] trace event string verifier disabled
[ 0.000000] rcu: Hierarchical RCU implementation.
[ 0.000000] rcu: RCU restricting CPUs from NR_CPUS=2048 to nr_cpu_ids=16.
[ 0.000000] Rude variant of Tasks RCU enabled.
[ 0.000000] Tracing variant of Tasks RCU enabled.
[ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[ 0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=16
[ 0.000000] NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
[ 0.000000] xive: Using IRQ range [0-f]
[ 0.000000] xive: Interrupt handling initialized with spapr backend
[ 0.000000] xive: Using priority 6 for all interrupts
[ 0.000000] xive: Using 64kB queues
[ 0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.000000] time_init: 56 bit decrementer (max: 7fffffffffffff)
[ 0.001027] clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 0x761537d007, max_idle_ns: 440795202126 ns
[ 0.002881] clocksource: timebase mult[1f40000] shift[24] registered

=================================================================
Host side:
When the L2-1 guest console got stuck on first attempt
Run 1

# top | cat

top - 08:53:11 up 2 days, 14:15, 6 users, load average: 9.00, 10.53, 12.53
Tasks: 496 total, 1 running, 495 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.2 us, 2.2 sy, 0.0 ni, 76.7 id, 0.0 wa, 20.0 hi, 0.0 si, 0.0 st
MiB Mem : 48414.8 total, 303.5 free, 24681.1 used, 23777.0 buff/cache
MiB Swap: 8191.9 total, 7910.1 free, 281.8 used. 23733.7 avail Mem

USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
root 20 0 15.6g 10.5g 15360 S 800.0 22.2 146:46.26 qemu-system-ppc
root 20 0 15.5g 10.5g 15360 S 100.0 22.1 88:03.53 qemu-system-ppc

# free -mh
total used free shared buff/cache available
Mem: 47Gi 24Gi 230Mi 2.2Mi 23Gi 23Gi

=================================================================
Debugging logs/dumps:
1. console.logs of both L2 guest consoles (All 3 attempts)
2. virsh dump of both guests (All 3 attempts)

Copying the above logs/dumps to june server machines under /dump/dumps/<bug-number>

=================================================================
Attachments:
1. Run-1 console.log of L2-1 guest getting stuck
2. Run-3 console.log of L2-1 guest getting stuck
3. Run-3 console.log of L2-2 guest getting stuck
4. Stress-ng script to run 90% load: stress-ng.sh

$ pwd
/home/dump/dumps/206735
$ ls
bug-206735-guest-console-logs bug-206735-guest-virsh-dumps

Thanks.

~/bug-206735-dumps# crash /root/.cache/debuginfod_client/475c3a23ac990f64c5a03cf1fe8b229fde9a7692/debuginfo ./vmcore-ubuntu_vm1-1

crash 8.0.4
Copyright (C) 2002-2022 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
Copyright (C) 2015, 2021 VMware, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "powerpc64le-unknown-linux-gnu".
Type "show configuration" for configuration details.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...

      KERNEL: /root/.cache/debuginfod_client/475c3a23ac990f64c5a03cf1fe8b229fde9a7692/debuginfo
    DUMPFILE: ./vmcore-ubuntu_vm1-1
        CPUS: 1
        DATE: Fri May 24 08:44:35 UTC 2024
      UPTIME: 00:00:00
LOAD AVERAGE: 0.00, 0.00, 0.00
       TASKS: 1
    NODENAME: (none)
     RELEASE: 6.8.0-31-generic
     VERSION: #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024
     MACHINE: ppc64le (3450 Mhz)
      MEMORY: 10.5 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper/0"
        TASK: c000000003bf8900 [THREAD_INFO: c000000003bf8900]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)
     WARNING: panic task not found

crash> bt
PID: 0 TASK: c000000003bf8900 CPU: 0 COMMAND: "swapper/0"
R0: c0000000000de4f4 R1: c00000028af13f80 R2: c000000002254800
R3: c0000000048de000 R4: c000000003c37bc0 R5: 0000000000000000
R6: 0000000000000000 R7: 0000000000000000 R8: c000000001724d18
R9: 000000000000ff00 R10: 0000000286f80000 R11: 0000000053474552
R12: c0000000000e4184 R13: c000000003e80000 R14: 0000000000000000
R15: 0000000000000000 R16: 0000000000000000 R17: 0000000000000000
R18: 0000000000000000 R19: 0000000000000000 R20: 0000000000000000
R21: 0000000000000000 R22: 0000000000000000 R23: 0000000000000000
R24: 0000000000000000 R25: c000000003c37bc0 R26: c00000028af13fe0
R27: c000000003c34000 R28: c000000003c69e88 R29: c000000003c37c80
R30: c000000003262bd8 R31: c0000000048de000
NIP: c0000000000e41a4 MSR: 8000000000000033 OR3: 0000000000000000
CTR: c0000000000e4184 LR: c0000000000de4f4 XER: 0000000000000074
CCR: 0000000082042840 MQ: 0000000000000000 DAR: 0000000000000000
DSISR: 0000000000000000 Syscall Result: 0000000000000000
[NIP : xive_spapr_update_pending+32]
[LR : xive_get_irq+76]
#0 [c00000028af13f80] (null) at 0 (unreliable)
#1 [c00000028af13fb0] __do_irq at c000000000017a78
#2 [c00000028af13fe0] __do_IRQ at c000000000018cd8
#3 [c000000003c37bc0] (null) at 0 (unreliable)
#4 [c000000003c37c20] do_IRQ at c000000000018e30
#5 [c000000003c37c50] hardware_interrupt_common_virt at c00000000000953c
#6 [c000000003c37f20] (null) at 9d6da29 (unreliable)
#7 [c000000003c37f50] start_kernel at c00000000300fed0
#8 [c000000003c37fe0] start_here_common at c00000000000e998
crash> dis xive_spapr_update_pending+32
0xc0000000000e41a4 <xive_spapr_update_pending+32>: hwsync
crash> dis -s xive_spapr_update_pending+32
FILE: /build/linux-NbDBKx/linux-6.8.0/arch/powerpc/sysdev/xive/spapr.c
LINE: 618

dis: xive_spapr_update_pending+32: source code is not available

crash>

I debugged the 6.9.4-200.fc40 kernel of the evelp2g2 L2 VM as given to me by Lekshmi and I find that stress-ng has nothing
to do with hitting this hang in the same __do_IRQ -> xive_get_irheq -> xive_spapr_update_pending call-stack.

This call-stack hits randomly on one of the L2 vcpus whenever we do a "echo c > /proc/sysrq-trigger".

The NIP points to the mb() macro in C code and hwsync in the GDB disassembly but this isn't really a problem with hwsync.

I could reproduce this exact same problem with kernel 6.10.0-rc6+ on FC40 after I:
i) Reset the crashkernel cmdline with the following command:
grubby --update-kernel ALL --args "crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M"
ii) Enabled kdump via the "systemctl enable kdump" and "systemctl start kdump" commands.

On debugging the vcpu thread in the Qemu instance running in L1 I find that the vcpu thread doesn't exit from KVM_RUN with any error code.

I find that the xive_get_irq() -> xive_spapr_update_pending() functions are actually being called repeatedly by do_IRQ() -> __do_IRQ() -> __do_irq()
in the arch/powerpc code. This means that we are constantly getting interrupts on this CPU. When I debugged the IRQ number we are constantly getting
I see that it is 0x0 which is not informative as of now (to me at least).

I think that there is some problem in the startup sequence of the secondary CPUs as I never faced this problem on the boot CPU as long as I tried today.

I request the CPU team to investigate the startup sequence of the secondary SMP CPUs as they would be having a better idea of this for powerpc.

The procedure to be followed is simple:
i) Put logs in the startup code of the secondary and primary CPU(s).
ii) Investigate the point at which the primary CPU waits for the secondary CPUs to come up and understand what isn't happening at the secondary CPUs
such that the primary CPU doesn't go past the "smp: Bringing up secondary CPUs" log.

(In reply to comment #21)
> I think that there is some problem in the startup sequence of the secondary
> CPUs as I never faced this problem on the boot CPU as long as I tried today.
>
> I request the CPU team to investigate the startup sequence of the secondary
> SMP CPUs as they would be having a better idea of this for powerpc.
>
> The procedure to be followed is simple:
> i) Put logs in the startup code of the secondary and primary CPU(s).
> ii) Investigate the point at which the primary CPU waits for the secondary
> CPUs to come up and understand what isn't happening at the secondary CPUs
> such that the primary CPU doesn't go past the "smp: Bringing up
> secondary CPUs" log.

We have done exactly that and we see that one of the secondary threads during the bring up is stuck in arch_local_irq_restore().
We dont know why its gets stuck there and thats a function that cant be instrumented since it leads to other side-effects even before hitting
Bringing up secondary CPUs.

Even the addr2line for the address (NIP) shown in rcu stall points to the same arch_local_irq_restore().

Just to level set.

Gautham's patch helps if we are going to disable xive in L2
Nick's patch will not throw RCU Stalls but L2 will still hang.

Current interpretation of investigation:
When we have xive on L2, CPU bring up gets stuck at mb() in xive_spapr_update-pending. However this is not reproducible with xive disabled.

Its probably a bit early to conclude xive is the problem.

-----------------------------------------------------------------------------------

After reverting the following commits pointed by Gautam Menghani, hang is not seen and kdump in L2 works as expected.

df938a5576f3 KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts
ec0f6639fa88 KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the L0

Tags: