Activity log for bug #2062380

Date Who What changed Old value New value Message
2024-04-18 14:56:39 Ian May bug added bug
2024-04-18 16:12:58 Ian May summary Using a 6.8 kernel modprobe nvidia hangs on Grace Hopper Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
2024-04-18 16:15:27 Ian May bug task added nvidia-graphics-drivers-535-server (Ubuntu)
2024-04-18 16:16:14 Ian May nvidia-graphics-drivers-535-server (Ubuntu): status New Confirmed
2024-04-18 16:16:17 Ian May nvidia-graphics-drivers-550-server (Ubuntu): status New Confirmed
2024-04-18 16:19:03 Ian May description Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number: 0 0 0 [ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO WARNING: cpu 54: cannot find NT_PRSTATUS note KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED] DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP] CPUS: 72 DATE: Wed Apr 17 21:39:13 UTC 2024 UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28 TASKS: 854 NODENAME: hinyari RELEASE: 6.8.0-1005-nvidia-64k VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024 MACHINE: aarch64 (unknown Mhz) MEMORY: 479.7 GB PANIC: "Kernel panic - not syncing: RCU Stall" PID: 0 COMMAND: "swapper/21" TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880] CPU: 21 STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number: 0 0 0 [ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008 [ 370.392488] Call trace: [ 370.392489] __switch_to+0xd0/0x118 [ 370.392499] __schedule+0x2a8/0x7b0 [ 370.392501] schedule+0x40/0x168 [ 370.392502] schedule_timeout+0xac/0x1e0 [ 370.392505] rcu_gp_fqs_loop+0x128/0x508 [ 370.392512] rcu_gp_kthread+0x150/0x188 [ 370.392514] kthread+0xf8/0x110 [ 370.392519] ret_from_fork+0x10/0x20 [ 370.392524] rcu: Stack dump where RCU GP kthread last ran: [ 370.398128] Sending NMI from CPU 21 to CPUs 31: [ 370.398131] NMI backtrace for cpu 31 [ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 370.398142] pc : cpuidle_enter_state+0xd8/0x790 [ 370.398150] lr : cpuidle_enter_state+0xcc/0x790 [ 370.398153] sp : ffff800081eefd70 [ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000 [ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000 [ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0 [ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030 [ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0 [ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244 [ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000 [ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 370.398181] Call trace: [ 370.398183] cpuidle_enter_state+0xd8/0x790 [ 370.398185] cpuidle_enter+0x44/0x78 [ 370.398195] cpuidle_idle_call+0x15c/0x210 [ 370.398202] do_idle+0xb0/0x130 [ 370.398204] cpu_startup_entry+0x40/0x50 [ 370.398206] secondary_start_kernel+0xec/0x130 [ 370.398211] __secondary_switched+0xc0/0xc8 [ 370.399132] Kernel panic - not syncing: RCU Stall [ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.421192] Call trace: [ 370.423686] dump_backtrace+0xa4/0x150 [ 370.427514] show_stack+0x24/0x50 [ 370.430896] dump_stack_lvl+0x78/0xf8 [ 370.434640] dump_stack+0x1c/0x38 [ 370.438023] panic+0x3a4/0x440 [ 370.441141] print_other_cpu_stall+0x578/0x610 [ 370.445681] check_cpu_stall+0x240/0x300 [ 370.449686] rcu_pending+0x44/0x220 [ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8 [ 370.457519] update_process_times+0x7c/0xf8 [ 370.461794] tick_sched_handle+0x3c/0x98 [ 370.465803] tick_nohz_highres_handler+0x5c/0xe8 [ 370.470520] __hrtimer_run_queues+0x164/0x398 [ 370.474969] hrtimer_interrupt+0xf4/0x278 [ 370.479063] arch_timer_handler_phys+0x38/0x80 [ 370.483607] handle_percpu_devid_irq+0x94/0x2b8 [ 370.488238] generic_handle_domain_irq+0x38/0x70 [ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 370.498743] gic_handle_irq+0x2c/0xa0 [ 370.502481] call_on_irq_stack+0x3c/0x50 [ 370.506486] do_interrupt_handler+0xb0/0xc8 [ 370.510759] el1_interrupt+0x48/0xf0 [ 370.514409] el1h_64_irq_handler+0x1c/0x40 [ 370.518592] el1h_64_irq+0x7c/0x80 [ 370.522063] cpuidle_enter_state+0xd8/0x790 [ 370.526336] cpuidle_enter+0x44/0x78 [ 370.529986] cpuidle_idle_call+0x15c/0x210 [ 370.534169] do_idle+0xb0/0x130 [ 370.537375] cpu_startup_entry+0x44/0x50 [ 370.541380] secondary_start_kernel+0xec/0x130 [ 370.545919] __secondary_switched+0xc0/0xc8 [ 370.550197] SMP: stopping secondary CPUs [ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71 [ 371.607097] Starting crashdump kernel... [ 371.611103] ------------[ cut here ]------------ [ 371.615820] Some CPUs may be stale, kdump will be unreliable. [ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0 [ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 371.744180] pc : machine_kexec+0x48/0x1f0 [ 371.748275] lr : machine_kexec+0x48/0x1f0 [ 371.752369] sp : ffff8000802afa10 [ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c [ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4 [ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000 [ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088 [ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463 [ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c [ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000 [ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 371.828696] Call trace: [ 371.831189] machine_kexec+0x48/0x1f0 [ 371.834928] __crash_kexec+0x94/0x128 [ 371.838668] panic+0x380/0x440 [ 371.841784] print_other_cpu_stall+0x578/0x610 [ 371.846325] check_cpu_stall+0x240/0x300 [ 371.850331] rcu_pending+0x44/0x220 [ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8 [ 371.858163] update_process_times+0x7c/0xf8 [ 371.862434] tick_sched_handle+0x3c/0x98 [ 371.866440] tick_nohz_highres_handler+0x5c/0xe8 [ 371.871156] __hrtimer_run_queues+0x164/0x398 [ 371.875605] hrtimer_interrupt+0xf4/0x278 [ 371.879700] arch_timer_handler_phys+0x38/0x80 [ 371.884240] handle_percpu_devid_irq+0x94/0x2b8 [ 371.888869] generic_handle_domain_irq+0x38/0x70 [ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 371.899368] gic_handle_irq+0x2c/0xa0 [ 371.903105] call_on_irq_stack+0x3c/0x50 [ 371.907110] do_interrupt_handler+0xb0/0xc8 [ 371.911382] el1_interrupt+0x48/0xf0 [ 371.915032] el1h_64_irq_handler+0x1c/0x40 [ 371.919215] el1h_64_irq+0x7c/0x80 [ 371.922686] cpuidle_enter_state+0xd8/0x790 [ 371.926958] cpuidle_enter+0x44/0x78 [ 371.930609] cpuidle_idle_call+0x15c/0x210 [ 371.934793] do_idle+0xb0/0x130 [ 371.937998] cpu_startup_entry+0x44/0x50 [ 371.942003] secondary_start_kernel+0xec/0x130 [ 371.946542] __secondary_switched+0xc0/0xc8 [ 371.950815] ---[ end trace 0000000000000000 ]--- In an attempt to get more debug info, I tried the open driver in github Using https://github.com/NVIDIA/open-gpu-kernel-modules Version 550.76- loads successfully Version 550.67- loads successfully Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug into I changed optimization in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1. CRASH INFO [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 8648.399560] [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)] [ 8648.407608] [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4 [ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 8648.525058] pc : __kmalloc+0x1e0/0x490 [ 8648.528892] lr : 0xffffa00000000000 [ 8648.532482] sp : ffff8000d132f5f0 [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484 [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828 [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8 [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4 [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000 [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200 [ 8648.608810] Call trace: [ 8648.611305] __kmalloc+0x1e0/0x490 [ 8648.614778] 0x8000604466e4a000 [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 8648.624219] SMP: stopping secondary CPUs Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. $ sudo dmidecode -t 0 # dmidecode 3.5 Getting SMBIOS data from sysfs. SMBIOS 3.6.0 present. # SMBIOS implementations newer than version 3.5.0 are not # fully supported by this version of dmidecode. Handle 0x0001, DMI type 0, 26 bytes BIOS Information Vendor: NVIDIA Version: 01.02.01 Release Date: 20240207 ROM Size: 64 MB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported Serial services are supported (int 14h) ACPI is supported Targeted content distribution is supported UEFI is supported Firmware Revision: 0.0 [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number: 0 0 0 [ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO WARNING: cpu 54: cannot find NT_PRSTATUS note       KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED]     DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP]         CPUS: 72         DATE: Wed Apr 17 21:39:13 UTC 2024       UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28        TASKS: 854     NODENAME: hinyari      RELEASE: 6.8.0-1005-nvidia-64k      VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024      MACHINE: aarch64 (unknown Mhz)       MEMORY: 479.7 GB        PANIC: "Kernel panic - not syncing: RCU Stall"          PID: 0      COMMAND: "swapper/21"         TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880]          CPU: 21        STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number: 0 0 0 [ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008 [ 370.392488] Call trace: [ 370.392489] __switch_to+0xd0/0x118 [ 370.392499] __schedule+0x2a8/0x7b0 [ 370.392501] schedule+0x40/0x168 [ 370.392502] schedule_timeout+0xac/0x1e0 [ 370.392505] rcu_gp_fqs_loop+0x128/0x508 [ 370.392512] rcu_gp_kthread+0x150/0x188 [ 370.392514] kthread+0xf8/0x110 [ 370.392519] ret_from_fork+0x10/0x20 [ 370.392524] rcu: Stack dump where RCU GP kthread last ran: [ 370.398128] Sending NMI from CPU 21 to CPUs 31: [ 370.398131] NMI backtrace for cpu 31 [ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 370.398142] pc : cpuidle_enter_state+0xd8/0x790 [ 370.398150] lr : cpuidle_enter_state+0xcc/0x790 [ 370.398153] sp : ffff800081eefd70 [ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000 [ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000 [ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0 [ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030 [ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0 [ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244 [ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000 [ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 370.398181] Call trace: [ 370.398183] cpuidle_enter_state+0xd8/0x790 [ 370.398185] cpuidle_enter+0x44/0x78 [ 370.398195] cpuidle_idle_call+0x15c/0x210 [ 370.398202] do_idle+0xb0/0x130 [ 370.398204] cpu_startup_entry+0x40/0x50 [ 370.398206] secondary_start_kernel+0xec/0x130 [ 370.398211] __secondary_switched+0xc0/0xc8 [ 370.399132] Kernel panic - not syncing: RCU Stall [ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.421192] Call trace: [ 370.423686] dump_backtrace+0xa4/0x150 [ 370.427514] show_stack+0x24/0x50 [ 370.430896] dump_stack_lvl+0x78/0xf8 [ 370.434640] dump_stack+0x1c/0x38 [ 370.438023] panic+0x3a4/0x440 [ 370.441141] print_other_cpu_stall+0x578/0x610 [ 370.445681] check_cpu_stall+0x240/0x300 [ 370.449686] rcu_pending+0x44/0x220 [ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8 [ 370.457519] update_process_times+0x7c/0xf8 [ 370.461794] tick_sched_handle+0x3c/0x98 [ 370.465803] tick_nohz_highres_handler+0x5c/0xe8 [ 370.470520] __hrtimer_run_queues+0x164/0x398 [ 370.474969] hrtimer_interrupt+0xf4/0x278 [ 370.479063] arch_timer_handler_phys+0x38/0x80 [ 370.483607] handle_percpu_devid_irq+0x94/0x2b8 [ 370.488238] generic_handle_domain_irq+0x38/0x70 [ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 370.498743] gic_handle_irq+0x2c/0xa0 [ 370.502481] call_on_irq_stack+0x3c/0x50 [ 370.506486] do_interrupt_handler+0xb0/0xc8 [ 370.510759] el1_interrupt+0x48/0xf0 [ 370.514409] el1h_64_irq_handler+0x1c/0x40 [ 370.518592] el1h_64_irq+0x7c/0x80 [ 370.522063] cpuidle_enter_state+0xd8/0x790 [ 370.526336] cpuidle_enter+0x44/0x78 [ 370.529986] cpuidle_idle_call+0x15c/0x210 [ 370.534169] do_idle+0xb0/0x130 [ 370.537375] cpu_startup_entry+0x44/0x50 [ 370.541380] secondary_start_kernel+0xec/0x130 [ 370.545919] __secondary_switched+0xc0/0xc8 [ 370.550197] SMP: stopping secondary CPUs [ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71 [ 371.607097] Starting crashdump kernel... [ 371.611103] ------------[ cut here ]------------ [ 371.615820] Some CPUs may be stale, kdump will be unreliable. [ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0 [ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 371.744180] pc : machine_kexec+0x48/0x1f0 [ 371.748275] lr : machine_kexec+0x48/0x1f0 [ 371.752369] sp : ffff8000802afa10 [ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c [ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4 [ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000 [ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088 [ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463 [ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c [ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000 [ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 371.828696] Call trace: [ 371.831189] machine_kexec+0x48/0x1f0 [ 371.834928] __crash_kexec+0x94/0x128 [ 371.838668] panic+0x380/0x440 [ 371.841784] print_other_cpu_stall+0x578/0x610 [ 371.846325] check_cpu_stall+0x240/0x300 [ 371.850331] rcu_pending+0x44/0x220 [ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8 [ 371.858163] update_process_times+0x7c/0xf8 [ 371.862434] tick_sched_handle+0x3c/0x98 [ 371.866440] tick_nohz_highres_handler+0x5c/0xe8 [ 371.871156] __hrtimer_run_queues+0x164/0x398 [ 371.875605] hrtimer_interrupt+0xf4/0x278 [ 371.879700] arch_timer_handler_phys+0x38/0x80 [ 371.884240] handle_percpu_devid_irq+0x94/0x2b8 [ 371.888869] generic_handle_domain_irq+0x38/0x70 [ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 371.899368] gic_handle_irq+0x2c/0xa0 [ 371.903105] call_on_irq_stack+0x3c/0x50 [ 371.907110] do_interrupt_handler+0xb0/0xc8 [ 371.911382] el1_interrupt+0x48/0xf0 [ 371.915032] el1h_64_irq_handler+0x1c/0x40 [ 371.919215] el1h_64_irq+0x7c/0x80 [ 371.922686] cpuidle_enter_state+0xd8/0x790 [ 371.926958] cpuidle_enter+0x44/0x78 [ 371.930609] cpuidle_idle_call+0x15c/0x210 [ 371.934793] do_idle+0xb0/0x130 [ 371.937998] cpu_startup_entry+0x44/0x50 [ 371.942003] secondary_start_kernel+0xec/0x130 [ 371.946542] __secondary_switched+0xc0/0xc8 [ 371.950815] ---[ end trace 0000000000000000 ]--- In an attempt to get more debug info, I tried the open driver in github Using https://github.com/NVIDIA/open-gpu-kernel-modules Version 550.76- loads successfully Version 550.67- loads successfully Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug into I changed optimization in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1. CRASH INFO [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 8648.399560] [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)] [ 8648.407608] [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4 [ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 8648.525058] pc : __kmalloc+0x1e0/0x490 [ 8648.528892] lr : 0xffffa00000000000 [ 8648.532482] sp : ffff8000d132f5f0 [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484 [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828 [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8 [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4 [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000 [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200 [ 8648.608810] Call trace: [ 8648.611305] __kmalloc+0x1e0/0x490 [ 8648.614778] 0x8000604466e4a000 [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 8648.624219] SMP: stopping secondary CPUs
2024-04-18 16:19:52 Ian May description Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. $ sudo dmidecode -t 0 # dmidecode 3.5 Getting SMBIOS data from sysfs. SMBIOS 3.6.0 present. # SMBIOS implementations newer than version 3.5.0 are not # fully supported by this version of dmidecode. Handle 0x0001, DMI type 0, 26 bytes BIOS Information Vendor: NVIDIA Version: 01.02.01 Release Date: 20240207 ROM Size: 64 MB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported Serial services are supported (int 14h) ACPI is supported Targeted content distribution is supported UEFI is supported Firmware Revision: 0.0 [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number: 0 0 0 [ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO WARNING: cpu 54: cannot find NT_PRSTATUS note       KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED]     DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP]         CPUS: 72         DATE: Wed Apr 17 21:39:13 UTC 2024       UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28        TASKS: 854     NODENAME: hinyari      RELEASE: 6.8.0-1005-nvidia-64k      VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024      MACHINE: aarch64 (unknown Mhz)       MEMORY: 479.7 GB        PANIC: "Kernel panic - not syncing: RCU Stall"          PID: 0      COMMAND: "swapper/21"         TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880]          CPU: 21        STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number: 0 0 0 [ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008 [ 370.392488] Call trace: [ 370.392489] __switch_to+0xd0/0x118 [ 370.392499] __schedule+0x2a8/0x7b0 [ 370.392501] schedule+0x40/0x168 [ 370.392502] schedule_timeout+0xac/0x1e0 [ 370.392505] rcu_gp_fqs_loop+0x128/0x508 [ 370.392512] rcu_gp_kthread+0x150/0x188 [ 370.392514] kthread+0xf8/0x110 [ 370.392519] ret_from_fork+0x10/0x20 [ 370.392524] rcu: Stack dump where RCU GP kthread last ran: [ 370.398128] Sending NMI from CPU 21 to CPUs 31: [ 370.398131] NMI backtrace for cpu 31 [ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 370.398142] pc : cpuidle_enter_state+0xd8/0x790 [ 370.398150] lr : cpuidle_enter_state+0xcc/0x790 [ 370.398153] sp : ffff800081eefd70 [ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000 [ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000 [ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0 [ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030 [ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0 [ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244 [ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000 [ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 370.398181] Call trace: [ 370.398183] cpuidle_enter_state+0xd8/0x790 [ 370.398185] cpuidle_enter+0x44/0x78 [ 370.398195] cpuidle_idle_call+0x15c/0x210 [ 370.398202] do_idle+0xb0/0x130 [ 370.398204] cpu_startup_entry+0x40/0x50 [ 370.398206] secondary_start_kernel+0xec/0x130 [ 370.398211] __secondary_switched+0xc0/0xc8 [ 370.399132] Kernel panic - not syncing: RCU Stall [ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.421192] Call trace: [ 370.423686] dump_backtrace+0xa4/0x150 [ 370.427514] show_stack+0x24/0x50 [ 370.430896] dump_stack_lvl+0x78/0xf8 [ 370.434640] dump_stack+0x1c/0x38 [ 370.438023] panic+0x3a4/0x440 [ 370.441141] print_other_cpu_stall+0x578/0x610 [ 370.445681] check_cpu_stall+0x240/0x300 [ 370.449686] rcu_pending+0x44/0x220 [ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8 [ 370.457519] update_process_times+0x7c/0xf8 [ 370.461794] tick_sched_handle+0x3c/0x98 [ 370.465803] tick_nohz_highres_handler+0x5c/0xe8 [ 370.470520] __hrtimer_run_queues+0x164/0x398 [ 370.474969] hrtimer_interrupt+0xf4/0x278 [ 370.479063] arch_timer_handler_phys+0x38/0x80 [ 370.483607] handle_percpu_devid_irq+0x94/0x2b8 [ 370.488238] generic_handle_domain_irq+0x38/0x70 [ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 370.498743] gic_handle_irq+0x2c/0xa0 [ 370.502481] call_on_irq_stack+0x3c/0x50 [ 370.506486] do_interrupt_handler+0xb0/0xc8 [ 370.510759] el1_interrupt+0x48/0xf0 [ 370.514409] el1h_64_irq_handler+0x1c/0x40 [ 370.518592] el1h_64_irq+0x7c/0x80 [ 370.522063] cpuidle_enter_state+0xd8/0x790 [ 370.526336] cpuidle_enter+0x44/0x78 [ 370.529986] cpuidle_idle_call+0x15c/0x210 [ 370.534169] do_idle+0xb0/0x130 [ 370.537375] cpu_startup_entry+0x44/0x50 [ 370.541380] secondary_start_kernel+0xec/0x130 [ 370.545919] __secondary_switched+0xc0/0xc8 [ 370.550197] SMP: stopping secondary CPUs [ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71 [ 371.607097] Starting crashdump kernel... [ 371.611103] ------------[ cut here ]------------ [ 371.615820] Some CPUs may be stale, kdump will be unreliable. [ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0 [ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 371.744180] pc : machine_kexec+0x48/0x1f0 [ 371.748275] lr : machine_kexec+0x48/0x1f0 [ 371.752369] sp : ffff8000802afa10 [ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c [ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4 [ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000 [ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088 [ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463 [ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c [ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000 [ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 371.828696] Call trace: [ 371.831189] machine_kexec+0x48/0x1f0 [ 371.834928] __crash_kexec+0x94/0x128 [ 371.838668] panic+0x380/0x440 [ 371.841784] print_other_cpu_stall+0x578/0x610 [ 371.846325] check_cpu_stall+0x240/0x300 [ 371.850331] rcu_pending+0x44/0x220 [ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8 [ 371.858163] update_process_times+0x7c/0xf8 [ 371.862434] tick_sched_handle+0x3c/0x98 [ 371.866440] tick_nohz_highres_handler+0x5c/0xe8 [ 371.871156] __hrtimer_run_queues+0x164/0x398 [ 371.875605] hrtimer_interrupt+0xf4/0x278 [ 371.879700] arch_timer_handler_phys+0x38/0x80 [ 371.884240] handle_percpu_devid_irq+0x94/0x2b8 [ 371.888869] generic_handle_domain_irq+0x38/0x70 [ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 371.899368] gic_handle_irq+0x2c/0xa0 [ 371.903105] call_on_irq_stack+0x3c/0x50 [ 371.907110] do_interrupt_handler+0xb0/0xc8 [ 371.911382] el1_interrupt+0x48/0xf0 [ 371.915032] el1h_64_irq_handler+0x1c/0x40 [ 371.919215] el1h_64_irq+0x7c/0x80 [ 371.922686] cpuidle_enter_state+0xd8/0x790 [ 371.926958] cpuidle_enter+0x44/0x78 [ 371.930609] cpuidle_idle_call+0x15c/0x210 [ 371.934793] do_idle+0xb0/0x130 [ 371.937998] cpu_startup_entry+0x44/0x50 [ 371.942003] secondary_start_kernel+0xec/0x130 [ 371.946542] __secondary_switched+0xc0/0xc8 [ 371.950815] ---[ end trace 0000000000000000 ]--- In an attempt to get more debug info, I tried the open driver in github Using https://github.com/NVIDIA/open-gpu-kernel-modules Version 550.76- loads successfully Version 550.67- loads successfully Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug into I changed optimization in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1. CRASH INFO [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 8648.399560] [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)] [ 8648.407608] [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4 [ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 8648.525058] pc : __kmalloc+0x1e0/0x490 [ 8648.528892] lr : 0xffffa00000000000 [ 8648.532482] sp : ffff8000d132f5f0 [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484 [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828 [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8 [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4 [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000 [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200 [ 8648.608810] Call trace: [ 8648.611305] __kmalloc+0x1e0/0x490 [ 8648.614778] 0x8000604466e4a000 [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 8648.624219] SMP: stopping secondary CPUs Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. $ sudo dmidecode -t 0 # dmidecode 3.5 Getting SMBIOS data from sysfs. SMBIOS 3.6.0 present. # SMBIOS implementations newer than version 3.5.0 are not # fully supported by this version of dmidecode. Handle 0x0001, DMI type 0, 26 bytes BIOS Information  Vendor: NVIDIA  Version: 01.02.01  Release Date: 20240207  ROM Size: 64 MB  Characteristics:   PCI is supported   PNP is supported   BIOS is upgradeable   BIOS shadowing is allowed   Boot from CD is supported   Selectable boot is supported   Serial services are supported (int 14h)   ACPI is supported   Targeted content distribution is supported   UEFI is supported  Firmware Revision: 0.0 CONSOLE RCU STALL MESSAGE: [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number: 0 0 0 [ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO: WARNING: cpu 54: cannot find NT_PRSTATUS note       KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED]     DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP]         CPUS: 72         DATE: Wed Apr 17 21:39:13 UTC 2024       UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28        TASKS: 854     NODENAME: hinyari      RELEASE: 6.8.0-1005-nvidia-64k      VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024      MACHINE: aarch64 (unknown Mhz)       MEMORY: 479.7 GB        PANIC: "Kernel panic - not syncing: RCU Stall"          PID: 0      COMMAND: "swapper/21"         TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880]          CPU: 21        STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number: 0 0 0 [ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008 [ 370.392488] Call trace: [ 370.392489] __switch_to+0xd0/0x118 [ 370.392499] __schedule+0x2a8/0x7b0 [ 370.392501] schedule+0x40/0x168 [ 370.392502] schedule_timeout+0xac/0x1e0 [ 370.392505] rcu_gp_fqs_loop+0x128/0x508 [ 370.392512] rcu_gp_kthread+0x150/0x188 [ 370.392514] kthread+0xf8/0x110 [ 370.392519] ret_from_fork+0x10/0x20 [ 370.392524] rcu: Stack dump where RCU GP kthread last ran: [ 370.398128] Sending NMI from CPU 21 to CPUs 31: [ 370.398131] NMI backtrace for cpu 31 [ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 370.398142] pc : cpuidle_enter_state+0xd8/0x790 [ 370.398150] lr : cpuidle_enter_state+0xcc/0x790 [ 370.398153] sp : ffff800081eefd70 [ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000 [ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000 [ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0 [ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030 [ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0 [ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244 [ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000 [ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 370.398181] Call trace: [ 370.398183] cpuidle_enter_state+0xd8/0x790 [ 370.398185] cpuidle_enter+0x44/0x78 [ 370.398195] cpuidle_idle_call+0x15c/0x210 [ 370.398202] do_idle+0xb0/0x130 [ 370.398204] cpu_startup_entry+0x40/0x50 [ 370.398206] secondary_start_kernel+0xec/0x130 [ 370.398211] __secondary_switched+0xc0/0xc8 [ 370.399132] Kernel panic - not syncing: RCU Stall [ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.421192] Call trace: [ 370.423686] dump_backtrace+0xa4/0x150 [ 370.427514] show_stack+0x24/0x50 [ 370.430896] dump_stack_lvl+0x78/0xf8 [ 370.434640] dump_stack+0x1c/0x38 [ 370.438023] panic+0x3a4/0x440 [ 370.441141] print_other_cpu_stall+0x578/0x610 [ 370.445681] check_cpu_stall+0x240/0x300 [ 370.449686] rcu_pending+0x44/0x220 [ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8 [ 370.457519] update_process_times+0x7c/0xf8 [ 370.461794] tick_sched_handle+0x3c/0x98 [ 370.465803] tick_nohz_highres_handler+0x5c/0xe8 [ 370.470520] __hrtimer_run_queues+0x164/0x398 [ 370.474969] hrtimer_interrupt+0xf4/0x278 [ 370.479063] arch_timer_handler_phys+0x38/0x80 [ 370.483607] handle_percpu_devid_irq+0x94/0x2b8 [ 370.488238] generic_handle_domain_irq+0x38/0x70 [ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 370.498743] gic_handle_irq+0x2c/0xa0 [ 370.502481] call_on_irq_stack+0x3c/0x50 [ 370.506486] do_interrupt_handler+0xb0/0xc8 [ 370.510759] el1_interrupt+0x48/0xf0 [ 370.514409] el1h_64_irq_handler+0x1c/0x40 [ 370.518592] el1h_64_irq+0x7c/0x80 [ 370.522063] cpuidle_enter_state+0xd8/0x790 [ 370.526336] cpuidle_enter+0x44/0x78 [ 370.529986] cpuidle_idle_call+0x15c/0x210 [ 370.534169] do_idle+0xb0/0x130 [ 370.537375] cpu_startup_entry+0x44/0x50 [ 370.541380] secondary_start_kernel+0xec/0x130 [ 370.545919] __secondary_switched+0xc0/0xc8 [ 370.550197] SMP: stopping secondary CPUs [ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71 [ 371.607097] Starting crashdump kernel... [ 371.611103] ------------[ cut here ]------------ [ 371.615820] Some CPUs may be stale, kdump will be unreliable. [ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0 [ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 371.744180] pc : machine_kexec+0x48/0x1f0 [ 371.748275] lr : machine_kexec+0x48/0x1f0 [ 371.752369] sp : ffff8000802afa10 [ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c [ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4 [ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000 [ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088 [ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463 [ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c [ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000 [ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 371.828696] Call trace: [ 371.831189] machine_kexec+0x48/0x1f0 [ 371.834928] __crash_kexec+0x94/0x128 [ 371.838668] panic+0x380/0x440 [ 371.841784] print_other_cpu_stall+0x578/0x610 [ 371.846325] check_cpu_stall+0x240/0x300 [ 371.850331] rcu_pending+0x44/0x220 [ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8 [ 371.858163] update_process_times+0x7c/0xf8 [ 371.862434] tick_sched_handle+0x3c/0x98 [ 371.866440] tick_nohz_highres_handler+0x5c/0xe8 [ 371.871156] __hrtimer_run_queues+0x164/0x398 [ 371.875605] hrtimer_interrupt+0xf4/0x278 [ 371.879700] arch_timer_handler_phys+0x38/0x80 [ 371.884240] handle_percpu_devid_irq+0x94/0x2b8 [ 371.888869] generic_handle_domain_irq+0x38/0x70 [ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 371.899368] gic_handle_irq+0x2c/0xa0 [ 371.903105] call_on_irq_stack+0x3c/0x50 [ 371.907110] do_interrupt_handler+0xb0/0xc8 [ 371.911382] el1_interrupt+0x48/0xf0 [ 371.915032] el1h_64_irq_handler+0x1c/0x40 [ 371.919215] el1h_64_irq+0x7c/0x80 [ 371.922686] cpuidle_enter_state+0xd8/0x790 [ 371.926958] cpuidle_enter+0x44/0x78 [ 371.930609] cpuidle_idle_call+0x15c/0x210 [ 371.934793] do_idle+0xb0/0x130 [ 371.937998] cpu_startup_entry+0x44/0x50 [ 371.942003] secondary_start_kernel+0xec/0x130 [ 371.946542] __secondary_switched+0xc0/0xc8 [ 371.950815] ---[ end trace 0000000000000000 ]--- In an attempt to get more debug info, I tried the open driver in github Using https://github.com/NVIDIA/open-gpu-kernel-modules Version 550.76- loads successfully Version 550.67- loads successfully Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug into I changed optimization in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1. CRASH INFO [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 8648.399560] [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)] [ 8648.407608] [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4 [ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 8648.525058] pc : __kmalloc+0x1e0/0x490 [ 8648.528892] lr : 0xffffa00000000000 [ 8648.532482] sp : ffff8000d132f5f0 [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484 [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828 [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8 [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4 [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000 [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200 [ 8648.608810] Call trace: [ 8648.611305] __kmalloc+0x1e0/0x490 [ 8648.614778] 0x8000604466e4a000 [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 8648.624219] SMP: stopping secondary CPUs
2024-04-18 16:20:46 Ian May description Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. $ sudo dmidecode -t 0 # dmidecode 3.5 Getting SMBIOS data from sysfs. SMBIOS 3.6.0 present. # SMBIOS implementations newer than version 3.5.0 are not # fully supported by this version of dmidecode. Handle 0x0001, DMI type 0, 26 bytes BIOS Information  Vendor: NVIDIA  Version: 01.02.01  Release Date: 20240207  ROM Size: 64 MB  Characteristics:   PCI is supported   PNP is supported   BIOS is upgradeable   BIOS shadowing is allowed   Boot from CD is supported   Selectable boot is supported   Serial services are supported (int 14h)   ACPI is supported   Targeted content distribution is supported   UEFI is supported  Firmware Revision: 0.0 CONSOLE RCU STALL MESSAGE: [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number: 0 0 0 [ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO: WARNING: cpu 54: cannot find NT_PRSTATUS note       KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED]     DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP]         CPUS: 72         DATE: Wed Apr 17 21:39:13 UTC 2024       UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28        TASKS: 854     NODENAME: hinyari      RELEASE: 6.8.0-1005-nvidia-64k      VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024      MACHINE: aarch64 (unknown Mhz)       MEMORY: 479.7 GB        PANIC: "Kernel panic - not syncing: RCU Stall"          PID: 0      COMMAND: "swapper/21"         TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880]          CPU: 21        STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number: 0 0 0 [ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008 [ 370.392488] Call trace: [ 370.392489] __switch_to+0xd0/0x118 [ 370.392499] __schedule+0x2a8/0x7b0 [ 370.392501] schedule+0x40/0x168 [ 370.392502] schedule_timeout+0xac/0x1e0 [ 370.392505] rcu_gp_fqs_loop+0x128/0x508 [ 370.392512] rcu_gp_kthread+0x150/0x188 [ 370.392514] kthread+0xf8/0x110 [ 370.392519] ret_from_fork+0x10/0x20 [ 370.392524] rcu: Stack dump where RCU GP kthread last ran: [ 370.398128] Sending NMI from CPU 21 to CPUs 31: [ 370.398131] NMI backtrace for cpu 31 [ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 370.398142] pc : cpuidle_enter_state+0xd8/0x790 [ 370.398150] lr : cpuidle_enter_state+0xcc/0x790 [ 370.398153] sp : ffff800081eefd70 [ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000 [ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000 [ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0 [ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030 [ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0 [ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244 [ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000 [ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 370.398181] Call trace: [ 370.398183] cpuidle_enter_state+0xd8/0x790 [ 370.398185] cpuidle_enter+0x44/0x78 [ 370.398195] cpuidle_idle_call+0x15c/0x210 [ 370.398202] do_idle+0xb0/0x130 [ 370.398204] cpu_startup_entry+0x40/0x50 [ 370.398206] secondary_start_kernel+0xec/0x130 [ 370.398211] __secondary_switched+0xc0/0xc8 [ 370.399132] Kernel panic - not syncing: RCU Stall [ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.421192] Call trace: [ 370.423686] dump_backtrace+0xa4/0x150 [ 370.427514] show_stack+0x24/0x50 [ 370.430896] dump_stack_lvl+0x78/0xf8 [ 370.434640] dump_stack+0x1c/0x38 [ 370.438023] panic+0x3a4/0x440 [ 370.441141] print_other_cpu_stall+0x578/0x610 [ 370.445681] check_cpu_stall+0x240/0x300 [ 370.449686] rcu_pending+0x44/0x220 [ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8 [ 370.457519] update_process_times+0x7c/0xf8 [ 370.461794] tick_sched_handle+0x3c/0x98 [ 370.465803] tick_nohz_highres_handler+0x5c/0xe8 [ 370.470520] __hrtimer_run_queues+0x164/0x398 [ 370.474969] hrtimer_interrupt+0xf4/0x278 [ 370.479063] arch_timer_handler_phys+0x38/0x80 [ 370.483607] handle_percpu_devid_irq+0x94/0x2b8 [ 370.488238] generic_handle_domain_irq+0x38/0x70 [ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 370.498743] gic_handle_irq+0x2c/0xa0 [ 370.502481] call_on_irq_stack+0x3c/0x50 [ 370.506486] do_interrupt_handler+0xb0/0xc8 [ 370.510759] el1_interrupt+0x48/0xf0 [ 370.514409] el1h_64_irq_handler+0x1c/0x40 [ 370.518592] el1h_64_irq+0x7c/0x80 [ 370.522063] cpuidle_enter_state+0xd8/0x790 [ 370.526336] cpuidle_enter+0x44/0x78 [ 370.529986] cpuidle_idle_call+0x15c/0x210 [ 370.534169] do_idle+0xb0/0x130 [ 370.537375] cpu_startup_entry+0x44/0x50 [ 370.541380] secondary_start_kernel+0xec/0x130 [ 370.545919] __secondary_switched+0xc0/0xc8 [ 370.550197] SMP: stopping secondary CPUs [ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71 [ 371.607097] Starting crashdump kernel... [ 371.611103] ------------[ cut here ]------------ [ 371.615820] Some CPUs may be stale, kdump will be unreliable. [ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0 [ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 371.744180] pc : machine_kexec+0x48/0x1f0 [ 371.748275] lr : machine_kexec+0x48/0x1f0 [ 371.752369] sp : ffff8000802afa10 [ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c [ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4 [ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000 [ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088 [ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463 [ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c [ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000 [ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 371.828696] Call trace: [ 371.831189] machine_kexec+0x48/0x1f0 [ 371.834928] __crash_kexec+0x94/0x128 [ 371.838668] panic+0x380/0x440 [ 371.841784] print_other_cpu_stall+0x578/0x610 [ 371.846325] check_cpu_stall+0x240/0x300 [ 371.850331] rcu_pending+0x44/0x220 [ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8 [ 371.858163] update_process_times+0x7c/0xf8 [ 371.862434] tick_sched_handle+0x3c/0x98 [ 371.866440] tick_nohz_highres_handler+0x5c/0xe8 [ 371.871156] __hrtimer_run_queues+0x164/0x398 [ 371.875605] hrtimer_interrupt+0xf4/0x278 [ 371.879700] arch_timer_handler_phys+0x38/0x80 [ 371.884240] handle_percpu_devid_irq+0x94/0x2b8 [ 371.888869] generic_handle_domain_irq+0x38/0x70 [ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 371.899368] gic_handle_irq+0x2c/0xa0 [ 371.903105] call_on_irq_stack+0x3c/0x50 [ 371.907110] do_interrupt_handler+0xb0/0xc8 [ 371.911382] el1_interrupt+0x48/0xf0 [ 371.915032] el1h_64_irq_handler+0x1c/0x40 [ 371.919215] el1h_64_irq+0x7c/0x80 [ 371.922686] cpuidle_enter_state+0xd8/0x790 [ 371.926958] cpuidle_enter+0x44/0x78 [ 371.930609] cpuidle_idle_call+0x15c/0x210 [ 371.934793] do_idle+0xb0/0x130 [ 371.937998] cpu_startup_entry+0x44/0x50 [ 371.942003] secondary_start_kernel+0xec/0x130 [ 371.946542] __secondary_switched+0xc0/0xc8 [ 371.950815] ---[ end trace 0000000000000000 ]--- In an attempt to get more debug info, I tried the open driver in github Using https://github.com/NVIDIA/open-gpu-kernel-modules Version 550.76- loads successfully Version 550.67- loads successfully Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug into I changed optimization in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1. CRASH INFO [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 8648.399560] [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)] [ 8648.407608] [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4 [ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 8648.525058] pc : __kmalloc+0x1e0/0x490 [ 8648.528892] lr : 0xffffa00000000000 [ 8648.532482] sp : ffff8000d132f5f0 [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484 [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828 [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8 [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4 [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000 [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200 [ 8648.608810] Call trace: [ 8648.611305] __kmalloc+0x1e0/0x490 [ 8648.614778] 0x8000604466e4a000 [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 8648.624219] SMP: stopping secondary CPUs Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. $ sudo dmidecode -t 0 # dmidecode 3.5 Getting SMBIOS data from sysfs. SMBIOS 3.6.0 present. # SMBIOS implementations newer than version 3.5.0 are not # fully supported by this version of dmidecode. Handle 0x0001, DMI type 0, 26 bytes BIOS Information  Vendor: NVIDIA  Version: 01.02.01  Release Date: 20240207  ROM Size: 64 MB  Characteristics:   PCI is supported   PNP is supported   BIOS is upgradeable   BIOS shadowing is allowed   Boot from CD is supported   Selectable boot is supported   Serial services are supported (int 14h)   ACPI is supported   Targeted content distribution is supported   UEFI is supported  Firmware Revision: 0.0 CONSOLE RCU STALL MESSAGE: [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number: 0 0 0 [ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO: WARNING: cpu 54: cannot find NT_PRSTATUS note       KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED]     DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP]         CPUS: 72         DATE: Wed Apr 17 21:39:13 UTC 2024       UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28        TASKS: 854     NODENAME: hinyari      RELEASE: 6.8.0-1005-nvidia-64k      VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024      MACHINE: aarch64 (unknown Mhz)       MEMORY: 479.7 GB        PANIC: "Kernel panic - not syncing: RCU Stall"          PID: 0      COMMAND: "swapper/21"         TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880]          CPU: 21        STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number: 0 0 0 [ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008 [ 370.392488] Call trace: [ 370.392489] __switch_to+0xd0/0x118 [ 370.392499] __schedule+0x2a8/0x7b0 [ 370.392501] schedule+0x40/0x168 [ 370.392502] schedule_timeout+0xac/0x1e0 [ 370.392505] rcu_gp_fqs_loop+0x128/0x508 [ 370.392512] rcu_gp_kthread+0x150/0x188 [ 370.392514] kthread+0xf8/0x110 [ 370.392519] ret_from_fork+0x10/0x20 [ 370.392524] rcu: Stack dump where RCU GP kthread last ran: [ 370.398128] Sending NMI from CPU 21 to CPUs 31: [ 370.398131] NMI backtrace for cpu 31 [ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 370.398142] pc : cpuidle_enter_state+0xd8/0x790 [ 370.398150] lr : cpuidle_enter_state+0xcc/0x790 [ 370.398153] sp : ffff800081eefd70 [ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000 [ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000 [ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0 [ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030 [ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0 [ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244 [ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000 [ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 370.398181] Call trace: [ 370.398183] cpuidle_enter_state+0xd8/0x790 [ 370.398185] cpuidle_enter+0x44/0x78 [ 370.398195] cpuidle_idle_call+0x15c/0x210 [ 370.398202] do_idle+0xb0/0x130 [ 370.398204] cpu_startup_entry+0x40/0x50 [ 370.398206] secondary_start_kernel+0xec/0x130 [ 370.398211] __secondary_switched+0xc0/0xc8 [ 370.399132] Kernel panic - not syncing: RCU Stall [ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.421192] Call trace: [ 370.423686] dump_backtrace+0xa4/0x150 [ 370.427514] show_stack+0x24/0x50 [ 370.430896] dump_stack_lvl+0x78/0xf8 [ 370.434640] dump_stack+0x1c/0x38 [ 370.438023] panic+0x3a4/0x440 [ 370.441141] print_other_cpu_stall+0x578/0x610 [ 370.445681] check_cpu_stall+0x240/0x300 [ 370.449686] rcu_pending+0x44/0x220 [ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8 [ 370.457519] update_process_times+0x7c/0xf8 [ 370.461794] tick_sched_handle+0x3c/0x98 [ 370.465803] tick_nohz_highres_handler+0x5c/0xe8 [ 370.470520] __hrtimer_run_queues+0x164/0x398 [ 370.474969] hrtimer_interrupt+0xf4/0x278 [ 370.479063] arch_timer_handler_phys+0x38/0x80 [ 370.483607] handle_percpu_devid_irq+0x94/0x2b8 [ 370.488238] generic_handle_domain_irq+0x38/0x70 [ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 370.498743] gic_handle_irq+0x2c/0xa0 [ 370.502481] call_on_irq_stack+0x3c/0x50 [ 370.506486] do_interrupt_handler+0xb0/0xc8 [ 370.510759] el1_interrupt+0x48/0xf0 [ 370.514409] el1h_64_irq_handler+0x1c/0x40 [ 370.518592] el1h_64_irq+0x7c/0x80 [ 370.522063] cpuidle_enter_state+0xd8/0x790 [ 370.526336] cpuidle_enter+0x44/0x78 [ 370.529986] cpuidle_idle_call+0x15c/0x210 [ 370.534169] do_idle+0xb0/0x130 [ 370.537375] cpu_startup_entry+0x44/0x50 [ 370.541380] secondary_start_kernel+0xec/0x130 [ 370.545919] __secondary_switched+0xc0/0xc8 [ 370.550197] SMP: stopping secondary CPUs [ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71 [ 371.607097] Starting crashdump kernel... [ 371.611103] ------------[ cut here ]------------ [ 371.615820] Some CPUs may be stale, kdump will be unreliable. [ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0 [ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 371.744180] pc : machine_kexec+0x48/0x1f0 [ 371.748275] lr : machine_kexec+0x48/0x1f0 [ 371.752369] sp : ffff8000802afa10 [ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c [ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4 [ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000 [ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088 [ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463 [ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c [ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000 [ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 371.828696] Call trace: [ 371.831189] machine_kexec+0x48/0x1f0 [ 371.834928] __crash_kexec+0x94/0x128 [ 371.838668] panic+0x380/0x440 [ 371.841784] print_other_cpu_stall+0x578/0x610 [ 371.846325] check_cpu_stall+0x240/0x300 [ 371.850331] rcu_pending+0x44/0x220 [ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8 [ 371.858163] update_process_times+0x7c/0xf8 [ 371.862434] tick_sched_handle+0x3c/0x98 [ 371.866440] tick_nohz_highres_handler+0x5c/0xe8 [ 371.871156] __hrtimer_run_queues+0x164/0x398 [ 371.875605] hrtimer_interrupt+0xf4/0x278 [ 371.879700] arch_timer_handler_phys+0x38/0x80 [ 371.884240] handle_percpu_devid_irq+0x94/0x2b8 [ 371.888869] generic_handle_domain_irq+0x38/0x70 [ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 371.899368] gic_handle_irq+0x2c/0xa0 [ 371.903105] call_on_irq_stack+0x3c/0x50 [ 371.907110] do_interrupt_handler+0xb0/0xc8 [ 371.911382] el1_interrupt+0x48/0xf0 [ 371.915032] el1h_64_irq_handler+0x1c/0x40 [ 371.919215] el1h_64_irq+0x7c/0x80 [ 371.922686] cpuidle_enter_state+0xd8/0x790 [ 371.926958] cpuidle_enter+0x44/0x78 [ 371.930609] cpuidle_idle_call+0x15c/0x210 [ 371.934793] do_idle+0xb0/0x130 [ 371.937998] cpu_startup_entry+0x44/0x50 [ 371.942003] secondary_start_kernel+0xec/0x130 [ 371.946542] __secondary_switched+0xc0/0xc8 [ 371.950815] ---[ end trace 0000000000000000 ]--- In an attempt to get more debug info, I tried the open driver in github Using https://github.com/NVIDIA/open-gpu-kernel-modules Version 550.76- loads successfully Version 550.67- loads successfully Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug info, I changed the optimization flag in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1. CRASH INFO [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 8648.399560] [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)] [ 8648.407608] [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4 [ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 8648.525058] pc : __kmalloc+0x1e0/0x490 [ 8648.528892] lr : 0xffffa00000000000 [ 8648.532482] sp : ffff8000d132f5f0 [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484 [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828 [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8 [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4 [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000 [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200 [ 8648.608810] Call trace: [ 8648.611305] __kmalloc+0x1e0/0x490 [ 8648.614778] 0x8000604466e4a000 [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 8648.624219] SMP: stopping secondary CPUs
2024-04-23 16:00:29 Taihsiang Ho bug added subscriber Taihsiang Ho
2024-04-24 16:10:13 Mitchell Augustin nvidia-graphics-drivers-535-server (Ubuntu): assignee Mitchell Augustin (mitchellaugustin)
2024-04-24 16:10:18 Mitchell Augustin nvidia-graphics-drivers-550-server (Ubuntu): assignee Mitchell Augustin (mitchellaugustin)
2024-04-24 23:29:20 Mitchell Augustin bug added subscriber Mitchell Augustin
2024-04-25 13:28:49 Fabio Augusto Miranda Martins bug added subscriber Fabio Augusto Miranda Martins
2024-04-25 21:57:07 kalvdans bug added subscriber kalvdans
2024-05-03 07:36:19 Simon Funk bug added subscriber Simon Funk