Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver. [ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124 [ 382.955683] rcu: hardirqs softirqs csw/system [ 382.961378] rcu: number: 0 0 0 [ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms) [ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72) [ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1 KDUMP INFO WARNING: cpu 54: cannot find NT_PRSTATUS note KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED] DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP] CPUS: 72 DATE: Wed Apr 17 21:39:13 UTC 2024 UPTIME: 00:06:10 LOAD AVERAGE: 0.68, 0.63, 0.28 TASKS: 854 NODENAME: hinyari RELEASE: 6.8.0-1005-nvidia-64k VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024 MACHINE: aarch64 (unknown Mhz) MEMORY: 479.7 GB PANIC: "Kernel panic - not syncing: RCU Stall" PID: 0 COMMAND: "swapper/21" TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880] CPU: 21 STATE: TASK_RUNNING (PANIC) [ 300.313144] nvidia: loading out-of-tree module taints kernel. [ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 300.316699] [ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148 [ 360.340903] rcu: hardirqs softirqs csw/system [ 360.346597] rcu: number: 0 0 0 [ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms) [ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72) [ 360.366704] Sending NMI from CPU 21 to CPUs 54: [ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31 [ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 370.387322] rcu: RCU grace-period kthread stack dump: [ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008 [ 370.392488] Call trace: [ 370.392489] __switch_to+0xd0/0x118 [ 370.392499] __schedule+0x2a8/0x7b0 [ 370.392501] schedule+0x40/0x168 [ 370.392502] schedule_timeout+0xac/0x1e0 [ 370.392505] rcu_gp_fqs_loop+0x128/0x508 [ 370.392512] rcu_gp_kthread+0x150/0x188 [ 370.392514] kthread+0xf8/0x110 [ 370.392519] ret_from_fork+0x10/0x20 [ 370.392524] rcu: Stack dump where RCU GP kthread last ran: [ 370.398128] Sending NMI from CPU 21 to CPUs 31: [ 370.398131] NMI backtrace for cpu 31 [ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 370.398142] pc : cpuidle_enter_state+0xd8/0x790 [ 370.398150] lr : cpuidle_enter_state+0xcc/0x790 [ 370.398153] sp : ffff800081eefd70 [ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000 [ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000 [ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0 [ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030 [ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0 [ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244 [ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000 [ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 370.398181] Call trace: [ 370.398183] cpuidle_enter_state+0xd8/0x790 [ 370.398185] cpuidle_enter+0x44/0x78 [ 370.398195] cpuidle_idle_call+0x15c/0x210 [ 370.398202] do_idle+0xb0/0x130 [ 370.398204] cpu_startup_entry+0x40/0x50 [ 370.398206] secondary_start_kernel+0xec/0x130 [ 370.398211] __secondary_switched+0xc0/0xc8 [ 370.399132] Kernel panic - not syncing: RCU Stall [ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 370.421192] Call trace: [ 370.423686] dump_backtrace+0xa4/0x150 [ 370.427514] show_stack+0x24/0x50 [ 370.430896] dump_stack_lvl+0x78/0xf8 [ 370.434640] dump_stack+0x1c/0x38 [ 370.438023] panic+0x3a4/0x440 [ 370.441141] print_other_cpu_stall+0x578/0x610 [ 370.445681] check_cpu_stall+0x240/0x300 [ 370.449686] rcu_pending+0x44/0x220 [ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8 [ 370.457519] update_process_times+0x7c/0xf8 [ 370.461794] tick_sched_handle+0x3c/0x98 [ 370.465803] tick_nohz_highres_handler+0x5c/0xe8 [ 370.470520] __hrtimer_run_queues+0x164/0x398 [ 370.474969] hrtimer_interrupt+0xf4/0x278 [ 370.479063] arch_timer_handler_phys+0x38/0x80 [ 370.483607] handle_percpu_devid_irq+0x94/0x2b8 [ 370.488238] generic_handle_domain_irq+0x38/0x70 [ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 370.498743] gic_handle_irq+0x2c/0xa0 [ 370.502481] call_on_irq_stack+0x3c/0x50 [ 370.506486] do_interrupt_handler+0xb0/0xc8 [ 370.510759] el1_interrupt+0x48/0xf0 [ 370.514409] el1h_64_irq_handler+0x1c/0x40 [ 370.518592] el1h_64_irq+0x7c/0x80 [ 370.522063] cpuidle_enter_state+0xd8/0x790 [ 370.526336] cpuidle_enter+0x44/0x78 [ 370.529986] cpuidle_idle_call+0x15c/0x210 [ 370.534169] do_idle+0xb0/0x130 [ 370.537375] cpu_startup_entry+0x44/0x50 [ 370.541380] secondary_start_kernel+0xec/0x130 [ 370.545919] __secondary_switched+0xc0/0xc8 [ 370.550197] SMP: stopping secondary CPUs [ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71 [ 371.607097] Starting crashdump kernel... [ 371.611103] ------------[ cut here ]------------ [ 371.615820] Some CPUs may be stale, kdump will be unreliable. [ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0 [ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu [ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 371.744180] pc : machine_kexec+0x48/0x1f0 [ 371.748275] lr : machine_kexec+0x48/0x1f0 [ 371.752369] sp : ffff8000802afa10 [ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c [ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4 [ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000 [ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088 [ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463 [ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c [ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000 [ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 371.828696] Call trace: [ 371.831189] machine_kexec+0x48/0x1f0 [ 371.834928] __crash_kexec+0x94/0x128 [ 371.838668] panic+0x380/0x440 [ 371.841784] print_other_cpu_stall+0x578/0x610 [ 371.846325] check_cpu_stall+0x240/0x300 [ 371.850331] rcu_pending+0x44/0x220 [ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8 [ 371.858163] update_process_times+0x7c/0xf8 [ 371.862434] tick_sched_handle+0x3c/0x98 [ 371.866440] tick_nohz_highres_handler+0x5c/0xe8 [ 371.871156] __hrtimer_run_queues+0x164/0x398 [ 371.875605] hrtimer_interrupt+0xf4/0x278 [ 371.879700] arch_timer_handler_phys+0x38/0x80 [ 371.884240] handle_percpu_devid_irq+0x94/0x2b8 [ 371.888869] generic_handle_domain_irq+0x38/0x70 [ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310 [ 371.899368] gic_handle_irq+0x2c/0xa0 [ 371.903105] call_on_irq_stack+0x3c/0x50 [ 371.907110] do_interrupt_handler+0xb0/0xc8 [ 371.911382] el1_interrupt+0x48/0xf0 [ 371.915032] el1h_64_irq_handler+0x1c/0x40 [ 371.919215] el1h_64_irq+0x7c/0x80 [ 371.922686] cpuidle_enter_state+0xd8/0x790 [ 371.926958] cpuidle_enter+0x44/0x78 [ 371.930609] cpuidle_idle_call+0x15c/0x210 [ 371.934793] do_idle+0xb0/0x130 [ 371.937998] cpu_startup_entry+0x44/0x50 [ 371.942003] secondary_start_kernel+0xec/0x130 [ 371.946542] __secondary_switched+0xc0/0xc8 [ 371.950815] ---[ end trace 0000000000000000 ]--- In an attempt to get more debug info, I tried the open driver in github Using https://github.com/NVIDIA/open-gpu-kernel-modules Version 550.76- loads successfully Version 550.67- loads successfully Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug into I changed optimization in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1. CRASH INFO [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506 [ 8648.399560] [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)] [ 8648.407608] [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4 [ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207 [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 8648.525058] pc : __kmalloc+0x1e0/0x490 [ 8648.528892] lr : 0xffffa00000000000 [ 8648.532482] sp : ffff8000d132f5f0 [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484 [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828 [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8 [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4 [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000 [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200 [ 8648.608810] Call trace: [ 8648.611305] __kmalloc+0x1e0/0x490 [ 8648.614778] 0x8000604466e4a000 [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) [ 8648.624219] SMP: stopping secondary CPUs