Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

Bug #2062380 reported by Ian May
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
nvidia-graphics-drivers-535-server (Ubuntu)
Confirmed
Undecided
Mitchell Augustin
nvidia-graphics-drivers-550-server (Ubuntu)
Confirmed
Undecided
Mitchell Augustin

Bug Description

Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I load the nvidia driver.

$ sudo dmidecode -t 0
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 3.6.0 present.
# SMBIOS implementations newer than version 3.5.0 are not
# fully supported by this version of dmidecode.

Handle 0x0001, DMI type 0, 26 bytes
BIOS Information
 Vendor: NVIDIA
 Version: 01.02.01
 Release Date: 20240207
 ROM Size: 64 MB
 Characteristics:
  PCI is supported
  PNP is supported
  BIOS is upgradeable
  BIOS shadowing is allowed
  Boot from CD is supported
  Selectable boot is supported
  Serial services are supported (int 14h)
  ACPI is supported
  Targeted content distribution is supported
  UEFI is supported
 Firmware Revision: 0.0

CONSOLE RCU STALL MESSAGE:
[ 382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 382.946075] rcu: 53-...0: (4 ticks this GP) idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124
[ 382.955683] rcu: hardirqs softirqs csw/system
[ 382.961378] rcu: number: 0 0 0
[ 382.967071] rcu: cputime: 0 0 0 ==> 30026(ms)
[ 382.974189] rcu: (detected by 52, t=60034 jiffies, g=24469, q=1199 ncpus=72)
[ 392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
[ 392.992769] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior

After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1

KDUMP INFO:
WARNING: cpu 54: cannot find NT_PRSTATUS note
      KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k [TAINTED]
    DUMPFILE: /var/crash/202404172139/dump.202404172139 [PARTIAL DUMP]
        CPUS: 72
        DATE: Wed Apr 17 21:39:13 UTC 2024
      UPTIME: 00:06:10
LOAD AVERAGE: 0.68, 0.63, 0.28
       TASKS: 854
    NODENAME: hinyari
     RELEASE: 6.8.0-1005-nvidia-64k
     VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
     MACHINE: aarch64 (unknown Mhz)
      MEMORY: 479.7 GB
       PANIC: "Kernel panic - not syncing: RCU Stall"
         PID: 0
     COMMAND: "swapper/21"
        TASK: ffff000082026880 (1 of 72) [THREAD_INFO: ffff000082026880]
         CPU: 21
       STATE: TASK_RUNNING (PANIC)

[ 300.313144] nvidia: loading out-of-tree module taints kernel.
[ 300.313153] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device number 506
[ 300.316699]
[ 360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 360.331206] rcu: 54-...0: (24 ticks this GP) idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148
[ 360.340903] rcu: hardirqs softirqs csw/system
[ 360.346597] rcu: number: 0 0 0
[ 360.352291] rcu: cputime: 0 0 0 ==> 30031(ms)
[ 360.359408] rcu: (detected by 21, t=60038 jiffies, g=25009, q=1123 ncpus=72)
[ 360.366704] Sending NMI from CPU 21 to CPUs 54:
[ 370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
[ 370.377983] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 370.387322] rcu: RCU grace-period kthread stack dump:
[ 370.392482] task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 flags:0x00000008
[ 370.392488] Call trace:
[ 370.392489] __switch_to+0xd0/0x118
[ 370.392499] __schedule+0x2a8/0x7b0
[ 370.392501] schedule+0x40/0x168
[ 370.392502] schedule_timeout+0xac/0x1e0
[ 370.392505] rcu_gp_fqs_loop+0x128/0x508
[ 370.392512] rcu_gp_kthread+0x150/0x188
[ 370.392514] kthread+0xf8/0x110
[ 370.392519] ret_from_fork+0x10/0x20
[ 370.392524] rcu: Stack dump where RCU GP kthread last ran:
[ 370.398128] Sending NMI from CPU 21 to CPUs 31:
[ 370.398131] NMI backtrace for cpu 31
[ 370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu
[ 370.398139] Hardware name: /P3880, BIOS 01.02.01 20240207
[ 370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 370.398142] pc : cpuidle_enter_state+0xd8/0x790
[ 370.398150] lr : cpuidle_enter_state+0xcc/0x790
[ 370.398153] sp : ffff800081eefd70
[ 370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 0000000000000000
[ 370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 0000000000000000
[ 370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 000000563d72ece0
[ 370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: ffff800081f00030
[ 370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ac8c73b08db0
[ 370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : ffffa0a1424fd244
[ 370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 0000000000000000
[ 370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[ 370.398181] Call trace:
[ 370.398183] cpuidle_enter_state+0xd8/0x790
[ 370.398185] cpuidle_enter+0x44/0x78
[ 370.398195] cpuidle_idle_call+0x15c/0x210
[ 370.398202] do_idle+0xb0/0x130
[ 370.398204] cpu_startup_entry+0x40/0x50
[ 370.398206] secondary_start_kernel+0xec/0x130
[ 370.398211] __secondary_switched+0xc0/0xc8
[ 370.399132] Kernel panic - not syncing: RCU Stall
[ 370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu
[ 370.414876] Hardware name: /P3880, BIOS 01.02.01 20240207
[ 370.421192] Call trace:
[ 370.423686] dump_backtrace+0xa4/0x150
[ 370.427514] show_stack+0x24/0x50
[ 370.430896] dump_stack_lvl+0x78/0xf8
[ 370.434640] dump_stack+0x1c/0x38
[ 370.438023] panic+0x3a4/0x440
[ 370.441141] print_other_cpu_stall+0x578/0x610
[ 370.445681] check_cpu_stall+0x240/0x300
[ 370.449686] rcu_pending+0x44/0x220
[ 370.453248] rcu_sched_clock_irq+0x7c/0x2c8
[ 370.457519] update_process_times+0x7c/0xf8
[ 370.461794] tick_sched_handle+0x3c/0x98
[ 370.465803] tick_nohz_highres_handler+0x5c/0xe8
[ 370.470520] __hrtimer_run_queues+0x164/0x398
[ 370.474969] hrtimer_interrupt+0xf4/0x278
[ 370.479063] arch_timer_handler_phys+0x38/0x80
[ 370.483607] handle_percpu_devid_irq+0x94/0x2b8
[ 370.488238] generic_handle_domain_irq+0x38/0x70
[ 370.492954] __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[ 370.498743] gic_handle_irq+0x2c/0xa0
[ 370.502481] call_on_irq_stack+0x3c/0x50
[ 370.506486] do_interrupt_handler+0xb0/0xc8
[ 370.510759] el1_interrupt+0x48/0xf0
[ 370.514409] el1h_64_irq_handler+0x1c/0x40
[ 370.518592] el1h_64_irq+0x7c/0x80
[ 370.522063] cpuidle_enter_state+0xd8/0x790
[ 370.526336] cpuidle_enter+0x44/0x78
[ 370.529986] cpuidle_idle_call+0x15c/0x210
[ 370.534169] do_idle+0xb0/0x130
[ 370.537375] cpu_startup_entry+0x44/0x50
[ 370.541380] secondary_start_kernel+0xec/0x130
[ 370.545919] __secondary_switched+0xc0/0xc8
[ 370.550197] SMP: stopping secondary CPUs
[ 371.601076] SMP: failed to stop secondary CPUs 0-20,22-71
[ 371.607097] Starting crashdump kernel...
[ 371.611103] ------------[ cut here ]------------
[ 371.615820] Some CPUs may be stale, kdump will be unreliable.
[ 371.621695] WARNING: CPU: 21 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0
[ 371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[ 371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G OE 6.8.0-1005-nvidia-64k #5-Ubuntu
[ 371.730748] Hardware name: /P3880, BIOS 01.02.01 20240207
[ 371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 371.744180] pc : machine_kexec+0x48/0x1f0
[ 371.748275] lr : machine_kexec+0x48/0x1f0
[ 371.752369] sp : ffff8000802afa10
[ 371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 000000000000003c
[ 371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: ffffa0a144268cb4
[ 371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: ffffa0a14481a000
[ 371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: ffff800080ba0088
[ 371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000463
[ 371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c
[ 371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000
[ 371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[ 371.828696] Call trace:
[ 371.831189] machine_kexec+0x48/0x1f0
[ 371.834928] __crash_kexec+0x94/0x128
[ 371.838668] panic+0x380/0x440
[ 371.841784] print_other_cpu_stall+0x578/0x610
[ 371.846325] check_cpu_stall+0x240/0x300
[ 371.850331] rcu_pending+0x44/0x220
[ 371.853892] rcu_sched_clock_irq+0x7c/0x2c8
[ 371.858163] update_process_times+0x7c/0xf8
[ 371.862434] tick_sched_handle+0x3c/0x98
[ 371.866440] tick_nohz_highres_handler+0x5c/0xe8
[ 371.871156] __hrtimer_run_queues+0x164/0x398
[ 371.875605] hrtimer_interrupt+0xf4/0x278
[ 371.879700] arch_timer_handler_phys+0x38/0x80
[ 371.884240] handle_percpu_devid_irq+0x94/0x2b8
[ 371.888869] generic_handle_domain_irq+0x38/0x70
[ 371.893585] __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[ 371.899368] gic_handle_irq+0x2c/0xa0
[ 371.903105] call_on_irq_stack+0x3c/0x50
[ 371.907110] do_interrupt_handler+0xb0/0xc8
[ 371.911382] el1_interrupt+0x48/0xf0
[ 371.915032] el1h_64_irq_handler+0x1c/0x40
[ 371.919215] el1h_64_irq+0x7c/0x80
[ 371.922686] cpuidle_enter_state+0xd8/0x790
[ 371.926958] cpuidle_enter+0x44/0x78
[ 371.930609] cpuidle_idle_call+0x15c/0x210
[ 371.934793] do_idle+0xb0/0x130
[ 371.937998] cpu_startup_entry+0x44/0x50
[ 371.942003] secondary_start_kernel+0xec/0x130
[ 371.946542] __secondary_switched+0xc0/0xc8
[ 371.950815] ---[ end trace 0000000000000000 ]---

In an attempt to get more debug info, I tried the open driver in github
Using https://github.com/NVIDIA/open-gpu-kernel-modules
Version 550.76- loads successfully
Version 550.67- loads successfully
Version 550.54.15 - crashes - which is the same version as the 550 package that hangs. Below is the crash info. What is interesting is that in an attempt to capture more debug info, I changed the optimization flag in utils.mk from -O2 to -O0 and the crash went away. It also doesn't happen with -O1.

CRASH INFO
[ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device number 506
[ 8648.399560]
[ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
[ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)]
[ 8648.407608]
[ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G OE 6.8.0-1004-nvidia-64k #4
[ 8648.511625] Hardware name: /P3880, BIOS 01.02.01 20240207
[ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 8648.525058] pc : __kmalloc+0x1e0/0x490
[ 8648.528892] lr : 0xffffa00000000000
[ 8648.532482] sp : ffff8000d132f5f0
[ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: ffffa00084d50484
[ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff0000c2aba828
[ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: ffff8000d132f7c8
[ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: ffff8000d132f5e4
[ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004
[ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : ffffa000806f73ec
[ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 0000000000000000
[ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : ffff0000c2a98200
[ 8648.608810] Call trace:
[ 8648.611305] __kmalloc+0x1e0/0x490
[ 8648.614778] 0x8000604466e4a000
[ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf)
[ 8648.624219] SMP: stopping secondary CPUs

Ian May (ian-may)
summary: - Using a 6.8 kernel modprobe nvidia hangs on Grace Hopper
+ Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper
Changed in nvidia-graphics-drivers-535-server (Ubuntu):
status: New → Confirmed
Changed in nvidia-graphics-drivers-550-server (Ubuntu):
status: New → Confirmed
Ian May (ian-may)
description: updated
description: updated
description: updated
Revision history for this message
Ian May (ian-may) wrote :

This issue looks to be related to kernel configuration. Using upstream stable 6.8.1 which is what the current noble being tested is rebased on. Using 'make defconfig' the nvidia module loads successfully. But with same kernel using noble config, the nvidia module experiences the same hang as with noble kernel.

I'm currently working through config comparison and testing changes.

Changed in nvidia-graphics-drivers-535-server (Ubuntu):
assignee: nobody → Mitchell Augustin (mitchellaugustin)
Changed in nvidia-graphics-drivers-550-server (Ubuntu):
assignee: nobody → Mitchell Augustin (mitchellaugustin)
Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :

It looks like this is the relevant option present in the upstream stable 6.8.1 defconfig but not in the 6.8.0-31-generic config that enables the defconfig kernel to load the Nvidia driver:

CONFIG_SHADOW_CALL_STACK=n

I suspect that the kernel team is not going to want to disable kernel support for the GCC shadow stack to fix this bug, so my guess is that we'll need to explore other potential fixes for this issue.

Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :
Download full text (3.9 KiB)

In trying to determine if core count had any effect on this bug, I set maxcpus to 4 and tried loading the driver on the kernel with the shadow stack enabled (aka the standard -generic config). It looks like the same root issue occurred, but this time, I got a panic with a trace that corroborates the claim that this is related to the shadow stack:

[ 391.736417] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
[ 391.744257] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cdc_ether cdc_subset usbnet cfg80211 binfmt_misc dax_hmem cxl_acpi cxl_core ast i2c_algo_bit nvidia_cspmu arm_spe_pmu arm_smmuv3_pmu arm_cspmu_module uio_pdrv_genirq uio spi_nor acpi_ipmi mtd nls_iso8859_1 ipmi_ssif ipmi_devintf cppc_cpufreq ipmi_msghandler acpi_power_meter dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core mlx5_dpll i2c_smbus crct10dif_ce polyval_ce polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce sha2_ce sha256_arm64 sha1_ce mlx5_core nvme_core mlxfw nvme_auth psample xhci_pci tls xhci_pci_renesas pci_hyperv_intf spi_tegra210_quad i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[ 391.826552] CPU: 0 PID: 14412 Comm: insmod Tainted: G OE 6.8.1+ #2
[ 391.834202] Hardware name: /, BIOS 01.02.01 20240207
[ 391.840074] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 391.847190] pc : __kmalloc+0x1e4/0x498
[ 391.851025] lr : 0xffffc04000000000
[ 391.854605] sp : ffff8000a3ab3620
[ 391.857987] x29: ffff8000a3ab3620 x28: 0000000000000001 x27: 0000000000000001
[ 391.865282] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: ffff00008feac028
[ 391.872577] x23: ffffc040aab743f0 x22: ffff80008d4c5020 x21: ffff8000a3ab37f8
[ 391.879871] x20: 0000000000000038 x19: ffff8000a3ab3658 x18: ffff8000a3ab3614
[ 391.887165] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004
[ 391.894459] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 391.901753] x11: 0000000000000000 x10: ffff8000a3ab36a0 x9 : ffffc040c0af8d48
[ 391.909049] x8 : ffff00008edc3c40 x7 : 0000000000000000 x6 : 0000000000000000
[ 391.916343] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 391.923637] x2 : 0000000000000000 x1 : ffff00008e87c480 x0 : ffff00008edc3c00
[ 391.930931] Call trace:
[ 391.933427] __kmalloc+0x1e4/0x498
[ 391.936899] 0xc0007304e5f6c040
[ 391.940107] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf)
[ 391.946336] ---[ end trace 0000000000000000 ]---
[ 391.977579] Kernel panic - not syncing: corrupted shadow stack detected inside scheduler
[ 391.980605] kauditd_printk_skb: 98 callbacks suppressed
[ 391.980607] audit: type=1400 audit(1713999301.128:108): apparmor="DENIED" operation="open" class="file" profile="rsyslogd" name="/run/systemd/sessions/" pid=801 comm=72733A6D61696E20513A526567 requested_mask="r" denied_mask="r" fsuid=103 ouid=0
[ 391.980674] audit: type=1400 audit(1713999301.128:109): apparmor="DENIED" op...

Read more...

Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :

Compiling the Nvidia drivers with -ffixed-x18 on affected versions is also sufficient to prevent this hang/panic:

https://github.com/NVIDIA/open-gpu-kernel-modules

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index 66edbf4e..d49a3bfb 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,6 +95,7 @@ endif
 ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
   CFLAGS += -march=armv8-a
+ CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
 endif

diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index e2f1c672..0f70514b 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -90,6 +90,7 @@ ifeq ($(TARGET_ARCH),aarch64)
   CFLAGS += -mgeneral-regs-only
   CFLAGS += -march=armv8-a
   CFLAGS += -mstrict-align
+ CFLAGS += -ffixed-x18
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mno-outline-atomics)
 endif

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.