arm64 AWS host hangs during modprobe nvidia on lunar and mantic

Bug #2029934 reported by Francis Ginther
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
Incomplete
Undecided
Unassigned
linux-hwe-6.5 (Ubuntu)
New
Undecided
Unassigned
nvidia-graphics-drivers-525 (Ubuntu)
Incomplete
Undecided
Unassigned
nvidia-graphics-drivers-525-server (Ubuntu)
Incomplete
Undecided
Unassigned
nvidia-graphics-drivers-535 (Ubuntu)
Confirmed
Undecided
Unassigned
nvidia-graphics-drivers-535-server (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Loading the nvidia driver dkms modules with "modprove nvidia" will result in the host hanging and being completely unusable. This was reproduced using both the linux generic and linux-aws kernels on lunar and mantic using an AWS g5g.xlarge instance.

To reproduce using the generic kernel:
# Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge.

# Install the linux generic kernel from lunar-updates:
$ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o DPkg::Options::=--force-confold linux-generic

# Boot to the linux-generic kernel (this can be accomplished by removing the existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel)
$ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws
$ reboot

# Install the Nvidia 535-server driver DKMS package:
$ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server

# Enable the driver
$ sudo modprobe nvidia

# At this point the system will hang and never return.
# A reboot instead of a modprobe will result in a system that never boots up all the way. I was able to recover the console logs from such a system and found (the full captured log is attached):

[ 1.964942] nvidia: loading out-of-tree module taints kernel.
[ 1.965475] nvidia: module license 'NVIDIA' taints kernel.
[ 1.965905] Disabling lock debugging due to kernel taint
[ 1.980905] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 2.012715]
[ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 62.025807] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=3301
[ 62.026516] (detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4)
[ 62.027018] Task dump for CPU 3:
[ 62.027290] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144 flags:0x0000000e
[ 62.028066] Call trace:
[ 62.028273] __switch_to+0xbc/0x100
[ 62.028567] 0x228
Timed out for waiting the udev queue being empty.
Timed out for waiting the udev queue being empty.
[ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 242.045655] rcu: 3-...0: (14 ticks this GP) idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=12303
[ 242.046373] (detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4)
[ 242.046874] Task dump for CPU 3:
[ 242.047146] task:systemd-udevd state:R running task stack:0 pid:164 ppid:144 flags:0x0000000f
[ 242.047922] Call trace:
[ 242.048128] __switch_to+0xbc/0x100
[ 242.048417] 0x228
Timed out for waiting the udev queue being empty.
Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [modprobe:215]
[ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE 6.2.0-26-generic #26-Ubuntu
[ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018
[ 384.004715] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4
[ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4
[ 384.006108] sp : ffff8000089a3a70
[ 384.006381] x29: ffff8000089a3a70 x28: 0000000000000003 x27: ffff00056d1fafa0
[ 384.006954] x26: ffff00056d1d76c8 x25: ffffc87cf18bdd10 x24: 0000000000000003
[ 384.007527] x23: 0000000000000001 x22: ffff00056d1d76c8 x21: ffffc87cf18c2690
[ 384.008086] x20: ffff00056d1fafa0 x19: ffff00056d1d76c0 x18: ffff80000896d058
[ 384.008645] x17: 0000000000000000 x16: 0000000000000000 x15: 617362755f5f0073
[ 384.009209] x14: 0000000000000001 x13: 0000000000000006 x12: 4630354535323145
[ 384.009779] x11: 0101010101010101 x10: ffffb78318e9c0e0 x9 : ffffc87ceeac7da4
[ 384.010339] x8 : ffff00056d1d76f0 x7 : 0000000000000000 x6 : 0000000000000000
[ 384.010894] x5 : 0000000000000004 x4 : 0000000000000000 x3 : ffff00056d1fafa8
[ 384.011464] x2 : 0000000000000003 x1 : 0000000000000011 x0 : 0000000000000000
[ 384.012030] Call trace:
[ 384.012241] smp_call_function_many_cond+0x1b4/0x4b4
[ 384.012635] kick_all_cpus_sync+0x50/0xa0
[ 384.012961] flush_module_icache+0x64/0xd0
[ 384.013294] load_module+0x4ec/0xb54
[ 384.013588] __do_sys_finit_module+0xb0/0x150
[ 384.013944] __arm64_sys_finit_module+0x2c/0x50
[ 384.014306] invoke_syscall+0x7c/0x124
[ 384.014613] el0_svc_common.constprop.0+0x5c/0x1cc
[ 384.015000] do_el0_svc+0x38/0x60
[ 384.015280] el0_svc+0x30/0xe0
[ 384.015540] el0t_64_sync_handler+0x11c/0x150
[ 384.015896] el0t_64_sync+0x1a8/0x1ac

This same procedure impacts the 525, 525-server, 535 and 535-server drivers. It does *not* hang a similarly configured host running focal or jammy.

Revision history for this message
Francis Ginther (fginther) wrote :
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Although nvidia seems to be the trigger here, the crashing code appears to be pure generic linux: arch/arm64/kernel/syscall.c

tags: added: arm64 nvidia
tags: added: lunar mantic
summary: - Host hangs during modprobe nvidia on lunar and mantic
+ arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

since then, we had multiple glibc srus; kernel sru's and most recently new release of 535-server.

can i request for this to be retested again?

Revision history for this message
Simon Fels (morphis) wrote (last edit ):

I can reproduce the the same with the latest 535.154.05-0ubuntu0.22.04.1 on jammy with the 6.5 HWE kernel on an arm64 machine. The same happens with the -server driver 535.154.05-0ubuntu0.22.04.1.

Reproducing is pretty simple:

1. Boot plain Ubuntu 22.04 with either HWE already installed or manually installed to switch to it from GA
2. Install NVIDIA driver via

$ sudo apt install -y nvidia-headless-535-server

or

$ sudo apt install -y nvidia-headless-535

Doing either an nvidia-smi (triggers the modprobe of nvidia kernel modules) or a `modprobe nvidia` makes the system hang entirely.

The same works fine on the 5.15 GA kernel.

Revision history for this message
Simon Fels (morphis) wrote :

Verified that the issue does not exist with 535.154.05-0ubuntu0.22.04.1 of nvidia-utils-535-server on 6.2.0-1017-aws or 6.2.0-1018-aws of linux-aws.

Revision history for this message
Simon Fels (morphis) wrote :

Trying the same with the linux-nvidia-hwe-22.04-edge kernel from proposed linux-image-6.5.0-1011-nvidia wit the same NVIDIA driver (535.154.05-0ubuntu0.22.04.1 of nvidia-utils-535-server) and loading kernel driver and running nvidia-smi works fine without problems.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I am surprised that `ubuntu-drivers list` doesn't provide any drivers to install, when it really should.

To install pre-built drivers I use

$ sudo apt install linux-modules-nvidia-535-server-aws nvidia-headless-535-server

Such that signed nvidia modules provided by Canonical are installed.

Similarly to upgrade to edge variant I did:

$ sudo apt install linux-aws-edge linux-modules-nvidia-535-server-aws-edge

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

and everything seems to work fine.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I wonder if the bug is with trying to install self-built dkms modules, instead of pre-built ones, and how come ubuntu-drivers is not offering pre-built ones...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-aws (Ubuntu):
status: New → Confirmed
Changed in nvidia-graphics-drivers-525 (Ubuntu):
status: New → Confirmed
Changed in nvidia-graphics-drivers-525-server (Ubuntu):
status: New → Confirmed
Changed in nvidia-graphics-drivers-535 (Ubuntu):
status: New → Confirmed
Changed in nvidia-graphics-drivers-535-server (Ubuntu):
status: New → Confirmed
Revision history for this message
Francis Ginther (fginther) wrote :

I can reproduce the failure on mantic with both the DKMS and LRM drivers. Specifically what I'm doing to install these are:

for DKMS:
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-driver-535-server

for LRM:
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-headless-no-dkms-535-server linux-modules-nvidia-535-server-generic nvidia-utils-535-server

I'm intentionally not using `ubuntu-drivers` to isolate this testing to just the installation and functioning of the drivers.

Revision history for this message
Simon Fels (morphis) wrote :

Verified that with linux-aws-edge 6.5.0.1012.12~22.04.1 the DKMS installation via

$ sudo apt install -y nvidia-driver-535-server

on an AWS g5g.xlarge goes through the driver comes up fine.

Trying the same with linux-generic-hwe-22.04-edge 6.5.0-17-generic #17~22.04.1 on an Ampere Altra with 2x NVIDIA L4 still runs into the same hang with nvidia-headless-535-server (535.154.05-0ubuntu0.22.04.1).

Changed in nvidia-graphics-drivers-525 (Ubuntu):
status: Confirmed → Incomplete
Changed in nvidia-graphics-drivers-525-server (Ubuntu):
status: Confirmed → Incomplete
Changed in linux-aws (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Simon Fels (morphis) wrote :

I gave this another spin today with 6.5.0-17-generic #17~22.04.1 and the LRM modules of the 535 driver (6.5.0-17.17~22.04.1+1 of linux-modules-nvidia-535-server-generic-hwe-22.04) on our Altra system with 2x L4 GPUs and the same problem exists as with the DKMS modules:

[ 39.437849] watchdog: BUG: soft lockup - CPU#62 stuck for 26s! [systemd-udevd:850]
[ 39.445411] Modules linked in: nvidia(POE+) crct10dif_ce polyval_ce polyval_generic ghash_ce ast mlx5_core video drm_shmem_helper sm4 mlxfw sha2_ce drm_kms_helper nvme psample sha256_arm64 sha1_ce nvme_core igb drm tls xhci_pci nvme_common pci_hyperv_intf xhci_pci_renesas i2c_algo_bit aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[ 39.474949] CPU: 62 PID: 850 Comm: systemd-udevd Tainted: P OE 6.5.0-17-generic #17~22.04.1-Ubuntu
[ 39.485196] Hardware name: GIGABYTE G242-P30-JG/MP32-AR0-JG, BIOS F07 03/22/2021
[ 39.492578] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 39.499526] pc : smp_call_function_many_cond+0x19c/0x720
[ 39.504830] lr : smp_call_function_many_cond+0x1b8/0x720
[ 39.510130] sp : ffff80008934b920
[ 39.513431] x29: ffff80008934b920 x28: ffffaef99146dd10 x27: 0000000000000000
[ 39.520554] x26: 000000000000004f x25: ffff085dcfffbb80 x24: 0000000000000026
[ 39.527677] x23: 0000000000000001 x22: ffff085dcfdd6708 x21: ffffaef9914726e0
[ 39.534799] x20: ffff085dcfadbb80 x19: ffff085dcfdd6700 x18: ffff800089341060
[ 39.541921] x17: 0000000000000000 x16: 0000000000000000 x15: 43535f5f00656c75
[ 39.549044] x14: 0c030b111b111303 x13: 0000000000000006 x12: 3931413337353339
[ 39.556166] x11: 0101010101010101 x10: 000000000000004f x9 : ffffaef98ee015b8
[ 39.563289] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003e
[ 39.570411] x5 : ffffaef99146d000 x4 : 0000000000000000 x3 : ffff085dcfadbb88
[ 39.577533] x2 : 0000000000000026 x1 : 0000000000000011 x0 : 0000000000000000
[ 39.584656] Call trace:
[ 39.587090] smp_call_function_many_cond+0x19c/0x720
[ 39.592043] kick_all_cpus_sync+0x50/0xa8
[ 39.596040] flush_module_icache+0x94/0xf8
[ 39.600125] load_module+0x448/0x8e0
[ 39.603688] init_module_from_file+0x94/0x110
[ 39.608033] idempotent_init_module+0x194/0x2b0
[ 39.612551] __arm64_sys_finit_module+0x74/0x100
[ 39.617155] invoke_syscall+0x7c/0x130
[ 39.620892] el0_svc_common.constprop.0+0x5c/0x170
[ 39.625670] do_el0_svc+0x38/0x68
[ 39.628972] el0_svc+0x30/0xe0
[ 39.632016] el0t_64_sync_handler+0x128/0x158
[ 39.636360] el0t_64_sync+0x1b0/0x1b8

Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :

I identified a similar bug today when installing nvidia-fabricmanager-535 on a noble dev build for arm64 that may be related:

https://bugs.launchpad.net/ubuntu/+source/fabric-manager-535/+bug/2052663

Revision history for this message
Abhishek Chauhan (abchauhan) wrote :

Hi all,
This should be fixed on the latest driver 550.67 - https://www.nvidia.com/Download/driverResults.aspx/223429/en-us/.
Please help verify if this is resolved on your systems. Thanks!

Revision history for this message
Abhishek Chauhan (abchauhan) wrote :

The fix is also available on 535.171.04 available here - https://www.nvidia.com/Download/driverResults.aspx/223761/en-us/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.