Milan Delta A100 GPU fails to detect on Ubuntu 18.04 and 20.04

Bug #1915413 reported by acd
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu
New
Undecided
Unassigned

Bug Description

An AMD Milan Delta system with HGX A100 8-GPUs is having issues detecting all 8 GPUs due to problem in enabling the fabric manager on both Ubuntu 18.04 and 20.04. But with other Linux variants -such as CentOS and RHEL, there’s no problem in detecting all 8-GPUs.

From clean Ubuntu 18.04 install
A100 Delta board

Output from systemctl status nvidia-fabricmanager process terminated due to NVSwitch driver failure
------------
Feb 06 04:44:11 milan-delta systemd[1]: Starting NVIDIA fabric manager service...
Feb 06 04:44:12 milan-delta nv-fabricmanager[64822]: request to query NVSwitch device information from NVSw>
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, >
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Feb 06 04:44:12 milan-delta systemd[1]: Failed to start NVIDIA fabric manager service.
------------

Syslog output
-----------
Feb 6 04:44:14 milan-delta kernel: [ 1185.231538] NVRM: GPU 0000:85:00.0: RmInitAdapter failed! (0x23:0xffff:624)
Feb 6 04:44:14 milan-delta kernel: [ 1185.231895] NVRM: GPU 0000:85:00.0: rm_init_adapter failed, device minor number 2
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:85:00.0 - failed to open.
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:8b:00.0 - registered
-----------

The dmesg
-----------
[ 1170.435712] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:45:00.0)
[ 1170.435714] NVRM: The system BIOS may have misconfigured your GPU.
[ 1170.435725] nvidia: probe of 0000:45:00.0 failed with error -1

[ 1182.379923] nvidia: loading out-of-tree module taints kernel.
[ 1182.379936] nvidia: module license 'NVIDIA' taints kernel.
[ 1182.379937] Disabling lock debugging due to kernel taint
[ 1182.389651] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1182.406795] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 1182.406939] nvidia-nvswitch: Probing device 0000:d4:00.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000
[ 1182.407252] nvidia-nvswitch0: Failed to map BAR0 region : -12
-----------

Revision history for this message
acd (alecd-smc) wrote :
Colin Watson (cjwatson)
affects: launchpad → ubuntu
Revision history for this message
acd (alecd-smc) wrote :

Additional information:

With Rome based Delta A100 system, in order for nvidia drivers and fabric manager to be installed successfully, the SR-IOV features at the BIOS must be enabled. Otherwise if disabled, it will behave similar as with Milan based Delta A100 system. This is evident on both 18.04 and 20.04 install.

However with RHEL 8 install, fabric manager service works fine either SR-IOV enabled or disabled at the BIOS and nvidia-smi will displays all 8-GPUs as expected.

Jeff Lane  (bladernr)
tags: added: blocks-hwcert-server
Revision history for this message
dann frazier (dannf) wrote :

Just a hunch, but does either kernel parameter pci=realloc or pci=nocrs workaround it?

fyi, it'd be useful to see and compare a full dmesg from RHEL and the corresponding one from Ubuntu (not just the truncated one in Comment #1).

Revision history for this message
acd (alecd-smc) wrote :

The issue has been fixed after the firmware was upgraded and SR-IOV enabled at the BIOS.

Revision history for this message
Manoj (manojamd2000) wrote (last edit ):

Yeah I am also facing same issue on NVIDIA H100 SXM and after enabling SR-IOV in BIOS the issue got resolved.
Try to reset the BOIS setting and chech SR-IOV is enabled.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.