Ubuntu

Milan Delta A100 GPU fails to detect on Ubuntu 18.04 and 20.04

Bug #1915413 reported by acd on 2021-02-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ubuntu	New	Undecided	Unassigned

Bug Description

An AMD Milan Delta system with HGX A100 8-GPUs is having issues detecting all 8 GPUs due to problem in enabling the fabric manager on both Ubuntu 18.04 and 20.04. But with other Linux variants -such as CentOS and RHEL, there’s no problem in detecting all 8-GPUs.

From clean Ubuntu 18.04 install
A100 Delta board

Output from systemctl status nvidia-fabricmanager process terminated due to NVSwitch driver failure
------------
Feb 06 04:44:11 milan-delta systemd[1]: Starting NVIDIA fabric manager service...
Feb 06 04:44:12 milan-delta nv-fabricmanager[64822]: request to query NVSwitch device information from NVSw>
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, >
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Feb 06 04:44:12 milan-delta systemd[1]: Failed to start NVIDIA fabric manager service.
------------

Syslog output
-----------
Feb 6 04:44:14 milan-delta kernel: [ 1185.231538] NVRM: GPU 0000:85:00.0: RmInitAdapter failed! (0x23:0xffff:624)
Feb 6 04:44:14 milan-delta kernel: [ 1185.231895] NVRM: GPU 0000:85:00.0: rm_init_adapter failed, device minor number 2
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:85:00.0 - failed to open.
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:8b:00.0 - registered
-----------

The dmesg
-----------
[ 1170.435712] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:45:00.0)
[ 1170.435714] NVRM: The system BIOS may have misconfigured your GPU.
[ 1170.435725] nvidia: probe of 0000:45:00.0 failed with error -1

[ 1182.379923] nvidia: loading out-of-tree module taints kernel.
[ 1182.379936] nvidia: module license 'NVIDIA' taints kernel.
[ 1182.379937] Disabling lock debugging due to kernel taint
[ 1182.389651] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1182.406795] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 1182.406939] nvidia-nvswitch: Probing device 0000:d4:00.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000
[ 1182.407252] nvidia-nvswitch0: Failed to map BAR0 region : -12
-----------

Tags:

Revision history for this message

acd (alecd-smc) wrote on 2021-02-11:

dmesg and syslog Edit (285.2 KiB, text/plain)

Colin Watson (cjwatson) on 2021-02-11

affects:

launchpad → ubuntu

Revision history for this message

acd (alecd-smc) wrote on 2021-02-11:

Additional information:

With Rome based Delta A100 system, in order for nvidia drivers and fabric manager to be installed successfully, the SR-IOV features at the BIOS must be enabled. Otherwise if disabled, it will behave similar as with Milan based Delta A100 system. This is evident on both 18.04 and 20.04 install.

However with RHEL 8 install, fabric manager service works fine either SR-IOV enabled or disabled at the BIOS and nvidia-smi will displays all 8-GPUs as expected.

Jeff Lane  (bladernr) on 2021-02-12

tags:

added: blocks-hwcert-server

Revision history for this message

dann frazier (dannf) wrote on 2021-02-12:

Just a hunch, but does either kernel parameter pci=realloc or pci=nocrs workaround it?

fyi, it'd be useful to see and compare a full dmesg from RHEL and the corresponding one from Ubuntu (not just the truncated one in Comment #1).

Revision history for this message

acd (alecd-smc) wrote on 2022-01-27:

The issue has been fixed after the firmware was upgraded and SR-IOV enabled at the BIOS.

Revision history for this message

Manoj (manojamd2000) wrote on 2024-02-28 (last edit on 2024-02-28):

Yeah I am also facing same issue on NVIDIA H100 SXM and after enabling SR-IOV in BIOS the issue got resolved.
Try to reset the BOIS setting and chech SR-IOV is enabled.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

dmesg and syslog Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.