Milan Delta A100 GPU fails to detect on Ubuntu 18.04 and 20.04
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu |
New
|
Undecided
|
Unassigned |
Bug Description
An AMD Milan Delta system with HGX A100 8-GPUs is having issues detecting all 8 GPUs due to problem in enabling the fabric manager on both Ubuntu 18.04 and 20.04. But with other Linux variants -such as CentOS and RHEL, there’s no problem in detecting all 8-GPUs.
From clean Ubuntu 18.04 install
A100 Delta board
Output from systemctl status nvidia-
------------
Feb 06 04:44:11 milan-delta systemd[1]: Starting NVIDIA fabric manager service...
Feb 06 04:44:12 milan-delta nv-fabricmanage
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-
Feb 06 04:44:12 milan-delta systemd[1]: Failed to start NVIDIA fabric manager service.
------------
Syslog output
-----------
Feb 6 04:44:14 milan-delta kernel: [ 1185.231538] NVRM: GPU 0000:85:00.0: RmInitAdapter failed! (0x23:0xffff:624)
Feb 6 04:44:14 milan-delta kernel: [ 1185.231895] NVRM: GPU 0000:85:00.0: rm_init_adapter failed, device minor number 2
Feb 6 04:44:14 milan-delta nvidia-
Feb 6 04:44:14 milan-delta nvidia-
-----------
The dmesg
-----------
[ 1170.435712] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:45:00.0)
[ 1170.435714] NVRM: The system BIOS may have misconfigured your GPU.
[ 1170.435725] nvidia: probe of 0000:45:00.0 failed with error -1
[ 1182.379923] nvidia: loading out-of-tree module taints kernel.
[ 1182.379936] nvidia: module license 'NVIDIA' taints kernel.
[ 1182.379937] Disabling lock debugging due to kernel taint
[ 1182.389651] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1182.406795] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 1182.406939] nvidia-nvswitch: Probing device 0000:d4:00.0, Vendor Id = 0x10de, Device Id = 0x1af1, Class = 0x68000
[ 1182.407252] nvidia-nvswitch0: Failed to map BAR0 region : -12
-----------
Additional information:
With Rome based Delta A100 system, in order for nvidia drivers and fabric manager to be installed successfully, the SR-IOV features at the BIOS must be enabled. Otherwise if disabled, it will behave similar as with Milan based Delta A100 system. This is evident on both 18.04 and 20.04 install.
However with RHEL 8 install, fabric manager service works fine either SR-IOV enabled or disabled at the BIOS and nvidia-smi will displays all 8-GPUs as expected.