NVIDIA A100-80GB GPU fails to initialize in R750xa, BAR Address setup failed if SR-IOV is disabled in BIOS

Bug #1934620 reported by Sujith Pandel
32
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dellserver
Incomplete
Undecided
Unassigned
linux (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Installing NVidia A100-80GB GPUs in R750XA fails to initialize BAR memory space for GPUs.
Multiple error messages in kernel.log file.

Summary:
* Dell prefers to set SRIOV to disabled in bios globally
* This bug does not occur IF:
    SRIOV is enabled in BIOS OR
    "pci=realloc=off" is passed to the kernel at boot time
* This bug only affects systems with NVIDIA A100-80GB GPUs, it does not
  affect systems with NVIDIA A100-40GB GPUs

information type: Public → Private
Revision history for this message
Sujith Pandel (sujithpandel) wrote :

If "Global SRIOV" is enabled in BIOS, we do not observe probe failure.
By default "Global SRIOV" is set to disabled in Dell BIOS.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Can you likewise confirm that this happens with the latest 21.10 (Impish) ISO?
Also, how are you installing (MAAS, ISO, etc)?

And why can you not just note users need to enable SR-IOV when using these GPUs?

And does this affect the A100-40GB GPUs? OR any other Ampere GPUs?
What about Tesla or older GPUs?

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Currently we are relying on NVidia team and their setup to experiment, we don't have access to a repro setup.
We are waiting for a setup to continue our analysis.

It looks like Ubuntu 20.04 sets pci=realloc automatically ( CONFIG_PCI_REALLOC_ENABLE_AUTO is set) during boot and this is causing problem.
Using "pci=realloc=off" seems to work and prevents driver probe failures.

Revision history for this message
Michael Reed (mreed8855) wrote :

Hi Sujith,

Just for clarity are you proposing that we disable pci=realloc by default?

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Hi Michael,
Currently it is only being considered as a workaround. We don't know the root-cause yet.
Also, it looks like other enterprise linux distros keep CONFIG_PCI_REALLOC_ENABLE_AUTO off, not sure why Ubuntu is keeping it on.

Revision history for this message
Michael Reed (mreed8855) wrote :

I have verified this bug.

SRIOV disabled - GPGPU certification fails and the system become unusable.
SRIOV disabled pci-realloc=off - GPGPU certification Passes
SRIOV enabled - GPGPU certification Passes

Dell prefers to ship their systems with SRIOV disabled and want to know if we would either address this issue in the kernel or consider setting pci-realloc to off by default.

Revision history for this message
Michael Reed (mreed8855) wrote :

Error messages on the console.

Revision history for this message
Michael Reed (mreed8855) wrote :

Additional error messages

Revision history for this message
Michael Reed (mreed8855) wrote :

Apparently this does not affect A100-40GB GPUs

Revision history for this message
Narendra K (knarendra) wrote :

Hi Michael,

Any update on setting CONFIG_PCI_REALLOC_ENABLE_AUTO to disabled in Ubuntu 22.04 ?

information type: Private → Public
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1934620

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Jeff Lane  (bladernr)
description: updated
Jeff Lane  (bladernr)
Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
Changed in dellserver:
status: New → Won't Fix
Jeff Lane  (bladernr)
Changed in linux (Ubuntu):
status: Won't Fix → Confirmed
Revision history for this message
Jeff Lane  (bladernr) wrote :

HI Sujith, can you tell us what server this is, and how much RAM is in it? Just wondering.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Hi Jeff,
Original repro machine had ~1TB memory.

Previously attached log snip:
Memory: 1055972180K/1073051196K available (14339K kernel code, 2400K rwdata, 5008K rodata, 2736K init, 4964K bss, 17079016K reserved, 0K cma-reserved)

Revision history for this message
Jeff Lane  (bladernr) wrote :

Sujith,

Can you get a nvidia-bug-report.log

Jeff Lane  (bladernr)
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Jeff Lane  (bladernr) wrote :

Hi Sujith, have you been able to get the nvidia bug report log for them to help debug this issue?

Revision history for this message
Jeff Lane  (bladernr) wrote :

Hi Sujith - Can you update this bug.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Team is working on getting the required GPU setup.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Uploaded the requested logs.

Revision history for this message
Jeff Lane  (bladernr) wrote :

@Sujith - I'm not seeing anyu updated logs... where were they uploaded?

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

@Jeff - I have sent them via email to you and Michael. Please check.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Thank you Sujith... FYI the log was passed to NVIDIA who are looking it over to see if there's anything to do on their end.

Revision history for this message
Jeff Lane  (bladernr) wrote :

closing the kernel task as there's really nothing for us to do here, unfortunately. Still waiting on nVidia to respond about the logs and if they have any advice (I asked on monday again and they're going to check with the internal team they sent the logs to again).

Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
Changed in dellserver:
status: Won't Fix → Opinion
status: Opinion → Incomplete
Revision history for this message
Jeff Lane  (bladernr) wrote :

Hey Sujith, so far no real update from nVidia other than the fact that not just the GPUs were having issues:

Jun 21 06:14:47 R750XAS kernel: [ 59.183873] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
Jun 21 06:14:47 R750XAS kernel: [ 59.183896] pnp 00:01: disabling [mem 0xff000000-0xffffffff] because it overlaps 0000:17:00.0 BAR 8 [mem 0x00000000-0x13ffffffff 64bit pref]
Jun 21 06:14:47 R750XAS kernel: [ 59.334512] pnp 00:01: disabling [mem 0xff000000-0xffffffff disabled] because it overlaps 0000:65:00.0 BAR 8 [mem 0x00000000-0x13ffffffff 64bit pref]
Jun 21 06:14:47 R750XAS kernel: [ 59.494432] pnp 00:01: disabling [mem 0xff000000-0xffffffff disabled] because it overlaps 0000:ca:00.0 BAR 8 [mem 0x00000000-0x13ffffffff 64bit pref]
Jun 21 06:14:47 R750XAS kernel: [ 59.654396] pnp 00:01: disabling [mem 0xff000000-0xffffffff disabled] because it overlaps 0000:e3:00.0 BAR 8 [mem 0x00000000-0x13ffffffff 64bit pref]

COuld this be something firmware related going on?

Revision history for this message
Jeff Lane  (bladernr) wrote :

Also, could there be a bios parameter that could be changed to help here?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.