Ubuntu desktop does not appear on DGX Station A100 discrete GPU by default

Bug #2033418 reported by dann frazier
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-nvidia (Ubuntu)
Expired
Undecided
Unassigned
nvidia-graphics-drivers-535 (Ubuntu)
Expired
Undecided
Unassigned
nvidia-graphics-drivers-535-server (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

The DGX Station A10 has various display options. The BMC provides an emulated device (Aspeed controller), a physical BMC controller (NVIDIA Quadro T1000 Mobile), and a discrete GPU card with 4 DisplayPort outputs (A100 SXM4 80GB). See the user guide[*] for a diagram.

In a default Ubuntu Desktop 22.04 ('jammy') install, nothing appears on the A100 DisplayPort output.

NVIDIA DGX OS - a derivative of Ubuntu - works around this by installing a service that generates an xorg config file dynamically based on the "OnBrd/Ext VGA Select" setting (aka "dmidecode -t 11"). I'll attach samples of those files here. Replacing the default Ubuntu xorg.conf file with the sample xorg-nvidia.conf file on a default Ubuntu Desktop install is confirmed to enable output.

[*] https://docs.nvidia.com/dgx/dgx-station-a100-user-guide/getting-started-station-a100.html

(Note: split from bug 2031367)

Revision history for this message
dann frazier (dannf) wrote :
description: updated
description: updated
summary: - Ubuntu desktop does not appear on DGX Station A100 discrete GPU
+ Ubuntu desktop does not appear on DGX Station A100 discrete GPU by
+ default
description: updated
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Libera.chat.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/2033418/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → xorg-server (Ubuntu)
tags: added: dgx hybrid jammy multigpu nvidia
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Does booting with the 'nomodeset' kernel parameter provide a better default experience without needing a custom Xorg config?

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

@vanvugt,

I have tried:

GRUB_CMDLINE_LINUX="vga=normal nofb nomodeset"

But the output is still only seen in the BMC Display.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Please don't use nofb. Just nomodeset by itself.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Also try fully disabling Aspeed by running this as root:

  echo "blacklist ast" > /etc/modprobe.d/blacklist-ast.conf

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Hi Daniel,

I tried removing Removing xorg-nvidia.conf and the following setups:

- nomodeset:
  hdmi: no display
  bmc: just a fsck message (blinked sometimes) - no splash / no X
  sosreport: sosreport-u-DGX-Station-A100-920-23487-2531-000-nomodeset-2023-09-04-fvncavf.tar.xz

- nomodeset + blacklist ast
  hdmi: no display
  bmc: just a fsck message (blinked sometimes) - no splash / no X
  sosreport: sosreport-u-DGX-Station-A100-920-23487-2531-000-nomodeset-blacklist-ast-2023-09-04-sasnfpi.tar.xz

- blacklist ast
  hdmi: no display
  bmc: just a fsck message (blinked sometimes) - no splash / no X
  sosreport: sosreport-u-DGX-Station-A100-920-23487-2531-000-blacklist-ast-2023-09-04-hcfupsp.tar.xz

The sos reports are in a shared Google Drive:

https://drive.google.com/drive/folders/1FVHt2IMGsOyAsQ5U-z0kiqEFMD8NaWzh

Without the ast blacklist or the nomodeset, the behavior goes back to:

- no /etc/X11/xorg.conf.d/ file: Splash and X on BMC Display
- /etc/X11/xorg.conf.d/xorg-nvidia.conf: no splash, X on HDMI

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

It sounds like the main problem therefore is the Nvidia X driver. It's refusing to light up HDMI without a custom config explicitly telling it to use "PCI:193:0:0". But this might be Nvidia's intended behaviour so I recommend checking the docs for solutions first:

https://us.download.nvidia.com/XFree86/Linux-x86_64/535.86.05/README/index.html

And contacting Nvidia for help second.

P.S. The Nvidia website recommends driver 460 for the A100. Also the installed driver version sounds like it's from nvidia-graphics-drivers-535 (desktop), but also there are some nvidia-graphics-drivers-535-server packages installed? So I wonder if this is just a bad choice of drivers for the hardware.

affects: xorg-server (Ubuntu) → nvidia-graphics-drivers-535 (Ubuntu)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Wait, is the problem just that the auto-generated config is telling the Nvidia driver to ignore Nvidia GPUs by default?

Section "ServerFlags"
    Option "AutoAddGPU" "off"
EndSection

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

I tried to cut the xorg.conf and the minimal config that does the trick is:

Section "Device"
    Identifier "Device0"
    Driver "nvidia"
    VendorName "NVIDIA Corporation"
    BusID "PCI:193:0:0"
EndSection

Removing the ServerFlags section did not change anything (at least on this minimal test).

Is there any way to set a similar configuration to the boot process, so the boot splash screen / systemd messages are also displayed in the HDMI output?

Also, is there a way to make Ubuntu aware of these settings?

Thanks

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

> Is there any way to set a similar configuration to the boot process, so the boot splash screen / systemd messages are also displayed in the HDMI output?

I think you would need to somehow disable the BMC/Aspeed in the kernel (or BIOS) AND convince Plymouth to display earlier. That last part is currently being tracked in bug 1869655 but was perhaps more relevantly described in bug 1868240 (which isn't actually fixed yet if you disable modeset). And even then, fixing Plymouth only gives you the splash screen, it doesn't give you kernel boot messages if those are also missing on HDMI.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

To prove if a Plymouth fix would be useful at all you can try the 'nosplash' kernel parameter. That will disable the splash screen. If that's not enough to give you the boot messages you desire then this remains an Nvidia driver or kernel bug.

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Hi Daniel,

Thanks. I'll try that, but I believe it won't work. For GRUB, to be able to see in both screens, I set:

GRUB_TERMINAL=console

In the grub options.

Thinking about the X issue, where the BusID "PCI:193:0:0" fixes the problem - is there any suggested way to make Ubuntu detect and apply that to xorg.conf? As Dann mentioned, Nvidia is using some additional .deb packages that do the dmidecode check during boot, and define the xorg.conf correctly. How could we implement that in Ubuntu?

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The data in comment #1 shows only two "VGA" controllers that could be used:

46:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 41) (prog-if 00 [VGA controller])
c1:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU117GLM [Quadro T1000 Mobile] [10de:1fb0] (rev a1) (prog-if 00 [VGA controller]) (this is "PCI:193:0:0")

So we want to tell the kernel to boot without PCI device "46:00.0" or else it will likely only use ASPEED as the boot VGA. I can't find a quick answer right now but maybe by assigning this bug to the kernel team someone will...

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Discussing this internally, it was suggested to use:

fbcon=map:X

With X = 0, I see the same behavior (boot process on the BMC, X on whatever is set)

And other values for X just blank the boot process on both outputs (BMC and HDMI).

Changed in nvidia-graphics-drivers-535-server (Ubuntu):
status: New → Incomplete
Changed in nvidia-graphics-drivers-535 (Ubuntu):
status: New → Incomplete
Changed in linux-nvidia (Ubuntu):
status: New → Incomplete
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Newer drivers are available now, and i'm not sure what is expected here.

Revision history for this message
dann frazier (dannf) wrote :

We're checking to see if the system behaves the same with NVIDIA's BaseOS. If it is, we'll probably move on and this can move to Won't Fix.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I think comment #17 was on the right track. fbcon=map:1 should work in theory if the Nvidia kernel driver has framebuffer support. Make sure you're not using nofb or nomodeset which would break it.

https://www.kernel.org/doc/html/latest/fb/fbcon.html

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for nvidia-graphics-drivers-535 (Ubuntu) because there has been no activity for 60 days.]

Changed in nvidia-graphics-drivers-535 (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux-nvidia (Ubuntu) because there has been no activity for 60 days.]

Changed in linux-nvidia (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for nvidia-graphics-drivers-535-server (Ubuntu) because there has been no activity for 60 days.]

Changed in nvidia-graphics-drivers-535-server (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.