GPGPU devices not fully named in the PCI Devices tab

Bug #1963284 reported by Jeff Lane 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
Low
Unassigned
lxd (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I have a machine with two nVidia A100 GPUs installed.

Looking at the GPU section of PCI Devices, I can see both cards, however, they are not named (see screenshot)

lshw data gathered during commissioning has the necessary pciid data for discovering the name:
<hints>
          <hint name="icon" value="display" />
          <hint name="pci.class" value="0x302" />
          <hint name="pci.device" value="0x20F1" />
          <hint name="pci.subdevice" value="0x145F" />
          <hint name="pci.subvendor" value="0x10DE" />
          <hint name="pci.vendor" value="0x10DE" />
         </hints>

and 10de:20f1 resolves to the nVidia Corporation A100 GPU.

I think the issue is described here:
https://bugs.launchpad.net/ubuntu/+source/pci.ids/+bug/1963283

Focal has a 2 year out of date pci.ids file, so I think this may be why commissioning can't identify the A100 GPUs.

The version of PCI IDs in Jammy does have the strings for identifying the A100 (and any other hardware added to the database since March 2020).

So while fixing the above bug will resolve this long term, MAAS should have a way to have an up to date pci.id database to avoid lapses like this without relying on SRUs to previous LTSs.

Looking forward, assuming my suspicions are correct and the outdated PCI.ID file is the culprit here, this means MAAS will continue failing to identifying some hardware until after 22.04 (or perhaps 22.04.1?) when we can finally use Jammy to commission harware.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Screenshot showing unidentified NVIDIA GPUs

Revision history for this message
Alberto Donato (ack) wrote (last edit ):

MAAS doesn't use lshw for reporting hardware, all the information is gathered by the resource binary (which is based on the LXD codebase) from the content of /sys files

Revision history for this message
Alberto Donato (ack) wrote :

To clarify, MAAS does gather the lshw output from commissioning, but that's only used for matching xpath expression for automatic tags.

Hardware and network information for the machine, though, is processed by parsing the json output from the binary mentioned above.

Revision history for this message
Jeff Lane  (bladernr) wrote :

HRmmm... so what then translates PCI IDs to recognizablee names? Is that a LXD problem then (in commissioning?) perhaps I was mistaken in thinking something in MAAS itself translated this... Perhaps this needs to be a LXD bug in that case.

Revision history for this message
Alberto Donato (ack) wrote :

Looking at LXD code (https://github.com/lxc/lxd/blob/master/lxd/resources/pci.go) that the resource binary uses, it uses this library: https://github.com/jaypipes/pcidb, which can use the local database, or fetch it from the net.
So I suspect that the db is found locally but not up to date.

It would be good to open a LXD bug referencing this one, as any change needs to be in LXD first, then we can update dependencies in MAAS to bring in those changes.

Out of curiosity, if you run `lxc query /1.0/resources` on the machine with that hardware, do you get proper names for those devices in the returned data?

Changed in maas:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Revision history for this message
Stéphane Graber (stgraber) wrote :

The Go pcidb package will load the database from /usr/share/misc/ when available.
That's what happens in the LXD snap where we pass in the Ubuntu 20.04 version of the database (until we transition to core22).

But that's for the LXD snap. MAAS does not use our snap, instead running a standalone binary, which then puts the responsability onto MAAS to ensure an available pci.ids database in /usr/share/misc which is sufficiently up to date.

Changed in lxd (Ubuntu):
status: New → Invalid
Revision history for this message
Jeff Lane  (bladernr) wrote :

Thanks, Stephane.

Changed in maas:
status: Expired → New
Revision history for this message
Christian Grabowski (cgrabowski) wrote :

So in the context of MAAS, the previously mentioned binary will read the local database from the commissioning OS image specifically. A possible workaround for this issue is to add a custom commissioning script to update the database, and name said script such that it has higher priority over machine resources (i.e prefix with a higher number than the machine-resources script).

Marking as triaged as updating the pci.ids is something MAAS could do as part of commissioning, though the workaround should suffice.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Is this reproducible on recent MAAS (3.3+)? Updated ephemeral images and changes to lxd hardware identification should cover a broader range of hardware now.

Changed in maas:
importance: Medium → Low
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.