MAAS server startup fails: machine-resources/amd64 returns error "failed to retrieve GPU information"

Bug #1970435 reported by Serdar Vural
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Alexsander de Souza

Bug Description

Snap-installed MAAS 3.1 on local machine (Ubuntu 20.04) using these steps, as listed on MAAS webpage:

sudo snap install maas
sudo snap install maas-test-db

snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.1.0-10901-g.f1f8f1505 19835 3.1/stable canonical✓ -

and then initialised it:

sudo maas init region+rack --database-uri maas-test-db:///

IP of the machine in the local network: 192.168.1.136
The web interface at http://192.168.1.136:5240/MAAS
This returns "This site can’t be reached" error.

Checked the error logs for the Region controller at /var/snap/maas/common/log/regiond.log

This prints an error every 4 seconds, as follows:

2022-04-26 13:36:09 maasserver.start_up: [error] Error during start-up.
Traceback (most recent call last):
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/start_up.py", line 65, in start_up
    yield deferToDatabase(inner_start_up, master=master)
  File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/snap/maas/19835/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/snap/maas/19835/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 870, in callInContext
    return func(*args, **kwargs)
  File "/snap/maas/19835/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 202, in wrapper
    result = func(*args, **kwargs)
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/utils/orm.py", line 711, in call_with_connection
    return func(*args, **kwargs)
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/utils/__init__.py", line 194, in call_with_lock
    return func(*args, **kwargs)
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/utils/orm.py", line 756, in call_within_transaction
    return func_outside_txn(*args, **kwargs)
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/utils/orm.py", line 559, in retrier
    return func(*args, **kwargs)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/start_up.py", line 118, in inner_start_up
    node = RegionController.objects.get_or_create_running_controller()
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/models/node.py", line 742, in get_or_create_running_controller
    node = self._find_or_create_running_controller()
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/models/node.py", line 779, in _find_or_create_running_controller
    node = self._find_running_node()
  File "/snap/maas/19835/lib/python3.8/site-packages/maasserver/models/node.py", line 795, in _find_running_node
    filter_macs = Q(interface__mac_address__in=get_mac_addresses())
  File "/snap/maas/19835/lib/python3.8/site-packages/provisioningserver/utils/ipaddr.py", line 41, in get_mac_addresses
    ip_addr = get_ip_addr()
  File "/snap/maas/19835/lib/python3.8/site-packages/provisioningserver/utils/ipaddr.py", line 28, in get_ip_addr
    output = call_and_check(command)
  File "/snap/maas/19835/lib/python3.8/site-packages/provisioningserver/utils/shell.py", line 106, in call_and_check
    raise ExternalProcessError(process.returncode, command, output=stderr)
provisioningserver.utils.shell.ExternalProcessError: Command `/snap/maas/19835/usr/share/maas/machine-resources/amd64` returned non-zero exit status 1:
ERROR: Failed to retrieve GPU information: Failed to add device information for "/sys/class/drm/card0/device": Failed to read "/sys/class/drm/card0/device/device": read /sys/class/drm/card0/device/device: is a directory

Seems like the command /snap/maas/19835/usr/share/maas/machine-resources/amd64 is looking for a device file, but actually it's a directory. The contents of the directory are as follows:

ls -la /sys/class/drm/card0/device/
total 0
drwxr-xr-x 5 root root 0 Apr 26 13:33 .
drwxr-xr-x 31 root root 0 Apr 26 13:32 ..
lrwxrwxrwx 1 root root 0 Apr 26 13:33 device -> ../../pci0000:00/0000:00:01.2/0000:02:00.0/usb1/1-10/1-10.3
lrwxrwxrwx 1 root root 0 Apr 26 13:33 driver -> ../../../bus/platform/drivers/evdi
-rw-r--r-- 1 root root 4096 Apr 26 14:06 driver_override
drwxr-xr-x 3 root root 0 Apr 26 13:33 drm
drwxr-xr-x 4 root root 0 Apr 26 13:33 i2c-6
-r--r--r-- 1 root root 4096 Apr 26 14:06 modalias
drwxr-xr-x 2 root root 0 Apr 26 14:06 power
lrwxrwxrwx 1 root root 0 Apr 26 13:33 subsystem -> ../../../bus/platform
-rw-r--r-- 1 root root 4096 Apr 26 13:33 uevent

Since the error message says "failed to retrieve GPU information", here is system info on the GPU card (I believe):

*-pci:2
             description: PCI bridge
             product: Starship/Matisse GPP Bridge
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 3.1
             bus info: pci@0000:00:03.1
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:28 ioport:f000(size=4096) memory:fb000000-fc0fffff ioport:e0000000(size=167772160)
           *-display
                description: VGA compatible controller
                product: GK208B [GeForce GT 730]
                vendor: NVIDIA Corporation
                physical id: 0
                bus info: pci@0000:0a:00.0
                version: a1
                width: 64 bits
                clock: 33MHz
                capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
                configuration: driver=nvidia latency=0
                resources: irq:88 memory:fb000000-fbffffff memory:e0000000-e7ffffff memory:e8000000-e9ffffff ioport:f000(size=128) memory:fc000000-fc07ffff

SIDE NOTE: (Could this be an issue with the EVDI module? - just an idea) In case this might be related, I have the DisplayLink driver installed on the system, to support two monitors. Link to this driver is at:
https://www.synaptics.com/products/displaylink-graphics/downloads/ubuntu

Latest instructions on installation of this driver is at https://support.displaylink.com/knowledgebase/articles/684649

Related branches

Revision history for this message
Serdar Vural (serdarvural80) wrote :
Bill Wear (billwear)
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
importance: Medium → Undecided
Revision history for this message
Björn Tillenius (bjornt) wrote :

Can you please run /snap/maas/19835/usr/share/maas/machine-resources/amd64 as root and see it that gives the same problem?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Serdar Vural (serdarvural80) wrote :

yes, it does. this seems to be the command that returns the error. please kindly see the attached as well.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Can you show:
 - ls -lh /sys/class/drm/
 - ls -lh /sys/class/drm/card0/
 - ls -lh /sys/class/drm/card0/device/
 - ls -lh /sys/class/drm/card0/device/device/

Normally the device entry should be a symlink to the device and device in there should therefore be a file which contains the device ID, looks like that's not the case here so we'll need to account for that.

Revision history for this message
Serdar Vural (serdarvural80) wrote :

I ran the "ls" commands as requested, and also some "tree" commands, just in case it gives more insight into the issue :) Please see the attached file for the outputs.

/sys/class/drm/card0 looks like a symlink
/sys/class/drm/card0/device looks like a symlink to evdi.0
/sys/class/drm/card0/device is also a symlink (the output mentions usb1)

The only usb device attached to the machine is the DisplayLink device (which itself has HDMI and USB ports). This device is where I plug my keyboard, mouse, the speaker, and one of the monitors.
I had to install a driver for this device to work, as mentioned in the original bug report.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Can you download and run https://dl.stgraber.org/lxd/lp-1970435

It's a static Go binary which prints the output of the GPU data.
I've added a fix which should handle such nested device directories.

Revision history for this message
Serdar Vural (serdarvural80) wrote :

I've attached the output. many thanks.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Good, will send the fix then as this clearly isn't crashing anymore.

Revision history for this message
Stéphane Graber (stgraber) wrote :
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

This is fixed in LXD 5.1, but we cannot update this dep before moving MAAS to Jammy as this requires Go 1.18.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
milestone: none → next
Changed in maas:
milestone: next → 3.2.0
assignee: nobody → Alexsander de Souza (alexsander-souza)
status: Triaged → In Progress
Changed in maas:
milestone: 3.2.0 → next
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: next → 3.2.0
Changed in maas:
milestone: 3.2.0 → 3.2.0-beta5
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.