MAAS raises exception when processing NUMA node with 0 memory

Bug #1878923 reported by Johnny Jamaica
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Lee Trager
2.7
Fix Released
High
Lee Trager

Bug Description

Hi,

I'm trying to commission/test a new node (Board: Supermicro H8DGU). I am able to install Ubuntu 18.04.x from USB stick and it works fine. When I try to commission/test the node I get following error in "smartctl-validate". I've tried multiple drives with the same result. My current veriosn is - MAAS 2.7.0 (8232-g.6e1dba4ab-0ubuntu1~18.04.1)

Unable to run 'smartctl-validate': 'MAAS did not detect any storage devices during commissioning!'
Given parameters:
{'storage': {'argument_format': '{path}', 'type': 'storage', 'value': 'all'}}
Discovered storage devices:
[{'NAME': 'sda', 'MODEL': 'ST1000LM035-1RK1', 'SERIAL': 'WL1P4E9C'}]

Related branches

Revision history for this message
Lee Trager (ltrager) wrote :

Could you verify if storage devices are shown on the storage tab?

Could you also upload the following

* /var/log/maas/maas.log
* /var/log/maas/regiond.log
* The output of 50-maas-01-commissioning from the failing host

Changed in maas:
status: New → Incomplete
Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote :

this time I did testing with 20.04.x image, but results are the same.

maas.log

2020-05-15T17:10:08.184467+00:00 mgmt01 maas.node: [info] compute001: Status transition from FAILED_TESTING to TESTING
2020-05-15T17:10:08.416273+00:00 mgmt01 maas.node: [warn] compute001: Could not start testing the node; it must be started manually
2020-05-15T17:12:14.275692+00:00 mgmt01 maas.node: [info] compute001: Status transition from TESTING to FAILED_TESTING

regiond.log

2020-05-15 17:11:02 provisioningserver.rackdservices.tftp: [info] lpxelinux.0 requested by xxx.xxx.xxx.163
2020-05-15 17:11:03 provisioningserver.rackdservices.tftp: [info] lpxelinux.0 requested by xxx.xxx.xxx.163
2020-05-15 17:11:03 provisioningserver.rackdservices.http: [info] ldlinux.c32 requested by xxx.xxx.xxx.163
2020-05-15 17:11:03 provisioningserver.rackdservices.http: [info] pxelinux.cfg/534d4349-0002-2990-2500-29902500458e requested by xxx.xxx.xxx.163
2020-05-15 17:11:03 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/generic/focal/daily/boot-kernel requested by xxx.xxx.xxx.163
2020-05-15 17:11:03 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/generic/focal/daily/boot-initrd requested by xxx.xxx.xxx.163
2020-05-15 17:11:16 provisioningserver.rackdservices.http: [info] /images/ubuntu/amd64/generic/focal/daily/squashfs requested by xxx.xxx.xxx.163

feel free to contact me for any additional questions

Alberto Donato (ack)
summary: - AAS did not detect any storage devices during commissioning although it
+ MAAS did not detect any storage devices during commissioning although it
detected storage
Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote : Re: MAAS did not detect any storage devices during commissioning although it detected storage

Regarding storage tab, no, there is no storage displayed. Screenshot attached

Revision history for this message
Lee Trager (ltrager) wrote :

Is it possible to get the full regiond.log? I suspect there is a failure somewhere in it.

Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote :

maas.log

Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote :
Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote :

Error in regiond.log is following:

2020-05-17 12:16:57 metadataserver.api: [critical] compute01.infra.lan(mq77wq): commissioning script '50-maas-01-commissioning' failed during post-processing.
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 802, in signal
     target_status = process(node, request, status)
   File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 622, in _process_commissioning
     node, node.current_commissioning_script_set, request, status
   File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 515, in _store_results
     **args, timedout=(status == SIGNAL_STATUS.TIMEDOUT)
   File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 388, in store_result
     exit_status=self.exit_status,
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 441, in try_or_log_event
     func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/metadataserver/builtin_scripts/hooks.py", line 453, in process_lxd_results
     node.memory, numa_nodes = _parse_memory(data, numa_nodes)
   File "/usr/lib/python3/dist-packages/metadataserver/builtin_scripts/hooks.py", line 548, in _parse_memory
     memory_node["total"] / 1024 / 1024
 builtins.KeyError: 3

Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote :

today I've started the machine in rescue mode and the disks were recognized successfully

Disk /dev/sda: 29.51 GiB, 31675383808 bytes, 61865984 sectors
Disk model: KingFast
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 2E3CC805-5993-4A8A-ABE9-BDFE531B3D64

Device Start End Sectors Size Type
/dev/sda1 2048 4095 2048 1M BIOS boot
/dev/sda2 4096 61863935 61859840 29.5G Linux filesystem

Disk /dev/sdb: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: ST1000LM035-1RK1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xf0b78a5d

Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 1953519615 1953517568 931.5G 7 HPFS/NTFS/exFAT

let me know if any additional information is required. I'm also fine with testing an updated version of the test script

Revision history for this message
Lee Trager (ltrager) wrote :

The issue actually isn't with storage it's with NUMA detection. As of MAAS 2.7 almost all commissioning data is coming from 50-maas-01-commissioning. If processing fails no data from that file is modeled. You have 4 NUMA nodes setup but only 2 have any memory assigned to them. MAAS is failing when it tries to convert 0 bytes to 0 megabytes. This is clearly a bug in MAAS that needs to be fixed but I'm wondering if this configuration was intentional? If so what is the use case?

    "memory": {
        "nodes": [
            {
                "numa_node": 0,
                "hugepages_used": 0,
                "hugepages_total": 0,
                "used": 1314131968,
                "total": 16814940160
            },
            {
                "numa_node": 1,
                "hugepages_used": 0,
                "hugepages_total": 0,
                "used": 0,
                "total": 0
            },
            {
                "numa_node": 2,
                "hugepages_used": 0,
                "hugepages_total": 0,
                "used": 753184768,
                "total": 16905523200
            },
            {
                "numa_node": 3,
                "hugepages_used": 0,
                "hugepages_total": 0,
                "used": 0,
                "total": 0
            }
        ],
        "hugepages_total": 0,
        "hugepages_used": 0,
        "hugepages_size": 2097152,
        "used": 602902528,
        "total": 33720463360
    },

Changed in maas:
status: Incomplete → Confirmed
assignee: nobody → Lee Trager (ltrager)
importance: Undecided → High
milestone: none → 2.8.0rc1
Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote :

Hi,

I assume, that this might be due to 2 numa nodes per "physical CPU" (but I would need to dig deeper to confirm). Memory was added as described in user manual. I can try to change the recommended order of DIMMs and check again

Revision history for this message
Johnny Jamaica (johnnyjamaica) wrote :

It appears that memory is always added to the uneven numa node nr

Lee Trager (ltrager)
summary: - MAAS did not detect any storage devices during commissioning although it
- detected storage
+ MAAS raises exception when processing NUMA node with 0 memory
Lee Trager (ltrager)
Changed in maas:
status: Confirmed → In Progress
Changed in maas:
status: In Progress → Fix Committed
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
Changed in maas:
status: Fix Released → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.