50-maas-01-commissioning failed with "Node with this Hardware uuid already exists"

Bug #1903544 reported by Bart Vrancken
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
New
Undecided
Unassigned

Bug Description

We have special purpose PCB's with Intel J4105 CPU and storage, etc. Running the Commisisioning it fails at the step:

Script result lookup or storage error - fresh-collie.nl.pha.farm(w7bb3y): commissioning script '50-maas-01-commissioning' failed during post-processing

The problem here is that it does not log why. There are no errors in the log on the controller ... only the returned YML data from the tool collecting the hardware info and in stderror it logs the download itself. I have compared this with other systems and the format looks identical.

when i boot the unit into rescue mode it works fine, and running the script manually did not provide any errors and outputted the same YAML data. The YAML data both in controller as locally outputted was verified to be valid YAML format.

Looking at the region controller dashboard for the node's inventory i do notice the amount of cores + speed are listed as 'unknown' while on the next line it does report the exact CPU type. For Memory it also says 'unknown'. The YAML data does contain this information though.

Since the error says post-processing i would think this is something at the region controller, however i would suspect such logs are included ... exspecially if there are errors.

The commissioning YAML output can be found at https://8n1.org/18077/43f3

Revision history for this message
Bart Vrancken (bartvrancken) wrote :

To clarify, system is running Ubuntu 20.04 with snap install of maas 2.8

description: updated
Revision history for this message
Lee Trager (ltrager) wrote :

The commissioning script logs you see associated with the node only show the output of the script that run. Logs of commissioning scripts which are processed by the region are stored in /var/snap/maas/common/log/regiond.log and /var/snap/maas/common/log/maas.log. Please upload those so we can see what is failing.

Changed in maas:
status: New → Incomplete
Revision history for this message
Bart Vrancken (bartvrancken) wrote :

So after testing it seems to be triggered when a node is running the latest BIOS and has Intel SGX Guard enabled. Two of the beta units (ASRock J4105M) run fine with BIOS 1.3 (2018) but are upgraded to 1.5B that enables SGX and some other features. Disabling SGX and other BIOS features did not resolve this, so until i can actually see why the post-processing is failing i cannot resolve this.

Revision history for this message
Bart Vrancken (bartvrancken) wrote :

I have two identical machines, one fails ... the other does not. I checked the YAML output for 50-maas-01-commissioning and compared them, and looking at them:
https://www.textcompare.org/index.html?id=5fa9b203cec8c500178eb63f

Only the differences are
- Hostname (logical)
- serial nr (logical)
- CPU freq (logical, due to governer)
- memory used (logical)
- memory node total (unsure why?)

Resetting board to BIOS defaults did not solve anything. Triple checked all BIOS settings, all identical.

A 3rd machine with similair layout, but a custom PCB is failing on the same reason(s) ... but due to many difference in the YAML hard to see where the mismatch is.

Revision history for this message
Bart Vrancken (bartvrancken) wrote :

Hi,

The error i am now logging is:

2020-11-09 22:15:48 metadataserver.api: [critical] asrock-rechts.nl.pha.farm(sacra3): commissioning script '50-maas-01-commissioning' failed during post-processing.
        Traceback (most recent call last):
          File "/snap/maas/8980/lib/python3.6/site-packages/metadataserver/api.py", line 800, in signal
            target_status = process(node, request, status)
          File "/snap/maas/8980/lib/python3.6/site-packages/metadataserver/api.py", line 622, in _process_commissioning
            node, node.current_commissioning_script_set, request, status
          File "/snap/maas/8980/lib/python3.6/site-packages/metadataserver/api.py", line 515, in _store_results
            **args, timedout=(status == SIGNAL_STATUS.TIMEDOUT)
          File "/snap/maas/8980/lib/python3.6/site-packages/metadataserver/models/scriptresult.py", line 391, in store_result
            exit_status=self.exit_status,
        --- <exception caught here> ---
          File "/snap/maas/8980/lib/python3.6/site-packages/metadataserver/api.py", line 441, in try_or_log_event
            func(*args, **kwargs)
          File "/snap/maas/8980/lib/python3.6/site-packages/metadataserver/builtin_scripts/hooks.py", line 728, in process_lxd_results
            node.save()
          File "/snap/maas/8980/lib/python3.6/site-packages/maasserver/models/node.py", line 1954, in save
            super(Node, self).save(*args, **kwargs)
          File "/snap/maas/8980/lib/python3.6/site-packages/maasserver/models/cleansave.py", line 216, in save
            self.validate_unique(exclude=exclude_unique_fields)
          File "/snap/maas/8980/usr/lib/python3/dist-packages/django/db/models/base.py", line 1041, in validate_unique
            raise ValidationError(errors)
        django.core.exceptions.ValidationError: {'hardware_uuid': ['Node with this Hardware uuid already exists.']}

Revision history for this message
Bart Vrancken (bartvrancken) wrote :

So this bug related to a know UUID issue, but this would be found if it failed at the lshw task. So this was not so logical. I am betting that the 3rd board has the same issue.

So maybe i should open a new ticket, as its maybe a better idea to use UUID + serial maybe? as the serial is unique in this case. It would have been great if the system would just log this error in the GUI.

Revision history for this message
Nobuto Murata (nobuto) wrote :

I'm seeing the "Node with this Hardware uuid already exists" error in my env too.

summary: - Commissioning failed without error
+ 50-maas-01-commissioning failed with "Node with this Hardware uuid
+ already exists"
Changed in maas:
status: Incomplete → New
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Bart Vrancken (bartvrancken) wrote :

This issue has been addressed in 2.9-RC1 and i tested it and it works.

this issue can be closed

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.