Recomissioning fails with 'Interface with this Node and Name already exists'

Bug #1929735 reported by Alexander Litvinov
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Expired
High
Unassigned

Bug Description

MAAS version: 2.9.2 (9165-g.c3e7848d1)

Commissioning of servers (Huawei CH121 V5) is successfull from the first try.
However if I try to re-commission the server the process fails, saying that 50-maas-commissioning script failed, however the script is ok and I can see the output in stdout.
But in the logs I find the error (see below)
Then if I try again a few times, eventually they all succeed to re-commission.
Tried checking the database - didn't find any duplicate machines.
So it's not another machine with the same interface/MAC, it is complaining about the exact same machine.

full regiond: https://private-fileshare.canonical.com/~alitvinov/random/fg-regiond.txt

regiond.log:
2021-05-26 12:35:53 metadataserver.api: [critical] u0400s2enthc03.maas(yr6bfp): commissioning script '50-maas-01-commissioning' failed during post-processing.
        Traceback (most recent call last):
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/api.py", line 820, in signal
            target_status = process(node, request, status)
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/api.py", line 641, in _process_commissioning
            self._store_results(
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/api.py", line 529, in _store_results
            script_result.store_result(
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/models/scriptresult.py", line 384, in store_result
            signal_status = try_or_log_event(
        --- <exception caught here> ---
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/api.py", line 447, in try_or_log_event
            func(*args, **kwargs)
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/builtin_scripts/hooks.py", line 767, in process_lxd_results
            _process_lxd_resources(node, data["resources"])
          File "/snap/maas/12555/lib/python3.8/ site-packages/metadataserver/builtin_scripts/hooks.py", line 515, in _process_lxd_resources
            update_node_network_information(node, data, numa_nodes)
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/builtin_scripts/hooks.py", line 316, in update_node_network_information
            update_interface_details(interface, interfaces_info)
          File "/snap/maas/12555/lib/python3.8/site-packages/metadataserver/builtin_scripts/hooks.py", line 231, in update_interface_details
            interface.save(update_fields=["updated", *update_fields])
          File "/snap/maas/12555/lib/python3.8/site-packages/maasserver/models/interface.py", line 1640, in save
            return super().save(*args, **kwargs)
          File "/snap/maas/12555/lib/python3.8/site-packages/maasserver/models/cleansave.py", line 186, in save
            self.validate_unique(exclude=[self._meta.pk.name])
          File "/snap/maas/12555/usr/lib/python3/dist-packages/django/db/models/base.py", line 987, in validate_unique
            raise ValidationError(errors)
        django.core.exceptions.ValidationError: {'__all__': ['Interface with this Node and Name already exists.']}

Revision history for this message
Alexander Litvinov (alitvinov) wrote :

Workaround is to delete and re-add machine, then it will commission from the first try.A

description: updated
Revision history for this message
Lee Trager (ltrager) wrote :

Could you upload the output of 50-maas-commissioning?

Changed in maas:
status: New → Incomplete
Revision history for this message
Alexander Litvinov (alitvinov) wrote :
Changed in maas:
status: Incomplete → New
Revision history for this message
Björn Tillenius (bjornt) wrote :

My guess would be that some of the interfaces changes names between the two commission runs, and we don't handle that well.

Could you upload the output of 'maas $profile read $system_id' for the machine?

Changed in maas:
status: New → Incomplete
Revision history for this message
Alexander Litvinov (alitvinov) wrote :

fresh regiond with machine cy637t recommissioning last:
https://private-fileshare.canonical.com/~alitvinov/random/fg-regiond2.txt

maas profile machne read cy637t output:
https://pastebin.canonical.com/p/PNhDY2W6jm/

Changed in maas:
status: Incomplete → New
Revision history for this message
Alberto Donato (ack) wrote :

Comparing machine info with machine-resources output confirms Bjorn's theory (some interfaces switch names between the two runs):

$ cat 50-maas-01-commissioning.json | jq '.resources.network.cards [].ports [] | [.id, .address]'
[
  "eno1",
  "48:dc:2d:0b:ec:bc"
]
[
  "eno2",
  "48:dc:2d:0b:ec:bd"
]
[
  "ens1",
  "f4:a4:d6:f3:6c:41"
]
[
  "eth0",
  "f4:a4:d6:f3:6c:3e"
]
[
  "eth1",
  "f4:a4:d6:f3:6c:3f"
]
[
  "eth2",
  "f4:a4:d6:f3:6c:40"
]

$ cat machine.json | jq '.interface_set [] | [.name, .mac_address]'
[
  "eno1",
  "48:dc:2d:0b:ec:bc"
]
[
  "eno2",
  "48:dc:2d:0b:ec:bd"
]
[
  "ens1",
  "f4:a4:d6:f3:6c:3f"
]
[
  "eth0",
  "f4:a4:d6:f3:6c:3e"
]
[
  "eth2",
  "f4:a4:d6:f3:6c:40"
]
[
  "eth3",
  "f4:a4:d6:f3:6c:41"
]

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → next
Revision history for this message
dann frazier (dannf) wrote :

I hit this issue as well. What I believe happened is that I'd upgraded my server's firmware, and that firmware update fixed an issue with the SMBIOS records that map PCIe slot labels to devices. That change caused systemd to start using slot based names instead of path based names for my NICs.

Bug 1940860 describes another scenario where NIC interfaces change between 5.4 and 5.8. I assume this will cause occurrences of this issue to increase if not addressed ahead of 22.04 LTS.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Does this issue still happen on MAAS 3.1?
The workaround is to remove in MAAS the network interfaces and re-commission.

Changed in maas:
milestone: next → none
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.