Unable to Deploy Machines; get() returned more than one Neighbour -- it returned 2!

Bug #2024242 reported by Sean Feole
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Jacopo Rota
3.4
Fix Released
High
Unassigned

Bug Description

Maas Snap Version: maas 3.3.3-13184-g.3e9972c19 27405 3.3/stable canonical✓ -

This issue has cropped up in the past, I know it's related to the networking. However, it would be beneficial to better understand why this occurs. The MAAS Server is relatively new.

I believe this occurs when a Subnet IP Addresses is previously observed on the same or different machine.

1.) Create a LXD Virtual POD (virsh) and deploy.
2.) Compose a new machine & VM will commission successfully.
3.) Attempt to Deploy the Virtual Machine.

Error: get() returned more than one Neighbour -- it returned 2!

This appears to affect every VM that is composed.

Here are some samples of the regiond.log and maas.log, these messages occur when attempting to deploy.

==> regiond.log <==
2023-06-16 15:51:57 maasserver.preseed: [warn] WARNING: '/snap/maas/27405/etc/maas/preseeds/curtin_userdata' contains deprecated preseed variables. Please remove: main_archive_directory, ports_archive_directory

==> maas.log <==
2023-06-16T15:51:57.980811+00:00 common maas.node: [info] sf2: Status transition from ALLOCATED to DEPLOYING
2023-06-16T15:51:58.165850+00:00 common maas.subnet: [warn] Next IP address to allocate from '192.168.102.0/24' has been observed previously: 192.168.102.6 was last claimed by 7c:e1:ff:02:4d:8d via enp6s0f0 (physical) on common at 2023-06-06 07:56:26.825224.
2023-06-16T15:51:58.342861+00:00 common maas.node: [info] sf2: Aborting DEPLOYING and reverted to ALLOCATED. Unable to power control the node. Please check power credentials.
2023-06-16T15:51:58.350278+00:00 common maas.node: [info] sf2: Status transition from DEPLOYING to ALLOCATED

==> regiond.log <==
2023-06-16 15:51:58 maasserver.websockets.protocol: [critical] Error on request (436) machine.action: get() returned more than one Neighbour -- it returned 2!
 Traceback (most recent call last):
   File "/usr/lib/python3.10/threading.py", line 953, in run
     self._target(*self._args, **self._kwargs)
   File "/snap/maas/27405/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 822, in worker
     return target()
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 47, in work
     task()
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/_threads/_team.py", line 182, in doWork
     task()
 --- <exception caught here> ---
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 244, in inContext
     result = inContext.theWork() # type: ignore[attr-defined]
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 260, in <lambda>
     inContext.theWork = lambda: context.call( # type: ignore[attr-defined]
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/context.py", line 117, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/context.py", line 82, in callWithContext
     return func(*args, **kw)
   File "/snap/maas/27405/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 857, in callInContext
     return func(*args, **kwargs)
   File "/snap/maas/27405/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 203, in wrapper
     result = func(*args, **kwargs)
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/utils/orm.py", line 770, in call_within_transaction
     with connected(), post_commit_hooks:
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/utils/orm.py", line 636, in __exit__
     self.fire()
   File "/snap/maas/27405/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 203, in wrapper
     result = func(*args, **kwargs)
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/utils/asynchronous.py", line 207, in fire
     self._fire_in_reactor(hook).wait(LONGTIME)
   File "/snap/maas/27405/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 198, in wait
     result.raiseException()
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
     result = current_context.run(
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/models/node.py", line 5772, in claim_auto_ips
     yield self._claim_auto_ips()
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
     result = current_context.run(
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/models/node.py", line 4699, in _claim_auto_ips
     check_failed, ips_in_use = yield deferToDatabase(
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 244, in inContext
     result = inContext.theWork() # type: ignore[attr-defined]
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 260, in <lambda>
     inContext.theWork = lambda: context.call( # type: ignore[attr-defined]
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/context.py", line 117, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/snap/maas/27405/usr/lib/python3/dist-packages/twisted/python/context.py", line 82, in callWithContext
     return func(*args, **kw)
   File "/snap/maas/27405/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 857, in callInContext
     return func(*args, **kwargs)
   File "/snap/maas/27405/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 203, in wrapper
     result = func(*args, **kwargs)
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/utils/orm.py", line 771, in call_within_transaction
     return func_outside_txn(*args, **kwargs)
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/utils/orm.py", line 574, in retrier
     return func(*args, **kwargs)
   File "/usr/lib/python3.10/contextlib.py", line 79, in inner
     return func(*args, **kwds)
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/models/node.py", line 4654, in process_results
     rack_interface.update_neighbour(
   File "/snap/maas/27405/lib/python3.10/site-packages/maasserver/models/interface.py", line 1557, in update_neighbour
     neighbour, created = Neighbour.objects.get_or_create(
   File "/snap/maas/27405/usr/lib/python3/dist-packages/django/db/models/manager.py", line 85, in manager_method
     return getattr(self.get_queryset(), name)(*args, **kwargs)
   File "/snap/maas/27405/usr/lib/python3/dist-packages/django/db/models/query.py", line 581, in get_or_create
     return self.get(**kwargs), False
   File "/snap/maas/27405/usr/lib/python3/dist-packages/django/db/models/query.py", line 439, in get
     raise self.model.MultipleObjectsReturned(
 maasserver.models.neighbour.MultipleObjectsReturned: get() returned more than one Neighbour -- it returned 2!

2023-06-16 15:51:58 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'common (m6d7bn)'.
2023-06-16 15:51:58 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'common (m6d7bn)'.
2023-06-16 15:51:59 maasserver.dhcp: [info] Successfully configured DHCPv4 on rack controller 'common (m6d7bn)'.
2023-06-16 15:51:59 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'common (m6d7bn)'.

Known Workaround: Edit the machine networking -> Device properties -> Change the IP Allocation from "Auto Assign" to "DHCP"

I still don't have steps to reproduce this issue, I did add an extra ipv6 subnet last week, but have deleted it since, as it was not going to be used.

Related branches

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

Hey Sean! Even if you don't have the steps to reproduce this issue, do you have a database dump?

I think it would be a possible starting point for us to triage this issue

Revision history for this message
Sean Feole (sfeole) wrote :

Hey Jacopo, I will get you a database dump as you requested. You you have any preferred location I place this in the event its too large for launchpad?

Revision history for this message
Sean Feole (sfeole) wrote :

I have reached out to Jacopo and gave him direct access to my MAAS server and supplied him with the database dump.

Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
Jacopo Rota (r00ta) wrote :

In the database I see that there are two records for the same (ip, mac_address)

maas=# select * from maasserver_neighbour where mac_address='7c:e1:ff:02:4d:8d';
 id | created | updated | ip | time | vid | count | mac_address | interface_id
----+-------------------------------+-------------------------------+---------------+------------+-----+-------+-------------------+--------------
 47 | 2023-06-06 07:56:26.819152+00 | 2023-06-06 07:56:26.819152+00 | 192.168.102.6 | 1686038186 | | 1 | 7c:e1:ff:02:4d:8d | 1
 48 | 2023-06-06 07:56:26.825224+00 | 2023-06-06 07:56:26.825224+00 | 192.168.102.6 | 1686038186 | | 1 | 7c:e1:ff:02:4d:8d | 1

I'm investigating where is the racing condition that created those records

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):
Download full text (3.1 KiB)

I finally found where the race condition is and I managed to reproduce it locally with a bit of luck.

There are 3 ways to add a neighbour:
1) The region monitors ARP packets on the interface - when a new packet is found, the update_neighbour function is called
2) The rack(s) monitors ARP packets on the interface - when a new packet is found, it sends a new request to the region with rpc to register the neighbour -> the update_neighbour function is called
3) When a new deployment is requested and the machine is configured to use AUTO IP mode, it checks if the first address is being used on the network. If so, the update_neighbour function is called to add a new neighbour so that the address is not picked up at the next retry.

I managed to get (1) and (3) running at the same time, ending up with two records in the database for the same (ip, mac_address) pair and breaking everything.

The logs I've extracted look like the following:

2023-06-21 21:03:49 maasserver.models.node: [info] DEPLOY ACTION START
2023-06-21 21:03:49 maasserver.models.node: [info] LOOK FOR FIRST IP ADDRESS: ['10.0.2.2']
2023-06-21 21:03:49 maasserver.models.node: [info] CHECK IF '10.0.2.2' IS USED
2023-06-21 21:03:49 maasserver.regiondservices.networks_monitoring: [info] NEW REGIOND ATTEMPT TO REGISTER ['10.0.2.2'] AS NEIGHBOUR
2023-06-21 21:03:49 maasserver.models.node: [info] IP IS USED, REGISTER NEW NEIGHBOUR

References:
- update_neighbour function: https://git.launchpad.net/maas/tree/src/maasserver/models/interface.py#n1537
- deploy logic https://git.launchpad.net/maas/tree/src/maasserver/models/node.py#n4611
- regiond daemon for neighbour discovery https://git.launchpad.net/maas/tree/src/maasserver/regiondservices/networks_monitoring.py#n35

High level steps to reproduce (if you are lucky enough to hit the race condition - otherwise you can patch the codebase and add a couple of sleeps here and there to force the reproducer):
1) install maas
2) create an empty VM on LXD with a bridged interface that is being used by maas as well
3) configure DHCP on the interface that the VM uses
4) commission the machine
5) add a new LXD container on the same bridged interface and assign the first IP address of the subnet (let's say 10.0.2.2)
6) Change the reserved IP range on MAAS and leave only 10.0.2.2 and 10.0.2.3 available.
7) deploy an OS to the VM

-> The deploy logic will check if 10.0.2.2 is being used but it also triggers network traffic that will be intercepted by the network monitoring service, which will try to add 10.0.2.2 as neighbour as well.

Tomorrow I'll discuss with the team what are the possible fixes. Enforcing a UNIQUE constraint at database level on the pair (ip, mac_address) might be the cleanest/easiest solution.

---

As a matter of facts, when the race condition does not create two separate records it increases the `count` in the database
```
42 | 2023-06-22 10:56:18.289564+00 | 2023-06-22 10:56:23.381946+00 | 10.0.2.2 | 1687431383 | | 2 | 00:16:3e:bc:ae:ee | 2
```
watching the database you could see that the `count` gets incremented immediately after the record is created (because the deploy routine and the discovery servi...

Read more...

Jacopo Rota (r00ta)
Changed in maas:
status: In Progress → Triaged
Revision history for this message
Jacopo Rota (r00ta) wrote :

Apparently there is already an UNIQUE constraint on the table

```
ALTER TABLE ONLY public.maasserver_neighbour
    ADD CONSTRAINT maasserver_neighbour_interface_id_vid_mac_add_a35f7098_uniq UNIQUE (interface_id, vid, mac_address, ip);
```

but given that the `vid` might be `null` (like in your case), the constraint is not violated.

Revision history for this message
Jacopo Rota (r00ta) wrote :

@sean the quickest fix for you should be to go to Settings -> Network discovery on the UI and click `Clear all discoveries`

Jacopo Rota (r00ta)
Changed in maas:
importance: Medium → High
Changed in maas:
milestone: none → 3.5.0
status: Triaged → Fix Committed
Revision history for this message
Alan Baghumian (alanbach) wrote :

I have seen this issue happening in different MAAS versions even 2.9! It would be great if we can backport the fix to other supported versions of MAAS as well.

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
status: Fix Committed → Fix Released
Revision history for this message
Bartosz Woronicz (mastier1) wrote :

Could we have it backported to 3.4 ? Do you have an estimate when will it land ?

Revision history for this message
Jacopo Rota (r00ta) wrote :

it's already backported. We are waiting for the solqa env to be able to start the new releases but I don't have an ETA

Revision history for this message
Bartosz Woronicz (mastier1) wrote :

Hi, Jacopo, thank you.

I asked the SolQA team and as per the latest update from them they are waiting for you to request it actually.

I would appreciate if could help me follow up on that.

Revision history for this message
Bartosz Woronicz (mastier1) wrote :

Thank you very much for resolving that issue. That was long standing one.

This bug is the predecessor
https://bugs.launchpad.net/bugs/2005243

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.