rack-controller 50-maas-01-commissioning stuck in "Pending" when using numerous vlans

Bug #2114255 reported by Gaetan Gouzi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Triaged
High
Unassigned

Bug Description

Describe the bug:

We are experiencing an issue deploying MAAS with a bunch of vlans (30). The rack-controller commissioning is stuck in Pending for a while. It takes around 90min to commission (per infra node). Deploying with a lower number of vlans (5) is pretty fast (less than 50 seconds)

Steps to reproduce:

Deploy MAAS in HA using multiple vlans and a bridge for each (30 in my case, 15 routable vlans with a gateway, and 15 non-routable L2 vlans)

Expected behavior (what should have happened?):

- Rack-controller commissioning is "Pending" for a few seconds and quickly change to "Passed"

Actual behavior (what actually happened?):

- Rack-controller commissioning is pending for a while. It takes several hours to change to "Passed"

MAAS version and installation type (deb, snap):

- MAAS snap 3.5.6

- 22.04

- 3 nodes in HA

- airgapped environment

Additional context:

- Reproduced on a lab at a lower scale. Similar logs and symptoms are observed

Gaetan Gouzi (ggouzi)
description: updated
Revision history for this message
Jacopo Rota (r00ta) wrote :

TLDR

- the good news: we know what's happening
- the bad news: it's due to bad design and performance problems in MAAS. Not something we can fix in a bugfix patch.

Long answer trying to explain what happens:

- we process the machine resources of the region controllers every 30 seconds (but we wait for the previous execution to complete before starting the new one)
- every processing of the machine resources is stressing a lot the database with many thousands of queries, expecially when there are a lot of bridges and vlans interfaces. The more interfaces you have, the more queries we execute.
- given the point above, the processing is very slow and can take several minutes
- when we start processing the machine resources we open a transaction that is kept open for several minutes
- When processing interfaces, we initiate a nested transaction. However, the node table is quite large and gets updated in multiple parts of the system. For instance, whenever an RPC connection is established (which happens frequently), as well as during processing the machine resources. As a result, by the time we attempt to create a savepoint at the end of interface processing, it's likely to fail due to concurrent modifications to the node table.

The proper fix needs a deep rethinking of many aspects and time allocation in the roadmap.

The best workaround I can come up with is:

1) you install MAAS on the nodes
2) in order to reduce the probability of a conflict, you stop all the region controllers and rack controllers except one. The single node should be able to process the machine-resouces eventually.
3) you stop the node and you start another controller. Same as above you wait until the network config is updated. Repeat for every controller.
4) when you are done, you start all the controllers.

Finally, in order to reduce the stress on the db you can have a recurrent job running on one controller node to execute

```
UPDATE maasserver_node
SET skip_networking = true,
    skip_storage = true
WHERE node_type >= 2;
```

every minute or so. Of course if the network config of one controller node changes you have to stop this job and start the whole process from the beginning.

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → 3.7.x
tags: added: bug-council
tags: removed: bug-council
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.