RPC RegisterRackController can saturate all the database threads, causing region controllers to become unresponsive for minutes

Bug #2130237 reported by Jacopo Rota
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
High
Jacopo Rota
3.3
Triaged
High
Jacopo Rota
3.4
Fix Committed
High
Jacopo Rota
3.5
Fix Released
High
Jacopo Rota
3.6
Fix Committed
High
Jacopo Rota
3.7
Fix Released
High
Jacopo Rota

Bug Description

Describe the bug:

By design rack controllers should connect to every region workers. If a region has 8 workers, every rack controller should establish a connection to each worker (8 connections in total).

In 3.3, a logic to scale up RPC connections was added. So in case the connections are busy, the rack will try to establish new ephimeral connections. Still, this is known to be affected by a bug https://bugs.launchpad.net/maas/+bug/2074122 which cases the rack to keep adding RPC connections.

When the region handles the RegisterRackController RPC command, it will try to "register" the rack controller https://github.com/canonical/maas/blob/c91dd7ecf47ea3d266288949abfc474327b46cce/src/maasserver/rpc/regionservice.py#L608

```
            rack_controller = yield deferToDatabase(
                rackcontrollers.register,
                system_id=system_id,
                hostname=hostname,
                interfaces=interfaces,
                url=url,
                is_loopback=is_loopback,
                version=version,
            )
```

and in particular

```
@synchronous
@with_connection
@synchronised(locks.startup)
@transactional
def register(
    system_id=None,
    hostname="",
    interfaces=None,
    url=None,
    is_loopback=None,
    version=None,
):
```

in some cases the transaction waits on other transactions to be committed, so the lock is not released.

We tracked down the problem to provisioningserver.utils.network.get_all_interfaces_definition , which is executed from different processes. When multiple executions of it are run in parallel, it gets very slow. This is because that function is listing all the /proc and reads their cmdline. It can take 1/2 minutes and it prevents the transaction to be committed.

Steps to reproduce:

Pick a couple of controller with a lot of processes and install MAAS in HA. Put some debug statements and you should see that. In some environments the bug is triggered and it gets in very bad situations.

Expected behavior (what should have happened?):

MAAS should be simply up and running without downtime :)

Actual behavior (what actually happened?):

Every 10/15 minutes, the regions become unresponsive for few minutes (2/3 minutes, more or less)

MAAS version and installation type (deb, snap):

From 3.3 on

MAAS setup (HA, single node, multiple regions/racks):

Host OS distro and version:

Ubuntu any version

Additional context:

Related branches

Jacopo Rota (r00ta)
Changed in maas:
importance: Undecided → High
assignee: nobody → Jacopo Rota (r00ta)
milestone: none → 3.8.x
Jacopo Rota (r00ta)
description: updated
Jacopo Rota (r00ta)
description: updated
Jacopo Rota (r00ta)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.