MAAS

Concurrent API calls don't get balanced between regiond processes

Bug #2027735 reported by Joao Andre Simioni on 2023-07-13

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
MAAS	Fix Released	Critical	Jacopo Rota	MAAS 3.5.0-beta1
3.2	Fix Released	Critical	Jacopo Rota	MAAS 3.2.9
3.3	Fix Released	Critical	Jacopo Rota	MAAS 3.3.5
3.4	Fix Released	Critical	Jacopo Rota	MAAS 3.4.0-rc1
3.5	Fix Released	Critical	Jacopo Rota	MAAS 3.5.0-beta1

Bug Description

[Problem Description]

I noticed that parallel requests to a MAAS API are all being
handled by a single regiond process instead of being balanced
between the multiple spawned processes.

Below is the time on a single request for a machines read:

ubuntu@cli01:~$ time maas admin machines read > /dev/null

real 0m40.534s
user 0m1.445s
sys 0m0.161s

When running the request simultaneously from more than one client,
times are highly increased, and checking the load in the server,
only one regiond process is at high CPU usage:

top -u maas on the server:

top - 16:13:58 up 2:46, 1 user, load average: 1.33, 0.98, 0.95
Tasks: 296 total, 1 running, 294 sleeping, 0 stopped, 1 zombie
%Cpu(s): 12.8 us, 0.0 sy, 0.0 ni, 87.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15991.5 total, 3062.5 free, 5723.7 used, 7205.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 7811.7 avail Mem

PID USER 1316 maas 803 maas 868 maas 869 maas 938 maas 939 maas 940 maas 1200 maas 1309 maas 1310 maas 1312 maas 1314 maas 1315 maas 1317 maas 1318 maas 1322 maas 1324 maas 1325 maas 1326 maas 1435 maas PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 3907428 3.5g 20828 S 100.3 22.3 20:25.14 regiond
20 0 2608 476 408 S 0.0 0.0 0:00.00 sh
20 0 977104 118120 20316 S 0.0 0.7 1:08.37 regiond
20 0 7236 516 452 S 0.0 0.0 0:00.03 tee
20 0 2608 484 412 S 0.0 0.0 0:00.00 sh
20 0 381836 125828 15976 S 0.0 0.8 17:47.29 rackd
20 0 7236 452 388 S 0.0 0.0 0:00.01 tee
20 0 63480 53720 9900 S 0.0 0.3 0:00.58 maas-common
20 0 614948 137300 20488 S 0.0 0.8 0:13.89 regiond
20 0 389428 113756 20356 S 0.0 0.7 0:11.40 regiond
20 0 388144 113616 20348 S 0.0 0.7 0:11.61 regiond
20 0 389436 113864 20560 S 0.0 0.7 0:11.35 regiond
20 0 461876 113704 20400 S 0.0 0.7 0:11.76 regiond
20 0 461876 113640 20288 S 0.0 0.7 0:11.38 regiond
20 0 389440 113788 20408 S 0.0 0.7 0:11.17 regiond
20 0 388148 113556 20384 S 0.0 0.7 0:11.39 regiond
20 0 388404 113396 20244 S 0.0 0.7 0:11.20 regiond
20 0 387868 112900 20416 S 0.0 0.7 0:09.33 regiond
20 0 462140 113856 20480 S 0.0 0.7 0:11.10 regiond
20 0 384700 4068 3456 S 0.0 0.0 0:00.02 rsyslogd

Notice that only PID 1316 is consuming CPU.

Another point is that all the client requests seem to finish at the same time
what could indicate they are all blocked on the same thing.

ubuntu@cli01:~$ time maas admin machines read > /dev/null

real 1m44.542s
user 0m1.189s
sys 0m0.176s

ubuntu@cli02:~$ time maas admin machines read > /dev/null

real 1m44.545s
user 0m1.155s
sys 0m0.206s

ubuntu@cli03:~$ time maas admin machines read > /dev/null

real 1m43.999s
user 0m1.180s
sys 0m0.157s

This behavior is seen in MAAS 3.2, 3.3, and 3.4.

Related branches

~r00ta/maas:lp-2027735-3.2

Merged into maas:3.2

MAAS Lander: Approve on 2023-07-18

Jacopo Rota: Approve on 2023-07-17

~r00ta/maas:lp-2027735-3.3

Merged into maas:3.3

MAAS Lander: Approve on 2023-07-17

Jacopo Rota: Approve on 2023-07-17

~r00ta/maas:lp-2027735-3.4

Merged into maas:3.4

MAAS Lander: Approve on 2023-07-17

Jacopo Rota: Approve on 2023-07-17

~r00ta/maas:lp-2027735

Merged into maas:master

Adam Collard (community): Approve on 2023-07-17

MAAS Lander: Approve on 2023-07-17

Alberto Donato (community): Approve on 2023-07-17

Jacopo Rota (r00ta) on 2023-07-14

Changed in maas:
status:	New → Confirmed

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-07-14 (last edit on 2023-07-14):

I was able to reproduce this. Thanks for reporting!

We start multiple regiond processes and each of them is running

def _makeEndpoint(self):
"""Make the endpoint for the webapp."""

        socket_path = os.getenv(
            "MAAS_HTTP_SOCKET_PATH",
            get_maas_data_path("maas-regiond-webapp.sock"),
        )

        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        if os.path.exists(socket_path):
            os.unlink(socket_path)

        s.bind(socket_path)
        # Use a backlog of 50, which seems to be fairly common.
        s.listen(50)

        # Adopt this socket into Twisted's reactor setting the endpoint.
        endpoint = AdoptedStreamServerEndpoint(reactor, s.fileno(), s.family)
        endpoint.socket = s # Prevent garbage collection.
        return endpoint

unfortunately only one process is going to process the requests as there is no load balancing for unix sockets (ignore the race condition on os.path.exists, I noticed that when we hit the race an exception is thrown and it is silently ignored).
When we execute
if os.path.exists(socket_path):
os.unlink(socket_path)
we remove the connection of the worker that was using that unix socket.

Unless I missed something, the only fix is to create a single unix socket for each process and let nginx round robin the requests to the unix sockets.

Anton Troyanov (troyanov) on 2024-03-05