Concurrent API calls don't get balanced between regiond processes

Bug #2027735 reported by Joao Andre Simioni
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Jacopo Rota
3.2
Fix Released
Critical
Jacopo Rota
3.3
Fix Released
Critical
Jacopo Rota
3.4
Fix Released
Critical
Jacopo Rota
3.5
Fix Released
Critical
Jacopo Rota

Bug Description

[Problem Description]

I noticed that parallel requests to a MAAS API are all being
handled by a single regiond process instead of being balanced
between the multiple spawned processes.

Below is the time on a single request for a machines read:

ubuntu@cli01:~$ time maas admin machines read > /dev/null

real 0m40.534s
user 0m1.445s
sys 0m0.161s

When running the request simultaneously from more than one client,
times are highly increased, and checking the load in the server,
only one regiond process is at high CPU usage:

top -u maas on the server:

top - 16:13:58 up 2:46, 1 user, load average: 1.33, 0.98, 0.95
Tasks: 296 total, 1 running, 294 sleeping, 0 stopped, 1 zombie
%Cpu(s): 12.8 us, 0.0 sy, 0.0 ni, 87.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15991.5 total, 3062.5 free, 5723.7 used, 7205.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 7811.7 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
   1316 maas 20 0 3907428 3.5g 20828 S 100.3 22.3 20:25.14 regiond
    803 maas 20 0 2608 476 408 S 0.0 0.0 0:00.00 sh
    868 maas 20 0 977104 118120 20316 S 0.0 0.7 1:08.37 regiond
    869 maas 20 0 7236 516 452 S 0.0 0.0 0:00.03 tee
    938 maas 20 0 2608 484 412 S 0.0 0.0 0:00.00 sh
    939 maas 20 0 381836 125828 15976 S 0.0 0.8 17:47.29 rackd
    940 maas 20 0 7236 452 388 S 0.0 0.0 0:00.01 tee
   1200 maas 20 0 63480 53720 9900 S 0.0 0.3 0:00.58 maas-common
   1309 maas 20 0 614948 137300 20488 S 0.0 0.8 0:13.89 regiond
   1310 maas 20 0 389428 113756 20356 S 0.0 0.7 0:11.40 regiond
   1312 maas 20 0 388144 113616 20348 S 0.0 0.7 0:11.61 regiond
   1314 maas 20 0 389436 113864 20560 S 0.0 0.7 0:11.35 regiond
   1315 maas 20 0 461876 113704 20400 S 0.0 0.7 0:11.76 regiond
   1317 maas 20 0 461876 113640 20288 S 0.0 0.7 0:11.38 regiond
   1318 maas 20 0 389440 113788 20408 S 0.0 0.7 0:11.17 regiond
   1322 maas 20 0 388148 113556 20384 S 0.0 0.7 0:11.39 regiond
   1324 maas 20 0 388404 113396 20244 S 0.0 0.7 0:11.20 regiond
   1325 maas 20 0 387868 112900 20416 S 0.0 0.7 0:09.33 regiond
   1326 maas 20 0 462140 113856 20480 S 0.0 0.7 0:11.10 regiond
   1435 maas 20 0 384700 4068 3456 S 0.0 0.0 0:00.02 rsyslogd

Notice that only PID 1316 is consuming CPU.

Another point is that all the client requests seem to finish at the same time
what could indicate they are all blocked on the same thing.

ubuntu@cli01:~$ time maas admin machines read > /dev/null

real 1m44.542s
user 0m1.189s
sys 0m0.176s

ubuntu@cli02:~$ time maas admin machines read > /dev/null

real 1m44.545s
user 0m1.155s
sys 0m0.206s

ubuntu@cli03:~$ time maas admin machines read > /dev/null

real 1m43.999s
user 0m1.180s
sys 0m0.157s

This behavior is seen in MAAS 3.2, 3.3, and 3.4.

Related branches

Jacopo Rota (r00ta)
Changed in maas:
status: New → Confirmed
Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

I was able to reproduce this. Thanks for reporting!

We start multiple regiond processes and each of them is running

    def _makeEndpoint(self):
        """Make the endpoint for the webapp."""

        socket_path = os.getenv(
            "MAAS_HTTP_SOCKET_PATH",
            get_maas_data_path("maas-regiond-webapp.sock"),
        )

        s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
        if os.path.exists(socket_path):
            os.unlink(socket_path)

        s.bind(socket_path)
        # Use a backlog of 50, which seems to be fairly common.
        s.listen(50)

        # Adopt this socket into Twisted's reactor setting the endpoint.
        endpoint = AdoptedStreamServerEndpoint(reactor, s.fileno(), s.family)
        endpoint.socket = s # Prevent garbage collection.
        return endpoint

unfortunately only one process is going to process the requests as there is no load balancing for unix sockets (ignore the race condition on os.path.exists, I noticed that when we hit the race an exception is thrown and it is silently ignored).
When we execute
        if os.path.exists(socket_path):
            os.unlink(socket_path)
we remove the connection of the worker that was using that unix socket.

Unless I missed something, the only fix is to create a single unix socket for each process and let nginx round robin the requests to the unix sockets.

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.