MAAS

regiond at 100% CPU after UI reconnect causing API errors

Series 3.4
Bug #2070304

Bug #2070304 reported by Elliott Rabe on 2024-06-24

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
MAAS	Fix Committed	High	Peter Makowski	MAAS 3.6.0
3.4	Fix Committed	High	Unassigned	MAAS 3.4.3
3.5	Fix Committed	High	Unassigned	MAAS 3.5.x

Bug Description

Version and build:
================================
snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.4.2-14353-g.5a5221d57 35359 3.4/stable canonical✓ -

Interface Used:
================================
UI/API
I believe use of the UI is integral to creating the problem. Once created, API usage can be disrupted.

What happened:
================================
When the problem manifests, a python process on the MaaS server will peg at 100% CPU utilization for an extended period and postgres activity will be noticeably increased. The python process corresponds to regiond. The excessive utilization will disrupt other MaaS requests inducing intermittent API request timeouts and errors.

This problem originally presented on our production server and it seemed as though regiond was in some kind of infinite retry loop, querying the database as fast as possible, indefinitely. Once hung, the regiond process would remain in this state for days causing intermittent errors until MaaS services were restarted. Things would then behave normally for days or a week, and the problem would manifest again.

After the problem was reproduced on an isolated unused MaaS server with the TestDB, it appears as though the regiond CPU utilization will eventually fall down from 100% to a more normal value, though database usage still appears to be elevated above normal.

I have added the following additional debugging flags to regiond.conf in an attempt to diagnose further.
cat /var/snap/maas/current/regiond.conf
...
debug: True
debug_queries: True
debug_http: True

I have done some review of the MaaS code, and although my knowledge about how all the pieces work together is increasing,there is still a lot about the websocket usage, handlers and notifications that I don't fully understand. I suspect that the increased CPU may be due to an improper UI re-connection strategy that allows multiple websocket connections to be created to service one client UI. Multiple connections from the same client compounded by the general inefficiency of django/db/data-retrieval path needed to supply the UI may be pushing the server over its scalability limits by creating unnecessary traffic, especially if the error happens to multiple clients at once or these errant connections persist for long periods.

I suspect this is the cause for two main reasons:
1) The Firefox Web Developer Tools shows multiple /ws URL GET requests outstanding simultaneously. In the test scenario above, it is not uncommon to see a dozen outstanding GET requests at a time, before any start failing with NS_ERROR_NET_TIMEOUT.

2) The regiond.log shows multiple near simultaneous connection openings at the time shortly after network connectivity is restored even though only one browser tab/URL access was involved. My limited understanding of the MaaS code seems to make me think this will result in unnecessary backend traffic and server utilization.

cat /var/snap/maas/common/log/regiond.log* | grep "maasserver.websockets.websockets"
2024-06-24 17:36:40 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:41 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:41 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:42 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:42 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:42 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:42 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:43 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:43 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:43 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:43 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:43 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:48 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:48 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:58 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:58 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:58 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)
2024-06-24 17:36:58 maasserver.websockets.websockets: [debug] Opening connection with UNIXAddress(None)

It is somewhat interesting to note that once regiond is in the high CPU utilization state, closing the client browser seems to to have no immediate remedial effects. If these websockets are indeed the cause of the problem, they appear to remain active between intermediate MaaS components (ie. nginx and regiond?) without a running MaaS UI, requiring a regiond or MaaS restart to clean things up.

It is expected that this UI reconnect path should not end up with more than one websocket connection in order to service a single UI client. On a production system this can create excessive load burdens that lead to API request timeouts and other confusing errors.

Steps to reproduce:
================================
+Created a new Ubuntu 22.04 install on a dedicated VM

+Installed MaaS and MaaS Postgres TestDB
  Steps include:
    snap install --channel=3.4 maas
    snap install maas-test-db
    maas init region+rack --database-uri maas-test-db:///
    maas createadmin --username admin --password <password> --email <email>

+Went to the MaaS UI and followed the install setup steps, picking Ubuntu 22.04 images, generally picking defaults, skipping user setup and skipping most everything else

+From a Firefox browser (which happened to be version 118.0.1), navigate to the MaaS UI and open web developer tools.
-Using filtering parameters querying machines has shown to increase the likelihood and duration of the problem, especially on a real MaaS system: http://<ip>:5240/MAAS/r/machines?q=<username>

+Drop the network connection between the VM and the MaaS UI
-Originally this happened through VPN disconnects when running against our production MaaS server
-For reproduction convenience on a test VM it proved more efficient for the VM to have two network interfaces, and to simply down the interface used to connect the MaaS UI and the server allowing the other to remain available for ssh access to the VM for debugging

+After ~15-25 minutes, the MaaS UI will discover connectivity issues, declare failure and display a "Trying to reconnect..." overlay

+Wait another hour or so. Throughout this time, observe periodic GET request attempts, presumably attempting to reconnect...
GET ws://<ip>:5240/MAAS/ws?csrftoken=<blah>
GET ws://<ip>:5240/MAAS/ws?csrftoken=<blah>
GET ws://<ip>:5240/MAAS/ws?csrftoken=<blah>
...

+Restore the network connection between the VM and the MaaS UI

+If a python3 regiond pegs a CPU 100% for many minutes and database usage is noticeably elevated, the problem has been reproduced. Attempts to use the MaaS API during this time will result in intermittent errors, seemingly based on whether they are assigned by socket to the disrupted regiond worker. The more machines/resources/load on the MaaS server, and the more outstanding GET requests when network connectivity is restored seem to increase the duration of the disruption.

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-06-24:

Thank you very much for all the investigations and the very detailed bug report!

I have a first question: in the steps to reproduce instructions you said

> The more machines/resources/load on the MaaS server, and the more outstanding GET requests when network connectivity is restored seem to increase the duration of the disruption.

Did you manage to reproduce this issue on a MAAS instance with nothing going on (no deployments, no external API call to MAAS and similar)? How many machines did you create on this test instance?

Revision history for this message

Elliott Rabe (erqtm) wrote on 2024-06-24:

> Did you manage to reproduce this issue on a MAAS instance with nothing going on (no deployments, no external API call to MAAS and similar)?
Yes I did. Since setting up the isolated test environment there are no deployments and I have been doing no API access.

> How many machines did you create on this test instance?
No MaaS machines. Browsing the UI to the machines endpoint shows nothing. I was using this endpoint with my username as a filter simply because on the production server that has hundreds of machines, this URL was where I observed the problem. When I switched to the test server, I continued to use the same endpoint, despite no machines existing.

Despite the lack of content, leaving the browser in a disconnected state for around a half hour following the "Trying to reconnect..." seems to create around a dozen and a half connections and keeps regiond pegged for several minutes. I am reasonably sure if I left it disconnected longer (ie. a few hours), regiond would be pegged for longer, despite an empty MaaS test system.

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-06-24:

Thank you very much for the additional details! Tomorrow I'll try to reproduce it and I'll get back to you

Jacopo Rota (r00ta) on 2024-06-25

Changed in maas:
assignee:	nobody → Jacopo Rota (r00ta)
status:	New → In Progress

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-06-25:

I confirm this is a bug and that I was able to reproduce it.
Thank you again Elliot for reporting this and for the time you spent to investigate this. We plan to fix this soon and include the fix in the next minor/major release

Changed in maas:
importance:	Undecided → High
status:	In Progress → Triaged
milestone:	none → 3.6.0
assignee:	Jacopo Rota (r00ta) → nobody