Intermittent slowness on MAAS 2.8.1

Bug #1892543 reported by Derek DeMoss
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Undecided
Unassigned
maas-ui
Fix Released
Unknown

Bug Description

Hey there,
We've been seeing intermittent slow page loads on our MAAS server.
Sometimes on the machines list, sometimes on the KVM list, sometimes on the Images list.
When we hit a very slow patch it can take 20-40 seconds to load one of the given pages.

Here's our setup:
3x SuperMicro X9DRT (blades), running Ubuntu 18.04 Server, 1x Samsung 850 EVO 500GB (boot drive), 2x Samsung 860 EVOs (mirrored zpool for LXDs), 64GB RAM, 10G LAN connection
3x LXDs each with a region+rack server (snap version)
3x LXDs each running PostgreSQL (deployed with juju using the postgresql charm version 208 [postgresql version 10.12]) as the database for MAAS.
176x Hardware Machines are checked into MAAS
~35x KVMs deployed across 77 KVM hosts (in MAAS)

Watching usual system statistics has been no help (CPU/load/disk bandwith+IO/network bandwidth+IO), as far as these go they seem to be pretty quiet.

I've been struggling to find anything else that could be the cause, so I've been watching Postgres for slow queries (>5 seconds), but I have no idea if what I'm seeing is out of the ordinary or not:

```
2020-08-12 20:46:53 UTC [22532]: [3-1] db=maasdb,user=maas LOG: duration: 31784.074 ms fastpath function call: "lo_unlink" (OID 964)

2020-08-12 20:47:32 UTC [40006]: [3-1] db=maasdb,user=maas LOG: duration: 6950.174 ms statement: COMMIT

2020-08-11 22:30:34 UTC [27895]: [3-1] db=maasdb,user=maas LOG: duration: 19999.379 ms fastpath function call: "loread" (OID 954)

2020-08-11 22:30:34 UTC [9323]: [5-1] db=maasdb,user=maas LOG: duration: 15769.444 ms statement: SELECT "maasserver_node"."id", "maasserver_node"."created", "maasserver_node"."updated", "maasserver_node"."system_id", "maasserver_node"."hardware_uuid", "maasserver_node"."hostname", "maasserver_node"."description", "maasserver_node"."pool_id", "maasserver_node"."domain_id", "maasserver_node"."address_ttl", "maasserver_node"."status", "maasserver_node"."previous_status", "maasserver_node"."status_expires", "maasserver_node"."owner_id", "maasserver_node"."bios_boot_method", "maasserver_node"."osystem", "maasserver_node"."distro_series", "maasserver_node"."architecture", "maasserver_node"."min_hwe_kernel", "maasserver_node"."hwe_kernel", "maasserver_node"."node_type", "maasserver_node"."parent_id", "maasserver_node"."agent_name", "maasserver_node"."error_description", "maasserver_node"."zone_id", "maasserver_node"."cpu_count", "maasserver_node"."cpu_speed", "maasserver_node"."memory", "maasserver_node"."swap_size", "maasserver_node"."bmc_id", "maasserver_node"."instance_power_parameters", "maasserver_node"."power_state", "maasserver_node"."power_state_queried", "maasserver_node"."power_state_updated", "maasserver_node"."last_image_sync", "maasserver_node"."error", "maasserver_node"."netboot", "maasserver_node"."ephemeral_deploy", "maasserver_node"."license_key", "maasserver_node"."creation_type", "maasserver_node"."boot_interface_id", "maasserver_node"."boot_cluster_ip", "maasserver_node"."boot_disk_id", "maasserver_node"."gateway_link_ipv4_id", "maasserver_node"."gateway_link_ipv6_id", "maasserver_node"."default_user", "maasserver_node"."install_rackd", "maasserver_node"."install_kvm", "maasserver_node"."enable_ssh", "maasserver_node"."skip_bmc_config", "maasserver_node"."skip_networking", "maasserver_node"."skip_storage", "maasserver_node"."url", "maasserver_node"."dns_process_id", "maasserver_node"."managing_process_id", "maasserver_node"."current_commissioning_script_set_id", "maasserver_node"."current_installation_script_set_id", "maasserver_node"."current_testing_script_set_id", "maasserver_node"."locked" FROM "maasserver_node" WHERE ("maasserver_node"."node_type" IN (2, 4) AND "maasserver_node"."system_id" = 'ae7a7x')

2020-08-14 04:13:11 UTC [40654]: [3-1] db=maasdb,user=postgres LOG: duration: 8866.240 ms statement: COPY public.metadataserver_scriptresult (id, created, updated, status, exit_status, script_name, stdout, stderr, result, script_id, script_set_id, script_version_id, output, ended, started, parameters, physical_blockdevice_id, suppressed, interface_id) TO stdout;

2020-08-11 22:30:34 UTC [27881]: [3-1] db=maasdb,user=maas LOG: duration: 19160.838 ms statement: UPDATE "maasserver_node" SET "power_state_queried" = '2020-08-11T22:30:15.608069'::timestamp WHERE "maasserver_node"."system_id" IN ('4a6mff', '44f6e6', '6qk8bd', 'ah3b6c', 'akeeaq', 'c8rdnn', '4ff66m', 'b6cpwt', 'fkrbqk', '73rhne')

2020-08-11 23:36:23 UTC [13011]: [3-1] db=maasdb,user=maas LOG: duration: 26002.864 ms statement: SELECT pg_advisory_xact_lock(20120116, 7)
```

Current counts for the above queries:

```
grep lo_unlink postgresql-10-main.log | wc -l
24

grep COMMIT postgresql-10-main.log | wc -l
390

grep loread postgresql-10-main.log | wc -l
201

grep 'SELECT "maasserver_node"' postgresql-10-main.log | wc -l
201

grep COPY postgresql-10-main.log | wc -l
4

grep UPDATE postgresql-10-main.log | wc -l
943

grep pg_advisory_xact_lock postgresql-10-main.log | wc -l
94
```

For the queries above, the Durations listed are pretty representative of those queries (at least when they're >5 seconds).

Is the above strange or out of order?
Is there something else we should be looking into?
Or are the slow UI times we're seeing expected performance?

This is cross-posted on the MAAS Discourse: https://discourse.maas.io/t/2-8-1-intermittent-slowness-on-lxds/1988

Thank you!
-Derek

Tags: ui
Moula BADJI (moulab1)
tags: added: pb same
Revision history for this message
Adam Collard (adam-collard) wrote :

> 3x LXDs each with a region+rack server (snap version)

This could be a source of slowness, snaps in LXD use squashfuse which is not as performant as native squashfs. This issue is tracked in https://bugs.launchpad.net/snapd/+bug/1817276

tags: removed: pb same
Revision history for this message
Derek DeMoss (derek-pdl) wrote :

Hey Adam,
I don't know if squashfuse is the issue - not seeing any consistently high CPU for the squashfuse processes. It's mostly ~0% but occasionally spikes to 20/40/70(rare)% on a single process, but only for a second at a time for any of those spikes. Doesn't sound like the same issue unless I'm misunderstanding the other bug report's symptoms.

Thoughts?

Revision history for this message
Derek DeMoss (derek-pdl) wrote :

Hey there,
This has continued getting worse for us. It's gotten so bad as to regularly take multiple minutes for inventory to load, and almost every page faces 15-30 seconds minimum load time.

Anything else we can check on here?

We do have a (perhaps unusually?) large number of machines added to the KVMs (previously Pods) section: 119 VM Hosts available.

MAAS has automatically updated to 2.8.2 (8577-g.a3e674063), but there hasn't been any improvement, nor obvious change in speed whenever the upgrade ran.

Is MAAS still tested on LXDs? Do we need to dedicate full 40-core/64GB blades to this? Seems like that would be kind of absurd. :/

Revision history for this message
Derek DeMoss (derek-pdl) wrote :

Also seeing a lot of slowness when changing network interface configuration.
Hard to trust that it's updated the interface until I refresh the page (for another 15 seconds or so).
It does seem to *mostly* apply the changes reliably though.

Revision history for this message
Derek DeMoss (derek-pdl) wrote :

My coworker and I did some more troubleshooting on this a moment ago.
Per Firefox's debug/web inspector, it looks like the slowness while loading a Machine's page is mostly attributed to slow rendering as we were not seeing any messages go in nor out of the websocket (wait for the machines page to fully load, then click a machine and wait for the details to appear).

So this part appears to be strictly a UI rendering bug.

Alberto Donato (ack)
tags: added: ui
Changed in maas-ui:
importance: Undecided → Unknown
Revision history for this message
Huw Wilkins (huwshimi) wrote :

Hi Derek, we've implemented a number of performance improvements in MAAS 2.9 and 3.0, particularly when loading the machine details page. If you're able to upgrade to 3.0 I'd recommend that as the first thing to try.

Changed in maas-ui:
status: New → Fix Released
Changed in maas:
status: New → Fix Committed
Changed in maas:
milestone: none → 3.2.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.