Bug #1978225 “Unexpectedly high number of DB queries when hittin...” : Bugs : MAAS

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2022-06-09:

#1

Hi there, this is something we are/were aware of and addressed in 3.2.0, which has an RC out as of right now. Can you try with the release candidate?

Changed in maas:
status:	New → Incomplete

Revision history for this message

Stéphane Graber (stgraber) wrote on 2022-06-09:

#2

No. We're sticking to stable having been bitten several times with past MAAS rc.

Are you going to be pushing this to 3.1 stable users?

Revision history for this message

Alexsander de Souza (alexsander-souza) wrote on 2022-06-21:

#3

I understand. In this case I'm going to ask you to re-test this after the 3.2 release comes out.

This patch doesn't meet our requirements for backports, it's too big and it's not a critical bug nor a regression.

Revision history for this message

Stéphane Graber (stgraber) wrote on 2022-06-23:

#4

Hmm, I'm really confused by your standards for a critical bug then.

This is the most common endpoint (or at least you'd think it is) on your core REST API which is taking long enough to timeout in most software (10s timeout is common) rendering it entirely unusable.

The kind of load it creates against the database also means that this bug makes for very trivial DoS attack against MAAS. Throw a dozen GET /MAAS/api/2.0/machines/ in parallel on a moderately busy MAAS and MAAs will be completely dead.

Alexsander de Souza (alexsander-souza) on 2022-06-24

Changed in maas:
status:	Incomplete → New
status:	New → Incomplete

Revision history for this message

Stéphane Graber (stgraber) wrote on 2022-06-27:

#5

Download full text (3.4 KiB)

With 3.2 now in stable, I've refreshed our deployment over to it.

I can't say that things really improved that much though, we're still seeing a single GET of the machine list taking up to 18s. On the low end, we now get responses in 6s instead of 10s, but that's the odd fast response rather than the norm and 6s is still very very slow for a MAAS with less than 50 machines...

The number of queries did decrease though with the most optimal run being down to 460 now.
     86 "maasserver_vlan"
     80 "maasserver_bootresourcefile"
     42 "maasserver_bootresourceset"
     31 "maasserver_numanode"
     27 "maasserver_node"
     24 maasserver_node
     24 "metadataserver_scriptresult"
     24 "maasserver_event"
     21 "maasserver_service"
     16 "maasserver_config"
     13 "maasserver_filesystem"
      7 "maasserver_staticipaddress"
      7 "maasserver_interface"
      6 "maasserver_virtualmachinedisk"
      6 "maasserver_partitiontable"
      6 "maasserver_partition"
      6 "maasserver_fabric"
      5 "maasserver_space"
      4 "maasserver_subnet"
      4 "maasserver_interfacerelationship"
      3 "maasserver_virtualblockdevice"
      3 "maasserver_physicalblockdevice"
      3 "maasserver_notification"
      3 "maasserver_nodeconfig"
      2 "piston3_consumer"
      2 "maasserver_regioncontrollerprocessendpoint"
      2 "maasserver_regioncontrollerprocess"
      2 "maasserver_blockdevice"
      1 "piston3_token"
      1 "piston3_nonce"
      1 "maasserver_userprofile"
      1 "maasserver_tag"
      1 "maasserver_ownerdata"
      1 "maasserver_numanodehugepages"
      1 "maasserver_nodemetadata"
      1 "maasserver_globaldefault"
      1 "maasserver_dnsresource"
      1 "maasserver_bootresource"
      1 "auth_user"

However, background activity is really really bad. I had to do 12 individual tests to capture such an optimal result. All others looked more like:
   1494 "maasserver_bootresourceset"
    703 "maasserver_bootresource"
     88 "maasserver_vlan"
     80 "maasserver_bootresourcefile"
     41 "maasserver_node"
     34 "maasserver_numanode"
     29 "maasserver_config"
     24 maasserver_node
     24 "metadataserver_scriptresult"
     24 "maasserver_event"
     17 "maasserver_service"
     15 "maasserver_filesystem"
     10 "maasserver_interface"
      9 "maasserver_subnet"
      9 "maasserver_staticipaddress"
      8 "maasserver_resourcepool"
      8 "maasserver_partitiontable"
      8 "maasserver_partition"
      8 "maasserver_fabric"
      7 "maasserver_notification"
      6 "maasserver_virtualmachinedisk"
      6 "maasserver_space"
      6 "maasserver_regioncontrollerprocessendpoint"
      6 "maasserver_regioncontrollerprocess"
      5 "maasserver_physicalblockdevice"
      5 "maasserver_nodeconfig"
      4 "maasserver_virtualblockdevice"
      4 "maasserver_interfacerelationship"
      4 "maasserver_blockdevice"
      2 "piston3_consumer"
      2 "maasserver_tag"
      2 "maasserver_numanodehugepages"
      2 "maasserver_nodemetadata"
      2 "maasserver_controllerinfo"
      2 "auth_user"
      1 pg_locks,
      1 "piston3_token"
      1 "piston3_nonce"
      1 "maasserver_userprofile"
      1 "maasserver_ownerdata"
      1 "maas...

With 3.2 now in stable, I've refreshed our deployment over to it.

I can't say that things really improved that much though, we're still seeing a single GET of the machine list taking up to 18s. On the low end, we now get responses in 6s instead of 10s, but that's the odd fast response rather than the norm and 6s is still very very slow for a MAAS with less than 50 machines...

The number of queries did decrease though with the most optimal run being down to 460 now.
     86 "maasserver_vlan"
     80 "maasserver_bootresourcefile"
     42 "maasserver_bootresourceset"
     31 "maasserver_numanode"
     27 "maasserver_node"
     24 maasserver_node
     24 "metadataserver_scriptresult"
     24 "maasserver_event"
     21 "maasserver_service"
     16 "maasserver_config"
     13 "maasserver_filesystem"
      7 "maasserver_staticipaddress"
      7 "maasserver_interface"
      6 "maasserver_virtualmachinedisk"
      6 "maasserver_partitiontable"
      6 "maasserver_partition"
      6 "maasserver_fabric"
      5 "maasserver_space"
      4 "maasserver_subnet"
      4 "maasserver_interfacerelationship"
      3 "maasserver_virtualblockdevice"
      3 "maasserver_physicalblockdevice"
      3 "maasserver_notification"
      3 "maasserver_nodeconfig"
      2 "piston3_consumer"
      2 "maasserver_regioncontrollerprocessendpoint"
      2 "maasserver_regioncontrollerprocess"
      2 "maasserver_blockdevice"
      1 "piston3_token"
      1 "piston3_nonce"
      1 "maasserver_userprofile"
      1 "maasserver_tag"
      1 "maasserver_ownerdata"
      1 "maasserver_numanodehugepages"
      1 "maasserver_nodemetadata"
      1 "maasserver_globaldefault"
      1 "maasserver_dnsresource"
      1 "maasserver_bootresource"
      1 "auth_user"

However, background activity is really really bad. I had to do 12 individual tests to capture such an optimal result. All others looked more like:
   1494 "maasserver_bootresourceset"
    703 "maasserver_bootresource"
     88 "maasserver_vlan"
     80 "maasserver_bootresourcefile"
     41 "maasserver_node"
     34 "maasserver_numanode"
     29 "maasserver_config"
     24 maasserver_node
     24 "metadataserver_scriptresult"
     24 "maasserver_event"
     17 "maasserver_service"
     15 "maasserver_filesystem"
     10 "maasserver_interface"
      9 "maasserver_subnet"
      9 "maasserver_staticipaddress"
      8 "maasserver_resourcepool"
      8 "maasserver_partitiontable"
      8 "maasserver_partition"
      8 "maasserver_fabric"
      7 "maasserver_notification"
      6 "maasserver_virtualmachinedisk"
      6 "maasserver_space"
      6 "maasserver_regioncontrollerprocessendpoint"
      6 "maasserver_regioncontrollerprocess"
      5 "maasserver_physicalblockdevice"
      5 "maasserver_nodeconfig"
      4 "maasserver_virtualblockdevice"
      4 "maasserver_interfacerelationship"
      4 "maasserver_blockdevice"
      2 "piston3_consumer"
      2 "maasserver_tag"
      2 "maasserver_numanodehugepages"
      2 "maasserver_nodemetadata"
      2 "maasserver_controllerinfo"
      2 "auth_user"
      1 pg_locks,
      1 "piston3_token"
      1 "piston3_nonce"
      1 "maasserver_userprofile"
      1 "maasserver_ownerdata"
      1 "maasserver_globaldefault"
      1 "maasserver_dnsresource"

Where you can see upwards of 2000 queries against the bootresource tables.

Note that all of this is over just 5-10s of capturing queries, so the database is still a big big problem.

Revision history for this message

Stéphane Graber (stgraber) wrote on 2022-06-27:

#6

Just to clarify that I don't think my expectations are unrealistic here.

I did a quick check against my home LXD cluster which is running on the same systems as our MAAS deployment and so has to deal with the same network and CPU performance.

Fetching all details for my 105 instances across 8 projects and 7 clustered servers takes 1.5s.
This fetches all instances, instance snapshots, configuration, devices, profiles, storage and network. So very similar in the amount of data as the MAAS endpoint.

To be clear, I don't think LXD's performance is acceptable here either. We're doing around 250 DB queries to fetch that data when it should realistically just be < 10 queries. Our goal is to get that API to the < 500ms range.

For MAAS, what we need is for it to consistently be below 10s. Realistically, we'd like it to be below 1s but 10s is a common timeout value and right now we're seeing timeouts and retries which are causing even more load, in turn causing even more timeouts...

Adam Collard (adam-collard) on 2022-06-28

summary:

- Insane amount of DB queries when hitting /MAAS/api/2.0/machines/
+ Unexpectedly high number of DB queries when hitting
+ /MAAS/api/2.0/machines/

Revision history for this message

Alexsander de Souza (alexsander-souza) wrote on 2022-06-28:

#7

Your expectations about the REST API performance are perfectly reasonable, but we need to be transparent on how we are addressing this issue and to create realistic expectations about the time to fix it.

1) why this is not a critical/blocking bug

This is regarded as a product characteristic at this point (a bad one), and we are aware it exposes MAAS to a number of DoS attacks. MAAS API has always been slow, so it's not a regression of the last release, and customers have learned to live with this. Fixing this is not the top priority for the project stakeholders either, so we are not blocking new releases because of this.

2) what's our strategy for improving MAAS performance

During the 3.2 cycle we introduced performance testing and profiling into our CI pipeline. This gave us insight about how the time is spent and what actions would produce the biggest impact. The conclusion was that MAAS needs major overhaul in critical components, which couldn't all be done in a single release cycle. We did what we could in 3.2, and our management is committed to setting aside time on future cycles to make this happen.

3) when MAAS is going to be fast

We'll get there when we get there. There are many unknowns, like how many man-hours we can commit to this each cycle and how fast the MAAS team will grow. We set the goal of getting faster with each release, so you can expect this much.

This might not be the answer you were expecting, but this is the answer we can give you at this time.

Changed in maas:
status:	Incomplete → Triaged
importance:	Undecided → High

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2023-02-16:

#8

We have planned improvements to MAAS DB query performance, as it's not a simple bug fix. Work on this is tracked on the internal backlog (internal ref. PF-3688, PF-3256, and many others).

Changed in maas:
status:	Triaged → Invalid

MAAS

Unexpectedly high number of DB queries when hitting /MAAS/api/2.0/machines/

Bug Description

Related branches

Other bug subscribers

Remote bug watches