Unexpectedly high number of DB queries when hitting /MAAS/api/2.0/machines/

Bug #1978225 reported by Stéphane Graber
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
High
Unassigned

Bug Description

While working on MAAS integration with a LXD related project, we've noticed that a simple:

  GET /MAAS/api/2.0/machines/

Over your REST API is taking upwards of 15s to get us a response.
Looking at MAAS, we can see a regiond process running at 100% of CPU usage during the entire time.
Then looking at traffic on the loopback device, I noticed that it's doing over 1Mb/s of postgresql traffic during that entire time.

Enabling query logging in postgresql and extracting the table accessed for that one API query, I'm getting:
   1976 "maasserver_physicalblockdevice"
    933 "maasserver_blockdevice"
    846 "maasserver_partitiontable"
    414 "maasserver_partition"
    164 "maasserver_vlan"
    103 "maasserver_virtualmachinedisk"
     94 "maasserver_node"
     80 "maasserver_bootresourcefile"
     72 "maasserver_service"
     69 "maasserver_filesystem"
     63 "maasserver_regioncontrollerprocess"
     56 "maasserver_regioncontrollerprocessendpoint"
     53 "maasserver_numanode"
     52 "maasserver_resourcepool"
     48 maasserver_node
     48 "metadataserver_scriptresult"
     48 "maasserver_event"
     42 "maasserver_bootresourceset"
     40 "maasserver_config"
     36 "maasserver_regionrackrpcconnection"
     23 "maasserver_interface"
     12 "maasserver_fabric"
     11 "maasserver_subnet"
     11 "maasserver_space"
     11 "maasserver_notification"
      4 "piston3_consumer"
      4 "maasserver_staticipaddress"
      4 "maasserver_interfacerelationship"
      3 "auth_user"
      2 "piston3_token"
      2 "piston3_nonce"
      2 "maasserver_userprofile"
      2 "maasserver_bmc"
      1 "maasserver_virtualmachine"
      1 "maasserver_virtualblockdevice"
      1 "maasserver_tag"
      1 "maasserver_ownerdata"
      1 "maasserver_numanodehugepages"
      1 "maasserver_nodemetadata"
      1 "maasserver_globaldefault"
      1 "maasserver_domain"
      1 "maasserver_dnsresource"
      1 "maasserver_controllerinfo"
      1 "maasserver_bootresource"

This is the number of QUERIES hitting a each table by looking at the FROM in each logged query.
It means that just getting the list of machines caused around 5340 SQL queries!

The MAAS in question only has around 50 machines in 4 resource pools, so it's not exactly massive.

As an additional datapoint, MAAS is also hitting "get_machine_default_gateway_ip" absolutely constantly during that time. I measured over 5000 such calls, surprisingly similar to the number of DB queries. Each of those calls will hit the Linux kernel with a dozen+ netlink queries causing a lot of kernel load. For no good reason as the gateway isn't very likely to change every few milliseconds :)

Tracing that get_machine_default_gateway_ip thing, it looks like, it's only ever hit from your prometheus handling code, so I wonder if this whole mess isn't caused by fetching metrics for your prometheus endpoint? Worth noting that in our case, we are not currently scraping MAAS' prometheus endpoint at all.

This is mAAS 3.1.0 from 3.1/stable snap. With one region daemon and 4 rack daemons.

Related branches

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

Hi there, this is something we are/were aware of and addressed in 3.2.0, which has an RC out as of right now. Can you try with the release candidate?

Changed in maas:
status: New → Incomplete
Revision history for this message
Stéphane Graber (stgraber) wrote :

No. We're sticking to stable having been bitten several times with past MAAS rc.

Are you going to be pushing this to 3.1 stable users?

Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

I understand. In this case I'm going to ask you to re-test this after the 3.2 release comes out.

This patch doesn't meet our requirements for backports, it's too big and it's not a critical bug nor a regression.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Hmm, I'm really confused by your standards for a critical bug then.

This is the most common endpoint (or at least you'd think it is) on your core REST API which is taking long enough to timeout in most software (10s timeout is common) rendering it entirely unusable.

The kind of load it creates against the database also means that this bug makes for very trivial DoS attack against MAAS. Throw a dozen GET /MAAS/api/2.0/machines/ in parallel on a moderately busy MAAS and MAAs will be completely dead.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Stéphane Graber (stgraber) wrote :
Download full text (3.4 KiB)

With 3.2 now in stable, I've refreshed our deployment over to it.

I can't say that things really improved that much though, we're still seeing a single GET of the machine list taking up to 18s. On the low end, we now get responses in 6s instead of 10s, but that's the odd fast response rather than the norm and 6s is still very very slow for a MAAS with less than 50 machines...

The number of queries did decrease though with the most optimal run being down to 460 now.
     86 "maasserver_vlan"
     80 "maasserver_bootresourcefile"
     42 "maasserver_bootresourceset"
     31 "maasserver_numanode"
     27 "maasserver_node"
     24 maasserver_node
     24 "metadataserver_scriptresult"
     24 "maasserver_event"
     21 "maasserver_service"
     16 "maasserver_config"
     13 "maasserver_filesystem"
      7 "maasserver_staticipaddress"
      7 "maasserver_interface"
      6 "maasserver_virtualmachinedisk"
      6 "maasserver_partitiontable"
      6 "maasserver_partition"
      6 "maasserver_fabric"
      5 "maasserver_space"
      4 "maasserver_subnet"
      4 "maasserver_interfacerelationship"
      3 "maasserver_virtualblockdevice"
      3 "maasserver_physicalblockdevice"
      3 "maasserver_notification"
      3 "maasserver_nodeconfig"
      2 "piston3_consumer"
      2 "maasserver_regioncontrollerprocessendpoint"
      2 "maasserver_regioncontrollerprocess"
      2 "maasserver_blockdevice"
      1 "piston3_token"
      1 "piston3_nonce"
      1 "maasserver_userprofile"
      1 "maasserver_tag"
      1 "maasserver_ownerdata"
      1 "maasserver_numanodehugepages"
      1 "maasserver_nodemetadata"
      1 "maasserver_globaldefault"
      1 "maasserver_dnsresource"
      1 "maasserver_bootresource"
      1 "auth_user"

However, background activity is really really bad. I had to do 12 individual tests to capture such an optimal result. All others looked more like:
   1494 "maasserver_bootresourceset"
    703 "maasserver_bootresource"
     88 "maasserver_vlan"
     80 "maasserver_bootresourcefile"
     41 "maasserver_node"
     34 "maasserver_numanode"
     29 "maasserver_config"
     24 maasserver_node
     24 "metadataserver_scriptresult"
     24 "maasserver_event"
     17 "maasserver_service"
     15 "maasserver_filesystem"
     10 "maasserver_interface"
      9 "maasserver_subnet"
      9 "maasserver_staticipaddress"
      8 "maasserver_resourcepool"
      8 "maasserver_partitiontable"
      8 "maasserver_partition"
      8 "maasserver_fabric"
      7 "maasserver_notification"
      6 "maasserver_virtualmachinedisk"
      6 "maasserver_space"
      6 "maasserver_regioncontrollerprocessendpoint"
      6 "maasserver_regioncontrollerprocess"
      5 "maasserver_physicalblockdevice"
      5 "maasserver_nodeconfig"
      4 "maasserver_virtualblockdevice"
      4 "maasserver_interfacerelationship"
      4 "maasserver_blockdevice"
      2 "piston3_consumer"
      2 "maasserver_tag"
      2 "maasserver_numanodehugepages"
      2 "maasserver_nodemetadata"
      2 "maasserver_controllerinfo"
      2 "auth_user"
      1 pg_locks,
      1 "piston3_token"
      1 "piston3_nonce"
      1 "maasserver_userprofile"
      1 "maasserver_ownerdata"
      1 "maas...

Read more...

Revision history for this message
Stéphane Graber (stgraber) wrote :

Just to clarify that I don't think my expectations are unrealistic here.

I did a quick check against my home LXD cluster which is running on the same systems as our MAAS deployment and so has to deal with the same network and CPU performance.

Fetching all details for my 105 instances across 8 projects and 7 clustered servers takes 1.5s.
This fetches all instances, instance snapshots, configuration, devices, profiles, storage and network. So very similar in the amount of data as the MAAS endpoint.

To be clear, I don't think LXD's performance is acceptable here either. We're doing around 250 DB queries to fetch that data when it should realistically just be < 10 queries. Our goal is to get that API to the < 500ms range.

For MAAS, what we need is for it to consistently be below 10s. Realistically, we'd like it to be below 1s but 10s is a common timeout value and right now we're seeing timeouts and retries which are causing even more load, in turn causing even more timeouts...

summary: - Insane amount of DB queries when hitting /MAAS/api/2.0/machines/
+ Unexpectedly high number of DB queries when hitting
+ /MAAS/api/2.0/machines/
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

Your expectations about the REST API performance are perfectly reasonable, but we need to be transparent on how we are addressing this issue and to create realistic expectations about the time to fix it.

1) why this is not a critical/blocking bug

This is regarded as a product characteristic at this point (a bad one), and we are aware it exposes MAAS to a number of DoS attacks. MAAS API has always been slow, so it's not a regression of the last release, and customers have learned to live with this. Fixing this is not the top priority for the project stakeholders either, so we are not blocking new releases because of this.

2) what's our strategy for improving MAAS performance

During the 3.2 cycle we introduced performance testing and profiling into our CI pipeline. This gave us insight about how the time is spent and what actions would produce the biggest impact. The conclusion was that MAAS needs major overhaul in critical components, which couldn't all be done in a single release cycle. We did what we could in 3.2, and our management is committed to setting aside time on future cycles to make this happen.

3) when MAAS is going to be fast

We'll get there when we get there. There are many unknowns, like how many man-hours we can commit to this each cycle and how fast the MAAS team will grow. We set the goal of getting faster with each release, so you can expect this much.

This might not be the answer you were expecting, but this is the answer we can give you at this time.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

We have planned improvements to MAAS DB query performance, as it's not a simple bug fix. Work on this is tracked on the internal backlog (internal ref. PF-3688, PF-3256, and many others).

Changed in maas:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.