Slow web performance (5+ minute response time) on MAAS with many nodes

Bug #1401707 reported by Jason Hobbs on 2014-12-11
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Critical
Blake Rouse
1.7
Undecided
Blake Rouse

Bug Description

I have a MAAS with 100+ nodes. It's taking 6-8 seconds to respond to some requests, like node page listings and image page listings.

Here's some examples; the last two numbers in the log are the log time in seconds/microseconds.

10.90.0.14 - - [11/Dec/2014:23:18:07 +0000] "GET /MAAS/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/view/?sequence=45 HTTP/1.1" 200 1282 "http://10.245.0.10/MAAS/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/view/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0" **7 7626093**
10.90.0.14 - - [11/Dec/2014:23:18:07 +0000] "GET /MAAS/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/view/?sequence=35 HTTP/1.1" 200 1282 "http://10.245.0.10/MAAS/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/view/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0" **7 7762253**
10.90.0.14 - - [11/Dec/2014:23:18:08 +0000] "GET /MAAS/nodes/node-a0620c92-c4cd-11e3-8102-00163efc5068/view/?sequence=1131 HTTP/1.1" 200 1453 "http://10.245.0.10/MAAS/nodes/node-a0620c92-c4cd-11e3-8102-00163efc5068/view/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36" **6 6048212**
10.90.0.38 - - [11/Dec/2014:23:17:56 +0000] "GET /MAAS/images/?sequence=10 HTTP/1.1" 200 638 "http://10.245.0.10/MAAS/images/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" **7 7129086**

Not all requests take this long:
10.90.0.38 - - [11/Dec/2014:23:18:03 +0000] "GET /MAAS/nodes/?id=node-a08c25d6-c4cd-11e3-8102-00163efc5068&id=node-a0754dc0-c4cd-11e3-824b-00163efc5068&id=node-a0620c92-c4cd-11e3-8102-00163efc5068&id=node-a04d5ba8-c4cd-11e3-824b-00163efc5068&id=node-a036d91e-c4cd-11e3-8102-00163efc5068&id=node-a0223dec-c4cd-11e3-824b-00163efc5068&id=node-a00d6a0c-c4cd-11e3-8102-00163efc5068&id=node-9ff3a32e-c4cd-11e3-824b-00163efc5068&id=node-9fd976c0-c4cd-11e3-8102-00163efc5068&id=node-9fc3db80-c4cd-11e3-824b-00163efc5068&id=node-9faae724-c4cd-11e3-8102-00163efc5068&id=node-9f89500a-c4cd-11e3-824b-00163efc5068&id=node-9f70a10e-c4cd-11e3-8102-00163efc5068&id=node-9f580cca-c4cd-11e3-824b-00163efc5068&id=node-9f40d3ca-c4cd-11e3-8102-00163efc5068&id=node-9f22e392-c4cd-11e3-824b-00163efc5068&id=node-9f0983ac-c4cd-11e3-8102-00163efc5068&id=node-9eee5bae-c4cd-11e3-824b-00163efc5068&id=node-9ed45f60-c4cd-11e3-8102-00163efc5068&id=node-9eb864f4-c4cd-11e3-824b-00163efc5068&id=node-9e9e2eb8-c4cd-11e3-8102-00163efc5068&id=node-9e7a2a18-c4cd-11e3-824b-00163efc5068&id=node-9e5f6976-c4cd-11e3-8102-00163efc5068&id=node-9e496b8a-c4cd-11e3-824b-00163efc5068&id=node-9e308d0e-c4cd-11e3-8102-00163efc5068&id=node-9e0f7d44-c4cd-11e3-824b-00163efc5068&id=node-9df8a42a-c4cd-11e3-824b-00163efc5068&id=node-9dd8bed0-c4cd-11e3-8102-00163efc5068&id=node-9db5ca7e-c4cd-11e3-824b-00163efc5068&id=node-9d9cdafa-c4cd-11e3-824b-00163efc5068&id=node-9d847d52-c4cd-11e3-824b-00163efc5068&id=node-9d6c8a30-c4cd-11e3-8102-00163efc5068&id=node-9d4e6a5a-c4cd-11e3-824b-00163efc5068&id=node-9d30e4e4-c4cd-11e3-8102-00163efc5068&id=node-94afcdc8-aea0-11e3-9074-00163efc5068&id=node-30459fcc-ae9e-11e3-b194-00163efc5068&id=node-5a2fcafe-ae9b-11e3-b194-00163efc5068&id=node-49c2f628-ae9b-11e3-b194-00163efc5068&id=node-4990416a-ae9b-11e3-9074-00163efc5068&id=node-49653d80-ae9b-11e3-b194-00163efc5068&id=node-4951e5e6-ae9b-11e3-9074-00163efc5068&id=node-a5224922-ae98-11e3-b194-00163efc5068&id=node-601eca8a-ae98-11e3-9074-00163efc5068&id=node-600e169a-ae98-11e3-b194-00163efc5068&id=node-5fccd02c-ae98-11e3-9074-00163efc5068&id=node-5f9c14e6-ae98-11e3-b194-00163efc5068&id=node-4cb16f16-ae98-11e3-b194-00163efc5068&id=node-0f5dc0ee-ae1e-11e3-9074-00163efc5068&id=node-12dfa450-ae17-11e3-9074-00163efc5068&id=node-0fd68e9a-ae17-11e3-b194-00163efc5068&sequence=7 HTTP/1.1" 200 3362 "http://10.245.0.10/MAAS/nodes/?page=3" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" **0 356661**

10.245.0.10 - - [11/Dec/2014:23:18:26 +0000] "GET /MAAS/rpc/ HTTP/1.0" 200 1102 "-" "Twisted PageGetter" **0 51178**

This is a big system and the load average is low overall, but the twisted pserv process is taking up 100% of one core.

This is with MAAS 1.7.1~rc1+bzr3322-0ubuntu1~trusty.

Related branches

Jason Hobbs (jason-hobbs) wrote :

This MAAS server also has many nodes that are down and have BMCs that can't be reached. It's almost always trying to query a node, and ps output usually shows something like this:

http://paste.ubuntu.com/9481782/

Andres Rodriguez (andreserl) wrote :

Same case as what we saw in 1.7.0.

Changed in maas:
importance: Undecided → Critical
assignee: nobody → Blake Rouse (blake-rouse)
Jason Hobbs (jason-hobbs) wrote :
Jason Hobbs (jason-hobbs) wrote :

Some requests are taking over 5 minutes to respond to:

127.0.0.1 - - [12/Dec/2014:16:48:08 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-05-46&remote=10.245.0.182 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **263 263245907**
127.0.0.1 - - [12/Dec/2014:16:48:10 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-03-06&remote=10.245.0.181 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **266 266199798**
127.0.0.1 - - [12/Dec/2014:16:48:08 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-05-46&remote=10.245.0.182 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **279 279637542**
127.0.0.1 - - [12/Dec/2014:16:48:11 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-06-16&remote=10.245.0.168 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **278 278281991**
127.0.0.1 - - [12/Dec/2014:16:48:13 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-02-16&remote=10.245.0.171 HTTP/1.0" 200 567 "-" "Twisted PageGetter" **284 284870560**
127.0.0.1 - - [12/Dec/2014:16:48:11 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-07-16&remote=10.245.0.180 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **287 287229599**
127.0.0.1 - - [12/Dec/2014:16:48:13 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-00-46&remote=10.245.0.174 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **287 287692603**
127.0.0.1 - - [12/Dec/2014:16:48:15 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-06-46&remote=10.245.0.170 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **298 298460868**
127.0.0.1 - - [12/Dec/2014:16:48:15 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-03-16&remote=10.245.0.175 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **306 306219258**
127.0.0.1 - - [12/Dec/2014:16:48:15 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-05-46&remote=10.245.0.182 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **308 308224281**
127.0.0.1 - - [12/Dec/2014:16:48:17 +0000] "GET /MAAS/api/1.0/pxeconfig/?cluster_uuid=037c960b-5b9f-4701-8366-eeda2c09d14e&local=10.245.0.10&mac=00-22-99-e0-03-06&remote=10.245.0.181 HTTP/1.0" 200 568 "-" "Twisted PageGetter" **308 308135835**

summary: - Slow performance on MAAS with many nodes
+ Slow web performance (5 minute response time) on MAAS with many nodes
summary: - Slow web performance (5 minute response time) on MAAS with many nodes
+ Slow web performance (5+ minute response time) on MAAS with many nodes
Jason Hobbs (jason-hobbs) wrote :

More logs of pxeconfig requests repeating endlessly.

http://paste.ubuntu.com/9491778/

Blake Rouse (blake-rouse) wrote :

There is a few things we can do to make MAAS perform better.

1. Stop using the pxeconfig API and use a proper RPC call, as the pxeconfig API call itself makes more class to the cluster, since this is originating from the cluster there is no need for this. We can reduce the number of RPC calls required and the amount of information sent between the region and the cluster.

2. Cache RPC calls on the cluster to the region for a short period of time, that way a request is not needed per file as it currently does. Using a cache will reduce the call count and increase the performance of the cluster controller. This will also take load off the region, increasing the performance there as well.

Changed in maas:
status: New → Triaged
Christian Reis (kiko) on 2014-12-12
Changed in maas:
milestone: none → 1.7.2
Blake Rouse (blake-rouse) wrote :

Targeting this to 1.7.2 is very dangerous. What I am proposing is a large change something that would need some serious QA.

I think I have a better quicker fix, that is less dangerous for 1.7.2.

1. Disable the WindowsPXEBootMethod as its no longer used to install Windows. We should just remove it as we probably don't want to support that method anymore. This would make only pxeconfig request per machine boot, instead of multiple per boot.

2. The fix I have for bug 1401983 should really improve performance.

Blake Rouse (blake-rouse) wrote :

So I have done some debugging to come up with the following.

1. the fix for bug 1401983 is needed, this helps a lot with the pxeconfig
2. removal of WindowsPXEBootMethod also helps a lot

Still twistd CPU% is very high, even without nodes booting. This is because of all of the RPC calls that are made from the region to the cluster anytime a web request comes it.

1. DescribePowerTypes is called 4 times per request. I implemented a quick caching of this method and it help lower the CPU usage of twisted.

2. ListBootImages which is a very heavy process on the cluster is called very often. This method needs to be cached as well. I think this should be cached on the region and the cluster. The cluster knows the last time it updated the boot images, so it only needs to update its cache on start and after each sync of boot images. The region can also cache this RPC call for a minute so not to hit the cluster every so often, but when the region does ask it will already be cached on the cluster.

Blake Rouse (blake-rouse) wrote :

Okay so after some more investigation and thought I have ruled down the biggest culpret!

list_boot_images() in pserv

This method which is very IO heavy, is called very often. I have a fix currently in OIL that cache this cluster side, and the improvement is huge. I will work on getting this merged so it can go into 1.7.

Christian Reis (kiko) on 2014-12-16
Changed in maas:
milestone: 1.7.2 → 1.7.1
milestone: 1.7.1 → none
Changed in maas:
status: Triaged → Fix Committed
Christian Reis (kiko) on 2014-12-18
Changed in maas:
milestone: none → next
Changed in maas:
status: Fix Committed → Fix Released
Andres Rodriguez (andreserl) wrote :

This bug has been reported and fixed on upstream MAAS. However, provided that the bug was listed on the debian changelog, this appears as needing verification for pending SRU [1]. This bug did not affect current MAAS in Ubuntu, hence setting this to verification-done to unblock pending SRU.

[1]:http://people.canonical.com/~ubuntu-archive/pending-sru.html

tags: added: verification-done
Changed in maas:
milestone: next → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments