Very slow load times for view node pages in MAAS

Bug #1430852 reported by Jason Hobbs on 2015-03-11
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
High
Blake Rouse

Bug Description

Sometimes loading node view pages in our MAAS server takes a really long time - like 26 seconds. Here's some logs of the load time for viewing node pages in our MAAS. The response time is given at the end of each log line.

http://paste.ubuntu.com/10580586/

Related branches

Jason Hobbs (jason-hobbs) wrote :

This is still present in 1.8

Jason Hobbs (jason-hobbs) wrote :

The slowness is being caused by slow event queries:

2015-04-08 15:23:08 UTC LOG: duration: 12915.433 ms statement: SELECT COUNT(*) FROM "maasserver_event" WHERE "maasserver_event"."node_id" = 786

There is a btree index on maasserver_event.node_id - I think this should be a hash index instead, since we only care about equality for node_ids - we never do stuff like: node_id < x.

Jason Hobbs (jason-hobbs) wrote :

blake pointed out some psql documentation discouraging hash indexes:

15:36 < blake_r> jhobbs: Testing has shown PostgreSQL's hash indexes to perform no better than B-tree indexes, and the index size and build time for hash indexes is
                 much worse. Furthermore, hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database
                 crash. For these reasons, hash index use is presently discouraged.
15:36 < blake_r> jhobbs: from psql documentation

Changed in maas:
milestone: none → 1.8.0
status: New → Triaged
importance: Undecided → High
Blake Rouse (blake-rouse) wrote :

Possible that we can create a field on the node that stores the total number of events for that node. That way a count will not need to be performed.

Note: that the COUNT(*) call is not always that slow. It only seems to occur on nodes that have node been loaded in a while. Making me think it has to do with postgres caching something to disk, and the time to load that data is what is slowing the query.

Andres Rodriguez (andreserl) wrote :

Marking this invalid and targetting this to maas-next because in 1.8 pages load much faster, although there are some corner cases we still need to improve.

Changed in maas:
milestone: 1.8.0 → next
summary: - Very slow load times for view node pages in MAAS 1.7.2
+ Very slow load times for view node pages in MAAS
Changed in maas:
status: Triaged → Incomplete
Jason Hobbs (jason-hobbs) wrote :

Sorry - this is still very valid and has not been addressed at all. The only reason pages are loading faster on OIL now is that I manually deleted most of the events. To do this, I had to delete the "delete event" trigger, delete the events, then recreate the "delete event" trigger, all manually via psql.

Changed in maas:
status: Incomplete → Triaged
Jason Hobbs (jason-hobbs) wrote :

Even with most of the events deleted, page load times for nodes in OIL are around 7 seconds - that's still very slow. They are always increasing and will be back up in the 20+ second range before too long if we don't clear out the events manually again.

Blake Rouse (blake-rouse) wrote :

I think the quick fix for this is to remove the total number events on the node details page. How do we feel about doing this? Because that is what is causing the page load to take a long time.

tags: added: ui ux
Changed in maas:
milestone: next → 1.8.0

I'm fine with that. A pretty good replacement would be to show the number
of displayed events which should be pretty easy client side.

On Fri, May 15, 2015 at 8:47 AM, Blake Rouse <email address hidden>
wrote:

> I think the quick fix for this is to remove the total number events on
> the node details page. How do we feel about doing this? Because that is
> what is causing the page load to take a long time.
>
> ** Tags added: ui ux
>
> ** Changed in: maas
> Milestone: next => 1.8.0
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1430852
>
> Title:
> Very slow load times for view node pages in MAAS
>
> Status in MAAS:
> Triaged
>
> Bug description:
> Sometimes loading node view pages in our MAAS server takes a really
> long time - like 26 seconds. Here's some logs of the load time for
> viewing node pages in our MAAS. The response time is given at the end
> of each log line.
>
> http://paste.ubuntu.com/10580586/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1430852/+subscriptions
>

Changed in maas:
status: Triaged → In Progress
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
status: In Progress → Fix Committed
Jason Hobbs (jason-hobbs) wrote :

Unfortunately this one is still an issue, even on 1.8.0 rc1. The node detail page load times are taking just as long as before.

Changed in maas:
status: Fix Committed → Triaged
Blake Rouse (blake-rouse) wrote :

Dang it.

Jason,

Are you still logging slow queries? Can you provide any log messages that show the queries that are taking a long period of time?

Jason Hobbs (jason-hobbs) wrote :

Sure - here are some slow queries from just earlier:

http://paste.ubuntu.com/11543728/

Changed in maas:
status: Triaged → Fix Committed
Blake Rouse (blake-rouse) wrote :

Here is the breakdown of the query plan for the first one in the paste which is the slowest one.

http://paste.ubuntu.com/11544024/

You can see that everything is using an index and that a LIMIT is set to 50, so the query should not be that long. Unless this information is cached to the disk and needs to be loaded into memory.

Have you tried placing the MAAS machine on an SSD or maybe a PCI-E SSD to speed up the loading of indexes from the disk? Bcache might be something to look into to fix this issue.

I really do not know of a way in MAAS to speed this up. With so many events in OIL I think you will need to look into upgrading the storage on the server that hosts the MAAS service.

Jason Hobbs (jason-hobbs) wrote :

On 06/03/2015 09:31 AM, Blake Rouse wrote:
> Here is the breakdown of the query plan for the first one in the paste
> which is the slowest one.
>
> http://paste.ubuntu.com/11544024/
>
> You can see that everything is using an index and that a LIMIT is set to
> 50, so the query should not be that long. Unless this information is
> cached to the disk and needs to be loaded into memory.
>
> Have you tried placing the MAAS machine on an SSD or maybe a PCI-E SSD
> to speed up the loading of indexes from the disk? Bcache might be
> something to look into to fix this issue.

Sure - SSD might help. If MAAS needs an SSD to be performant though,
that's not great.

> I really do not know of a way in MAAS to speed this up. With so many
> events in OIL I think you will need to look into upgrading the storage
> on the server that hosts the MAAS service.

How about giving us a way to clear out old events, or limit the number
of events?

Jason

Robert Collins (lifeless) wrote :

Re: "Here is the breakdown of the query plan for the first one in the paste which is the slowest one. ..."

Have a look at http://explain.depesz.com/s/H7X - thats a very useful site for quick analysis of an explain. BTW I *really* recomend using explain analyze, not just explain - and using the relatively recent features to gather more data in the analysis, which can turn what I say below (thats a guess) into a concrete 'yes thats the problem'.

Anyhow, the plan shows that its likely going to behave terribly - the cost estimate is 18K.

The eventtype table is being hashed once - thats cheap. But here's the thing I'd look at::
               -> Bitmap Heap Scan on maasserver_event (cost=505.91..18416.10 rows=5869 width=52)
                     Recheck Cond: (node_id = 827)
                     -> Bitmap Index Scan on maasserver_event_node_id (cost=0.00..504.44 rows=5869 width=0)
                           Index Cond: (node_id = 827)

Thats expecting to pull 6000 rows off of disk, but its doing it via a bitmap heap scan - so its going to be examining every single row in the table. It the table has (say) 2M rows, thats going to flatten your server while it walks every single tuple in the index to figure out which tuples from the table it actually needs. http://<email address hidden>

I'd seriously consider putting this data into a data mining form where you can avoid all full-table query plans.

Changed in maas:
status: Fix Committed → Triaged
Robert Collins (lifeless) wrote :

Wgrant has pointed out that my postgresql is a little rusty, and this is using the index - but its still having to hit all 6K rows - getting an explain (analyze on, buffers on) would help with determining if this is entirely IO driven, or something else; making it possible to answer the 50 most recent from an index should be possible fairly easily - and thats going to be needed if this is just IO seeks killing performance.

Raphaël Badin (rvb) wrote :

Here is "EXPLAIN ANALYSE" run twice (http://paste.ubuntu.com/11645381/). As you can see, once the data is cached it's much much faster. While this is expected, the magnitude of the speedup makes me think it's really IO seek that is killing performance.

Here is EXPLAIN (ANALYSE, BUFFERS) http://paste.ubuntu.com/11645601/.

A cheap improvement (doesn't really solve the main problem but improves things a bit) is to avoid the join and pre-compute the types the query should ignore (we know for a fact this will be a small list): http://paste.ubuntu.com/11645601/

William Grant (wgrant) wrote :

The filter isn't very selective, so postgres should be very happy to use an index for the sort. An index on maasserver_event(node_id, id) will probably get you the latest 50 in <200ms cold.

If you want a COUNT you're out of luck without faster IO (or more RAM), but a COUNT doesn't sound very useful here.

Raphaël Badin (rvb) wrote :

The speedup is indeed significant with the (node_id, id) index: http://paste.ubuntu.com/11666261/

Raphaël Badin (rvb) on 2015-06-09
Changed in maas:
status: Triaged → Fix Committed
Jason Hobbs (jason-hobbs) wrote :

Tested this on OIL - it works! Thanks everyone for your help on this!

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers