maasserver_event table grows without bounds, impacting UI performance

Bug #1860619 reported by Adam Beeman on 2020-01-23
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
High
Björn Tillenius

Bug Description

There have been a number of performance related bugs over time which are in various states, so this may duplicate some past concerns, but I could not find a currently active bug which reflects the issue in a recent MAAS version.

I have multiple MAAS regions each with several hundred servers in them. Over time, the web interface becomes increasingly slow to load the list of systems. For example, I timed refreshing the Machines tab on a server with 295 machines to load. It took 3 minutes and 45 seconds to load!
Inspection of the database shows a huge server events table:

postgres=# \c maasdb
You are now connected to database "maasdb" as user "postgres".
maasdb=# SELECT nspname || '.' || relname AS "relation",
maasdb-# pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
maasdb-# FROM pg_class C
maasdb-# LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
maasdb-# WHERE nspname NOT IN ('pg_catalog', 'information_schema')
maasdb-# AND C.relkind <> 'i'
maasdb-# AND nspname !~ '^pg_toast'
maasdb-# ORDER BY pg_total_relation_size(C.oid) DESC
maasdb-# LIMIT 5;
              relation | total_size
------------------------------------+------------
 public.maasserver_event | 24 GB
 public.metadataserver_scriptresult | 73 MB
 public.metadataserver_nodeuserdata | 27 MB
 public.maasserver_node | 1320 kB
 public.maasserver_neighbour | 1256 kB
(5 rows)

maasdb=#

We have adopted the undesirable practice of truncating the events table periodically with:
truncate table public.maasserver_event;

... and this immediate speeds things up and makes the web interface usable again.
I think possibly our heavy use of DHCP is a contributor to the bloat of this table, because I believe that DHCP lease renewals are logged as events in this table - though I may be wrong on that.

Is there something we can do to make this more manageable? Either an event log rotation/pruning mechanism, or perhaps we don't need to log as much into this table?

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-======================================-============-=============================================
un maas <none> <none> (no description available)
ii maas-cli 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS server common files
ii maas-dhcp 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS DHCP server
un maas-dns <none> <none> (no description available)
ii maas-proxy 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all Rack Controller for MAAS
ii maas-region-api 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.6.1-7832-g17912cdc9-0ubuntu1~18.04.1 all MAAS server provisioning libraries (Python 3)

Related branches

Adam Beeman (abeeman) wrote :

See also: https://bugs.launchpad.net/maas/+bug/1860619 though that bug says "Fix Committed", I don't believe it's addressing the problem of database bloat.

Adam Beeman (abeeman) wrote :

Correction, https://bugs.launchpad.net/maas/+bug/1830365 is the older bug.

Alberto Donato (ack) on 2020-02-28
Changed in maas:
status: New → Triaged
importance: Undecided → High
Changed in maas:
milestone: none → 2.8.0b1
Alberto Donato (ack) on 2020-04-17
Changed in maas:
milestone: 2.8.0b1 → 2.8.0b2
Alberto Donato (ack) on 2020-04-24
Changed in maas:
milestone: 2.8.0b2 → 2.8.0rc1
Alberto Donato (ack) on 2020-05-01
Changed in maas:
milestone: 2.8.0b3 → 2.8.0rc1
Changed in maas:
assignee: nobody → Björn Tillenius (bjornt)
Alberto Donato (ack) on 2020-05-11
Changed in maas:
milestone: 2.8.0b4 → 2.8.0rc1
Changed in maas:
assignee: Björn Tillenius (bjornt) → Adam Collard (adam-collard)
Changed in maas:
assignee: Adam Collard (adam-collard) → Björn Tillenius (bjornt)
status: Triaged → In Progress
Björn Tillenius (bjornt) wrote :

abeeman (and anyone else experience problems), could you please run the following SQL to shed some light on what the top events are?

  https://paste.ubuntu.com/p/Nv3QhY4996/

Long-term, we'll probably need to cull the event table regularly, so that it can't grow too much. But short term we're going to remove logging of the power queries, which we do know cause a lot of events being logged, and aren't useful to have in the logs.

It'd be interesting to see what other events are being issued a lot. We know that installation and commissioning events may grow quite a lot, but it's harder to fix that, so it won't be don't for 2.8.

Changed in maas:
status: In Progress → Fix Committed
Alberto Donato (ack) on 2020-06-04
Changed in maas:
status: Fix Committed → Fix Released
György Szombathelyi (gyurco) wrote :

Having a _DEBUG event on the top might be easily fixable, I guess:

               name | event_count
----------------------------------+-------------
 NODE_POWER_QUERIED_DEBUG | 13633640
 NODE_POWER_QUERY_FAILED | 1088125
 RACK_IMPORT_INFO | 179640
 NODE_STATUS_EVENT | 71113
 NODE_INSTALL_EVENT | 52484
 REGION_IMPORT_INFO | 31408
 NODE_COMMISSIONING_EVENT | 28222
 NODE_TFTP_REQUEST | 5725
 NODE_CHANGED_STATUS | 1082
 NODE_POWER_QUERIED | 895
 NODE_PXE_REQUEST | 708
 NODE_POWERED_ON | 547
 NODE_POWER_ON_STARTING | 520
 REQUEST_NODE_START_COMMISSIONING | 288
 NODE_POST_INSTALL_EVENT_FAILED | 284
 REQUEST_NODE_START_DEPLOYMENT | 251
 REQUEST_NODE_ACQUIRE | 249
 NODE_INSTALLATION_FINISHED | 228
 NODE_POWER_OFF_STARTING | 145
 NODE_POWERED_OFF | 145

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers