[2.5, websockets, performance] Loading network related page, such as 'interfaces' or 'add device' takes too long

Bug #1816452 reported by Andres Rodriguez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Björn Tillenius
2.5
Fix Released
Critical
Björn Tillenius

Bug Description

Attempting to load the 'interfaces' tab of multiple controllers/machines or attempting to 'add device' takes to long.

I've noticed that 'add device' takes at least 30s, which is a crazy amount of time. The similarities here is that both places access the network model and this is an area that takes very long.

There are two things that cause the slow down.

  1) The subnet.list websocket handler gets slow when
     a lot of IPs are observed
  2) We keep adding StaticIPAddresses with a NULL ip
     for some interfaces, causing them to have more
     than 10 000 records each.

This bug is specifically to fix 1) and to make sure things work
correctly even due to 2).

I've filed bug 1817056 to track 2), which also should be backported
to 2.5.

Related branches

Changed in maas:
importance: Undecided → Critical
status: New → Triaged
milestone: none → 2.5.2
assignee: nobody → Lee Trager (ltrager)
tags: added: performance track
Changed in maas:
assignee: Lee Trager (ltrager) → Björn Tillenius (bjornt)
Revision history for this message
Björn Tillenius (bjornt) wrote :

I've been looking through a tcpdump of the websocket traffic when loading the Devices page. Most websocket requests return quickly, but one of them, "subnet.list", took a minute to return 4kb of data.

Tomorrow I'm going to do a few more dumps to see whether that one always takes a long time, or whether it was just a coincidence that it took so long.

Changed in maas:
status: Triaged → In Progress
Revision history for this message
Björn Tillenius (bjornt) wrote :

Maybe it's not related, but here's a SQL query that takes around around 0.5 seconds to execute, but the transaction in which it is executed is taking around 1 minute to finish.

It uses an IN clause, but it passes a very long list of ids, so the query itself is about half of a Mb big. I've yet to investigate what piece of code that is generating that query, but it will most likely take quite a while to work with that amount of data in Python.

Another data point is that the maasserver_interface_ip_addresses table have 67136 records.

Here's the query, without the IDs:

SELECT ("maasserver_interface_ip_addresses"."staticipaddress_id") AS "_prefetch_related_val_staticipaddress_id", "maasserver_interface"."id", "maasserver_interface"."created", "maasserver_interface"."updated", "maasserver_interface"."node_id", "maasserver_interface"."name", "maasserver_interface"."type", "maasserver_interface"."vlan_id", "maasserver_interface"."mac_address", "maasserver_interface"."ipv4_params", "maasserver_interface"."ipv6_params", "maasserver_interface"."params", "maasserver_interface"."tags", "maasserver_interface"."enabled", "maasserver_interface"."mdns_discovery_state", "maasserver_interface"."neighbour_discovery_state", "maasserver_interface"."acquired", "maasserver_interface"."vendor", "maasserver_interface"."product", "maasserver_interface"."firmware_version" FROM "maasserver_interface" INNER JOIN "maasserver_interface_ip_addresses" ON ("maasserver_interface"."id" = "maasserver_interface_ip_addresses"."interface_id") WHERE "maasserver_interface_ip_addresses"."staticipaddress_id" IN (...)

Revision history for this message
Björn Tillenius (bjornt) wrote :

I reproduced the issue locally by inserting 60 000 records into maasserver_interface_ip_addresses spread out over three devices. The issue is most likely that we allow multiple staticipaddress records for the same interface that has ip set to null, but I'm going to look into why it's so inefficient.

Revision history for this message
Björn Tillenius (bjornt) wrote :

There are two issues here. The first one is that we shouldn't allow an interface to have more than one staticipaddress record that has an empty ip. That should reduce the number of records, so that subnet.list should be a lot faster.

But I'm also investigating how well it scales when there are a lot of observed IP addresses. It's clear that we scale badly when one interface has thousands of addresses, but I've yet to investigate what happens when we see thousands of interfaces.

The reason we scale badly at the moment, is that the query to get all the subnet uses prefetch to fetch everything in one query.

Revision history for this message
Björn Tillenius (bjornt) wrote :

I plan to fix this in three branches:

  1) Optimize the subnets.list websocket handler
  2) Stop creating multiple StaticIPAddress record that
     all have a null IP
  3) A database patch to clean up the redundant StaticIPAddress
     records.

1) is the most critical, so I'll link that to this bug and will
file new bugs for 2) and 3). In 1) I'll change the handler to
both filter out null IP addresses in the database query itself,
but I'll also make it scale better. At the moment if you have
many IP adresses, subnet.list will be slow, even if 2) is fixed.
I tested with 10 000 IP addresses, and with the current code
it took around 20 seconds to run. With my branch with some
optimizations I got it down to 0.6 seconds.

Ideally 2) and 3) would be in the same branch, but backporting
database patches is tricky. And the database patch only cleans
up the data, so it's not vital.

description: updated
Changed in maas:
milestone: 2.5.2 → 2.6.0
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.6.0 → 2.6.0alpha1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.