MAAS has orphan ip addresses and dns records that are slowing down the entire service

Bug #2049508 reported by Jacopo Rota
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Jacopo Rota
3.4
Fix Released
High
Jacopo Rota
3.5
Fix Released
High
Jacopo Rota

Bug Description

On SOLQA env at X.X.164.2:5240 running 3.4.0 snap we see a lot of postgres activities

```
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 654234 root 20 0 1187948 908072 26908 S 84.1 1.4 20:51.34 python3
 654236 root 20 0 1312776 952840 27144 S 55.5 1.4 22:23.74 python3
 654232 root 20 0 1311844 964052 26996 S 43.2 1.5 25:05.65 python3
 671936 postgres 20 0 239800 103008 90004 S 31.9 0.2 0:38.30 postgres
 672556 postgres 20 0 250784 118092 98500 R 31.6 0.2 0:34.93 postgres
 672900 postgres 20 0 235176 101228 92344 R 29.6 0.2 0:07.53 postgres
 672552 postgres 20 0 236608 79128 69008 S 28.9 0.1 0:21.86 postgres
 673032 postgres 20 0 239484 100504 87736 R 26.6 0.2 0:04.76 postgres
 672220 postgres 20 0 258404 103504 75952 R 15.6 0.2 0:04.42 postgres
 673275 postgres 20 0 235116 89260 81424 S 15.6 0.1 0:05.57 postgres
 672577 postgres 20 0 235636 82832 73604 S 14.3 0.1 0:45.81 postgres
```

and after looking at the db most of the queries are dns/static ip related, for example:

```
 SELECT "maasserver_staticipaddress"."id", "maasserver_staticipaddress"."created", "maasserver_staticipaddress"."updated", "maasserver_staticipaddress"."ip", "maasserver_staticipaddress"."alloc_type", "maasserver_staticipaddress"."subnet_id", "maasserver_staticipaddress"."user_id", "maasserver_staticipaddress"."lease_time", "maasserver_staticipaddress"."temp_expires_on" FROM "maasserver_staticipaddress" INNER JOIN "maasserver_dnsresource_ip_addresses" ON ("maasserver_staticipaddress"."id" = "maasserver_dnsresource_ip_addresses"."staticipaddress_id") WHERE "maasserver_dnsresource_ip_addresses"."dnsresource_id" = 2068
```

and looking at the record 2068

```
 2068 | 2023-10-23 19:21:19.843186+00 | 2023-10-23 19:21:19.843186+00 | network-poller-b2a8e551-1953-47e6-9026-d43056f11570 | 1 |
```
and the related ip address
```
  id | created | updated | ip | alloc_type | subnet_id | user_id | lease_time | temp_expires_on
-------+------------------------+-------------------------------+----+------------+-----------+---------+------------+-----------------
 50843 | 2023-10-23 19:21:19+00 | 2023-10-23 19:31:19.913985+00 | | 6 | 2 | | 600 |

```

we spotted that this is coming from one of the MANY lxd containers that are spawned and deleted after few minutes.

The amount of such records is huge

```
maasdb=# select subnet_id, count(*) from maasserver_staticipaddress group by subnet_id;
 subnet_id | count
-----------+-------
           | 121
         3 | 303
         5 | 246
         4 | 65
         2 | 13686
```

as well as the ip addresses

```
maasdb=# select domain_id, count(*) from maasserver_dnsresource group by domain_id;
 domain_id | count
-----------+-------
         8 | 43
         7 | 43
         1 | 11862
         5 | 43
         4 | 43
         6 | 43
         3 | 43
```

most of these IP addresses are of `alloc_type=DHCP` and are really old

```
   id | created | updated | ip | alloc_type | subnet_id | user_id | lease_time | temp_expires_on
--------+-------------------------------+-------------------------------+----------------+------------+-----------+---------+------------+-----------------
 169688 | 2023-12-01 20:57:44+00 | 2023-12-01 21:07:44.427741+00 | | 6 | 2 | | 600 |
  37082 | 2023-10-18 20:12:25+00 | 2023-10-18 20:22:25.756612+00 | | 6 | 2 | | 600 |
  30036 | 2023-10-17 17:14:27+00 | 2023-10-17 17:24:27.870446+00 | | 6 | 2 | | 600 |
  24933 | 2023-10-16 17:14:03+00 | 2023-10-16 17:24:03.284484+00 | | 6 | 2 | | 600 |
  27055 | 2023-10-17 03:28:39+00 | 2023-10-17 03:38:39.891357+00 | | 6 | 2 | | 600 |
  31218 | 2023-10-17 21:11:55+00 | 2023-10-17 21:21:55.570965+00 | | 6 | 2 | | 600 |
  22972 | 2023-10-16 11:08:21+00 | 2023-10-16 11:18:21.326206+00 | | 6 | 2 | | 600 |
```

meaning that we are missing to clean them up (these are probably created when a new host gets an IP from the MAAS DHCP server).

Related branches

Jacopo Rota (r00ta)
description: updated
Revision history for this message
Marian Gasparovic (marosg) wrote :

As discussed with Jacopo, I did a test on freshly deployed MAAS 3.4. I had a loop which created lxc container getting dhcp address from MAAS and immediately deleted the container.

After doing it around 100 times I could see numbers in abovce tables grow and they never go down even after leases expire.
I started another round with more containers.

maasdb=# select domain_id, count(*) from maasserver_dnsresource group by domain_id;
 domain_id | count
-----------+-------
         1 | 2067
(1 row)

maasdb=# select subnet_id, count(*) from maasserver_staticipaddress group by subnet_id;
 subnet_id | count
-----------+-------
         2 | 17
         6 | 6
           | 6
         1 | 2233
         3 | 8
         5 | 6
(6 rows)

This is after all containers are gone and leases expired

MAAS UI says "5 hosts; 32 resource records" which is right

Revision history for this message
Jacopo Rota (r00ta) wrote :

I've loaded solqa database into my homelab and I collected some advanced Postgres metrics. I was able to see the large amount of dns/ip queries as per screenshot

Revision history for this message
Jacopo Rota (r00ta) wrote :

This seems to be related to https://bugs.launchpad.net/maas/+bug/2025468 .

In this bug report we have to
1) ensure that the bug is mitigated
2) provide a query to cleanup the orphan ips and dns entries caused by https://bugs.launchpad.net/maas/+bug/2025468

In order to provide the step 2, I'd du the following:
a) take a dump of solqa environment
b) load it into a local environment to work on the cleanup query
c) once the query is finalized and the system is properly running, execute the query on solqa env
d) if something breaks, rollback the env using the original dump

Revision history for this message
Jacopo Rota (r00ta) wrote :

As you can see from the screenshot cleaning up these records improves a lot the cpu usage.

Changed in maas:
status: Triaged → Fix Committed
Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
Revision history for this message
Jacopo Rota (r00ta) wrote :

For the record, the cleanup workaround for MAAS <= 3.2 is

Run
```
sudo snap run --shell maas.supervisor -c "maas-region shell"
```

and then execute

```
from maasserver.enum import INTERFACE_TYPE, IPADDRESS_TYPE
from maasserver.models import Interface

interfaces = Interface.objects.filter(type=INTERFACE_TYPE.UNKNOWN,ip_addresses__ip__isnull=True,ip_addresses__alloc_type=IPADDRESS_TYPE.DISCOVERED,)
len_interfaces = len(interfaces)
for index, interface in enumerate(interfaces):
    print(f"\rDeleting interface {index}/{len_interfaces}", end="")
    interface.delete()

print("")
```

It will take several minutes/hours depending on the amount of zombie resources you have

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.