MAAS

MAAS has orphan ip addresses and dns records that are slowing down the entire service

Bug #2049508 reported by Jacopo Rota on 2024-01-16

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
MAAS	Fix Released	High	Jacopo Rota	MAAS 3.5.0-beta1
3.4	Fix Released	High	Jacopo Rota	MAAS 3.4.1
3.5	Fix Released	High	Jacopo Rota	MAAS 3.5.0-beta1

Bug Description

On SOLQA env at X.X.164.2:5240 running 3.4.0 snap we see a lot of postgres activities

```
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
654234 root 20 0 1187948 908072 26908 S 84.1 1.4 20:51.34 python3
654236 root 20 0 1312776 952840 27144 S 55.5 1.4 22:23.74 python3
654232 root 20 0 1311844 964052 26996 S 43.2 1.5 25:05.65 python3
671936 postgres 20 0 239800 103008 90004 S 31.9 0.2 0:38.30 postgres
672556 postgres 20 0 250784 118092 98500 R 31.6 0.2 0:34.93 postgres
672900 postgres 20 0 235176 101228 92344 R 29.6 0.2 0:07.53 postgres
672552 postgres 20 0 236608 79128 69008 S 28.9 0.1 0:21.86 postgres
673032 postgres 20 0 239484 100504 87736 R 26.6 0.2 0:04.76 postgres
672220 postgres 20 0 258404 103504 75952 R 15.6 0.2 0:04.42 postgres
673275 postgres 20 0 235116 89260 81424 S 15.6 0.1 0:05.57 postgres
672577 postgres 20 0 235636 82832 73604 S 14.3 0.1 0:45.81 postgres
```

and after looking at the db most of the queries are dns/static ip related, for example:

```
SELECT "maasserver_staticipaddress"."id", "maasserver_staticipaddress"."created", "maasserver_staticipaddress"."updated", "maasserver_staticipaddress"."ip", "maasserver_staticipaddress"."alloc_type", "maasserver_staticipaddress"."subnet_id", "maasserver_staticipaddress"."user_id", "maasserver_staticipaddress"."lease_time", "maasserver_staticipaddress"."temp_expires_on" FROM "maasserver_staticipaddress" INNER JOIN "maasserver_dnsresource_ip_addresses" ON ("maasserver_staticipaddress"."id" = "maasserver_dnsresource_ip_addresses"."staticipaddress_id") WHERE "maasserver_dnsresource_ip_addresses"."dnsresource_id" = 2068
```

and looking at the record 2068

```
2068 | 2023-10-23 19:21:19.843186+00 | 2023-10-23 19:21:19.843186+00 | network-poller-b2a8e551-1953-47e6-9026-d43056f11570 | 1 |
```
and the related ip address
```
id | created | updated | ip | alloc_type | subnet_id | user_id | lease_time | temp_expires_on
-------+------------------------+-------------------------------+----+------------+-----------+---------+------------+-----------------
50843 | 2023-10-23 19:21:19+00 | 2023-10-23 19:31:19.913985+00 | | 6 | 2 | | 600 |

```

we spotted that this is coming from one of the MANY lxd containers that are spawned and deleted after few minutes.

The amount of such records is huge

```
maasdb=# select subnet_id, count(*) from maasserver_staticipaddress group by subnet_id;
subnet_id | count
-----------+-------
           | 121
         3 | 303
         5 | 246
         4 | 65
         2 | 13686
```

as well as the ip addresses

```
maasdb=# select domain_id, count(*) from maasserver_dnsresource group by domain_id;
domain_id | count
-----------+-------
         8 | 43
         7 | 43
         1 | 11862
         5 | 43
         4 | 43
         6 | 43
         3 | 43
```

most of these IP addresses are of `alloc_type=DHCP` and are really old

```
   id | created | updated | ip | alloc_type | subnet_id | user_id | lease_time | temp_expires_on
--------+-------------------------------+-------------------------------+----------------+------------+-----------+---------+------------+-----------------
169688 | 2023-12-01 20:57:44+00 | 2023-12-01 21:07:44.427741+00 | | 6 | 2 | | 600 |
  37082 | 2023-10-18 20:12:25+00 | 2023-10-18 20:22:25.756612+00 | | 6 | 2 | | 600 |
  30036 | 2023-10-17 17:14:27+00 | 2023-10-17 17:24:27.870446+00 | | 6 | 2 | | 600 |
  24933 | 2023-10-16 17:14:03+00 | 2023-10-16 17:24:03.284484+00 | | 6 | 2 | | 600 |
  27055 | 2023-10-17 03:28:39+00 | 2023-10-17 03:38:39.891357+00 | | 6 | 2 | | 600 |
  31218 | 2023-10-17 21:11:55+00 | 2023-10-17 21:21:55.570965+00 | | 6 | 2 | | 600 |
  22972 | 2023-10-16 11:08:21+00 | 2023-10-16 11:18:21.326206+00 | | 6 | 2 | | 600 |
```

meaning that we are missing to clean them up (these are probably created when a new host gets an IP from the MAAS DHCP server).

See original description

Related branches

~r00ta/maas:lp-2049508-3.4

Merged into maas:3.4

MAAS Lander: Approve on 2024-02-27

Jacopo Rota: Approve on 2024-02-27

~r00ta/maas:lp-2049508

Merged into maas:master

MAAS Lander: Approve on 2024-02-26

Alexsander de Souza: Approve on 2024-02-22

Jacopo Rota (r00ta) on 2024-01-16

description:

updated

Revision history for this message

Marian Gasparovic (marosg) wrote on 2024-01-18:

As discussed with Jacopo, I did a test on freshly deployed MAAS 3.4. I had a loop which created lxc container getting dhcp address from MAAS and immediately deleted the container.

After doing it around 100 times I could see numbers in abovce tables grow and they never go down even after leases expire.
I started another round with more containers.

maasdb=# select domain_id, count(*) from maasserver_dnsresource group by domain_id;
domain_id | count
-----------+-------
1 | 2067
(1 row)

maasdb=# select subnet_id, count(*) from maasserver_staticipaddress group by subnet_id;
subnet_id | count
-----------+-------
         2 | 17
         6 | 6
           | 6
         1 | 2233
         3 | 8
         5 | 6
(6 rows)

This is after all containers are gone and leases expired

MAAS UI says "5 hosts; 32 resource records" which is right

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-01-18:

Screenshot 2024-01-18 at 23.06.49.png Edit (170.1 KiB, image/png)

I've loaded solqa database into my homelab and I collected some advanced Postgres metrics. I was able to see the large amount of dns/ip queries as per screenshot

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-01-22:

This seems to be related to https://bugs.launchpad.net/maas/+bug/2025468 .

In this bug report we have to
1) ensure that the bug is mitigated
2) provide a query to cleanup the orphan ips and dns entries caused by https://bugs.launchpad.net/maas/+bug/2025468

In order to provide the step 2, I'd du the following:
a) take a dump of solqa environment
b) load it into a local environment to work on the cleanup query
c) once the query is finalized and the system is properly running, execute the query on solqa env
d) if something breaks, rollback the env using the original dump

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-01-24:

Screenshot from 2024-01-24 15-51-46.png Edit (82.6 KiB, image/png)

As you can see from the screenshot cleaning up these records improves a lot the cpu usage.

MAAS Lander (maas-lander) on 2024-02-27

Changed in maas:
status:	Triaged → Fix Committed

Jacopo Rota (r00ta) on 2024-02-27

Changed in maas:
assignee:	nobody → Jacopo Rota (r00ta)

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-02-27:

For the record, the cleanup workaround for MAAS <= 3.2 is

Run
```
sudo snap run --shell maas.supervisor -c "maas-region shell"
```

and then execute

```
from maasserver.enum import INTERFACE_TYPE, IPADDRESS_TYPE
from maasserver.models import Interface

interfaces = Interface.objects.filter(type=INTERFACE_TYPE.UNKNOWN,ip_addresses__ip__isnull=True,ip_addresses__alloc_type=IPADDRESS_TYPE.DISCOVERED,)
len_interfaces = len(interfaces)
for index, interface in enumerate(interfaces):
print(f"\rDeleting interface {index}/{len_interfaces}", end="")
interface.delete()

print("")
```

It will take several minutes/hours depending on the amount of zombie resources you have

Anton Troyanov (troyanov) on 2024-03-05

Changed in maas:
milestone:	3.5.0 → 3.5.0-beta1
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.