MAAS 3.4.0 - python errors in regiond and rackd, controllers losing RPC connection
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Triaged
|
Medium
|
Unassigned |
Bug Description
Hello,
We have a setup with 6 MaaS systems, 3 region+rack servers and 3 rack controllers, in HA mode.
The whole env has been updated from 3.0.1 to 3.2 and then from 3.2 to 3.4/stable, all via snap packages.
current version is: 3.4.0-14321-
environment setup:
- 3 bare metal servers, running regiond+rackd and libvirt with 2 KVM VMs on top (see below)
- 3 rack controllers , running on VMware as virtual machines
- all servers have Ubuntu 22.04 Jammy with latest updates installed
Since the update, we can see controllers degrading in MaaS UI, and then coming back to normal (green state).
Analyzing the logs, on EVERY controller we have the following errors:
- in all rackd.log from all 6 servers:
2024-02-29 12:26:34 provisioningser
2024-02-29 12:26:34 twisted.
2024-02-29 12:26:34 twisted.
Traceback (most recent call last):
Failure: twisted.
2024-02-29 12:26:34 twisted.
2024-02-29 12:26:34 twisted.
Traceback (most recent call last):
Failure: twisted.
- all regiond.log from the 3 servers that run regiond+rackd:
2024-02-29 12:27:10 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('n6dwhp', '10.x.x.4', 5250).
2024-02-29 12:27:10 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('n6dwhp', '10.x.x.4', 5250).
2024-02-29 12:27:10 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('n6dwhp', '10.x.x.4', 5250).
2024-02-29 12:27:10 regiond: [info] 127.0.0.1 POST /MAAS/metadata/
2024-02-29 12:27:11 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningser
2024-02-29 12:27:12 maasserver: [error] #######
2024-02-29 12:27:12 maasserver: [error] Traceback (most recent call last):
File "/snap/
response = wrapped_
File "/snap/
return view_atomic(*args, **kwargs)
File "/usr/lib/
return func(*args, **kwds)
File "/snap/
response = super()
File "/snap/
response = func(*args, **kwargs)
File "/snap/
result = self.error_
File "/snap/
result = meth(request, *args, **kwargs)
File "/snap/
return function(self, request, *args, **kwargs)
File "/snap/
target_status = process(node, request, status)
File "/snap/
self.
File "/snap/
script_
File "/snap/
self.status in SCRIPT_
AssertionError: Status for scriptresult 23445 is not running or pending (2)
2024-02-29 12:27:12 regiond: [info] 127.0.0.1 POST /MAAS/metadata/
2024-02-29 12:27:15 maasserver.ipc: [info] Worker pid:481647 registered RPC connection to ('w3dn6r', '10.x.x.4', 5251).
2024-02-29 12:27:15 maasserver.ipc: [info] Worker pid:481647 registered RPC connection to ('w3dn6r', '10.x.x.4', 5251).
2024-02-29 12:27:16 maasserver.ipc: [info] Worker pid:481647 registered RPC connection to ('w3dn6r', '10.x.x.4', 5251).
2024-02-29 12:27:25 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('w3dn6r', '10.x.x.4', 5250).
2024-02-29 12:27:25 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('w3dn6r', '10.x.x.4', 5250).
2024-02-29 12:27:25 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('w3dn6r', '10.x.x.4', 5250).
We haven't seen this behavior in 3.2, but all the systems are now 3.4.0 and in production, so this issue sometimes prevents our juju controller to deploy machines from MaaS for this reason.
Once in a while when all 3 region servers are losing RPC connection to rack controllers at the same time, we get an error from juju.
If we restart maas via snap restart maas, the logs are fine for 30 seconds to 1 minute, then the above errors start to show
On the 3 regiond+rack servers we also run libvirt with 2 VMs on each Maas server that act as a PostgreSQL 14 cluster.
With this setup, the system ran fine and stable on 3.0.1 and 3.2.10
long shot, but i gotta ask: are *all* the controllers up to 3.4.0-14321- g.1027c7664, no holdouts?