MAAS 3.4.0 - python errors in regiond and rackd, controllers losing RPC connection

Bug #2055413 reported by Radu Malica
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

Hello,

We have a setup with 6 MaaS systems, 3 region+rack servers and 3 rack controllers, in HA mode.
The whole env has been updated from 3.0.1 to 3.2 and then from 3.2 to 3.4/stable, all via snap packages.

current version is: 3.4.0-14321-g.1027c7664

environment setup:
- 3 bare metal servers, running regiond+rackd and libvirt with 2 KVM VMs on top (see below)
- 3 rack controllers , running on VMware as virtual machines
- all servers have Ubuntu 22.04 Jammy with latest updates installed

Since the update, we can see controllers degrading in MaaS UI, and then coming back to normal (green state).

Analyzing the logs, on EVERY controller we have the following errors:

- in all rackd.log from all 6 servers:

 2024-02-29 12:26:34 provisioningserver.rpc.clusterservice: [info] Rack controller 'hbf4b8' registered (via maas01:pid=1984695) with MAAS version 3.4.0-14321-g.1027c7664.
2024-02-29 12:26:34 twisted.internet.defer: [critical] Unhandled error in Deferred:
2024-02-29 12:26:34 twisted.internet.defer: [critical]
        Traceback (most recent call last):
        Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\n{\x00\x04', 98, 'Address already in use')

2024-02-29 12:26:34 twisted.internet.defer: [critical] Unhandled error in Deferred:
2024-02-29 12:26:34 twisted.internet.defer: [critical]
        Traceback (most recent call last):
        Failure: twisted.internet.error.MulticastJoinError: (b'\xe0\x00\x00v', b'\xc0\xa8\xde\x01', 98, 'Address already in use')

- all regiond.log from the 3 servers that run regiond+rackd:

2024-02-29 12:27:10 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('n6dwhp', '10.x.x.4', 5250).
2024-02-29 12:27:10 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('n6dwhp', '10.x.x.4', 5250).
2024-02-29 12:27:10 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('n6dwhp', '10.x.x.4', 5250).
2024-02-29 12:27:10 regiond: [info] 127.0.0.1 POST /MAAS/metadata/2012-03-01/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.10)
2024-02-29 12:27:11 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2024-02-29 12:27:12 maasserver: [error] ################################ Exception: Status for scriptresult 23445 is not running or pending (2) ################################
2024-02-29 12:27:12 maasserver: [error] Traceback (most recent call last):
  File "/snap/maas/32469/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/utils/views.py", line 298, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/snap/maas/32469/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/snap/maas/32469/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/snap/maas/32469/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/api/support.py", line 371, in dispatch
    return function(self, request, *args, **kwargs)
  File "/snap/maas/32469/lib/python3.10/site-packages/metadataserver/api.py", line 858, in signal
    target_status = process(node, request, status)
  File "/snap/maas/32469/lib/python3.10/site-packages/metadataserver/api.py", line 680, in _process_commissioning
    self._store_results(
  File "/snap/maas/32469/lib/python3.10/site-packages/metadataserver/api.py", line 563, in _store_results
    script_result.store_result(
  File "/snap/maas/32469/lib/python3.10/site-packages/maasserver/models/scriptresult.py", line 270, in store_result
    self.status in SCRIPT_STATUS_RUNNING_OR_PENDING
AssertionError: Status for scriptresult 23445 is not running or pending (2)

2024-02-29 12:27:12 regiond: [info] 127.0.0.1 POST /MAAS/metadata/2012-03-01/ HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: Python-urllib/3.10)
2024-02-29 12:27:15 maasserver.ipc: [info] Worker pid:481647 registered RPC connection to ('w3dn6r', '10.x.x.4', 5251).
2024-02-29 12:27:15 maasserver.ipc: [info] Worker pid:481647 registered RPC connection to ('w3dn6r', '10.x.x.4', 5251).
2024-02-29 12:27:16 maasserver.ipc: [info] Worker pid:481647 registered RPC connection to ('w3dn6r', '10.x.x.4', 5251).
2024-02-29 12:27:25 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('w3dn6r', '10.x.x.4', 5250).
2024-02-29 12:27:25 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('w3dn6r', '10.x.x.4', 5250).
2024-02-29 12:27:25 maasserver.ipc: [info] Worker pid:481646 lost RPC connection to ('w3dn6r', '10.x.x.4', 5250).

We haven't seen this behavior in 3.2, but all the systems are now 3.4.0 and in production, so this issue sometimes prevents our juju controller to deploy machines from MaaS for this reason.

Once in a while when all 3 region servers are losing RPC connection to rack controllers at the same time, we get an error from juju.

If we restart maas via snap restart maas, the logs are fine for 30 seconds to 1 minute, then the above errors start to show

On the 3 regiond+rack servers we also run libvirt with 2 VMs on each Maas server that act as a PostgreSQL 14 cluster.

With this setup, the system ran fine and stable on 3.0.1 and 3.2.10

Revision history for this message
Bill Wear (billwear) wrote :

long shot, but i gotta ask: are *all* the controllers up to 3.4.0-14321-g.1027c7664, no holdouts?

Changed in maas:
status: New → Incomplete
Revision history for this message
Bill Wear (billwear) wrote :

owp, nevermind, just found a related bug. triaging with a note to see if solution for https://bugs.launchpad.net/maas/+bug/2020798 fixes this.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → Medium
milestone: none → 3.5.x
Revision history for this message
Radu Malica (radumalica) wrote :

Hello, to answer your question, yes, all the controllers around the installation, regiond, rackd are all the same exact version of 3.4.0, no exceptions

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.