Rod Smith (rodsmith) wrote :

This has been happening to us recently with MAAS 3.3.4. It's a sporadic failure. I'm seeing the following in regiond.log:

2023-09-08 19:12:21 twisted.internet.protocol.Factory: [info] RegionServer conne
ction established (HOST:IPv6Address(type='TCP', host='::ffff:', port=52
52, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:',
 port=58048, flowInfo=0, scopeID=0))
2023-09-08 19:12:21 maasserver.rpc.regionservice: [info] Rack controller authent
icated from '::ffff:'.
2023-09-08 19:12:21 maasserver.rpc.regionservice: [info] Rack controller authent
icated from '::ffff:'.
2023-09-08 19:12:23 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '', 5252).
2023-09-08 19:12:25 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '', 5252).
2023-09-08 19:12:25 maasserver.dhcp: [info] Successfully configured DHCPv4 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:26 maasserver.dhcp: [info] Successfully configured DHCPv6 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:39 regiond: [info] GET /MAAS/rpc/ HTTP/1.1 --> 200 OK
 (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService
2023-09-08 19:13:02 maasserver.models.node: [info] hoggus: Turning on netboot fo
r node
2023-09-08 19:13:02 maasserver.models.node: [info] hoggus: Turning ephemeral dep
loy off for node
2023-09-08 19:15:21 maasserver: [error] Error while calling ScanNetworks: Unable
 to get RPC connection for rack controller 'weavile' (rsfc68).
2023-09-08 19:15:21 maasserver.regiondservices.active_discovery: [info] Active n
etwork discovery: Unable to initiate network scanning on any rack controller. Ve
rify that the rack controllers are started and have connected to the region.
2023-09-08 19:15:42 maasserver.models.signals.power: [critical] Failed to update
 power state of machine after state transition.
        Traceback (most recent call last):
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/", line 857, in _runCallbacks
            current.result = callback( # type: ignore[misc]
          File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/models/", line 6052, in cb_power_control
            d = getClientFromIdentifiers(client_idents)
          File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/utils/", line 128, in wrapper
            return func(*args, **kwargs)
          File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/utils/", line 60, in wrapper
            return maybeDeferred(func, *args, **kwargs)
        --- <exception caught here> ---
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/", line 857, in _runCallbacks
            current.result = callback( # type: ignore[misc]
          File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/models/signals/", line 46, in eb_error
            failure.trap(Node.DoesNotExist, UnknownPowerType, PowerProblem)
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/", line 451, in trap
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/", line 475, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/", line 190, in maybeDeferred
            result = f(*args, **kwargs)
          File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/rpc/", line 25, in getClientFromIdentifiers
            raise exceptions.NoConnectionsAvailable(
        provisioningserver.rpc.exceptions.NoConnectionsAvailable: Unable to connect to any rack controller rsfc68; no connections available.

2023-09-08 19:17:01 RegionServer,5851,::ffff: [info] RegionServer connection lost (HOST:IPv6Address(type='TCP', host='::ffff:', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:', port=44146, flowInfo=0, scopeID=0))
2023-09-08 19:17:01 RegionServer,5852,::ffff: [info] RegionServer connection lost (HOST:IPv6Address(type='TCP', host='::ffff:', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:', port=44152, flowInfo=0, scopeID=0))
2023-09-08 19:17:01 maasserver.ipc: [info] Worker pid:29452 lost RPC connection to ('rsfc68', '', 5253).
2023-09-08 19:17:01 maasserver.ipc: [info] Worker pid:29452 lost RPC connection to ('rsfc68', '', 5253).
2023-09-08 19:17:21 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:', port=53994, flowInfo=0, scopeID=0))
2023-09-08 19:17:21 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:', port=54008, flowInfo=0, scopeID=0))

Based on the context around the Python traceback, it looks to me like the rack controller is OK, goes away, and then comes back; but in the brief time that it's non-responsive, the region controller fails. I'm also seeing a lot of Python tracebacks in rackd.log.

I'll try to attach rackd.log and regiond.log to another comment; LP is flaking out when I try to attach it to this one.