Comment 5 for bug 1910783

Revision history for this message
Rod Smith (rodsmith) wrote :

This has been happening to us recently with MAAS 3.3.4. It's a sporadic failure. I'm seeing the following in regiond.log:

2023-09-08 19:12:21 twisted.internet.protocol.Factory: [info] RegionServer conne
ction established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.16.3', port=52
52, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.16.3',
 port=58048, flowInfo=0, scopeID=0))
2023-09-08 19:12:21 maasserver.rpc.regionservice: [info] Rack controller authent
icated from '::ffff:10.1.16.3:58044'.
2023-09-08 19:12:21 maasserver.rpc.regionservice: [info] Rack controller authent
icated from '::ffff:10.1.16.3:58048'.
2023-09-08 19:12:23 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '10.1.16.3', 5252).
2023-09-08 19:12:25 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '10.1.16.3', 5252).
2023-09-08 19:12:25 maasserver.dhcp: [info] Successfully configured DHCPv4 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:26 maasserver.dhcp: [info] Successfully configured DHCPv6 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:39 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK
 (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService
)
2023-09-08 19:13:02 maasserver.models.node: [info] hoggus: Turning on netboot fo
r node
2023-09-08 19:13:02 maasserver.models.node: [info] hoggus: Turning ephemeral dep
loy off for node
2023-09-08 19:15:21 maasserver: [error] Error while calling ScanNetworks: Unable
 to get RPC connection for rack controller 'weavile' (rsfc68).
2023-09-08 19:15:21 maasserver.regiondservices.active_discovery: [info] Active n
etwork discovery: Unable to initiate network scanning on any rack controller. Ve
rify that the rack controllers are started and have connected to the region.
2023-09-08 19:15:42 maasserver.models.signals.power: [critical] Failed to update
 power state of machine after state transition.
        Traceback (most recent call last):
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback( # type: ignore[misc]
          File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/models/node.py", line 6052, in cb_power_control
            d = getClientFromIdentifiers(client_idents)
          File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 128, in wrapper
            return func(*args, **kwargs)
          File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 60, in wrapper
            return maybeDeferred(func, *args, **kwargs)
        --- <exception caught here> ---
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
            current.result = callback( # type: ignore[misc]
          File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/models/signals/power.py", line 46, in eb_error
            failure.trap(Node.DoesNotExist, UnknownPowerType, PowerProblem)
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 451, in trap
            self.raiseException()
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 190, in maybeDeferred
            result = f(*args, **kwargs)
          File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/rpc/__init__.py", line 25, in getClientFromIdentifiers
            raise exceptions.NoConnectionsAvailable(
        provisioningserver.rpc.exceptions.NoConnectionsAvailable: Unable to connect to any rack controller rsfc68; no connections available.

2023-09-08 19:17:01 RegionServer,5851,::ffff:10.1.20.3: [info] RegionServer connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=44146, flowInfo=0, scopeID=0))
2023-09-08 19:17:01 RegionServer,5852,::ffff:10.1.20.3: [info] RegionServer connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=44152, flowInfo=0, scopeID=0))
2023-09-08 19:17:01 maasserver.ipc: [info] Worker pid:29452 lost RPC connection to ('rsfc68', '10.1.20.3', 5253).
2023-09-08 19:17:01 maasserver.ipc: [info] Worker pid:29452 lost RPC connection to ('rsfc68', '10.1.20.3', 5253).
2023-09-08 19:17:21 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=53994, flowInfo=0, scopeID=0))
2023-09-08 19:17:21 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=5253, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.20.3', port=54008, flowInfo=0, scopeID=0))

Based on the context around the Python traceback, it looks to me like the rack controller is OK, goes away, and then comes back; but in the brief time that it's non-responsive, the region controller fails. I'm also seeing a lot of Python tracebacks in rackd.log.

I'll try to attach rackd.log and regiond.log to another comment; LP is flaking out when I try to attach it to this one.