This has been happening to us recently with MAAS 3.3.4. It's a sporadic failure. I'm seeing the following in regiond.log:
2023-09-08 19:12:21 twisted.internet.protocol.Factory: [info] RegionServer conne
ction established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.16.3', port=52
52, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.16.3',
port=58048, flowInfo=0, scopeID=0))
2023-09-08 19:12:21 maasserver.rpc.regionservice: [info] Rack controller authent
icated from '::ffff:10.1.16.3:58044'.
2023-09-08 19:12:21 maasserver.rpc.regionservice: [info] Rack controller authent
icated from '::ffff:10.1.16.3:58048'.
2023-09-08 19:12:23 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '10.1.16.3', 5252).
2023-09-08 19:12:25 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '10.1.16.3', 5252).
2023-09-08 19:12:25 maasserver.dhcp: [info] Successfully configured DHCPv4 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:26 maasserver.dhcp: [info] Successfully configured DHCPv6 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:39 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK
(referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService
)
2023-09-08 19:13:02 maasserver.models.node: [info] hoggus: Turning on netboot fo
r node
2023-09-08 19:13:02 maasserver.models.node: [info] hoggus: Turning ephemeral dep
loy off for node
2023-09-08 19:15:21 maasserver: [error] Error while calling ScanNetworks: Unable
to get RPC connection for rack controller 'weavile' (rsfc68).
2023-09-08 19:15:21 maasserver.regiondservices.active_discovery: [info] Active n
etwork discovery: Unable to initiate network scanning on any rack controller. Ve
rify that the rack controllers are started and have connected to the region.
2023-09-08 19:15:42 maasserver.models.signals.power: [critical] Failed to update
power state of machine after state transition.
Traceback (most recent call last):
File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks current.result = callback( # type: ignore[misc]
File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/models/node.py", line 6052, in cb_power_control
d = getClientFromIdentifiers(client_idents)
File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 128, in wrapper
return func(*args, **kwargs)
File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/utils/twisted.py", line 60, in wrapper
return maybeDeferred(func, *args, **kwargs)
--- <exception caught here> ---
File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks current.result = callback( # type: ignore[misc]
File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/models/signals/power.py", line 46, in eb_error failure.trap(Node.DoesNotExist, UnknownPowerType, PowerProblem)
File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 451, in trap self.raiseException()
File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
raise self.value.with_traceback(self.tb)
File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 190, in maybeDeferred
result = f(*args, **kwargs)
File "/snap/maas/28521/lib/python3.10/site-packages/maasserver/rpc/__init__.py", line 25, in getClientFromIdentifiers
raise exceptions.NoConnectionsAvailable( provisioningserver.rpc.exceptions.NoConnectionsAvailable: Unable to connect to any rack controller rsfc68; no connections available.
Based on the context around the Python traceback, it looks to me like the rack controller is OK, goes away, and then comes back; but in the brief time that it's non-responsive, the region controller fails. I'm also seeing a lot of Python tracebacks in rackd.log.
I'll try to attach rackd.log and regiond.log to another comment; LP is flaking out when I try to attach it to this one.
This has been happening to us recently with MAAS 3.3.4. It's a sporadic failure. I'm seeing the following in regiond.log:
2023-09-08 19:12:21 twisted. internet. protocol. Factory: [info] RegionServer conne ss(type= 'TCP', host=': :ffff:10. 1.16.3' , port=52 s(type= 'TCP', host=': :ffff:10. 1.16.3' , rpc.regionservi ce: [info] Rack controller authent 10.1.16. 3:58044' . rpc.regionservi ce: [info] Rack controller authent 10.1.16. 3:58048' . ver.rpc. clusterservice. ClusterClientSe rvice models. node: [info] hoggus: Turning on netboot fo models. node: [info] hoggus: Turning ephemeral dep regiondservices .active_ discovery: [info] Active n models. signals. power: [critical] Failed to update maas/28521/ usr/lib/ python3/ dist-packages/ twisted/ internet/ defer.py" , line 857, in _runCallbacks
current. result = callback( # type: ignore[misc] maas/28521/ lib/python3. 10/site- packages/ maasserver/ models/ node.py" , line 6052, in cb_power_control entifiers( client_ idents) maas/28521/ lib/python3. 10/site- packages/ provisioningser ver/utils/ twisted. py", line 128, in wrapper maas/28521/ lib/python3. 10/site- packages/ provisioningser ver/utils/ twisted. py", line 60, in wrapper maas/28521/ usr/lib/ python3/ dist-packages/ twisted/ internet/ defer.py" , line 857, in _runCallbacks
current. result = callback( # type: ignore[misc] maas/28521/ lib/python3. 10/site- packages/ maasserver/ models/ signals/ power.py" , line 46, in eb_error
failure. trap(Node. DoesNotExist, UnknownPowerType, PowerProblem) maas/28521/ usr/lib/ python3/ dist-packages/ twisted/ python/ failure. py", line 451, in trap
self. raiseException( ) maas/28521/ usr/lib/ python3/ dist-packages/ twisted/ python/ failure. py", line 475, in raiseException with_traceback( self.tb) maas/28521/ usr/lib/ python3/ dist-packages/ twisted/ internet/ defer.py" , line 190, in maybeDeferred maas/28521/ lib/python3. 10/site- packages/ maasserver/ rpc/__init_ _.py", line 25, in getClientFromId entifiers NoConnectionsAv ailable(
provisioningse rver.rpc. exceptions. NoConnectionsAv ailable: Unable to connect to any rack controller rsfc68; no connections available.
ction established (HOST:IPv6Addre
52, flowInfo=0, scopeID=0) PEER:IPv6Addres
port=58048, flowInfo=0, scopeID=0))
2023-09-08 19:12:21 maasserver.
icated from '::ffff:
2023-09-08 19:12:21 maasserver.
icated from '::ffff:
2023-09-08 19:12:23 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '10.1.16.3', 5252).
2023-09-08 19:12:25 maasserver.ipc: [info] Worker pid:29451 registered RPC conne
ction to ('rsfc68', '10.1.16.3', 5252).
2023-09-08 19:12:25 maasserver.dhcp: [info] Successfully configured DHCPv4 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:26 maasserver.dhcp: [info] Successfully configured DHCPv6 on ra
ck controller 'weavile (rsfc68)'.
2023-09-08 19:12:39 regiond: [info] 127.0.0.1 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK
(referrer: -; agent: provisioningser
)
2023-09-08 19:13:02 maasserver.
r node
2023-09-08 19:13:02 maasserver.
loy off for node
2023-09-08 19:15:21 maasserver: [error] Error while calling ScanNetworks: Unable
to get RPC connection for rack controller 'weavile' (rsfc68).
2023-09-08 19:15:21 maasserver.
etwork discovery: Unable to initiate network scanning on any rack controller. Ve
rify that the rack controllers are started and have connected to the region.
2023-09-08 19:15:42 maasserver.
power state of machine after state transition.
Traceback (most recent call last):
File "/snap/
File "/snap/
d = getClientFromId
File "/snap/
return func(*args, **kwargs)
File "/snap/
return maybeDeferred(func, *args, **kwargs)
--- <exception caught here> ---
File "/snap/
File "/snap/
File "/snap/
File "/snap/
raise self.value.
File "/snap/
result = f(*args, **kwargs)
File "/snap/
raise exceptions.
2023-09-08 19:17:01 RegionServer, 5851,:: ffff:10. 1.20.3: [info] RegionServer connection lost (HOST:IPv6Addre ss(type= 'TCP', host=': :ffff:10. 1.20.3' , port=5253, flowInfo=0, scopeID=0) PEER:IPv6Addres s(type= 'TCP', host=': :ffff:10. 1.20.3' , port=44146, flowInfo=0, scopeID=0)) 5852,:: ffff:10. 1.20.3: [info] RegionServer connection lost (HOST:IPv6Addre ss(type= 'TCP', host=': :ffff:10. 1.20.3' , port=5253, flowInfo=0, scopeID=0) PEER:IPv6Addres s(type= 'TCP', host=': :ffff:10. 1.20.3' , port=44152, flowInfo=0, scopeID=0)) internet. protocol. Factory: [info] RegionServer connection established (HOST:IPv6Addre ss(type= 'TCP', host=': :ffff:10. 1.20.3' , port=5253, flowInfo=0, scopeID=0) PEER:IPv6Addres s(type= 'TCP', host=': :ffff:10. 1.20.3' , port=53994, flowInfo=0, scopeID=0)) internet. protocol. Factory: [info] RegionServer connection established (HOST:IPv6Addre ss(type= 'TCP', host=': :ffff:10. 1.20.3' , port=5253, flowInfo=0, scopeID=0) PEER:IPv6Addres s(type= 'TCP', host=': :ffff:10. 1.20.3' , port=54008, flowInfo=0, scopeID=0))
2023-09-08 19:17:01 RegionServer,
2023-09-08 19:17:01 maasserver.ipc: [info] Worker pid:29452 lost RPC connection to ('rsfc68', '10.1.20.3', 5253).
2023-09-08 19:17:01 maasserver.ipc: [info] Worker pid:29452 lost RPC connection to ('rsfc68', '10.1.20.3', 5253).
2023-09-08 19:17:21 twisted.
2023-09-08 19:17:21 twisted.
Based on the context around the Python traceback, it looks to me like the rack controller is OK, goes away, and then comes back; but in the brief time that it's non-responsive, the region controller fails. I'm also seeing a lot of Python tracebacks in rackd.log.
I'll try to attach rackd.log and regiond.log to another comment; LP is flaking out when I try to attach it to this one.