MAAS

RPC failure to contact rack/region - operations on closed handler

Bug #2029417 reported by Joao Andre Simioni on 2023-08-02

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
MAAS	Fix Released	High	Jacopo Rota	MAAS 3.5.0-beta1
3.2	Fix Released	High	Jacopo Rota	MAAS 3.2.10
3.3	Fix Released	High	Jacopo Rota	MAAS 3.3.5
3.4	Fix Released	High	Jacopo Rota	MAAS 3.4.1

Bug Description

[Problem Description]

After applying the fixes proposed in LP:2027735 to MAAS 3.2.8 (taken from ppa:r00ta/maas-2027735), MAAS started to behave well, with the expected improved performance. But after around ~24 hours, provisioning of nodes started to fail, and the following traces were seen in:

rackd.log:
----------
2023-07-31 23:16:36 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.217.0.11:5240/MAAS/, http://10.217.0.66:5240/MAAS/).
Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
     _inlineCallbacks(r, g, status)
--- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1292, in _doUpdate
     eventloops, maas_url = yield self._get_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1549, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1520, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1494, in handle_responses
     errors[0].raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1455, in _serial_fetch_rpc_info
     raise last_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1447, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1549, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1520, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1494, in handle_responses
     errors[0].raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1455, in _serial_fetch_rpc_info
     raise last_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1447, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
twisted.internet.error.ConnectingCancelledError: HostnameAddress(hostname=b'10.217.0.11', port=5240)

2023-07-31 23:16:36 provisioningserver.rpc.common: [debug] [RPC -> sent] AmpBox({b'_command': b'Ping'})

regiond.log:
------------
2023-07-31 23:17:23 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': unable to perform operation on <UVPoll closed=True 0x7f33f5cf0660>; the handler is closed
Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/provisioningserver/prometheus/utils.py", line 127, in wrapper
     result = func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 127, in wrapper
     return func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/common.py", line 176, in __call__
     return deferWithTimeout(
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 325, in deferWithTimeout
     d = maybeDeferred(func, *args, **kwargs)
--- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 898, in configure_dhcp
     yield client(
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 151, in maybeDeferred
     result = f(*args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 971, in callRemote
     return co._doCommand(self)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 2000, in _doCommand
     d = proto._sendBoxCommand(self.commandName,
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/common.py", line 261, in _sendBoxCommand
     return super()._sendBoxCommand(
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 902, in _sendBoxCommand
     box._sendTo(self.boxSender)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 723, in _sendTo
     proto.sendBox(self)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 2386, in sendBox
     self.transport.write(box.serialize())
   File "/usr/lib/python3/dist-packages/twisted/internet/_newtls.py", line 191, in write
     FileDescriptor.write(self, bytes)
   File "/usr/lib/python3/dist-packages/twisted/internet/abstract.py", line 356, in write
     self.startWriting()
   File "/usr/lib/python3/dist-packages/twisted/internet/abstract.py", line 443, in startWriting
     self.reactor.addWriter(self)
   File "/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 173, in addWriter
     self._asyncioEventloop.add_writer(fd, callWithLogger, writer,
   File "uvloop/loop.pyx", line 2399, in uvloop.loop.Loop.add_writer

File "uvloop/loop.pyx", line 808, in uvloop.loop.Loop._add_writer

File "uvloop/handles/poll.pyx", line 122, in uvloop.loop.UVPoll.start_writing

File "uvloop/handles/poll.pyx", line 39, in uvloop.loop.UVPoll._poll_start

File "uvloop/handles/handle.pyx", line 159, in uvloop.loop.UVHandle._ensure_alive

builtins.RuntimeError: unable to perform operation on <UVPoll closed=True 0x7f33f5cf0660>; the handler is closed

2023-07-31 23:17:23 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
     _inlineCallbacks(r, g, status)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1464, in _inlineCallbacks
     status.deferred.errback()
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
     self._startRunCallbacks(fail)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
     self._runCallbacks()
--- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/maasserver/rack_controller.py", line 281, in <lambda>
     d.addErrback(lambda f: f.trap(NoConnectionsAvailable))
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 439, in trap
     self.raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
     result = g.send(result)
   File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 951, in configure_dhcp
     raise ipv4_exc
   File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 869, in configure_dhcp
     yield client(
builtins.RuntimeError: unable to perform operation on <UVPoll closed=True 0x7f33f5cf0660>; the handler is closed

Ubuntu version: 20.04
MAAS: 3.2.99 (Interim version from PPA)
Format: Debian
PostgreSQL 12

See original description

Related branches

~r00ta/maas:lp-2029417--3.4

Merged into maas:3.4

Jack Lloyd-Walters: Approve on 2023-10-18

MAAS Lander: Approve on 2023-10-03

~r00ta/maas:lp-2029417--3.5

Merged into maas:master

Jerzy Husakowski: Approve on 2023-09-22

MAAS Lander: Approve on 2023-09-21

~r00ta/maas:lp-2029417-3.4

Merged into maas:3.4

MAAS Lander: Approve on 2023-08-30

Jacopo Rota: Approve on 2023-08-30

~r00ta/maas:lp-2029417-3.3

Merged into maas:3.3

MAAS Lander: Needs Fixing on 2023-08-30

Adam Collard (community): Approve on 2023-08-30

Jacopo Rota: Approve on 2023-08-30

~r00ta/maas:lp-2029417

Merged into maas:master

MAAS Lander: Approve on 2023-08-24

Adam Collard (community): Approve on 2023-08-24

~igor-brovtsin/maas:r00ta-lp-2029417-landing

Merged into maas:3.2

Adam Collard (community): Approve on 2023-08-22

MAAS Lander: Needs Fixing on 2023-08-22

Revision history for this message

Alan Baghumian (alanbach) wrote on 2023-08-02:

The planned MAAS 3.2.9 release includes the LP #2027735 fix and that might trigger this issue at more client locations once they upgrade.

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-08-02:

I see in the rack logs that the same call fails with
```
2023-07-31 23:07:35 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.217.0.11:5240/MAAS/, http://10.217.0.66:5240/MAAS/).
...
response = yield self._fetch_rpc_info(url, orig_url)
twisted.internet.error.DNSLookupError: DNS lookup failed: Couldn't find the hostname '10.217.0.11'.
```

and

```
2023-07-31 23:09:50 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.217.0.11:5240/MAAS/, http://10.217.0.66:5240/MAAS/).
...
response = yield self._fetch_rpc_info(url, orig_url)
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
```

a lot of times.

Revision history for this message

Alan Baghumian (alanbach) wrote on 2023-08-02:

@r00ta Please note that we don't have any connectivity issues at the networking level.

Stopping and Starting Rackd & Regiond services fixes this issue. Everything works for quite some time after the restart, and then this starts happening.

Revision history for this message

Alan Baghumian (alanbach) wrote on 2023-08-02:

Just a side note, region controllers have been configured to use 8 workers. What is you opinion regarding reducing them back to the default 4?

Do you think that will make the situation better or not?

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-08-03:

@alan Every worker requires 10 database connections. So if you have two regions with 8 workers each, you are exceeding the maximum number of database connections (in postgres it's 100 by default).

Jerzy Husakowski (jhusakowski) on 2023-08-03

tags:

added: bug-council

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-08-03:

Hi Jacopo,

max_connections is set to 400, and it is around 210 in normal operations, as rbac and candid are being used.

Adam Collard (adam-collard) on 2023-08-03

summary:	- After applying fix on LP#2027735 RPC Communitation is failing + RPC failure to contact rack/region - operations on closed handler
description:	updated

Björn Tillenius (bjornt) on 2023-08-08

Changed in maas:
assignee:	nobody → Jacopo Rota (r00ta)
status:	New → In Progress

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-08-09:

You can find the investigations I've made here https://docs.google.com/presentation/d/1_YuUw7DlfghQxBakmnU4Fv_o8kp6PdnYEn4MxActi5I/edit#slide=id.p .

To summarize, based on my investigations, the issue should be at twisted/uvloop level. In particular, if libuv raises an exception (out of memory/too many files open/IO error/whatever), uvloop might silently close the connection and the `connection_lost` callback is not invoked.

MAAS keeps track of the connections to the rack controllers using the `connection_made` and `connection_lost` callbacks. In particular, if `connection_lost` is NOT called, we don't remove this connection from the queue.

I've opened https://github.com/MagicStack/uvloop/issues/552.

If the uvloop maintainers don't confirm this is a bug, we'll have to investigate on twisted side (for example, the exception should be handled here https://github.com/twisted/twisted/blob/a9ee8e59a5cd1950bf1a18c4b6ca813e5f56ad08/src/twisted/internet/asyncioreactor.py#L171 - apart from that, twisted should anyway call the `connection_lost` accordingly).

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2023-08-10:

Let's compare the performance of MAAS with and without uvloop. If the performance tests show no meaningful difference when uvloop is used, we should consider removing it.

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-08-10:

I agree @jerzy. Will do

Revision history for this message

Björn Tillenius (bjornt) wrote on 2023-08-14:

#10

In what way did the machines fail to provision? Did they get turned on? Did they receive an IP? Did they download bootloaders? I.e. where did they fail?

While the tracebacks are concerning, I'm not sure they actually cause any fatal errors. There are a lot of different errors in the logs, but without knowing how thing fails, we can only make guesses at possible causes.

Changed in maas:
status:	In Progress → Incomplete

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-08-14:

#11

connectionsfull.png Edit (186.3 KiB, image/png)

As per mattermost chat, I think I finally found why the loseConnection was not called as I suspected last week: It is done on purpuse in our code according to https://git.launchpad.net/maas/tree/src/provisioningserver/rpc/common.py#n318 .
Looking at the defailt implementation of twisted, unhandledError would call loseConnection! https://github.com/twisted/twisted/blob/a9ee8e59a5cd1950bf1a18c4b6ca813e5f56ad08/src/twisted/protocols/amp.py#L2507 but this behaviour was overwritten in 2015 for https://bugs.launchpad.net/maas/+bug/1457799 .
The proof is that in the new logs it's much more clean that the number of connections are wrong exactly when we hit the exceptionUnhandled failure during AMP request (see the screenshot, the blue vertical dashed line is put when we see those Unhandled failure during AMP request exceptions - It's now evident that the DHCP exceptions start after we hit the AMP exception)

So I think the fix could simply be something like

```
def unhandledError(self, failure):
"""Terminal errback, after application code has seen the failure.

        `amp.BoxDispatcher.unhandledError` calls the `amp.IBoxSender`'s
        `unhandledError`. In the default implementation this disconnects the
        transport.

        Here we instead log the failure but do *not* disconnect because it's
        too disruptive to the running of MAAS.
        """
        if failure.check(builtins.RuntimeError) and "The handler is closed" in failure.getErrorMessage():
            super().unhandledError(failure)
            log.info("The handler is closed, the connection will be dropped.")
        else:
            log.err(
                failure,
                (
                    "Unhandled failure during AMP request. This is probably a bug. "
                    "Please ensure that this error is handled within application "
                    "code."
                ),
            )
```
can anybody in the team double check my latest findings above and send a MP? I'm currently off and I can't follow up in the working hours of the team

As per mattermost chat, I think I finally found why the loseConnection was not called as I suspected last week: It is done on purpuse in our code according to https://git.launchpad.net/maas/tree/src/provisioningserver/rpc/common.py#n318 . 
Looking at the defailt implementation of twisted,  unhandledError would call loseConnection! https://github.com/twisted/twisted/blob/a9ee8e59a5cd1950bf1a18c4b6ca813e5f56ad08/src/twisted/protocols/amp.py#L2507 but this behaviour was overwritten in 2015 for https://bugs.launchpad.net/maas/+bug/1457799 . 
The proof is that in the new logs it's much more clean that the number of connections are wrong exactly when we hit the exceptionUnhandled failure during AMP request (see the screenshot, the blue vertical dashed line is put when we see those Unhandled failure during AMP request exceptions - It's now evident that the DHCP exceptions start after we hit the AMP exception)

So I think the fix could simply be something like

```
    def unhandledError(self, failure):
        """Terminal errback, after application code has seen the failure.

`amp.BoxDispatcher.unhandledError` calls the `amp.IBoxSender`'s
        `unhandledError`. In the default implementation this disconnects the
        transport.

Jack Lloyd-Walters (lloydwaltersj) on 2023-08-17

Changed in maas:
status:	Incomplete → Confirmed

Jack Lloyd-Walters (lloydwaltersj) on 2023-08-17

Changed in maas:
status:	Confirmed → In Progress

Jacopo Rota (r00ta) on 2023-08-24

Changed in maas:
milestone:	none → 3.5.0
status:	In Progress → Fix Committed
status:	Fix Committed → In Progress

Jacopo Rota (r00ta) on 2023-08-24

Changed in maas:
importance:	Undecided → High

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-08-25:

#12

connectionsfull.png Edit (186.3 KiB, image/png)

MAAS Lander (maas-lander) on 2023-08-29

Changed in maas:
status:	In Progress → Fix Committed

Jerzy Husakowski (jhusakowski) on 2023-08-31

tags:

removed: bug-council

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-09-19:

#13

Hi team,

the problem manifested again after many days. I've shared the sos reports internally.

Revision history for this message

Jacopo Rota (r00ta) wrote on 2023-09-20:

#14

I have a small reproducer. I think it's (almost?) impossible to reproduce the race condition that leads to the closed handler: they have a pretty unique env and we had to wait more than 1 month to hit the bug again.
So I focused on forcing the same exception at the same place and I'm now working on a new fix. For the record the reproducer is here https://github.com/r00ta/maas-bugs-reproducers/tree/main/lp-2029417

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2024-01-05:

#15

Download full text (15.8 KiB)

The customer is using the interim version we provided (3.2.10~alpha3-12060-g.ee175c971-0ubuntu1), and around 30 days after a restart, they start to see RPC problems again.

The MAAS environment has two servers - pdx01-m01-c33-cpu-01 and pdx01-m01-c34-cpu-01

Restarting the maas-rackd and maas-regiond on both servers brings the system back to normality.

Sos reports are available here:
https://drive.google.com/drive/folders/1u01dldSwTUoEYb6s3dz3n-539MlLEwSf?usp=drive_link

These logs are seen in maas.log:

2023-11-02T12:43:32+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T13:04:14+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 15 times: [ [error] Can't update service statuses, no RPC connection to region.]
2023-11-02T13:06:00+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T13:13:15+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 6 times: [ [error] Can't update service statuses, no RPC connection to region.]
2023-11-02T13:14:29+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T13:59:57+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T14:04:14+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 3 times: [ [error] Can't update service statuses, no RPC connection to region.]
2023-11-02T14:05:44+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T14:16:49+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 7 times: [ [error] Can't update service statuses, no RPC connection to region.]

2023-12-04T20:03:50+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2023-12-29T18:08:20+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2023-12-30T06:07:42+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:09:40+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:10:53+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:14:34+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:15:02+00:00 pdx01-m01-c34-cpu-01 maas.dhcp.probe: [error] Can't initiate DHCP probe; no RPC connection to region.
2023-12-30T06:15:53+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2024-01-03T06:29:23+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no R...

The customer is using the interim version we provided (3.2.10~alpha3-12060-g.ee175c971-0ubuntu1), and around 30 days after a restart, they start to see RPC problems again.

The MAAS environment has two servers - pdx01-m01-c33-cpu-01 and pdx01-m01-c34-cpu-01

Restarting the maas-rackd and maas-regiond on both servers brings the system back to normality.

Sos reports are available here:
https://drive.google.com/drive/folders/1u01dldSwTUoEYb6s3dz3n-539MlLEwSf?usp=drive_link

These logs are seen in maas.log:

2023-12-04T20:03:50+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2023-12-29T18:08:20+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2024-01-03T06:29:23+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2024-01-03T08:53:39+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2024-01-03T08:54:32+00:00 pdx01-m01-c34-cpu-01 maas.dhcp.probe: [error] Can't initiate DHCP probe; no RPC connection to region.
2024-01-03T08:54:32+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

In regiond.log, different types of logs can be seen:

2023-12-04 20:02:47 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2023-12-05 03:51:46 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2023-12-05 21:34:47 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2023-12-06 16:27:34 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2023-12-07 16:04:18 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!

2023-12-29 18:06:49 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2023-12-30 06:04:48 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2023-12-30 11:02:20 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2023-12-30 21:25:47 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!

2024-01-03 06:30:17 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2024-01-05 06:40:47 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2024-01-05 08:12:30 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!
2024-01-05 08:41:16 -: [critical] Amp server or network failure unhandled by client application.  Dropping connection!  To avoid, add errbacks to ALL remote commands!

These messages usually precede the point where the system starts to fail. Below we can see the that the Error configuring DHCPv4 starts to be displayed, until a point they happen every minute. This is when the customer identifies it a restart is needed.

2024-01-03 06:29:55 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 06:29:55 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 06:30:17 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 06:30:17 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 06:30:30 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 06:30:30 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 06:31:06 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 06:31:06 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 06:31:06 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 06:31:24 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 06:31:59 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 06:32:44 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 06:33:01 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 08:07:47 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 08:08:01 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 08:51:32 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 08:51:41 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:51:41 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 08:55:39 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 08:55:48 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:55:48 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 08:57:16 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:57:16 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:57:16 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 08:58:16 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:58:16 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:58:16 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 08:59:05 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:59:05 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': Connection was closed cleanly.
2024-01-03 08:59:05 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 08:59:51 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 09:00:00 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
2024-01-03 09:00:45 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': 
2024-01-03 09:00:52 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)':

This is the full stack of these errors:

2024-01-05 11:11:11 maasserver.dhcp: [critical] Error configuring DHCPv4 on rack controller 'pdx01-m01-c33-cpu-01 (cgbctk)':                                                                                
        Traceback (most recent call last):                                                                                                                                                                  
        --- <exception caught here> ---                                                                                                                                                                     
          File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 875, in configure_dhcp                                                                                                             
            yield client(                                                                                                                                                                                   
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks                                                                                                       
            current.result = callback(current.result, *args, **kw)                                                                                                                                          
          File "/usr/lib/python3/dist-packages/provisioningserver/rpc/common.py", line 145, in _global_intercept_errback                                                                                    
            failure.raiseException()                                                                                                                                                                        
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException                                                                                                      
            raise self.value.with_traceback(self.tb)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 1994, in _massageError
            error.trap(RemoteAmpError)
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 439, in trap
            self.raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException 
            raise self.value.with_traceback(self.tb)
        twisted.internet.defer.CancelledError:

2024-01-05 11:11:17 twisted.internet.protocol.Factory: [info] RegionServer connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.217.0.11', port=5250, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.217.0.131', port=60518, flowInfo=0, scopeID=0))
2024-01-05 11:11:17 maasserver.rpc.regionservice: [info] Rack controller 'None' disconnected.
2024-01-05 11:11:17 RegionServer,91772,::ffff:10.217.0.131: [info] RegionServer connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.217.0.11', port=5250, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.217.0.131', port=60518, flowInfo=0, scopeID=0))
2024-01-05 11:11:27 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'pdx01-m01-c33-cpu-01 (cgbctk)'.
2024-01-05 11:11:27 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:1'.
        Traceback (most recent call last):
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
            _inlineCallbacks(r, g, status)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1464, in _inlineCallbacks
            status.deferred.errback()
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python3/dist-packages/maasserver/rack_controller.py", line 281, in <lambda>
            d.addErrback(lambda f: f.trap(NoConnectionsAvailable))
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 439, in trap
            self.raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
            result = g.send(result)
          File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 957, in configure_dhcp
            raise ipv4_exc
          File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 875, in configure_dhcp
            yield client(
        twisted.internet.defer.CancelledError:

I noticed MAAS 3.3 has some re-working related the connectivity between region and racks. Is it something that could change this behavior, in a way that upgrading might be a recommended action to solve this issue?

Thanks

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2024-01-05:

#16

I also added the sos reports when we were seeing the UVPoll errors if you want to compare. They also show the errors:

[critical] Amp server or network failure unhandled by client application. Dropping connection! To avoid, add errbacks to ALL remote commands!

maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

So even after the fix for UVPoll some other issue is still being triggered.

Thanks

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-01-07:

#17

I suspect this is somehow a different issue. The _global_intercept_errback function that we added is supposed to intercept the exception, close the connection and re-raise (this is why you can still see the exception in the logs).
As a matter of facts we don't see anymore the repeated rpc errors due to "the handler is closed" exception. However, it's really strange that we see the logs

2024-01-05 11:29:25 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'pdx01-m01-c33-cpu-01 (cgbctk)'.
2024-01-05 11:29:25 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:1'.
which means that we managed to configure dhcpv6 but not dhcpv4. In other words, the region was actually able to talk to the rack but the dhcpv4 rpc call failed somehow.

At my eyes, with the information we have right now there is not much we can do to triage this new issue. Can we include some more verbose logging in the ppa delivered to the customer in order to triage this issue (in case they hit it again)?

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2024-01-07:

#18

Sure. If you can release a new version with the additional logging I can work to upgrade the environment.

Also, rackd / regiond debug is turned off. If it helps, we can try to enable them.

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-01-08:

#19

Thanks but I would not enable the `debug` option as it would just produce tons of logs and I suspect it would not include what we are looking for. I'll add some specific log statements and get back to you when I'm done (tomorrow EOD)

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-01-11:

#20

Here's the ppa https://launchpad.net/~r00ta/+archive/ubuntu/maas3.2-2027735-rpclogs