RPC failure to contact rack/region - operations on closed handler

Bug #2029417 reported by Joao Andre Simioni
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Jacopo Rota
3.2
Fix Released
High
Jacopo Rota
3.3
Fix Released
High
Jacopo Rota
3.4
Fix Released
High
Jacopo Rota

Bug Description

[Problem Description]

After applying the fixes proposed in LP:2027735 to MAAS 3.2.8 (taken from ppa:r00ta/maas-2027735), MAAS started to behave well, with the expected improved performance. But after around ~24 hours, provisioning of nodes started to fail, and the following traces were seen in:

rackd.log:
----------
2023-07-31 23:16:36 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.217.0.11:5240/MAAS/, http://10.217.0.66:5240/MAAS/).
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
     self._startRunCallbacks(result)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
     _inlineCallbacks(r, g, status)
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1292, in _doUpdate
     eventloops, maas_url = yield self._get_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1549, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1520, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1494, in handle_responses
     errors[0].raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1455, in _serial_fetch_rpc_info
     raise last_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1447, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1549, in _get_rpc_info
     raise config_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1520, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1494, in handle_responses
     errors[0].raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1455, in _serial_fetch_rpc_info
     raise last_exc
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 1447, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
 twisted.internet.error.ConnectingCancelledError: HostnameAddress(hostname=b'10.217.0.11', port=5240)

2023-07-31 23:16:36 provisioningserver.rpc.common: [debug] [RPC -> sent] AmpBox({b'_command': b'Ping'})

regiond.log:
------------
2023-07-31 23:17:23 maasserver.dhcp: [critical] Error configuring DHCPv6 on rack controller 'pdx01-m01-c34-cpu-01 (xfhrbn)': unable to perform operation on <UVPoll closed=True 0x7f33f5cf0660>; the handler is closed
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/provisioningserver/prometheus/utils.py", line 127, in wrapper
     result = func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 127, in wrapper
     return func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/common.py", line 176, in __call__
     return deferWithTimeout(
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 325, in deferWithTimeout
     d = maybeDeferred(func, *args, **kwargs)
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 898, in configure_dhcp
     yield client(
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 151, in maybeDeferred
     result = f(*args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 971, in callRemote
     return co._doCommand(self)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 2000, in _doCommand
     d = proto._sendBoxCommand(self.commandName,
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/common.py", line 261, in _sendBoxCommand
     return super()._sendBoxCommand(
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 902, in _sendBoxCommand
     box._sendTo(self.boxSender)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 723, in _sendTo
     proto.sendBox(self)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 2386, in sendBox
     self.transport.write(box.serialize())
   File "/usr/lib/python3/dist-packages/twisted/internet/_newtls.py", line 191, in write
     FileDescriptor.write(self, bytes)
   File "/usr/lib/python3/dist-packages/twisted/internet/abstract.py", line 356, in write
     self.startWriting()
   File "/usr/lib/python3/dist-packages/twisted/internet/abstract.py", line 443, in startWriting
     self.reactor.addWriter(self)
   File "/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 173, in addWriter
     self._asyncioEventloop.add_writer(fd, callWithLogger, writer,
   File "uvloop/loop.pyx", line 2399, in uvloop.loop.Loop.add_writer

   File "uvloop/loop.pyx", line 808, in uvloop.loop.Loop._add_writer

   File "uvloop/handles/poll.pyx", line 122, in uvloop.loop.UVPoll.start_writing

   File "uvloop/handles/poll.pyx", line 39, in uvloop.loop.UVPoll._poll_start

   File "uvloop/handles/handle.pyx", line 159, in uvloop.loop.UVHandle._ensure_alive

 builtins.RuntimeError: unable to perform operation on <UVPoll closed=True 0x7f33f5cf0660>; the handler is closed

2023-07-31 23:17:23 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:12'.
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
     _inlineCallbacks(r, g, status)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1464, in _inlineCallbacks
     status.deferred.errback()
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
     self._startRunCallbacks(fail)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python3/dist-packages/maasserver/rack_controller.py", line 281, in <lambda>
     d.addErrback(lambda f: f.trap(NoConnectionsAvailable))
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 439, in trap
     self.raiseException()
   File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
     result = g.send(result)
   File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 951, in configure_dhcp
     raise ipv4_exc
   File "/usr/lib/python3/dist-packages/maasserver/dhcp.py", line 869, in configure_dhcp
     yield client(
 builtins.RuntimeError: unable to perform operation on <UVPoll closed=True 0x7f33f5cf0660>; the handler is closed

Ubuntu version: 20.04
MAAS: 3.2.99 (Interim version from PPA)
Format: Debian
PostgreSQL 12

Related branches

Revision history for this message
Alan Baghumian (alanbach) wrote :

The planned MAAS 3.2.9 release includes the LP #2027735 fix and that might trigger this issue at more client locations once they upgrade.

Revision history for this message
Jacopo Rota (r00ta) wrote :

I see in the rack logs that the same call fails with
```
2023-07-31 23:07:35 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.217.0.11:5240/MAAS/, http://10.217.0.66:5240/MAAS/).
...
     response = yield self._fetch_rpc_info(url, orig_url)
 twisted.internet.error.DNSLookupError: DNS lookup failed: Couldn't find the hostname '10.217.0.11'.
```

and

```
2023-07-31 23:09:50 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.217.0.11:5240/MAAS/, http://10.217.0.66:5240/MAAS/).
...
     response = yield self._fetch_rpc_info(url, orig_url)
 twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
```

a lot of times.

Revision history for this message
Alan Baghumian (alanbach) wrote :

@r00ta Please note that we don't have any connectivity issues at the networking level.

Stopping and Starting Rackd & Regiond services fixes this issue. Everything works for quite some time after the restart, and then this starts happening.

Revision history for this message
Alan Baghumian (alanbach) wrote :

Just a side note, region controllers have been configured to use 8 workers. What is you opinion regarding reducing them back to the default 4?

Do you think that will make the situation better or not?

Revision history for this message
Jacopo Rota (r00ta) wrote :

@alan Every worker requires 10 database connections. So if you have two regions with 8 workers each, you are exceeding the maximum number of database connections (in postgres it's 100 by default).

tags: added: bug-council
Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Hi Jacopo,

max_connections is set to 400, and it is around 210 in normal operations, as rbac and candid are being used.

summary: - After applying fix on LP#2027735 RPC Communitation is failing
+ RPC failure to contact rack/region - operations on closed handler
description: updated
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
status: New → In Progress
Revision history for this message
Jacopo Rota (r00ta) wrote :

You can find the investigations I've made here https://docs.google.com/presentation/d/1_YuUw7DlfghQxBakmnU4Fv_o8kp6PdnYEn4MxActi5I/edit#slide=id.p .

To summarize, based on my investigations, the issue should be at twisted/uvloop level. In particular, if libuv raises an exception (out of memory/too many files open/IO error/whatever), uvloop might silently close the connection and the `connection_lost` callback is not invoked.

MAAS keeps track of the connections to the rack controllers using the `connection_made` and `connection_lost` callbacks. In particular, if `connection_lost` is NOT called, we don't remove this connection from the queue.

I've opened https://github.com/MagicStack/uvloop/issues/552.

If the uvloop maintainers don't confirm this is a bug, we'll have to investigate on twisted side (for example, the exception should be handled here https://github.com/twisted/twisted/blob/a9ee8e59a5cd1950bf1a18c4b6ca813e5f56ad08/src/twisted/internet/asyncioreactor.py#L171 - apart from that, twisted should anyway call the `connection_lost` accordingly).

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Let's compare the performance of MAAS with and without uvloop. If the performance tests show no meaningful difference when uvloop is used, we should consider removing it.

Revision history for this message
Jacopo Rota (r00ta) wrote :

I agree @jerzy. Will do

Revision history for this message
Björn Tillenius (bjornt) wrote :

In what way did the machines fail to provision? Did they get turned on? Did they receive an IP? Did they download bootloaders? I.e. where did they fail?

While the tracebacks are concerning, I'm not sure they actually cause any fatal errors. There are a lot of different errors in the logs, but without knowing how thing fails, we can only make guesses at possible causes.

Changed in maas:
status: In Progress → Incomplete
Revision history for this message
Jacopo Rota (r00ta) wrote :

As per mattermost chat, I think I finally found why the loseConnection was not called as I suspected last week: It is done on purpuse in our code according to https://git.launchpad.net/maas/tree/src/provisioningserver/rpc/common.py#n318 .
Looking at the defailt implementation of twisted, unhandledError would call loseConnection! https://github.com/twisted/twisted/blob/a9ee8e59a5cd1950bf1a18c4b6ca813e5f56ad08/src/twisted/protocols/amp.py#L2507 but this behaviour was overwritten in 2015 for https://bugs.launchpad.net/maas/+bug/1457799 .
The proof is that in the new logs it's much more clean that the number of connections are wrong exactly when we hit the exceptionUnhandled failure during AMP request (see the screenshot, the blue vertical dashed line is put when we see those Unhandled failure during AMP request exceptions - It's now evident that the DHCP exceptions start after we hit the AMP exception)

So I think the fix could simply be something like

```
    def unhandledError(self, failure):
        """Terminal errback, after application code has seen the failure.

        `amp.BoxDispatcher.unhandledError` calls the `amp.IBoxSender`'s
        `unhandledError`. In the default implementation this disconnects the
        transport.

        Here we instead log the failure but do *not* disconnect because it's
        too disruptive to the running of MAAS.
        """
        if failure.check(builtins.RuntimeError) and "The handler is closed" in failure.getErrorMessage():
            super().unhandledError(failure)
            log.info("The handler is closed, the connection will be dropped.")
        else:
            log.err(
                failure,
                (
                    "Unhandled failure during AMP request. This is probably a bug. "
                    "Please ensure that this error is handled within application "
                    "code."
                ),
            )
```
can anybody in the team double check my latest findings above and send a MP? I'm currently off and I can't follow up in the working hours of the team

Changed in maas:
status: Incomplete → Confirmed
Changed in maas:
status: Confirmed → In Progress
Jacopo Rota (r00ta)
Changed in maas:
milestone: none → 3.5.0
status: In Progress → Fix Committed
status: Fix Committed → In Progress
Jacopo Rota (r00ta)
Changed in maas:
importance: Undecided → High
Revision history for this message
Jacopo Rota (r00ta) wrote :
Changed in maas:
status: In Progress → Fix Committed
tags: removed: bug-council
Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Hi team,

the problem manifested again after many days. I've shared the sos reports internally.

Revision history for this message
Jacopo Rota (r00ta) wrote :

I have a small reproducer. I think it's (almost?) impossible to reproduce the race condition that leads to the closed handler: they have a pretty unique env and we had to wait more than 1 month to hit the bug again.
So I focused on forcing the same exception at the same place and I'm now working on a new fix. For the record the reproducer is here https://github.com/r00ta/maas-bugs-reproducers/tree/main/lp-2029417

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :
Download full text (15.8 KiB)

The customer is using the interim version we provided (3.2.10~alpha3-12060-g.ee175c971-0ubuntu1), and around 30 days after a restart, they start to see RPC problems again.

The MAAS environment has two servers - pdx01-m01-c33-cpu-01 and pdx01-m01-c34-cpu-01

Restarting the maas-rackd and maas-regiond on both servers brings the system back to normality.

Sos reports are available here:
https://drive.google.com/drive/folders/1u01dldSwTUoEYb6s3dz3n-539MlLEwSf?usp=drive_link

These logs are seen in maas.log:

2023-11-02T12:43:32+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T13:04:14+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 15 times: [ [error] Can't update service statuses, no RPC connection to region.]
2023-11-02T13:06:00+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T13:13:15+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 6 times: [ [error] Can't update service statuses, no RPC connection to region.]
2023-11-02T13:14:29+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T13:59:57+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T14:04:14+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 3 times: [ [error] Can't update service statuses, no RPC connection to region.]
2023-11-02T14:05:44+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-11-02T14:16:49+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: message repeated 7 times: [ [error] Can't update service statuses, no RPC connection to region.]

2023-12-04T20:03:50+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2023-12-29T18:08:20+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2023-12-30T06:07:42+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:09:40+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:10:53+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:14:34+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.
2023-12-30T06:15:02+00:00 pdx01-m01-c34-cpu-01 maas.dhcp.probe: [error] Can't initiate DHCP probe; no RPC connection to region.
2023-12-30T06:15:53+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

2024-01-03T06:29:23+00:00 pdx01-m01-c34-cpu-01 maas.service_monitor_service: [error] Can't update service statuses, no R...

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

I also added the sos reports when we were seeing the UVPoll errors if you want to compare. They also show the errors:

[critical] Amp server or network failure unhandled by client application. Dropping connection! To avoid, add errbacks to ALL remote commands!

maas.service_monitor_service: [error] Can't update service statuses, no RPC connection to region.

So even after the fix for UVPoll some other issue is still being triggered.

Thanks

Revision history for this message
Jacopo Rota (r00ta) wrote :

I suspect this is somehow a different issue. The _global_intercept_errback function that we added is supposed to intercept the exception, close the connection and re-raise (this is why you can still see the exception in the logs).
As a matter of facts we don't see anymore the repeated rpc errors due to "the handler is closed" exception. However, it's really strange that we see the logs

2024-01-05 11:29:25 maasserver.dhcp: [info] Successfully configured DHCPv6 on rack controller 'pdx01-m01-c33-cpu-01 (cgbctk)'.
2024-01-05 11:29:25 maasserver.rack_controller: [critical] Failed configuring DHCP on rack controller 'id:1'.
which means that we managed to configure dhcpv6 but not dhcpv4. In other words, the region was actually able to talk to the rack but the dhcpv4 rpc call failed somehow.

At my eyes, with the information we have right now there is not much we can do to triage this new issue. Can we include some more verbose logging in the ppa delivered to the customer in order to triage this issue (in case they hit it again)?

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Sure. If you can release a new version with the additional logging I can work to upgrade the environment.

Also, rackd / regiond debug is turned off. If it helps, we can try to enable them.

Revision history for this message
Jacopo Rota (r00ta) wrote :

Thanks but I would not enable the `debug` option as it would just produce tons of logs and I suspect it would not include what we are looking for. I'll add some specific log statements and get back to you when I'm done (tomorrow EOD)

Revision history for this message
Jacopo Rota (r00ta) wrote :
Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.