rack can't contact region, deployments fails

Bug #1914807 reported by Dan Streetman
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
maas (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Seen both with maas 2.8 as well as maas 2.9; after running for a while, deployments stop working, and the rackd log has many messages like:

2021-02-05 18:13:56 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.230.56.2:5240/MAAS).
        Traceback (most recent call last):
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
            self._startRunCallbacks(result)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
            _inlineCallbacks(r, g, status)
        --- <exception caught here> ---
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1367, in _doUpdate
            eventloops, maas_url = yield self._get_rpc_info(urls)
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
            raise config_exc
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
            errors[0].raiseException()
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
            raise last_exc
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
            raise config_exc
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
            errors[0].raiseException()
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/snap/maas/11322/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
            raise last_exc
          File "/snap/maas/11322/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
        twisted.internet.error.ConnectingCancelledError: HostnameAddress(hostname=b'10.230.56.2', port=5240)

The region controller appears to be working fine and there are no errors in the regiond log. This deployment uses a single region and single rack, which are both located on a single VM.

To get maas working again, the system must be rebooted, or the maas snap service must be restarted. However, the problem being occurring again after some number of hours or days.

Tags: sts seg
Dan Streetman (ddstreet)
tags: added: seg sts
description: updated
Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

We are resorting to rebooting maas every day just so we can avoid hitting this and having deployments fail.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in maas (Ubuntu):
status: New → Confirmed
Revision history for this message
Björn Tillenius (bjornt) wrote :

Can you please attach maas.log, regiond.log an rackd.log?

The error doesn't provide much clues as to why this is happening.

Changed in maas:
status: New → Incomplete
Revision history for this message
Dan Streetman (ddstreet) wrote :
Revision history for this message
Dan Streetman (ddstreet) wrote :

attached just those log files, we also have older rotated regiond.log files

Changed in maas:
status: Incomplete → New
Revision history for this message
Bill Wear (billwear) wrote :

Last time we saw a a bug very much like this (https://bugs.launchpad.net/maas/+bug/1707971), it was fixed in the 2.2 release range. The error logs look almost identical, so I'm going to take a chance and triage this as if it were a regression. Someone can swat me if they prove me wrong, no worries.

Changed in maas:
status: New → Triaged
Revision history for this message
Björn Tillenius (bjornt) wrote :

I can't find anything obvious in the log files.

It would be good it you could provide the output of:

  curl http://10.230.56.2/MAAS/rpc/

from the rackd that is having problems. Both when it's working, and when you see this problem.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Alexander Balderson (asbalderson) wrote (last edit ):

We had this issue using the maas 2.9.2 snap in the new SQA lab

curling the address came back with a long delayed "connection refused" and nothing else.

I was able to get everything back in working order by restarting maas.

Another odd symptom was that I could loop calls where i would connect via the maas python package, and i would get 2-4 connects before a time out and error.

Revision history for this message
Alberto Donato (ack) wrote :

is there anything relevant showing in regiond.log when you curl and it returns connection refused?

(note that from the previous curl command, the 5240 port is missing)

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Changed in maas (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Marking as Invalid for MAAS, per previous comment and incomplete status since; unclear if still an issue/relevant, and input from Björn and Jerzy.

Changed in maas:
status: Incomplete → Invalid
Changed in maas (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.