maas hangs - 1 regiond spinning

Bug #2031451 reported by dann frazier
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
New
Undecided
Unassigned

Bug Description

installed: 3.3.4-13189-g.f88272d1e (28521) 138MB -

A user reported that the MAAS UI was hanging and all deployments were timing out. In the logs, I see rackd is reporting a critical error:

2023-08-15 23:03:05 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://localhost:5240/MAAS).
 Traceback (most recent call last):
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 661, in callback
     self._startRunCallbacks(result)
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 763, in _startRunCallbacks
     self._runCallbacks()
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
     current.result = callback( # type: ignore[misc]
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1750, in gotResult
     current_context.run(_inlineCallbacks, r, gen, status)
 --- <exception caught here> ---
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1299, in _doUpdate
     eventloops, maas_url = yield self._get_rpc_info(urls)
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
     raise config_exc
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
     current.result = callback( # type: ignore[misc]
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
     errors[0].raiseException()
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
     raise last_exc
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
     result = current_context.run(
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1558, in _get_rpc_info
     raise config_exc
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _get_rpc_info
     eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 857, in _runCallbacks
     current.result = callback( # type: ignore[misc]
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1503, in handle_responses
     errors[0].raiseException()
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 475, in raiseException
     raise self.value.with_traceback(self.tb)
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
     result = current_context.run(
   File "/snap/maas/28521/usr/lib/python3/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1464, in _serial_fetch_rpc_info
     raise last_exc
   File "/snap/maas/28521/lib/python3.10/site-packages/provisioningserver/rpc/clusterservice.py", line 1456, in _serial_fetch_rpc_info
     response = yield self._fetch_rpc_info(url, orig_url)
 twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]

regionds are running:
root@maas:/var/snap/maas/common/log# ps -ef | grep regiond
root 82270 82197 1 Jun22 ? 16:41:35 python3 /snap/maas/28521/bin/regiond
root 82398 82270 29 Jun22 ? 16-08:57:38 python3 /snap/maas/28521/bin/regiond
root 82399 82270 0 Jun22 ? 06:39:56 python3 /snap/maas/28521/bin/regiond
root 82400 82270 0 Jun22 ? 13:07:29 python3 /snap/maas/28521/bin/regiond
root 82401 82270 0 Jun22 ? 06:04:17 python3 /snap/maas/28521/bin/regiond
root 3735835 3732995 0 23:04 pts/15 00:00:00 grep --color=auto regiond

But one of them is spinning on a CPU:
  82398 root 20 0 1349024 406760 12472 S 110.9 0.6 23577:53 python3

strace on that spinning regiond shows that it is getting a bunch of -EAGAINs when reading a socket:
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(4, 0x55e14a3c58c0, 4096, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55e146830bec, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=220350015}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=220577533}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bd0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=222496459}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bec, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, [], 46, 0) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=223147519}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bf0, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=223931288}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, [], 46, 5) = 0
futex(0x55e146830be8, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, [], 46, 0) = 0
futex(0x55e146830bec, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=231659458}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bec, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, [{events=EPOLLIN, data={u32=4, u64=94424856002564}}], 46, 499) = 1
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=296551151}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bd0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=297757406}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bec, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(4, "\0", 4096, 0, NULL, NULL) = 1
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=299812592}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x55e146830bec, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=305013714}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bec, FUTEX_WAKE_PRIVATE, 1) = 1
recvfrom(4, 0x55e14a3c58c0, 4096, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=309470966}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bec, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, [], 46, 0) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=312523457}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bec, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=313405840}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=314754416}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bd0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=316072157}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bd0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830be8, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=8404744, tv_nsec=317530450}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55e146830bec, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55e146830bf0, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, ^Cstrace: Process 82398 detached
 <detached ...>

fd 4 is a unix socket:

root@maas:/var/snap/maas/common/log# lsof -p 82398 | grep 4u
python3 82398 root 4u unix 0x0000000000000000 0t0 241252057 type=STREAM

who is on the other end? well, it looks like the same regiond is:

root@maas:/var/snap/maas/common/log# ss -xp | grep 241252057
u_str ESTAB 0 0 * 241252058 * 241252057 users:(("python3",pid=82398,fd=5))
u_str ESTAB 0 0 * 241252057 * 241252058 users:(("python3",pid=82398,fd=4))

... and that's about all I can discern quickly. I killed the spinning regiond at this point and things seemed to recover.

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

Hi @dan, do you have the full logs of the region and the rack? An sos report would be ideal btw. Did you use snap or deb?

Revision history for this message
dann frazier (dannf) wrote :

Since you are a Canonical employee, I've posted an sosreport privately here:
  https://private-fileshare.canonical.com/~dannf/lp2031451/sosreport-maas-2023-08-25-qktpuem.tar.xz

We use the snap.

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

Uhm the logs are empty, could you double check?

Revision history for this message
dann frazier (dannf) wrote :
Download full text (8.9 KiB)

Indeed, I tried again and sos report is reliably not including them. Perhaps the maas plugin timeout in the output below is related. I've captured the logs manually and uploaded them here:
  https://private-fileshare.canonical.com/~dannf/lp2031451/var_snap_maas_common_log-20230828.tar.gz

sos report output follows (maas plugin timeout at the end)
----------------------------------------------------------

ubuntu@maas:~$ sudo sos report

sosreport (version 4.4)

This command will collect system configuration and diagnostic
information from this Ubuntu system.

For more information on Canonical visit:

        Community Website : https://www.ubuntu.com/
        Commercial Support : https://www.canonical.com

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Optionally, please enter the case id that you are generating this report for []:

 Setting up archive ...
 Setting up plugins ...
[plugin:devicemapper] skipped command 'dmsetup info -c': required kmods missing: dm_mod.
[plugin:devicemapper] skipped command 'dmsetup table': required kmods missing: dm_mod.
[plugin:devicemapper] skipped command 'dmsetup status': required kmods missing: dm_mod.
[plugin:devicemapper] skipped command 'dmsetup ls --tree': required kmods missing: dm_mod.
[plugin:devicemapper] skipped command 'dmsetup udevcookies': required kmods missing: dm_mod.
[plugin:devicemapper] skipped command 'dmstats list': required kmods missing: dm_mod.
[plugin:devicemapper] skipped command 'dmstats print --allregions': required kmods missing: dm_mod.
[plugin:lxd] skipped command 'lxc image list': required kmods missing: ebtable_filter, iptable_raw, ebtables, ip6table_raw.
[plugin:lxd] skipped command 'lxc list': required kmods missing: ebtable_filter, iptable_raw, ebtables, ip6table_raw.
[plugin:lxd] skipped command 'lxc network list': required kmods missing: ebtable_filter, iptable_raw, ebtables, ip6table_raw.
[plugin:lxd] skipped command 'lxc profile list': required kmods missing: ebtable_filter, iptable_raw, ebtables, ip6table_raw.
[plugin:lxd] skipped command 'lxc storage list': required kmods missing: ebtable_filter, iptable_raw, ebtables, ip6table_raw.
[plugin:networking] skipped command 'ip -s macsec show': required kmods missing: macsec. Use '--allow-system-changes' to enable collection.
[plugin:networking] skipped command 'ss -peaonmi': required kmods missing: netlink_diag, inet_diag, tcp_diag, udp_diag, xsk_diag, af_packet_diag. Use '--allow-system-changes' to enable collection.
 Running plugins. Please wait ...

  Starting 3/69 apport [Running: apport] Starting 1/69 acpid [Running: apport acpid] Starting 4/69 apt [Running: apport apt] Starting 2/69 apparmor [Running: apport apt apparmor] Starting 5/69 ata [Running: apport apt apparmor ata] Starting 6/69 block ...

Read more...

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

The version you are running 3.3.4-13189-g.f88272d1e is affected by https://bugs.launchpad.net/maas/+bug/2027735 . Given that there is only one process processing all the HTTP requests it's likely that it is using 100% CPU unfortunately.

I'd suggest to
1) upgrade to 3.3/edge which contains the fix for the bug above
2) run the infrastructure ~for some days/a week
3) upload the new logs
4) ensure that the issue is gone. If not, this needs more investigation.

wrt the `recvfrom(4, 0x55e14a3c58c0` every worker is doing it in that version of maas so it's not specific to your env/bug. I don't know by heart what's behind this, I'll try to investigate a bit anyways.

Does this make sense to you @dann?

Revision history for this message
dann frazier (dannf) wrote :

Thanks @r00ta - I've updated to 3.3.5-13205-g.3e0e5156e.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.