[2.4] Cannot allocate memory when DescribePowerTypes happen

Bug #1749962 reported by Andres Rodriguez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Unassigned

Bug Description

After deploying multiple machines, I noticed that MAAS logs were not spitting out any logs like they would normally do. For example, I was browsing the UI but the logs would not show anything.

After a few seconds, logging start again but also showed the following errors.

I believe this errors were when the CI was trying to create pods, but for some reason, it runs out of memory.

That said, the only significant change that could have really affect this is the way how the daemons are.

So, my take would be that since now the workers are being started from a python daemon itself, it is consuming more memory and impacting performance, instead of how the daemons were started before, which were various instances via systemd.

==> /var/log/maas/regiond.log <==
2018-02-16 13:47:09 maasserver: [warn] Exception during DescribePowerTypes() on rack controller 'autopkgtest' (ny36d7): UnhandledCommand: (b'UNHANDLED', 'Unknown Error [autopkgtest:pid=11814:cmd=DescribePowerTypes:ask=26a]')
2018-02-16 13:47:09 regiond: [info] 10.245.136.6 POST /MAAS/api/2.0/pods/ HTTP/1.1 --> 400 BAD_REQUEST (referrer: -; agent: Python-httplib2/0.9.2 (gzip))

==> /var/log/maas/rackd.log <==
2018-02-16 13:47:15 provisioningserver.rpc.common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'DescribePowerTypes' command. [autopkgtest:pid=11814:cmd=DescribePowerTypes:ask=8c4]
 Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/common.py", line 241, in dispatchCommand
     d = super(RPCProtocol, self).dispatchCommand(box)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 1101, in dispatchCommand
     return maybeDeferred(responder, box)
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
     result = f(*args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 1188, in doit
     return maybeDeferred(aCallable, **kw).addCallback(
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 150, in maybeDeferred
     result = f(*args, **kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/clusterservice.py", line 316, in describe_power_types
     'power_types': list(PowerDriverRegistry.get_schema()),
   File "/usr/lib/python3/dist-packages/provisioningserver/drivers/power/registry.py", line 44, in get_schema
     for _, driver in cls
   File "/usr/lib/python3/dist-packages/provisioningserver/drivers/power/registry.py", line 44, in <listcomp>
     for _, driver in cls
   File "/usr/lib/python3/dist-packages/provisioningserver/drivers/power/__init__.py", line 242, in get_schema
     if detect_missing_packages else []))
   File "/usr/lib/python3/dist-packages/provisioningserver/drivers/power/amt.py", line 71, in detect_missing_packages
     if not shell.has_command_available(binary):
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/shell.py", line 119, in has_command_available
     call_and_check(["which", command])
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/shell.py", line 108, in call_and_check
     process = Popen(command, *args, stdout=PIPE, stderr=PIPE, **kwargs)
   File "/usr/lib/python3.6/subprocess.py", line 709, in __init__
     restore_signals, start_new_session)
   File "/usr/lib/python3.6/subprocess.py", line 1275, in _execute_child
     restore_signals, start_new_session, preexec_fn)
builtins.OSError: [Errno 12] Cannot allocate memory

Tags: performance
description: updated
Changed in maas:
importance: Undecided → Critical
status: New → Triaged
tags: added: performance
Changed in maas:
milestone: none → 2.4.0alpha2
assignee: nobody → Blake Rouse (blake-rouse)
description: updated
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I think the question here is how much memory was rackd using? To spawn a new process the current process has to fork, seems that rackd was so large that forking would cause your system to run out of memory.

Can you get me that information? I need to know more about the current memory usage and the running process of the system.

I mean if you don't have enough memory in your system (that would be surprising), it will never work.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I just wanted to capture what we discussed, as after further investigation running multiple tests, I was able to better pinpoint the problem:

1. The first time I saw thsi, which initially prompted me to file the bug happened while adding KVM pods & machines. At which point, I stopped the test and restarted.

2. The second time, what I saw is that MAAS successfully added kvm pods, added machines, and commissioned machines. This, time, however, saw issues as I was deploying machines for the second time.

So, since it happened at different stages, this would seem to be that there's a memory leak somewhere that's causing the machine to run out of memory.

The test was done with a 2 gig machine, 1 CPU and 100 nodes.

Changed in maas:
milestone: 2.4.0alpha2 → 2.4.0beta1
Revision history for this message
Andres Rodriguez (andreserl) wrote :

This was fixed as part of performance improvements.

Changed in maas:
status: Triaged → Fix Released
assignee: Blake Rouse (blake-rouse) → nobody
milestone: 2.4.0beta1 → 2.4.0alpha2
Revision history for this message
Dan Ackerson (dan.ackerson) wrote :

is there a backport patch available for MaaS versions 2.3.x? As this is the standard install version for Xenial, there are quite a few customers running this version.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.