tgtadm: out of memory crash

Bug #1389811 reported by Christian Reis on 2014-11-05
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Unassigned
tgt (Ubuntu)
High
Unassigned

Bug Description

Fresh install of RC1 on 14.04. I don't see what actually failed, but as I auto-enlist new machines I am seeing OOPSes generated:

2014-11-04 15:45:32-0200 [-] Logged OOPS id OOPS-cadddc8e814aa4a90f32cbd6cd1a8c3c: No exception type: No exception value
2014-11-04 15:45:32-0200 [-] Logged OOPS id OOPS-26933e64305da2663847ebb6302d38d3: ExternalProcessError: Command `sudo /usr/sbin/tgt-admin --conf /var/l
ib/maas/boot-resources/snapshot-20141104-174452/maas.tgt --update ALL` returned non-zero exit status 22:
        tgtadm: out of memory
2014-11-04 16:45:29-0200 [-] Unhandled error in Deferred:
2014-11-04 16:45:29-0200 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 143, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/boot_images.py", line 67, in _run_import
            boot_resources.import_images(sources)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/import_images/boot_resources.py", line 273, in import_images
            update_targets_conf(snapshot_path)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/import_images/boot_resources.py", line 195, in update_targets_conf
            '--update', 'ALL',
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/shell.py", line 125, in call_and_check
            raise ExternalProcessError(process.returncode, command, output=stderr)
        provisioningserver.utils.shell.ExternalProcessError: Command `sudo /usr/sbin/tgt-admin --conf /var/lib/maas/boot-resources/snapshot-20141104-184452/maas.tgt --update ALL` returned non-zero exit status 22:
        tgtadm: out of memory

Christian Reis (kiko) wrote :

There was no issue other than the crash, incidentally; MAAS enlisted and commissioned the nodes just fine.

Julian Edwards (julian-edwards) wrote :

There is zero error handling for tgt AFAICT. We need to catch these and put them somewhere in the admin's face.

Graham Binns (gmb) wrote :

Crash, so critical. Not sure there's much we can do besides handle it gracefully, though.

Changed in maas:
status: New → Triaged
importance: Undecided → High
importance: High → Critical
tags: added: crash
Changed in maas:
milestone: none → 1.7.1
Changed in maas:
milestone: 1.7.1 → 1.7.2
Changed in maas:
milestone: 1.7.2 → 1.7.3
Ryan Collis (vyan) wrote :

I am showing this same problem (as far as I can tell) with Xenial and MAAS 1.10 and 2.0 Alpha2. If I only have the Trusty images I have no issues. As soon as I add any other release I get the tgtadm out of memory error. Will add logs if this seems to be the same bug.

stsp (stsp-0) wrote :

I did a bit of an investigation to that problem.
MaaS cannot be used without tgt, so there was no way around
but to get this fixed.

Firstly, tgtd creates too many threads, over 500 when maas is
configured with just 2 images. It soon goes OOM.
This may be a bug by itself, but there is also a switch, "-t", that
allows to limit the number of threads.
So all you need is to add "-t 1" to "ExecStart=" of /lib/systemd/system/tgt.service.
Sounds simple? Except that its not.

On xenial there is also the /etc/init.d/tgt provided by the same
package, which you better remove (or update the same way).

Then the one finds out that tgtd does not accept any numeric
arguments. I've fixed this some time ago:
https://github.com/fujita/tgt/pull/18
but of course no one wants to take the patch.

After applying this patch, you need to build tgt with SD_NOTIFY=1
exported, or systemd will not be able to start it.

Now there is no more the OOM condition, but tgtd sometimes fails
to start and keeps respawning. And even if you disable it in systemd,
it keeps doing so. It turnes out provisioningserver/service_monitor.py
is trying to re-start it together with systemd, both failing each other's
attempts. So I removed it from provisionserver/service_monitor.py,
from maasserver/models/service.py and from maasserver/service_monitor.py.
Then it calmed.

Someone may still investigate why does tgt create so many threads
when started without the limiting switch.

Larry Michel (lmic) wrote :

After upgrading from 1.9.1 to beta2, I am hitting this whenever trying to import images:

2016-04-18 14:43:43+0000 [-] Unhandled error in Deferred:
2016-04-18 14:43:43+0000 [-] Unhandled Error
 Traceback (most recent call last):
   File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
     self.run()
   File "/usr/lib/python3.5/threading.py", line 862, in run
     self._target(*self._args, **self._kwargs)
   File "/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
     task()
   File "/usr/lib/python3/dist-packages/twisted/_threads/_team.py", line 190, in doWork
     task()
 --- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
     result = inContext.theWork()
   File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
     inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 201, in wrapper
     return func(*args, **kwargs)
   File "/usr/lib/python3/dist-packages/provisioningserver/rpc/boot_images.py", line 106, in _run_import
     boot_resources.import_images(sources)
   File "/usr/lib/python3/dist-packages/provisioningserver/import_images/boot_resources.py", line 281, in import_images
     update_targets_conf(snapshot_path)
   File "/usr/lib/python3/dist-packages/provisioningserver/import_images/boot_resources.py", line 196, in update_targets_conf
     '--update', 'ALL',
   File "/usr/lib/python3/dist-packages/provisioningserver/utils/shell.py", line 129, in call_and_check
     raise ExternalProcessError(process.returncode, command, output=stderr)
 provisioningserver.utils.shell.ExternalProcessError: Command `sudo /usr/sbin/tgt-admin --conf /var/lib/maas/boot-resources/snapshot-20160418-144327/maas.tgt --update ALL` returned non-zero exit status 22:
 tgtadm: out of memory

tags: added: oil
Dimiter Naydenov (dimitern) wrote :

Related issue in tgt 1.0.63 on xenial (see bug 1547060) makes it impossible to use the '-t 1' workaround (comment #5) as numeric arguments are not accepted.

How I managed to resolve it locally, in case somebody finds it useful. On my MAAS 2.0 beta3 xenial machine:

# apt-get build-dep tgt
# apt-get install devscripts git
# mkdir ~/tgt-source && cd ~/tgt-source
# git clone https://github.com/fujita/tgt.git .
# wget https://github.com/fujita/tgt/pull/18.patch
# git apply 18.path
# export SD_NOTIFY=1
# make
# make deb
# dpkg -i pkg/tgt_1.0.63-1_amd64.deb
# cp -f pkg/tgt_1.0.63/scripts/tgtd.service /lib/systemd/system/tgt.service
# vi /lib/systemd/system/tgt.service
Changed the first ExecStart= to add -t1 -d1 arguments:
-ExecStart=/usr/sbin/tgtd
+ExecStart=/usr/sbin/tgtd -t1 -d1

# rm -f /etc/init.d/tgt
# systemctl --system daemon-reload
# systemctl --system enable tgt

Now I can again commission my NUCs with xenial and my MAAS overall *feels* a lot more responsive.
No more tgtadm: out of memory errors in /var/log/syslog

Changed in tgt (Ubuntu):
status: New → Confirmed
tags: added: trusty
Larry Michel (lmic) on 2016-05-20
tags: added: xenial
Changed in tgt (Ubuntu):
importance: Undecided → High
Blake Rouse (blake-rouse) wrote :

This is a huge problem affecting many people. If a users has lots of images imported into MAAS, tgt just fails to run.

Andres Rodriguez (andreserl) wrote :

I wonder if this is the issue caused by https://bugs.launchpad.net/ubuntu/+source/tgt/+bug/1547060

Christian Reis (kiko) wrote :

See also the Xenial issue, which may give a hint: https://bugs.launchpad.net/maas/+bug/1559088

I agree to Kiko's assumption and am currently fixing bug 1559088 along the next tgt merge.
I'll ping you here once there is something out there.
If that helps your case we can make it a dup and think on SRUs - if not - well - we at least would have lost nothing by trying.

Changed in tgt (Ubuntu):
assignee: nobody → ChristianEhrhardt (paelzer)
status: Confirmed → In Progress
assignee: ChristianEhrhardt (paelzer) → nobody
status: In Progress → Confirmed

Hi,
as planned I fixed bug 1559088 in the merge I made into zesty.

If you could take a look and check if with tgt (1:1.0.67-1ubuntu1) this issue here would be solved for you (or at least less broken - as this is not fixing why tgt spawns so many threads, but it removes the systemd service limit on them) we could maybe consider an SRU for Xenial?

Blake Rouse (blake-rouse) wrote :

Yes this fix should be SRU to Xenial as anyone with lots of MAAS images comes across this issue.

Hi Blake - I thought so, but could you try (on zesty or by grabbing the package from zesty into yours) if the fix for 1559088 actually fixes your issue on Xenial?
Let me know if you would need a ppa.

Changed in maas:
importance: Critical → High
Changed in maas:
importance: High → Critical
Spyderdyne (spyderdyne) wrote :

Same issue, but on RPi3B

Not sure if it should even be able to run or if this qualifies as a bug since memory is so scarce on these:

tgtadm: out of memory

root@juju-rack2:/var/log/maas# cat /etc/issue
Ubuntu 16.04.1 LTS \n \l

root@juju-rack2:/var/log/maas# uname -a
Linux juju-rack2.home.spyderdyne.net 4.4.43-v7+ #948 SMP Sun Jan 15 22:20:07 GMT 2017 armv7l armv7l armv7l GNU/Linux

root@juju-rack2:/var/log/maas# free -m
              total used free shared buff/cache available
Mem: 925 521 25 36 379 341
Swap: 0 0 0

MAAS Version 2.1.3+bzr5573-0ubuntu1 (16.04.1)

Attempted overclocking after noticing that python-twist was using so many resources and it is much faster now, but the iSCSI daemon still dies on startup.

Spyderdyne (spyderdyne) wrote :
Download full text (16.2 KiB)

Same problem when running maas-rack controller on a second machine:

Intel NUC 5i5MYHE RACK CONTROLLER: /var/lib/maas/maas.log
Feb 15 19:58:13 rack2-maas-rack0 maas.import-images: [info] Updating boot image iSCSI targets.
Feb 15 19:58:13 rack2-maas-rack0 maas.boot_image_download_service: [error] Failed to download images: Command `sudo /usr/sbin/tgt-admin --conf /var/lib/maas/boot-resources/current/maas.tgt --update ALL` returned non-zero exit status 2:#012Config file /var/lib/maas/boot-resources/current/maas.tgt not found. Exiting...
Feb 15 19:58:17 rack2-maas-rack0 maas.dhcp.probe: [error] Unable to probe for DHCP servers: Connection was closed cleanly.

tgtd dies, rendering the rack controller unusable since it cannot complete the maas-region import.

Raspberry Pi 3B REGION CONTROLLER: /var/lib/rediogd.log

2017-02-15 15:06:02 twisted.python.log: [info] ::ffff:192.168.199.6 - - [15/Feb/2017:20:06:01 +0000] "GET /MAAS/rpc/ HTTP/1.0" 200 316 "-" "provisioningserver.rpc.clusterservice.ClusterClientService"
sudo2017-02-15 15:06:28 provisioningserver.utils.services: [info] Neighbour observation process for enxb827eb208904 started.
: a password is required
2017-02-15 15:06:26 maasserver.regiondservices.ntp: [critical] Failed to update NTP configuration.

Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 824, in worker
    return target()
  File "/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
    task()
  File "/usr/lib/python3/dist-packages/twisted/_threads/_team.py", line 190, in doWork
    task()
--- <exception caught here> ---
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 857, in callInContext
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 225, in wrapper
    result = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/ntp/config.py", line 49, in configure
    mode=0o644)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/fs.py", line 281, in sudo_write_file
    raise ExternalProcessError(proc.returncode, command, stderr)
provisioningserver.utils.shell.ExternalProcessError: Command `maas-rack atomic-write --filename /etc/ntp/maas.conf --mode 0644` returned non-zero exit status 1:

/var/lib/rackd.log

2017-02-10 15:42:28 sstreams: [info] maas:v2:download/maas:boot:ubuntu:amd64:ga-16.04:xenial: to_add=['20170207'] to_remove=[]
2...

Spyderdyne (spyderdyne) wrote :

I rolled both devices to the dev branch and this issue seems to be resolved.

RPi 3B region:

MAAS Version 2.2.0 (beta2+bzr5717)

NUC rack:

MAAS Version 2.2.0 (beta2+bzr5717)

While I am ecstatic that this is also working now, I am deeply concerned that none of the Ubuntu "stable" stack is currently functional in a MaaS + Juju deployment due to various bugs, and only the devel PPAs seem to function right now with LXD and a network bridge to MaaS.

(https://bugs.launchpad.net/juju/+bug/1633788)

FWIW

Spyderdyne (spyderdyne) wrote :

Now that I am on the dev branch, the TGT service works great. Unfortunately now that I am on the dev branch, it is unable to determine the correct IP address for the controller during node discovery, and if you set up a new machine entry in MaaS for it using its MAC address, it ignores the ACTUAL MaaS IP address on commissioning and attemtps to hit the MaaS APIs on the default gateway instead.

Is there currently or has there ever been an actual working version of MaaS somewhere? Does anyone know where to find one?

Opening another new bug report...

Changed in maas:
status: Triaged → Won't Fix
Changed in maas:
status: Won't Fix → Invalid
Andres Rodriguez (andreserl) wrote :

@Christian,

We cannot test the zesty package nor grab the package and install it in xenial because it needs to be built against the xenial archive for the library linking. As such, I would recommend this patch is pushed to -proposed where we can test it against the libraries of the archive.

That would allow us to test it just once and be pushed into -xenial, and not have to test it in the PPA and then again test it in -proposed.

Hi Andres,
My full sentence was "test on zesty OR grab the package"
I'm clear on the linking issues and that grabbing this implies way more messy changes - but in this case "grabbing" the change would have been like adding two lines to the service file of tgt.
Anyway of course I'm ok with building something for Xenial to test.

The fix there was about task-max (number, not size) so it wasn't sure if this will fix this issue here as well. That was the reason why we asked for a pre-check. The assumption is that this will fix the issue to be able to spawn more threads (when serving many images) but not the generic memory consumption (yet that might be avoided with swap).

Yet IMHO that is not what -proposed is for (trial and error if it fixes the issues), but instead to ensure we have no regressions - well you know all that anyway. So if there is any chance you can test off a Xenial-ppa that would be much preferred to x-proposed. Only if there would be no way to do a xenial-paa I'd think it is ok to test directly in proposed.

A ppa to test for your case is in [1].

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2991

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers