Twisted Unhandled Error in Deferred - Too many open files: /var/log/

Bug #1405998 reported by Larry Michel on 2014-12-27
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Gavin Panella
1.7
Undecided
Unassigned

Bug Description

From pserv.log, there are 660227 such errors during a short period of time: ~ 1 hour and 34 minutes.

===========================================================
ubuntu@maas-trusty-back-may22:~$ grep " Unhandled error in Deferred" /var/log/maas/pserv.log|grep "2014-12-26" | wc -l
680227
===========================================================

There are also 386375 oops files.

They start at
2014-12-26 21:35:17+0000 [-] Unhandled error in Deferred
and end at
2014-12-26 22:09:40+0000 [-] Unhandled error in Deferred:

This is one entry:
===========================================================
2014-12-26 21:49:59+0000 [-] Unhandled error in Deferred:
2014-12-26 21:49:59+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/oops/publishers.py", line 38, in result
            return publisher(report)
          File "/usr/lib/python2.7/dist-packages/oops_datedir_repo/repository.py", line 133, in publish
            self.serializer.write(report, open(filename + '.tmp', 'wb'))
        exceptions.IOError: [Errno 24] Too many open files: '/var/log/maas/oops/2014-12-26/OOPS-9157e9b65c315009c17fbf642a0f0f46.tmp'
===========================================================

There were also 18 occurrences on 12/21 and 36 oops files.

Around time of these Unhandled errors, we saw deployment related failures such as this one:

===========================================================
Launching instance
WARNING picked arbitrary tools &{1.20.14-precise-amd64 https://streams.canonical.com/juju/tools/releases/juju-1.20.14-precise-amd64.tgz 932640702b5d9f08a312118a4afe330444308f92b7de350de9e547719f084bd9 8130412}
ERROR bootstrap failed: cannot start bootstrap instance: gomaasapi: got error back from server: 500 INTERNAL SERVER ERROR (u'ipmi')
Bootstrap failed, destroying environment
ERROR Bootstrap failed, and the environment could not be destroyed: gomaasapi: got error back from server: 503 SERVICE UNAVAILABLE (Unable to connect to cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e); no connections available.)
ERROR cannot start bootstrap instance: gomaasapi: got error back from server: 500 INTERNAL SERVER ERROR (u'ipmi')
===========================================================

This is for maas: 1.7.1~rc1+bzr3313-0ubuntu1~trusty1

Related branches

Larry Michel (lmic) on 2014-12-27
description: updated
tags: added: oil
Raphaël Badin (rvb) wrote :

> There are also 386375 oops files.

The OOPS system is an extra moving part that doesn't really add anything. We need to get rid of it. The logs should go in MAAS' normal logs.

Now, it seems your installation has a problem so bad it's creating a crazy amount of OOPS reports and this, in turn, becomes a problem ("Too many open files"). Could you please try seeing if you could get more information about the *original* problem? I think you will be able to find more information in the OOPS files (the ones that could be written before the "Too many open files" problem started happening). OOPS files are bson encoded files in /var/log/maas/oops/<day>/; usually, `strings <oops-file>` is all it takes to get a good idea about what the problem is.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
Changed in maas:
milestone: none → 1.7.2
Larry Michel (lmic) wrote :

Pserv logs:

Christian Reis (kiko) wrote :

<kiko> larrymi, when was the machine upgraded to 1.7.1RC1 and what from?
<larrymi> kiko: it was upgraded from 1.7.0. We got the beta first (i think very close to time it came out) then we went to rc1 around 1st week of December timeframe.

Christian Reis (kiko) wrote :

These pserv logs are a treasure trove of OOPSes. Just in pserv.log.4.gz we have a total of :

- provisioningserver.power.poweraction.PowerActionFail: sm15k failed with return code 1: Not Found: servers/26/0 (HTTP 404)
- Failure: provisioningserver.rpc.exceptions.NodeStateViolation: The status of the node is 6; this status cannot be transitioned to a corresponding failed status.
- tftp.errors.InvalidErrorcodeError: Unknown error code: 8
- exceptions.AttributeError: 'Port' object has no attribute 'socket'
- exceptions.OSError: [Errno 17] File exists: '/var/log/maas/oops/2014-12-09'
- Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
  (why would that be a Failure?)
- Failure: twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
- provisioningserver.rpc.exceptions.NoConnectionsAvailable:
- Failure: twisted.internet.defer.CancelledError:
- Failed to login to virsh console.
- Failure: provisioningserver.rpc.exceptions.NodeStateViolation: The status of the node is 4; this status cannot be transitioned to a corresponding failed status.
- provisioningserver.power.poweraction.PowerActionFail: virsh failed with return code 1: Failed to login to virsh console.
- Failure: provisioningserver.rpc.exceptions.NodeStateViolation: The status of the node is 6; this status cannot be transitioned to a corresponding failed status.
- Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.
- exceptions.AttributeError: 'NoneType' object has no attribute 'stopListening

Christian Reis (kiko) wrote :

pserv.log.3.gz is similar, but it has a bunch more sm15k OOPSes and a large set of:

- Failure: twisted.web.error.Error: 500 INTERNAL SERVER ERROR

Christian Reis (kiko) wrote :
Download full text (3.8 KiB)

And pserv.log.2.gz has the actual issue reported, where we see complaints about too many open files. It starts with this:

2014-12-26 21:39:58+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 423, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1155, in gotResult
            _inlineCallbacks(r, g, deferred)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/power.py", line 234, in change_power_state
            'query', context)
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 148, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/power.py", line 111, in perform_power_change
            return action.execute(power_change=power_change, **context)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/power/poweraction.py", line 138, in execute
            template = self.get_template()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/power/poweraction.py", line 86, in get_template
            with open(self.path, "rb") as f:
        exceptions.IOError: [Errno 24] Too many open files: u'/etc/maas/templates/power/sm15k.template'

Nice that it waited for after xmas to start reporting problems.

Immediately after that, it starts to complain:

        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap

          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner

          File "/usr/lib/python2.7/threading.py", line 763, in run

        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function,...

Read more...

Christian Reis (kiko) wrote :

Oh, I missed an error that happens a bit before the first one in my previous comment:

        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/tftp/protocol.py", line 67, in _startSession
            context, self.backend.get_reader, datagram.filename)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 62, in wrapper
            return maybeDeferred(func, *args, **kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 139, in maybeDeferred
            result = f(*args, **kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/tftp.py", line 216, in get_reader
            mac_address = get_remote_mac()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/__init__.py", line 106, in get_remote_mac
            return find_mac_via_arp(remote_host)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/network.py", line 101, in find_mac_via_arp
            output = call_and_check(['ip', 'neigh'], env={'LC_ALL': 'C'})
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/shell.py", line 142, in call_and_check
            process = Popen(command, *args, stdout=PIPE, stderr=PIPE, **kwargs)
          File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
            errread, errwrite)
          File "/usr/lib/python2.7/subprocess.py", line 1215, in _execute_child
            errpipe_read, errpipe_write = self.pipe_cloexec()
          File "/usr/lib/python2.7/subprocess.py", line 1167, in pipe_cloexec
            r, w = os.pipe()
        exceptions.OSError: [Errno 24] Too many open files

Could we be flooding the server with find_mac_via_arp() calls?

Christian Reis (kiko) wrote :

Or just leaking descriptors? At any rate, after that error the machine is wedged and goes into an OOPS generation loop until the last one is generated:

2014-12-26 22:09:40+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.7/threading.py", line 763, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/oops/publishers.py", line 38, in result
            return publisher(report)
          File "/usr/lib/python2.7/dist-packages/oops_datedir_repo/repository.py", line 133, in publish
            self.serializer.write(report, open(filename + '.tmp', 'wb'))
        exceptions.IOError: [Errno 24] Too many open files: '/var/log/maas/oops/2014-12-26/OOPS-ce59fba5782424eb6106c025b4b53192.tmp'

At which point it goes back into normality:

2014-12-26 22:10:03+0000 [-] TFTP starting on 69
2014-12-26 22:10:03+0000 [-] TFTP starting on 69
[...]

Christian Reis (kiko) wrote :

After that the only tracebacks in pserv.log.2.gz are of the sm15k issue:

2014-12-27 21:54:04+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 423, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1155, in gotResult
            _inlineCallbacks(r, g, deferred)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/power.py", line 222, in change_power_state
            power_change, context)
          File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 148, in wrapper
            return func(*args, **kwargs)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/rpc/power.py", line 111, in perform_power_change
            return action.execute(power_change=power_change, **context)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/power/poweraction.py", line 142, in execute
            return self.run_shell(rendered)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/power/poweraction.py", line 127, in run_shell
            process.returncode, shell, output))
        provisioningserver.power.poweraction.PowerActionFail: sm15k failed with return code 1:
        Not Found: servers/20/0 (HTTP 404)

Christian Reis (kiko) wrote :

(and the familiar provisioningserver.rpc.exceptions.NodeStateViolation with status 6)

Christian Reis (kiko) wrote :

Okay, and pserv.log.1.gz also has the Too many open files problem, starting with the familiar find_mac_via_arp() issue:

2015-01-03 10:07:12+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/dist-packages/tftp/protocol.py", line 67, in _startSession
            context, self.backend.get_reader, datagram.filename)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 62, in wrapper
            return maybeDeferred(func, *args, **kwargs)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 139, in maybeDeferred
            result = f(*args, **kw)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/tftp.py", line 216, in get_reader
            mac_address = get_remote_mac()
          File "/usr/lib/python2.7/dist-packages/provisioningserver/boot/__init__.py", line 106, in get_remote_mac
            return find_mac_via_arp(remote_host)
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/network.py", line 101, in find_mac_via_arp
            output = call_and_check(['ip', 'neigh'], env={'LC_ALL': 'C'})
          File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/shell.py", line 142, in call_and_check
            process = Popen(command, *args, stdout=PIPE, stderr=PIPE, **kwargs)
          File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
            errread, errwrite)
          File "/usr/lib/python2.7/subprocess.py", line 1215, in _execute_child
            errpipe_read, errpipe_write = self.pipe_cloexec()
          File "/usr/lib/python2.7/subprocess.py", line 1167, in pipe_cloexec
            r, w = os.pipe()
        exceptions.OSError: [Errno 24] Too many open files

After a few of those, a while later, we get a few other crashes caused by fd exhaustion (in check_lease_changes(), and find_ip_via_arp()) and then we go into the OOPS generation storm which lasts

Christian Reis (kiko) wrote :

until we generate the last OOPS generation traceback at 2015-01-03 12:28:08+0000.

Christian Reis (kiko) wrote :

The actual pserv.log file doesn't contain many OOPSes, and they seem to be relatively uninteresting ones with mainly two types of exception:

- twisted.internet.error.CannotListenError: Couldn't listen on 192.168.224.251:69: [Errno 98] Address already in use.
- provisioningserver.rpc.exceptions.NoConnectionsAvailable:

Christian Reis (kiko) wrote :

So my very simple-minded analysis suggests that we are doing something in MAAS which causes us to run out of FDs, possibly one of the issues for which we have a traceback above (such as the one with find_mac_via_arp() in its stack). This may be an FD leak or perhaps just the result of us trying to do too much at once, such as trying to query power status -- or something else.

As we run out of FDs we then crash in other places where new FDs are requested, finishing with the failure to generate an OOPS because that itself requires an FD, which then triggers recursive OOPS generation.

Jason Hobbs (jason-hobbs) wrote :

I've attached an lsof dump from the system this was happening on.

Jason Hobbs (jason-hobbs) wrote :

Why are there 144 open tcp connections from the cluster server to the region server?

I adjusted the ulimit on open files for our maas-clusterd to 65535 by adding this in /etc/init/maas.clusterd.conf:

ulimit -n 65535

.. right before twisted is started.

Christian Reis (kiko) wrote :

This is a count of all the unique items referred to in the lsof output.

I wonder if this lsof is from the middle of MAAS doing something (such as enlisting nodes)?

Christian Reis (kiko) wrote :

We should drop OOPS generation in 1.8; for 1.7, I'd like us to figure out if we are leaking fds and fix that if possible.

Jason Hobbs (jason-hobbs) wrote :

kiko, MAAS wasn't doing anything especially interesting at this point - no bulk enlistments.

Gavin Panella (allenap) wrote :

> Why are there 144 open tcp connections from the cluster server to the
> region server?

I think this is lsof reporting each connection for each thread. There
appear to be only 18 distinct connections.

Gavin Panella (allenap) wrote :

I've landed a change into MAAS trunk to remove all OOPS reporting. I'll mark this bug as fixed for now. Let's reopen it if we see this problem again.

Changed in maas:
status: Triaged → Fix Committed
assignee: nobody → Gavin Panella (allenap)
Christian Reis (kiko) on 2015-01-28
Changed in maas:
milestone: 1.7.2 → next
Gavin Panella (allenap) wrote :

A corresponding packaging change needs to be made. The change landed already - lp:~allenap/maas/oops-did-it-again - can be released in 1.7.2 and will hopefully alleviate the issue. The packaging change, which will clean-up and remove dependencies, can come with 1.8.0. The reason for this is to reduce the immediate delta and testing overhead.

Changed in maas:
milestone: next → 1.7.2
Raphaël Badin (rvb) wrote :

Note that we still need the packaging side fixed: remove the dependency on python-oops* and the creation of the oops directory.

Christian Reis (kiko) on 2015-02-11
Changed in maas:
milestone: 1.7.2 → next
Andres Rodriguez (andreserl) wrote :

This bug has been reported and fixed on upstream MAAS. However, provided that the bug was listed on the debian changelog, this appears as needing verification for pending SRU [1]. This bug did not affect current MAAS in Ubuntu, hence setting this to verification-done to unblock pending SRU.

[1]:http://people.canonical.com/~ubuntu-archive/pending-sru.html

tags: added: verification-done
Changed in maas:
status: Fix Committed → Fix Released
Changed in maas:
milestone: next → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers