1.8rc1: failures including region not available due to '[Errno 24] too many open files'

Bug #1461863 reported by Larry Michel on 2015-06-04
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Unassigned

Bug Description

Seeing these failures below in maas logs:

From regiond.log:
====================================================================
pen files. (While requesting RPC info at http://10.245.0.10/MAAS/rpc/).
2015-06-04 10:06:25+0000 [-] Unhandled Error
 Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 423, in errback
     self._startRunCallbacks(fail)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
     self._runCallbacks()
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1155, in gotResult
     _inlineCallbacks(r, g, deferred)
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/lease_upload_service.py", line 113, in _get_client_and_start_upload
     yield self._start_upload(client)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1097, in _inlineCallbacks
     result = result.throwExceptionIntoGenerator(g)
   File "/usr/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
     return g.throw(self.type, self.value, self.tb)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/pserv_services/lease_upload_service.py", line 118, in _start_upload
     updated_lease_info = yield deferToThread(check_lease_changes)
   File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 191, in _worker
     result = context.call(ctx, function, *args, **kwargs)
   File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
     return self.currentContext().callWithContext(ctx, func, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.7/dist-packages/provisioningserver/dhcp/leases.py", line 118, in check_lease_changes
     with objectfork() as (pid, recv, send):
   File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
     return self.gen.next()
   File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/shell.py", line 307, in objectfork
     with pipefork() as (pid, fin, fout):
   File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
     return self.gen.next()
   File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/shell.py", line 207, in pipefork
     crashfile = TemporaryFile()
   File "/usr/lib/python2.7/tempfile.py", line 493, in TemporaryFile
     (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags)
   File "/usr/lib/python2.7/tempfile.py", line 239, in _mkstemp_inner
     fd = _os.open(file, flags, 0600)
 exceptions.OSError: [Errno 24] Too many open files: '/tmp/tmpfXJQAl'

2015-06-04 10:06:27+0000 [-] Region not available: Couldn't bind: 24: Too many open files. (While requesting RPC info at http://10.245.0.10/MAAS/rpc/).
2015-06-04 10:06:29+0000 [-] Region not available: Couldn't bind: 24: Too man
====================================================================

From maas.log:
====================================================================
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] marmon.local: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/ipmi.template'
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] hayward-29: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/sm15k.template'
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] hayward-37: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/sm15k.template'
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] huffman-vm-06: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/virsh.template'
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] elgin.local: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/ipmi.template'
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] kobusch.local: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/ipmi.template'
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] gytrash.local: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/ipmi.template'
Jun 4 09:42:25 maas-trusty-back-may22 maas.power: [ERROR] hayward-35: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/sm15k.template'
Jun 4 09:43:25 maas-trusty-back-may22 maas.service_monitor: [ERROR] While monitoring service 'maas-dhcpd' an error was encountered: [Errno 24] Too many open

...

Jun 4 09:52:25 maas-trusty-back-may22 maas.power: [ERROR] hayward-35: Failed to refresh power state: [Errno 24] Too many open files: u'/etc/maas/templates/power/sm15k.template'
Jun 4 09:53:25 maas-trusty-back-may22 maas.service_monitor: [ERROR] While monitoring service 'maas-dhcpd' an error was encountered: [Errno 24] Too many open files
Jun 4 09:53:25 maas-trusty-back-may22 maas.service_monitor: [ERROR] While monitoring service 'maas-dhcpd6' an error was encountered: [Errno 24] Too many open files
Jun 4 09:53:25 maas-trusty-back-may22 maas.lease_upload_service: [ERROR] Failed to upload leases: [Errno 24] Too many open files: '/tmp/tmpkcxlFa'
Jun 4 09:53:25 maas-trusty-back-may22 maas.service_monitor: [ERROR] While monitoring service 'tgt' an error was encountered: [Errno 24] Too many open files
Jun 4 09:54:25 maas-trusty-back-may22 maas.lease_upload_service: [ERROR] Failed to upload leases: [Errno 24] Too many open files: '/tmp/tmp1c6i7Z'
Jun 4 09:55:25 maas-trusty-back-may22 maas.service_monitor: [ERROR] While monitoring service 'maas-dhcpd' an error was encountered: [Errno 24] Too many open files
Jun 4 09:55:25 maas-trusty-back-may22 maas.service_monitor: [ERROR] While monitoring service 'maas-dhcpd6' an error was encountered: [Errno 24] Too many open files
Jun 4 09:55:25 maas-trusty-back-may22 maas.service_monitor: [ERROR] While monitoring service 'tgt' an error was encountered: [Errno 24] Too many open files
Jun 4 09:55:25 maas-trusty-back-may22 maas.lease_upload_service: [ERROR] Failed to upload leases: [Errno 24] Too many open files: '/tmp/tmpitR9pp'
====================================================================

Releasing is failing and we're seeing from build console:
====================================================================
2015-06-04 09:28:44,625 [INFO] oil_ci.juju.client: Boostrapping new environment
Bootstrapping environment "ci-oil-slave1"
Starting new instance for initial state server
Launching instance
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
Bootstrap failed, destroying environment
ERROR failed to bootstrap environment: cannot start bootstrap instance: gomaasapi: got error back from server: 400 BAD REQUEST ({"distro_series": ["'trusty' is not a valid distro_series. It should be one of: ''."]})
2015-06-04 09:29:11,727 [ERROR] oil_ci.juju.client: Calling "juju bootstrap" failed!
2015-06-04 09:29:11,727 [ERROR] oil_ci.cli: Deployment failed:
+ rc=1
+ echo 'Deployment returned: 1'
Deployment returned: 1
+ [[ 1 == 0 ]]
====================================================================

From lsof:
====================================================================
ubuntu@maas-trusty-back-may22:~$ sudo lsof|wc -l
111033
ubuntu@maas-trusty-back-may22:~$ lsof|wc -l
4873
====================================================================

Logs attached.

Larry Michel (lmic) wrote :
summary: - 1.8rc1: Failed to refresh power state: [Errno24] [Errno 24] Too many
- open files: u'/etc/maas/templates/power/ipmi.template'
+ 1.8rc1: Failed to refresh power state: [Errno 24] Too many open files:
+ u'/etc/maas/templates/power/ipmi.template'

output of sudo lsof

summary: - 1.8rc1: Failed to refresh power state: [Errno 24] Too many open files:
- u'/etc/maas/templates/power/ipmi.template'
+ 1.8rc1: number of different failures in maas.log due to '... [Errno 24]
+ too many open files ...'
description: updated
Larry Michel (lmic) on 2015-06-04
description: updated
summary: - 1.8rc1: number of different failures in maas.log due to '... [Errno 24]
- too many open files ...'
+ 1.8rc1: failures including region not available due to '[Errno 24] too
+ many open files'
Raphaël Badin (rvb) wrote :

The tgt server has a lot of threads and a lot of open files:
cat lsof.txt | grep tgt | wc -l => 76655
cat lsof.txt | grep tgt | awk '{ print $3 }' | uniq | wc -l => 834

Raphaël Badin (rvb) wrote :

Actually, the tgt daemon runs as root and thus has a soft/hard limit of 1048576/1048576.

Ryan Beisner (1chb1n) wrote :

We are periodically seeing this occur in 1.8.0~beta8+bzr3951-0ubuntu1~trusty1. Restarting the clusterd service gets us back in operation for a number of days. It seems to repeat.

The impact we observe is the inability to release or acquire nodes ('Releasing Failed' at the moment).

http://paste.ubuntu.com/11564921/

2015-06-04 12:31:51+0000 [-] Region not available: Couldn't bind: 24: Too many open files. (While requesting RPC info at http://10.245.168.2/MAAS/rpc/).
...
exceptions.OSError: [Errno 24] Too many open files: '/tmp/tmpvQtPf2'

# Foo
sudo lsof > lsof.txt

cat lsof.txt | grep tgt | wc -l
72109

cat lsof.txt | grep tgt | awk '{ print $3 }' | uniq | wc -l
802

Ryan Beisner (1chb1n) wrote :

Adding lsof attachment from dellstack maas lab.

Andres Rodriguez (andreserl) wrote :

I''m aattaching a new lsof

Changed in maas:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → 1.8.0
Andres Rodriguez (andreserl) wrote :
Ryan Beisner (1chb1n) on 2015-06-04
tags: added: openstack uosci
Changed in maas:
milestone: 1.8.0 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers