nova-compute reports OSError: [Errno 24] Too many open files

Bug #1651526 reported by Mattia Belluco
26
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
Unassigned

Bug Description

Description
===========
After we upgraded our openstack deployment to Mitaka we started noticing more and more VMs in an error and/or shutoff state. We found out that although nova service-list was reporting all nova-compute as "up" nova-compute was logging :

ERROR nova.compute.manager OSError: [Errno 24] Too many open files

our deployment uses ceph as storage backend for ephemeral/cinder/glance with currently around 900 osd installed.

Steps to reproduce
==================

Ask a compute node to perform several actions at once such as:
-take a live snapshot
-delete a VM
-boot a VM

Depending on the task the compute will allow for a certain number of jobs before complaining about the number of open files.

Expected behaviour
=================
Nova-compute should be more robust in handling requests and should report when in error state.

Environment
===========
We are currently running:

nova-common 2:13.1.2-0ubuntu2~cloud0
nova-compute 2:13.1.2-0ubuntu2~cloud0

hypervisor: Libvirt + KVM
nova-compute-kvm 2:13.1.2-0ubuntu2~cloud0
nova-compute-libvirt 2:13.1.2-0ubuntu2~cloud0

storage:
ceph 0.94.9-1trusty

networking: neutron + openvswitch
neutron-common 2:8.3.0-0ubuntu1.1~cloud0

Logs & Configs
==============

Revision history for this message
Mattia Belluco (blmt) wrote :
Matt Riedemann (mriedem)
tags: added: ceph compute
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (3.9 KiB)

The compute logs are all wrapped so it's hard to read, so I've reformatted it and the error is here:

http://paste.openstack.org/show/594211/

2016-12-20 17:02:11.554 102936 DEBUG oslo_service.periodic_task [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215
2016-12-20 17:02:11.576 102936 INFO nova.compute.resource_tracker [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Auditing locally available compute resources for node node-l4-21-16
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Error updating resources for node node-l4-21-16.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager Traceback (most recent call last):
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6487, in update_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager rt.update_available_resource(context)
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 508, in update_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5365, in get_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5006, in _get_local_gb_info
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 380, in get_pool_info
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 107, in __init__
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 136, in _connect_to_rados
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/rados.py", line 212, in __init__
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 253, in find_library
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 242, in _findSoname_ldconfig
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager OSError: [Errno 24] Too many open files

We end up failing here:

https://github.com/openstack/nova/blob/stable/mitaka/nova/compute/manager.py#L6498

The resource tracker doesn't change the service record state, that's handled in the service group API based on whether or not the service is reporting in on a regular interval, which is different from what's failing here which is getting local used disk info.

The compute node record itself doesn't have a state or status, that's all determined through the associated services record.

The libvirt driver could disable the service itself, like it does when libvirt is disconnected. That happens via even...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

I've also posted to the dev and operators mailing lists about this issue and to get some feedback on how to fix it:

http://lists.openstack.org/pipermail/openstack-dev/2017-January/109780.html

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Antonio Messina (arcimboldo) wrote :

Sorry, but disabling the compute node doesn't solve the issue, although you might want to change the behavior, this is a solution for a different problem.

The problem here is twofold:
* Mitaka+ version of nova now creates multiple threads to delete VMs in parallel (and create snapshots etc), and as a consequence is more connection-hungry
* nova is not dealing properly with fd starvation.

Since the number of connections created is a function of the size of the ceph cluster, I would expect either:
a) nova is limiting the amount of parallel operations based on the max number of files he can create (man getrlimit)
b) (ugly but easier) a configuration option is provided to limit the amount of parallel connections nova will make

Option a) has the advantage that in some cases the limits are too low to be able to do anything and this issue might be spotted *before* actually failing, and a nice error might be printed in the log file.

So far we were able to *mitigate* the issue by:

1) setting EVENTLET_THREADPOOL_SIZE to a lower value in upstart script
2) increasing nfile (ulimit in upstart script)

Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version mitaka in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.mitaka
Revision history for this message
David Hill (david-hill-ubisoft) wrote :

We hit this issue in Newton too if this is for Mitaka only.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.