OpenStack Compute (nova)

nova-compute reports OSError: [Errno 24] Too many open files

Bug #1651526 reported by Mattia Belluco on 2016-12-20

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Confirmed	Low	Unassigned

Bug Description

Description
===========
After we upgraded our openstack deployment to Mitaka we started noticing more and more VMs in an error and/or shutoff state. We found out that although nova service-list was reporting all nova-compute as "up" nova-compute was logging :

ERROR nova.compute.manager OSError: [Errno 24] Too many open files

our deployment uses ceph as storage backend for ephemeral/cinder/glance with currently around 900 osd installed.

Steps to reproduce
==================

Ask a compute node to perform several actions at once such as:
-take a live snapshot
-delete a VM
-boot a VM

Depending on the task the compute will allow for a certain number of jobs before complaining about the number of open files.

Expected behaviour
=================
Nova-compute should be more robust in handling requests and should report when in error state.

Environment
===========
We are currently running:

nova-common 2:13.1.2-0ubuntu2~cloud0
nova-compute 2:13.1.2-0ubuntu2~cloud0

hypervisor: Libvirt + KVM
nova-compute-kvm 2:13.1.2-0ubuntu2~cloud0
nova-compute-libvirt 2:13.1.2-0ubuntu2~cloud0

storage:
ceph 0.94.9-1trusty

networking: neutron + openvswitch
neutron-common 2:8.3.0-0ubuntu1.1~cloud0

Logs & Configs
==============

Tags:

Revision history for this message

Mattia Belluco (blmt) wrote on 2016-12-20:

nova-compute.log Edit (1.8 MiB, text/html)

Matt Riedemann (mriedem) on 2017-01-07

tags:

added: ceph compute

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-01-07:

Download full text (3.9 KiB)

The compute logs are all wrapped so it's hard to read, so I've reformatted it and the error is here:

http://paste.openstack.org/show/594211/

2016-12-20 17:02:11.554 102936 DEBUG oslo_service.periodic_task [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215
2016-12-20 17:02:11.576 102936 INFO nova.compute.resource_tracker [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Auditing locally available compute resources for node node-l4-21-16
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Error updating resources for node node-l4-21-16.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager Traceback (most recent call last):
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6487, in update_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager rt.update_available_resource(context)
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 508, in update_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5365, in get_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5006, in _get_local_gb_info
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 380, in get_pool_info
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 107, in __init__
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 136, in _connect_to_rados
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/rados.py", line 212, in __init__
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 253, in find_library
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 242, in _findSoname_ldconfig
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager OSError: [Errno 24] Too many open files

We end up failing here:

https://github.com/openstack/nova/blob/stable/mitaka/nova/compute/manager.py#L6498

The resource tracker doesn't change the service record state, that's handled in the service group API based on whether or not the service is reporting in on a regular interval, which is different from what's failing here which is getting local used disk info.

The compute node record itself doesn't have a state or status, that's all determined through the associated services record.

The libvirt driver could disable the service itself, like it does when libvirt is disconnected. That happens via even...

The compute logs are all wrapped so it's hard to read, so I've reformatted it and the error is here:

http://paste.openstack.org/show/594211/

2016-12-20 17:02:11.554 102936 DEBUG oslo_service.periodic_task [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215
2016-12-20 17:02:11.576 102936 INFO nova.compute.resource_tracker [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Auditing locally available compute resources for node node-l4-21-16
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Error updating resources for node node-l4-21-16.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager Traceback (most recent call last): 
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6487, in update_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager rt.update_available_resource(context)
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 508, in update_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5365, in get_available_resource
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5006, in _get_local_gb_info
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 380, in get_pool_info
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 107, in __init__
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 136, in _connect_to_rados
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/rados.py", line 212, in __init__
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 253, in find_library
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 242, in _findSoname_ldconfig
2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager OSError: [Errno 24] Too many open files

We end up failing here:

https://github.com/openstack/nova/blob/stable/mitaka/nova/compute/manager.py#L6498

The compute node record itself doesn't have a state or status, that's all determined through the associated services record.

The libvirt driver could disable the service itself, like it does when libvirt is disconnected. That happens via event callback here:

https://github.com/openstack/nova/blob/stable/mitaka/nova/virt/libvirt/driver.py#L3445

So when we hit too many files error here:

https://github.com/openstack/nova/blob/stable/mitaka/nova/virt/libvirt/driver.py#L5006

We could disable the service for that host...but I'm not sure how we re-enable the service...I guess we could set a flag in memory when disabling the service because of that failure and on the next check if it passes and the flag was set, we then enable the service again if get_pool_info() passes. It doesn't seem like an awesome solution but it could work to automatically disable/enable the service during the update available resource periodic task, and disabling the service might trigger an alarm for the operator to see things are going haywire in the cluster.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-01-07:

I've also posted to the dev and operators mailing lists about this issue and to get some feedback on how to fix it:

http://lists.openstack.org/pipermail/openstack-dev/2017-January/109780.html

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

Antonio Messina (arcimboldo) wrote on 2017-01-09:

Sorry, but disabling the compute node doesn't solve the issue, although you might want to change the behavior, this is a solution for a different problem.

The problem here is twofold:
* Mitaka+ version of nova now creates multiple threads to delete VMs in parallel (and create snapshots etc), and as a consequence is more connection-hungry
* nova is not dealing properly with fd starvation.

Since the number of connections created is a function of the size of the ceph cluster, I would expect either:
a) nova is limiting the amount of parallel operations based on the max number of files he can create (man getrlimit)
b) (ugly but easier) a configuration option is provided to limit the amount of parallel connections nova will make

Option a) has the advantage that in some cases the limits are too low to be able to do anything and this issue might be spotted *before* actually failing, and a nice error might be printed in the log file.

So far we were able to *mitigate* the issue by:

1) setting EVENTLET_THREADPOOL_SIZE to a lower value in upstart script
2) increasing nfile (ulimit in upstart script)

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-27:

Automatically discovered version mitaka in description. If this is incorrect, please update the description to include 'nova version: ...'

tags:

added: openstack-version.mitaka

Revision history for this message

David Hill (david-hill-ubisoft) wrote on 2019-05-26:

We hit this issue in Newton too if this is for Mitaka only.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

nova-compute.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.