nova-compute reports OSError: [Errno 24] Too many open files
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Low
|
Unassigned |
Bug Description
Description
===========
After we upgraded our openstack deployment to Mitaka we started noticing more and more VMs in an error and/or shutoff state. We found out that although nova service-list was reporting all nova-compute as "up" nova-compute was logging :
ERROR nova.compute.
our deployment uses ceph as storage backend for ephemeral/
Steps to reproduce
==================
Ask a compute node to perform several actions at once such as:
-take a live snapshot
-delete a VM
-boot a VM
Depending on the task the compute will allow for a certain number of jobs before complaining about the number of open files.
Expected behaviour
=================
Nova-compute should be more robust in handling requests and should report when in error state.
Environment
===========
We are currently running:
nova-common 2:13.1.
nova-compute 2:13.1.
hypervisor: Libvirt + KVM
nova-compute-kvm 2:13.1.
nova-compute-
storage:
ceph 0.94.9-1trusty
networking: neutron + openvswitch
neutron-common 2:8.3.0-
Logs & Configs
==============
tags: | added: ceph compute |
The compute logs are all wrapped so it's hard to read, so I've reformatted it and the error is here:
http:// paste.openstack .org/show/ 594211/
2016-12-20 17:02:11.554 102936 DEBUG oslo_service. periodic_ task [req-442fe967- 9b86-4ddc- 9548-e569f226b5 08 - - - - -] Running periodic task ComputeManager. update_ available_ resource run_periodic_tasks /usr/lib/ python2. 7/dist- packages/ oslo_service/ periodic_ task.py: 215 resource_ tracker [req-442fe967- 9b86-4ddc- 9548-e569f226b5 08 - - - - -] Auditing locally available compute resources for node node-l4-21-16 manager [req-442fe967- 9b86-4ddc- 9548-e569f226b5 08 - - - - -] Error updating resources for node node-l4-21-16. manager Traceback (most recent call last): manager File "/usr/lib/ python2. 7/dist- packages/ nova/compute/ manager. py", line 6487, in update_ available_ resource manager rt.update_ available_ resource( context) manager File "/usr/lib/ python2. 7/dist- packages/ nova/compute/ resource_ tracker. py", line 508, in update_ available_ resource manager File "/usr/lib/ python2. 7/dist- packages/ nova/virt/ libvirt/ driver. py", line 5365, in get_available_ resource manager File "/usr/lib/ python2. 7/dist- packages/ nova/virt/ libvirt/ driver. py", line 5006, in _get_local_gb_info manager File "/usr/lib/ python2. 7/dist- packages/ nova/virt/ libvirt/ storage/ rbd_utils. py", line 380, in get_pool_info manager File "/usr/lib/ python2. 7/dist- packages/ nova/virt/ libvirt/ storage/ rbd_utils. py", line 107, in __init__ manager File "/usr/lib/ python2. 7/dist- packages/ nova/virt/ libvirt/ storage/ rbd_utils. py", line 136, in _connect_to_rados manager File "/usr/lib/ python2. 7/dist- packages/ rados.py" , line 212, in __init__ manager File "/usr/lib/ python2. 7/ctypes/ util.py" , line 253, in find_library manager File "/usr/lib/ python2. 7/ctypes/ util.py" , line 242, in _findSoname_ ldconfig manager OSError: [Errno 24] Too many open files
2016-12-20 17:02:11.576 102936 INFO nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
2016-12-20 17:02:11.577 102936 ERROR nova.compute.
We end up failing here:
https:/ /github. com/openstack/ nova/blob/ stable/ mitaka/ nova/compute/ manager. py#L6498
The resource tracker doesn't change the service record state, that's handled in the service group API based on whether or not the service is reporting in on a regular interval, which is different from what's failing here which is getting local used disk info.
The compute node record itself doesn't have a state or status, that's all determined through the associated services record.
The libvirt driver could disable the service itself, like it does when libvirt is disconnected. That happens via even...