The compute logs are all wrapped so it's hard to read, so I've reformatted it and the error is here: http://paste.openstack.org/show/594211/ 2016-12-20 17:02:11.554 102936 DEBUG oslo_service.periodic_task [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215 2016-12-20 17:02:11.576 102936 INFO nova.compute.resource_tracker [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Auditing locally available compute resources for node node-l4-21-16 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager [req-442fe967-9b86-4ddc-9548-e569f226b508 - - - - -] Error updating resources for node node-l4-21-16. 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager Traceback (most recent call last): 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6487, in update_available_resource 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager rt.update_available_resource(context) 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 508, in update_available_resource 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5365, in get_available_resource 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5006, in _get_local_gb_info 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 380, in get_pool_info 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 107, in __init__ 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/storage/rbd_utils.py", line 136, in _connect_to_rados 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/rados.py", line 212, in __init__ 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 253, in find_library 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager File "/usr/lib/python2.7/ctypes/util.py", line 242, in _findSoname_ldconfig 2016-12-20 17:02:11.577 102936 ERROR nova.compute.manager OSError: [Errno 24] Too many open files We end up failing here: https://github.com/openstack/nova/blob/stable/mitaka/nova/compute/manager.py#L6498 The resource tracker doesn't change the service record state, that's handled in the service group API based on whether or not the service is reporting in on a regular interval, which is different from what's failing here which is getting local used disk info. The compute node record itself doesn't have a state or status, that's all determined through the associated services record. The libvirt driver could disable the service itself, like it does when libvirt is disconnected. That happens via event callback here: https://github.com/openstack/nova/blob/stable/mitaka/nova/virt/libvirt/driver.py#L3445 So when we hit too many files error here: https://github.com/openstack/nova/blob/stable/mitaka/nova/virt/libvirt/driver.py#L5006 We could disable the service for that host...but I'm not sure how we re-enable the service...I guess we could set a flag in memory when disabling the service because of that failure and on the next check if it passes and the flag was set, we then enable the service again if get_pool_info() passes. It doesn't seem like an awesome solution but it could work to automatically disable/enable the service during the update available resource periodic task, and disabling the service might trigger an alarm for the operator to see things are going haywire in the cluster.