Fuel 2014.2-6.0 / Juno Ubuntu 12.04.4
1 Controller / 40 Compute Nodes
over 900 vCPUs and 3TB of memory
Deployment is fine. The problem crops up when using Rally to attempt to spawn a large amount of m1.medium VMs (2vcpu/4gbram).
When spawning small around (50-200) there usually is no problem. The issue starts to crop up around (300-500 VMs). Compute-service starts to go down. Usually it is 1-3 of them go down. I see nodes go offline in fuel. Sometimes I IPMI in and it appears the system is completely hung. Other times, it will require a force reboot. Restarting nova-compute does NOT work. Error in fuel shows.
Failed to publish message to topic 'conductor': [Errno 110] Connection timed out
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 681, in ensure
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit return method()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 784, in _publish
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic=topic, **kwargs)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 395, in __init__
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit **options)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 340, in __init__
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 348, in reconnect
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 84, in __init__
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.revive(self._channel)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 218, in revive
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.declare()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 104, in declare
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.exchange.declare()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 166, in declare
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit nowait=nowait, passive=passive,
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 620, in exchange_declare
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit (40, 11), # Channel.exchange_declare_ok
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 67, in wait
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.channel_id, allowed_methods)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 237, in _wait_method
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.method_reader.read_method()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/method_framing.py", line 189, in read_method
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit raise m
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 110] Connection timed out
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit
Another node shows a different error.
Exception during message handling: Failed to terminate process 15995 with SIGKILL: Device or resource busy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 137, in _dispatch_and_reply
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 180, in _dispatch
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 126, in _do_dispatch
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 88, in wrapped
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher payload)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 71, in wrapped
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 298, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher pass
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 284, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 348, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 326, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 314, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2510, in terminate_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher do_terminate_instance(instance, bdms)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py", line 272, in inner
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return f(*args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2508, in do_terminate_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._set_instance_error_state(context, instance)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2498, in do_terminate_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._delete_instance(context, instance, bdms, quotas)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/hooks.py", line 131, in inner
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher rv = f(*args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2467, in _delete_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher quotas.rollback()
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2444, in _delete_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._shutdown_instance(context, instance, bdms)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2374, in _shutdown_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher requested_networks)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2363, in _shutdown_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher block_device_info)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1054, in destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._destroy(instance)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1010, in _destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher instance=instance)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 979, in _destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher virt_dom.destroy()
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher rv = execute(f, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(c, e, tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher rv = meth(*args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/libvirt.py", line 918, in destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher if ret == -1: raise libvirtError ('virDomainDestroy() failed', dom=self)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher libvirtError: Failed to terminate process 15995 with SIGKILL: Device or resource busy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher
This appears very often also, not sure if it means anything
2015-06-09 23:02:18 INFO
Unable to import guestfsfalling back to VFSLocalFS
2015-06-09 23:02:17 INFO
Compute_service record updated for ('node-193', 'node-193.domain.tld')
2015-06-09 23:02:17 INFO
Creating image
2015-06-09 23:02:17 WARNING
No network configured!
2015-06-09 23:02:17 INFO
Compute_service record updated for ('node-193', 'node-193.domain.tld')
2015-06-09 23:02:17 INFO
Compute_service record updated for ('node-193', 'node-193.domain.tld')
2015-06-09 23:02:17 WARNING
Claim successful
2015-06-09 23:02:17 WARNING
disk limit: 5529.00 GB, free: 5329.00 GB
2015-06-09 23:02:17 WARNING
Total disk: 5529 GB, used: 200.00 GB
2015-06-09 23:02:17 WARNING
memory limit: 96677.00 MB, free: 75685.00 MB
2015-06-09 23:02:17 WARNING
Total memory: 96677 MB, used: 20992.00 MB
2015-06-09 23:02:17 WARNING
Attempting claim: memory 4096 MB, disk 40 GB
2015-06-09 23:02:17 WARNING
Starting instance...
2015-06-09 23:02:17 INFO
Compute_service record updated for node-193:node-193.domain.tld
2015-06-09 23:02:17 INFO
Compute_service record updated for ('node-193', 'node-193.domain.tld')
2015-06-09 23:52:14 ERR
Error from last host: node-198 (node node-198.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance f52e67b9-720c-4438-b665-08c634f7fdcd was re-scheduled: Cannot write data: Broken pipe\n']
2015-06-09 23:37:37 ERR
Error from last host: node-158 (node node-158.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance 4b7c3a2f-8592-4910-aa94-ac0f093535b3 was re-scheduled: [Errno 5] Input/output error: '/var/lib/nova/instances/4b7c3a2f-8592-4910-aa94-ac0f093535b3'\n"]
2015-06-09 23:37:35 ERR
Error from last host: node-158 (node node-158.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance 71993a19-a08c-4f39-91db-0913e302b0ff was re-scheduled: [Errno 5] Input/output error: '/var/lib/nova/instances/71993a19-a08c-4f39-91db-0913e302b0ff'\n"]
2015-06-09 23:30:14 ERR
Error from last host: node-191 (node node-191.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance c8237bb2-34a0-4ed9-aac3-b86ee68e019a was re-scheduled: Unexpected error while running command.\nCommand: qemu-img create -f qcow2 -o backing_file=/var/lib/nova/instances/_base/168203d2c86483edf54feb5a31fa52931fd0afeb /var/lib/nova/instances/c8237bb2-34a0-4ed9-aac3-b86ee68e019a/disk\nExit code: 1\nStdout: u"Formatting \'/var/lib/nova/instances/c8237bb2-34a0-4ed9-aac3-b86ee68e019a/disk\', fmt=qcow2 size=2361393152 backing_file=\'/var/lib/nova/instances/_base/168203d2c86483edf54feb5a31fa52931fd0afeb\' encryption=off cluster_size=65536 lazy_refcounts=off \\n"\nStderr: u\'qemu-img: /var/lib/nova/instances/c8237bb2-34a0-4ed9-aac3-b86ee68e019a/disk: Could not write qcow2 header: Input/output error\\n\'\n']
I see two possible reasons for nodes go down:
* either you overestimate total capacity of your cloud
* or the problem is that default overcommits are two high in you case, and that causes failure when scheduler allocates too many instances on some compute nodes.
To check the later case, try reducing default overcommits and rerun the test. The parameters names are cpu_allocation_ ratio and ram_allocation_ ratio and they could be found in /etc/nova/nova.conf on controllers. You can read more about overcommitting and Nova parameters in these docs: docs.openstack. org/openstack- ops/content/ compute_ nodes.html# overcommit docs.openstack. org/juno/ config- reference/ content/ list-of- compute- config- options. html
http://
http://
If that does not help, try to monitor compute nodes during tests to understand what causes them hang (CPU/memory exhaustion, some else?). The errors you observe in OpenStack logs seem to be a consequence of node hanging, not the cause.