Instances go to error state with RPC timeout
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
We've been seeing a trend of people being unable to start instances, investigation yields the following tracebacks in the nova-compute.log on the relevant compute nodes:
2012-04-04 14:21:10 ERROR nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
(nova.compute.
2012-04-04 14:21:10 ERROR nova.rpc.amqp [-] Exception during message handling
(nova.rpc.amqp): TRACE: Traceback (most recent call last):
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: rval = node_func(
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: return f(*args, **kw)
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: sys.exc_info())
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: self.gen.next()
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: return function(self, context, instance_uuid, *args, **kwargs)
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: self._run_
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: self._set_
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: self.gen.next()
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: requested_networks)
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: requested_
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: 'args': args})
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: return _get_impl(
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: return rpc_amqp.
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: rv = list(rv)
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: self._iterator.
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: yield self.ensure(
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: error_callback(e)
(nova.rpc.amqp): TRACE: File "/usr/lib/
(nova.rpc.amqp): TRACE: raise rpc_common.
(nova.rpc.amqp): TRACE: Timeout: Timeout while waiting on RPC response.
(nova.rpc.amqp): TRACE:
Then the instance is destroyed because it couldn't come up cleanly. Restarting nova-network on the network manager, then nova-compute on the compute node seems to fix this for a time, but it recurs after a few hours. Is there further debugging information we can provide? I haven't found log messages that appear related in nova-network or the rabbitmq logs.
Changed in nova: | |
status: | Incomplete → Fix Released |
Changed in nova: | |
status: | Fix Released → Invalid |
This should only happen if network is blocking for a long time or the network node is not connected for some reason. Are you running only one nova-network? If so it could be getting overloaded with requests. You might try using multi_host mode and running one nova-network on each compute host.
you might try:
rabbitmqctl list_queues
it should show you how many messages are waiting in the queue. If the number is greater than 0 either the worker is blocking doing some work or it can't keep up with the messages.
FYI you can also change the timeout using: timeout= <timeout_ seconds>
rpc_response_
so you could set it much higher and see if that helps.