Comment 0 for bug 1440134

Revision history for this message
Miroslav Anashkin (manashkin) wrote :

We've implemented the following workaround in Fuel 5.1.1:

https://bugs.launchpad.net/fuel/+bug/1322259/
Set delay to 5.0 to recover channel errors on highly loaded environments.

Actually, this fix had been made for slow environments.

It introduced to us a new issue, affecting all Mirantis OpenStack versions between 5.1.1 and 6.1.
Instance launch may fail.
It is related to this HA+RabbitMQ+Oslo Messaging bug:
https://bugs.launchpad.net/mos/+bug/1415932

So far customers reported the following errors:

===================================================================

Error Message 1
Timed out waiting for a reply to message ID 7bc529a2d71141cf8e65cbee6402f817
Code
500
Details
File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 614, in build_instances request_spec, filter_properties) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 49, in select_destinations context, request_spec, filter_properties) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 35, in __run_method return getattr(self.instance, __name)(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/query.py", line 34, in select_destinations context, request_spec, filter_properties) File "/usr/lib/python2.7/dist-packages/nova/scheduler/rpcapi.py", line 108, in select_destinations request_spec=request_spec, filter_properties=filter_properties) File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call retry=self.retry) File "/usr/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send timeout=timeout, retry=retry) File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 434, in send retry=retry) File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 423, in _send result = self._waiter.wait(msg_id, timeout) File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 289, in wait reply, ending = self._poll_connection(msg_id, timeout) File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 239, in _poll_connection % msg_id)

===================================================================

Error message 2
The problem I am seeing is that when cloud-init would access the metadata service it was taking 10seconds to respond to each query which is slowing down the startup times of instances from ~30sec to over 3min.

time curl http://169.254.169.254/2009-04-04/meta-data/instance-id
i-00000046
real 0m10.289s
user 0m0.000s
sys 0m0.003s

After some troubleshooting I noticed that the the neutron-metadata-agent was calling out for locks to the neutron-server, when I looked at the neutron-server logs I started seeing error messages like this:

2015-03-12 21:00:34.247 41004 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'reply_948744ac899648c19260a53a8e5c0f4b': [Errno 32] Broken pipe

===================================================================

Should we set kombu_reconnect_delay back to its default value, 1.0?