Comment 15 for bug 1323277

I finally managed to reproduce it on the latest 5.1 ISO (#263)
{
    "api": "1.0",
    "astute_sha": "694b5a55695e01e1c42185bfac9cc7a641a9bd48",
    "build_id": "2014-06-23_00-31-14",
    "build_number": "265",
    "fuellib_sha": "dc2713b3ba20ccff2816cf61e8481fe2f17ed69b",
    "fuelmain_sha": "4394ca9be6540d652cc3919556633d9381e0db64",
    "mirantis": "yes",
    "nailgun_sha": "eaabb2c9bbe8e921aaa231960dcda74a7bc86213",
    "ostf_sha": "429c373fb79b1073aa336bc62c6aad45a8f93af6",
    "production": "docker",
    "release": "5.1"
}

The problem is caused by rabbitmq glitch on one of the remaining controller nodes (in my case node-2, after bringing down br-mgmt on node-1). Here is an example of nova-compute log:

2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db % msg_id)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID a735198df0b94436801231af311adb99
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db

According to tcpdump and logs, such errors occured only when "nova-compute" tried to send message to rabbitmq on node-2. Messages to rabbitmq on node-4 were fine.

Rabbitmq was accepting connections on node-2 but it looks like it was not able to handle messages. Due to this "nova-compute" services were going "up" and "down" all the time in "nova service-list". Also all instances created in Horizon were in ERROR state. Rabbitmq on node-2 even failed to stop nicely via "service rabbitmq stop" and I had to kill it.

After killing problem rabbitmq on node-2 (when only one working rabbitmq left on node-4), nova-compute services successfully recovered and I was able to create instances and pass OSTF.

This intermittent bug should be fixed with https://blueprints.launchpad.net/fuel/+spec/rabbitmq-cluster-controlled-by-pacemaker.

Attaching snapshot just in case