I finally managed to reproduce it on the latest 5.1 ISO (#263)
{
"api": "1.0",
"astute_sha": "694b5a55695e01e1c42185bfac9cc7a641a9bd48",
"build_id": "2014-06-23_00-31-14",
"build_number": "265",
"fuellib_sha": "dc2713b3ba20ccff2816cf61e8481fe2f17ed69b",
"fuelmain_sha": "4394ca9be6540d652cc3919556633d9381e0db64",
"mirantis": "yes",
"nailgun_sha": "eaabb2c9bbe8e921aaa231960dcda74a7bc86213",
"ostf_sha": "429c373fb79b1073aa336bc62c6aad45a8f93af6",
"production": "docker",
"release": "5.1"
}
The problem is caused by rabbitmq glitch on one of the remaining controller nodes (in my case node-2, after bringing down br-mgmt on node-1). Here is an example of nova-compute log:
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db % msg_id)
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID a735198df0b94436801231af311adb99
2014-06-23 11:31:49.244 25809 TRACE nova.servicegroup.drivers.db
According to tcpdump and logs, such errors occured only when "nova-compute" tried to send message to rabbitmq on node-2. Messages to rabbitmq on node-4 were fine.
Rabbitmq was accepting connections on node-2 but it looks like it was not able to handle messages. Due to this "nova-compute" services were going "up" and "down" all the time in "nova service-list". Also all instances created in Horizon were in ERROR state. Rabbitmq on node-2 even failed to stop nicely via "service rabbitmq stop" and I had to kill it.
After killing problem rabbitmq on node-2 (when only one working rabbitmq left on node-4), nova-compute services successfully recovered and I was able to create instances and pass OSTF.
I finally managed to reproduce it on the latest 5.1 ISO (#263) e1c42185bfac9cc 7a641a9bd48" , 23_00-31- 14", ff2816cf61e8481 fe2f17ed69b" , 652cc3919556633 d9381e0db64" , 21aaa231960dcda 74a7bc86213" , 73aa336bc62c6aa d45a8f93af6" ,
{
"api": "1.0",
"astute_sha": "694b5a55695e01
"build_id": "2014-06-
"build_number": "265",
"fuellib_sha": "dc2713b3ba20cc
"fuelmain_sha": "4394ca9be6540d
"mirantis": "yes",
"nailgun_sha": "eaabb2c9bbe8e9
"ostf_sha": "429c373fb79b10
"production": "docker",
"release": "5.1"
}
The problem is caused by rabbitmq glitch on one of the remaining controller nodes (in my case node-2, after bringing down br-mgmt on node-1). Here is an example of nova-compute log:
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro up.drivers. db Traceback (most recent call last): up.drivers. db File "/usr/lib/ python2. 6/site- packages/ nova/servicegro up/drivers/ db.py", line 95, in _report_state up.drivers. db service. service_ ref, state_catalog) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ nova/conductor/ api.py" , line 218, in service_update up.drivers. db return self._manager. service_ update( context, service, values) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ nova/conductor/ rpcapi. py", line 330, in service_update up.drivers. db service=service_p, values=values) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ oslo/messaging/ rpc/client. py", line 150, in call up.drivers. db wait_for_ reply=True, timeout=timeout) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ oslo/messaging/ transport. py", line 90, in _send up.drivers. db timeout=timeout) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ oslo/messaging/ _drivers/ amqpdriver. py", line 409, in send up.drivers. db return self._send(target, ctxt, message, wait_for_reply, timeout) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ oslo/messaging/ _drivers/ amqpdriver. py", line 400, in _send up.drivers. db result = self._waiter. wait(msg_ id, timeout) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ oslo/messaging/ _drivers/ amqpdriver. py", line 267, in wait up.drivers. db reply, ending = self._poll_ connection( msg_id, timeout) up.drivers. db File "/usr/lib/ python2. 6/site- packages/ oslo/messaging/ _drivers/ amqpdriver. py", line 217, in _poll_connection up.drivers. db % msg_id) up.drivers. db MessagingTimeout: Timed out waiting for a reply to message ID a735198df0b9443 6801231af311adb 99 up.drivers. db
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
2014-06-23 11:31:49.244 25809 TRACE nova.servicegro
According to tcpdump and logs, such errors occured only when "nova-compute" tried to send message to rabbitmq on node-2. Messages to rabbitmq on node-4 were fine.
Rabbitmq was accepting connections on node-2 but it looks like it was not able to handle messages. Due to this "nova-compute" services were going "up" and "down" all the time in "nova service-list". Also all instances created in Horizon were in ERROR state. Rabbitmq on node-2 even failed to stop nicely via "service rabbitmq stop" and I had to kill it.
After killing problem rabbitmq on node-2 (when only one working rabbitmq left on node-4), nova-compute services successfully recovered and I was able to create instances and pass OSTF.
This intermittent bug should be fixed with https:/ /blueprints. launchpad. net/fuel/ +spec/rabbitmq- cluster- controlled- by-pacemaker.
Attaching snapshot just in case