Comment 4 for bug 1580634

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Alexander, I've checked the diagnostic snapshot you've provided and it looks like you don't give a RabbitMQ cluster enough time to fully recover before trying to boot an instance. There is connection error in the middle of instance boot:

we start building an instance

2016-05-11T14:26:37.695624+00:00 debug: 2016-05-11 14:26:34.755 1792 DEBUG nova.compute.manager [req-f44df37d-43bb-4655-9f52-7289823dfbf7 - -
- - -] [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] Skipping network cache update for instance because it is Building. _heal_instance_info_
cache /usr/lib/python2.7/dist-packages/nova/compute/manager.py:5553

then there is a disconnect

2016-05-11T14:26:42.796898+00:00 err: 2016-05-11 14:26:39.854 1792 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.36.8:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.

connection is recovered after ~1.5 minutes:

2016-05-11T14:28:21.814615+00:00 warning: 2016-05-11 14:28:18.871 1792 WARNING nova.servicegroup.drivers.db [-] Lost connection to nova-conductor for reporting service status.
2016-05-11T14:28:21.814615+00:00 warning: 2016-05-11 14:28:18.872 1792 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 5.03 sec
2016-05-11T14:28:21.899409+00:00 info: 2016-05-11 14:28:18.957 1792 INFO nova.servicegroup.drivers.db [-] Recovered from being unable to report status.

the problem is that by the time Nova reconnects to a RabbitMQ cluster the RPC calls time out (it took more than 60 seconds)

2016-05-11 14:28:24.465 1792 ERROR nova.compute.manager [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] MessagingTimeout: Timed out waiting for a reply to message ID b4eee6c75c274a93803f526007752c26

so the instance goes to ERROR state.

I quickly checked the RabbitMQ logs and it looks like although only one controller node was stopped, the RabbitMQ service were shut down and restarted on two remaining nodes:

2016-05-11T14:28:48.803260+00:00 notice: =INFO REPORT==== 11-May-2016::14:28:42 ===
2016-05-11T14:28:48.803260+00:00 notice: started TCP Listener on 10.109.36.5:5673

2016-05-11T14:27:13.560595+00:00 notice: =INFO REPORT==== 11-May-2016::14:27:01 ===
2016-05-11T14:27:13.560595+00:00 notice: started TCP Listener on 10.109.36.8:5673

so during some time rabbitmq was not available, which explains long reconnect in nova.

Oslo team, is this expected? ^