Handle RabbitMQ crash and attempt to reestablish connections and queues

Bug #794627 reported by Antony Messerli
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Chris Behrens

Bug Description

If RabbitMQ fails, crashes, or is restarted, the services connected to RabbitMQ don't attempt to re-establish their connections. All services currently need to be restarted to reconnect and recreate the queues, otherwise they just sit in place running. Services should attempt to reconnect to RabbitMQ in that situation and try and reestablish their queues.

If RabbitMQ is shutdown, services will receive this exception:

(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/nova/rpc.py", line 316, in wait
(nova.rpc): TRACE: it.next()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/carrot/backends/pyamqplib.py", line 287, in consume
(nova.rpc): TRACE: self.channel.wait()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/abstract_channel.py", line 89, in wait
(nova.rpc): TRACE: self.channel_id, allowed_methods)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/connection.py", line 218, in _wait_method
(nova.rpc): TRACE: self.wait()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/abstract_channel.py", line 105, in wait
(nova.rpc): TRACE: return amqp_method(self, args)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/connection.py", line 367, in _close
(nova.rpc): TRACE: raise AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
(nova.rpc): TRACE: AMQPConnectionException: (320, u"CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'", (0, 0), '')

If RabbitMQ is killed, services will receive this exception:

(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/nova/rpc.py", line 316, in wait
(nova.rpc): TRACE: it.next()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/carrot/backends/pyamqplib.py", line 287, in consume
(nova.rpc): TRACE: self.channel.wait()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/abstract_channel.py", line 89, in wait
(nova.rpc): TRACE: self.channel_id, allowed_methods)
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/connection.py", line 198, in _wait_method
(nova.rpc): TRACE: self.method_reader.read_method()
(nova.rpc): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/method_framing.py", line 215, in read_method
(nova.rpc): TRACE: raise m
(nova.rpc): TRACE: IOError: Socket closed

(Running 1156 of Nova)

Related branches

description: updated
vivek.ys (vivekys)
Changed in nova:
assignee: nobody → vivek.ys (vivekys)
vivek.ys (vivekys)
Changed in nova:
status: New → In Progress
Revision history for this message
Thierry Carrez (ttx) wrote :

This was supposed to be addressed in bug 718869 ?

Changed in nova:
importance: Undecided → High
Revision history for this message
Graham Hemingway (graham-hemingway) wrote :

This is still not addressed fully in Diablo milestone 1 release.

Revision history for this message
Thierry Carrez (ttx) wrote :

@vivek.ys: are you working on that ? Or should we unassign you and try to convince someone else to work on it ?

Revision history for this message
Graham Hemingway (graham-hemingway) wrote :

This again bit me yesterday. Is there a plan to enable automatic reconnect for nova-* services (nova-compute, nova-api and nova-scheduler especially). I am on Diablo-2. Please, this would make a huge improvement in cloud reliability.

Revision history for this message
Thierry Carrez (ttx) wrote :

No response from assignee, unassigning to encourage someone else to have a shot at it.

Changed in nova:
assignee: vivek.ys (vivekys) → nobody
status: In Progress → Confirmed
Changed in nova:
assignee: nobody → William Wolf (throughnothing)
Revision history for this message
William Wolf (throughnothing) wrote :

Currently I get this trace from nova-compute.log when I restart rabbitmq on a freshly started nova install:

http://paste.openstack.org/show/2030/

Revision history for this message
William Wolf (throughnothing) wrote :

and this from nova-api (only sometimes, it seems)

http://paste.perldancer.org/18WxrnmfRSjtW

Revision history for this message
Chris Behrens (cbehrens) wrote :

I think it's best to wait for kombu for this... as the carrot code is hard to deal with. I'm linking my kombu branch here.

I can show it will be solved there for sure. Here's a restart of rabbit in the middle of a greenthread shoving stuff into a queue and a greenthread pulling stuff out:

[...]
Message received: hi there meow 1216
Message received: hi there meow 1217
WARNING:root:consume reconnecting: [Errno 107] Transport endpoint is not connected
WARNING:root:publisher_send reconnecting: [Errno 32] Broken pipe
ERROR:root:AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
ERROR:root:AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
ERROR:root:Connected to AMQP server on localhost:5672
ERROR:root:Connected to AMQP server on localhost:5672
Message received: hi there meow 1218
Message received: hi there meow 1219
Message received: hi there meow 1220
[...]

Changed in nova:
assignee: William Wolf (throughnothing) → Chris Behrens (cbehrens)
Chris Behrens (cbehrens)
Changed in nova:
status: Confirmed → In Progress
Thierry Carrez (ttx)
Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → diablo-rbp
Thierry Carrez (ttx)
Changed in nova:
milestone: diablo-rbp → 2011.3
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.