Encountering sporadic AMQPChannelException
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Critical
|
Vish Ishaya | ||
Diablo |
Fix Released
|
Undecided
|
Unassigned | ||
nova (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Oneiric |
Won't Fix
|
High
|
Unassigned | ||
Precise |
Fix Released
|
High
|
Unassigned |
Bug Description
Running one controller and one compute, using nova-network version 2011.3~
Create four instances, terminate them.
This mostly works but one in every three or four times, one of the vms does not come up and I see this error in the nova-network log. This can occur both in the controller node (also running compute) or the compute node. (contents of nova.conf follows)
2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab semaphore "iptables" for method "apply"... from (pid=1082) inner /usr/lib/
2011-09-20 13:22:59,295 DEBUG nova.utils [-] Attempting to grab file lock "iptables" for method "apply"... from (pid=1082) inner /usr/lib/
2011-09-20 13:22:59,296 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t filter from (pid=1082) execute /usr/lib/
2011-09-20 13:22:59,311 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/
2011-09-20 13:22:59,350 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-save -t nat from (pid=1082) execute /usr/lib/
2011-09-20 13:22:59,366 DEBUG nova.utils [-] Running cmd (subprocess): sudo iptables-restore from (pid=1082) execute /usr/lib/
2011-09-20 13:22:59,424 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: ctxt.reply(None, None)
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: msg_reply(
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: conn.direct_
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: self._done()
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: self.connection
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: self.channel.
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: (20, 41), # Channel.close_ok
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: return self.dispatch_
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: return amqp_method(self, args)
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: (class_id, method_id))
(nova.rpc): TRACE: AMQPChannelExce
(nova.rpc): TRACE:
2011-09-20 13:22:59,451 ERROR nova.rpc [-] Returning exception (404, u"NOT_FOUND - no exchange '3ff1ba7e274a4e
2011-09-20 13:22:59,452 ERROR nova.rpc [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/
I also see this at the same time in the rabbitmq log:
=ERROR REPORT==== 20-Sep-
connection <0.1061.0>, channel 1 - error:
{amqp_error,
"no exchange '3ff1ba7e274a4e
--flagfile=
--use_deprecate
--dhcpbridge_
--dhcpbridge=
--sql_connectio
--s3_host=
--rabbit_
--glance_
--logdir=
--state_
--lock_
--verbose
--ec2_url=http://
--dmz_cidr=
--fixed_
--network_size=8
--flat_
--image_
--bridge_
--flat_
--network_
--public_
--multi_host
--osapi_
Changed in nova: | |
status: | New → Incomplete |
Changed in nova: | |
status: | Incomplete → Confirmed |
importance: | Undecided → Critical |
assignee: | nobody → Vish Ishaya (vishvananda) |
milestone: | none → essex-2 |
Changed in nova (Ubuntu Oneiric): | |
importance: | Undecided → High |
status: | New → Triaged |
Changed in nova (Ubuntu Precise): | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in nova (Ubuntu Precise): | |
status: | Triaged → Fix Committed |
status: | Fix Committed → Fix Released |
Changed in nova: | |
status: | Fix Committed → Fix Released |
Changed in nova: | |
milestone: | essex-2 → 2012.1 |
Pretty interesting. This has to be from an rpc.call. The caller creates a queue for the response from the worker. It appears the worker is not finding this queue when it goes to 'msg_reply'.
The only reasons I can think of how this could happen:
1) After the caller publishes a request for a worker.... it is getting an exception during waiting for the response. That would cause the queue to disappear. (Seems unlikely)
2) After the caller publishes a request for a worker... it gets disconnected from rabbit, which causes the queue to be deleted momentarily before it reconnects and re-declares it... and the worker tries to respond in the middle of this. (Seems unlikely)
3) rabbit is returning an ACK to the caller (amqplib) before it is actually finished creating the queue... and there's a race condition where the worker tries to reply before the queue is created. (Seems unlikely)
4) kombu+amqplib randomly doesn't send a declare_queue() ? Seems unlikely without an exception thrown. And an exception would abort the publish to the worker.
I wonder what I'm missing. What version of kombu and what version of rabbit? Have you tried falling back to carrot? You can specify: 'rpc_backend= nova.rpc. impl_carrot' to try it.