RPC timeout in all neutron agents
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | |||
Icehouse |
Fix Released
|
High
|
Xiang Hui | |||
Juno |
Fix Released
|
High
|
Xiang Hui | |||
neutron |
Invalid
|
Undecided
|
Unassigned | |||
oslo.messaging |
Fix Released
|
Medium
|
Dr. Jens Harbott | |||
neutron (Ubuntu) | ||||||
Trusty |
Fix Released
|
High
|
Xiang Hui | |||
oslo.messaging (Ubuntu) |
Fix Released
|
High
|
Unassigned | |||
Trusty |
Fix Released
|
High
|
Xiang Hui |
Bug Description
In the logs the first traceback that happen is this:
[-] Unexpected exception occurred 1 time(s)... retrying.
Traceback (most recent call last):
File "/opt/cloudbau/
return infunc(*args, **kwargs)
File "/opt/cloudbau/
File "/opt/cloudbau/
@excutils.
File "/opt/cloudbau/
try:
File "/opt/cloudbau/
def close(self):
File "/opt/cloudbau/
# to return an error not covered by its transport
File "/opt/cloudbau/
Will retry up to self.max_retries number of times.
File "/opt/cloudbau/
use the callback passed during __init__()
File "/opt/cloudbau/
self.
File "/opt/cloudbau/
self.
File "/opt/cloudbau/
nowait=nowait)
File "/opt/cloudbau/
(50, 21), # Channel.
File "/opt/cloudbau/
return self.dispatch_
File "/opt/cloudbau/
return amqp_method(self, args)
File "/opt/cloudbau/
reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Queue.bind: (404) NOT_FOUND - no exchange 'reply_
Than an RPC Timeout is raised each second in all the agents
ERROR neutron.
TRACE neutron.
TRACE neutron.
TRACE neutron.
TRACE neutron.
TRACE neutron.
TRACE neutron.
TRACE neutron.
TRACE neutron.
This actually make the agent useless until they are all restarted.
An analyze of what's going on coming soon :)
-------
[Impact]
This patch addresses an issue when a RabbitMQ cluster node goes down, OpenStack services try to reconnect to another RabbitMQ node and then re-create everything from scratch , and due to the 'auto-delete' flag is set, race condition happened between re-create and delete on Exchange, Queues, Bindings, which caused nova-compute and neutron agents are down.
[Test Case]
Note steps are for trusty-icehouse, including latest oslo.messaging library (1.3.0-0ubuntu1.2 at the time of this writing).
Deploy an OpenStack cloud w/ multiple rabbit nodes and then abruptly kill one of the rabbit nodes (e.g. sudo service rabbitmq-server stop, etc). Observe that the nova services and neutron agents do detect that the node went down and report that they are reconnected, but messages are still reporting as timed out, nova service-
[Regression Potential]
None.
Related branches
- Ubuntu Cloud Archive Team: Pending requested
-
Diff: 9657 lines (+9295/-23)29 files modified.pc/.quilt_patches (+1/-0)
.pc/.quilt_series (+1/-0)
.pc/.version (+1/-0)
.pc/0001-Fix-_poll_connection-not-timeout-issue-1-2.patch/oslo/messaging/_drivers/amqpdriver.py (+444/-0)
.pc/0001-Fix-_poll_connection-not-timeout-issue-1-2.patch/oslo/messaging/_drivers/common.py (+349/-0)
.pc/0001-Fix-_poll_connection-not-timeout-issue-1-2.patch/oslo/messaging/_drivers/impl_rabbit.py (+821/-0)
.pc/0002-Fix-_poll_connection-not-timeout-issue-2-2.patch/oslo/messaging/_drivers/common.py (+374/-0)
.pc/0002-Fix-_poll_connection-not-timeout-issue-2-2.patch/tests/test_utils.py (+49/-0)
.pc/0003-qpid-Always-auto-delete-queue-of-DirectConsumer.patch/oslo/messaging/_drivers/impl_qpid.py (+729/-0)
.pc/0004-rabbit-redeclare-consumers-when-ack-requeue-fail.patch/oslo/messaging/_drivers/impl_rabbit.py (+832/-0)
.pc/0004-rabbit-redeclare-consumers-when-ack-requeue-fail.patch/tests/drivers/test_impl_rabbit.py (+735/-0)
.pc/0005-Fix-possible-usage-of-undefined-variable.patch/oslo/messaging/_drivers/impl_rabbit.py (+839/-0)
.pc/0006-Declare-DirectPublisher-exchanges-with-passive-True.patch/oslo/messaging/_drivers/common.py (+374/-0)
.pc/0006-Declare-DirectPublisher-exchanges-with-passive-True.patch/oslo/messaging/_drivers/impl_rabbit.py (+839/-0)
.pc/0006-Declare-DirectPublisher-exchanges-with-passive-True.patch/tests/drivers/test_impl_rabbit.py (+756/-0)
.pc/0006-Declare-DirectPublisher-exchanges-with-passive-True.patch/tests/test_utils.py (+64/-0)
.pc/0007-fix-reconnect-race-condition-with-rabbitmq-cluster.patch/oslo/messaging/_drivers/impl_rabbit.py (+865/-0)
.pc/applied-patches (+8/-0)
.pc/zmq-server-routing.patch/oslo/messaging/_drivers/impl_zmq.py (+995/-0)
debian/changelog (+8/-0)
debian/patches/0007-fix-reconnect-race-condition-with-rabbitmq-cluster.patch (+24/-0)
debian/patches/series (+1/-0)
oslo/messaging/_drivers/amqpdriver.py (+16/-8)
oslo/messaging/_drivers/common.py (+26/-0)
oslo/messaging/_drivers/impl_qpid.py (+0/-1)
oslo/messaging/_drivers/impl_rabbit.py (+69/-12)
oslo/messaging/_drivers/impl_zmq.py (+2/-0)
tests/drivers/test_impl_rabbit.py (+41/-2)
tests/test_utils.py (+32/-0)
- Ubuntu Development Team: Pending requested
-
Diff: 1047 lines (+992/-1)6 files modified.pc/applied-patches (+1/-0)
.pc/fix-reconnect-race-condition-with-rabbitmq-cluster.patch/neutron/openstack/common/rpc/impl_kombu.py (+858/-0)
debian/changelog (+8/-0)
debian/patches/fix-reconnect-race-condition-with-rabbitmq-cluster.patch (+77/-0)
debian/patches/series (+1/-0)
neutron/openstack/common/rpc/impl_kombu.py (+47/-1)
- Ubuntu Development Team: Pending requested
-
Diff: 992 lines (+936/-1)6 files modified.pc/applied-patches (+1/-0)
.pc/fix-reconnect-race-condition-with-rabbitmq-cluster.patch/oslo/messaging/_drivers/impl_rabbit.py (+834/-0)
debian/changelog (+8/-0)
debian/patches/fix-reconnect-race-condition-with-rabbitmq-cluster.patch (+78/-0)
debian/patches/series (+1/-0)
oslo/messaging/_drivers/impl_rabbit.py (+14/-1)
- James Page: Approve
-
Diff: 104 lines (+84/-0)3 files modifieddebian/changelog (+8/-0)
debian/patches/fix-reconnect-race-condition-with-rabbitmq-cluster.patch (+75/-0)
debian/patches/series (+1/-0)
affects: | oslo-incubator → oslo.messaging |
Changed in neutron: | |
assignee: | mouadino (mouadino) → Dr. Jens Rosenboom (j-rosenboom-j) |
Changed in oslo.messaging: | |
assignee: | mouadino (mouadino) → Dr. Jens Rosenboom (j-rosenboom-j) |
Changed in oslo.messaging: | |
assignee: | mouad.benchchaoui (mouad-benchchaoui) → Dr. Jens Rosenboom (j-rosenboom-j) |
Changed in oslo.messaging: | |
milestone: | none → 1.6.0 |
status: | Fix Committed → Fix Released |
description: | updated |
affects: | ubuntu → neutron (Ubuntu) |
Changed in neutron (Ubuntu): | |
importance: | Undecided → Medium |
Changed in oslo.messaging (Ubuntu): | |
importance: | Undecided → Medium |
no longer affects: | oslo.messaging (Ubuntu) |
Changed in neutron (Ubuntu Trusty): | |
milestone: | none → trusty-updates |
Changed in oslo.messaging (Ubuntu Trusty): | |
milestone: | none → trusty-updates |
Changed in cloud-archive: | |
status: | Invalid → Fix Released |
Changed in oslo.messaging (Ubuntu): | |
importance: | Undecided → High |
status: | Invalid → Fix Released |
no longer affects: | neutron (Ubuntu) |
tags: | added: sts-sru |
tags: | removed: sts-sru |
Hi Again,
We have tracked this error that happened a few times in our cluster and we found that this error happen only if you're using RabbitMQ clustering (I don't know if it the same case with QPid), and it's start when one of the cluster nodes go down.
Basically when a cluster node go down, neutron will try to reconnect to another RabbitMQ node and then re-create everything from scratch i.e. Exchanges, Queues, Bindings ... and so on, and b/c RabbitMQ clustering make sure that everything is replicated so most of the re-creation end up being on no-op in cluster side, until here everything is fine except for exchanges with the auto-delete flag set.
A bit of background first, as some of you may already know, basically an auto-delete flag in the Exchange tell The RabbitMQ server to delete the Exchange when there is no more Queue using that exchange [check footer], same thing with Queues when auto-delete is set this tell RabbitMQ server to delete the Queue when the last consumer is disconnected from the Queue. Now in Neutron side, we should state here that all RPC Queues and Exchanges are created with auto-delete, and that only RPC Queues/Exchanges exhibit this problem, which mean that from now on we will be only discussing the RPC case.
Also for the purpose of this example, let's say that we have a 3 node cluster of RabbitMQ (node1, node2, node3) and that some neutron agent X is using one node1, by that i mean that the connection are made from neutron-x-agent to node1, we should also state that we will be talking about agents only b/c they are the one that create the RPC Queues/Exchanges.
Now when node1 go down [step 1], the Queue consumers will be removed b/c connection from agent X to node1 is broken, which mean that the the Queue will be deleted, and when Queues are deleted the exchanges will also be deleted, this all happen **eventually** in all cluster nodes that are still alive [step 2]. **In the same time** when neutron agents detect that node1 is down, the agent will try to reconnect to another node and re-create the same RPC Queues and Exchanges [step 3].
As you may have guessed there a race condition here over Queues and Exchanges, from one hand in [step 2] the cluster is trying to delete the Queues and Exchanges, and from the other hand Neutron is trying to create them.
In details this is what happen from neutron side:
N1. Connect to node 2.
N2. Create Exchange X.
N3. Create Queue Q.
N4. Create Binding from Q to X.
NOTE: (N2, N3, N4) are all part of kombu.entity. Queue.declare( )
And in each alive node of the cluster side:
R1. Delete consumer.
R2. Delete Queue Q (Binding is deleted explicitly).
R3. Delete Exchange X.
So there is actually a good change that before N4 is being done R3 has already been executed, which mean that N4 will fail with Exchange NOT_FOUND (Check first traceback) !
As for what happen after this happen, well while agent can send RPC calls to the Neutron server, the Neutron server cannot send reply back b/c there is no exchange to send it to, which mean that replies will be dropped by RabbitMQ server, and the agent will wait until the timeout is reached (Check second traceback).
To generalize, actu...