RPC timeout in all neutron agents

Bug #1318721 reported by mouadino on 2014-05-12
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Icehouse
High
Xiang Hui
Juno
High
Xiang Hui
neutron
Undecided
Unassigned
oslo.messaging
Medium
Dr. Jens Harbott
neutron (Ubuntu)
Trusty
High
Xiang Hui
oslo.messaging (Ubuntu)
High
Unassigned
Trusty
High
Xiang Hui

Bug Description

In the logs the first traceback that happen is this:

[-] Unexpected exception occurred 1 time(s)... retrying.
Traceback (most recent call last):
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/excutils.py", line 62, in inner_func
    return infunc(*args, **kwargs)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 741, in _consumer_thread

  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 732, in consume
    @excutils.forever_retry_uncaught_exceptions
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 660, in iterconsume
    try:
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 590, in ensure
    def close(self):
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 531, in reconnect
    # to return an error not covered by its transport
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 513, in _connect
    Will retry up to self.max_retries number of times.
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 150, in reconnect
    use the callback passed during __init__()
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 508, in declare
    self.queue_bind(nowait)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 541, in queue_bind
    self.binding_arguments, nowait=nowait)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/kombu/entity.py", line 551, in bind_to
    nowait=nowait)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/channel.py", line 1003, in queue_bind
    (50, 21), # Channel.queue_bind_ok
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 68, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/abstract_channel.py", line 86, in dispatch_method
    return amqp_method(self, args)
  File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/amqp/channel.py", line 241, in _close
    reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Queue.bind: (404) NOT_FOUND - no exchange 'reply_8f19344531b448c89d412ee97ff11e79' in vhost '/'

Than an RPC Timeout is raised each second in all the agents

ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
TRACE neutron.agent.l3_agent Traceback (most recent call last):
TRACE neutron.agent.l3_agent File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/agent/l3_agent.py", line 702, in _rpc_loop
TRACE neutron.agent.l3_agent self.context, router_ids)
TRACE neutron.agent.l3_agent File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/agent/l3_agent.py", line 79, in get_routers
TRACE neutron.agent.l3_agent topic=self.topic)
TRACE neutron.agent.l3_agent File "/opt/cloudbau/neutron-virtualenv/lib/python2.7/site-packages/neutron/openstack/common/rpc/proxy.py", line 130, in call
TRACE neutron.agent.l3_agent exc.info, real_topic, msg.get('method'))
TRACE neutron.agent.l3_agent Timeout: Timeout while waiting on RPC response - topic: "q-l3-plugin", RPC method: "sync_routers" info: "<unknown>"

This actually make the agent useless until they are all restarted.

An analyze of what's going on coming soon :)

---------------------------

[Impact]

This patch addresses an issue when a RabbitMQ cluster node goes down, OpenStack services try to reconnect to another RabbitMQ node and then re-create everything from scratch , and due to the 'auto-delete' flag is set, race condition happened between re-create and delete on Exchange, Queues, Bindings, which caused nova-compute and neutron agents are down.

[Test Case]

Note steps are for trusty-icehouse, including latest oslo.messaging library (1.3.0-0ubuntu1.2 at the time of this writing).

Deploy an OpenStack cloud w/ multiple rabbit nodes and then abruptly kill one of the rabbit nodes (e.g. sudo service rabbitmq-server stop, etc). Observe that the nova services and neutron agents do detect that the node went down and report that they are reconnected, but messages are still reporting as timed out, nova service-list/neutron agent-list still reports compute and agents as down, etc.

[Regression Potential]

None.

Related branches

mouadino (mouadino) wrote :
Download full text (3.6 KiB)

Hi Again,

We have tracked this error that happened a few times in our cluster and we found that this error happen only if you're using RabbitMQ clustering (I don't know if it the same case with QPid), and it's start when one of the cluster nodes go down.

Basically when a cluster node go down, neutron will try to reconnect to another RabbitMQ node and then re-create everything from scratch i.e. Exchanges, Queues, Bindings ... and so on, and b/c RabbitMQ clustering make sure that everything is replicated so most of the re-creation end up being on no-op in cluster side, until here everything is fine except for exchanges with the auto-delete flag set.

A bit of background first, as some of you may already know, basically an auto-delete flag in the Exchange tell The RabbitMQ server to delete the Exchange when there is no more Queue using that exchange [check footer], same thing with Queues when auto-delete is set this tell RabbitMQ server to delete the Queue when the last consumer is disconnected from the Queue. Now in Neutron side, we should state here that all RPC Queues and Exchanges are created with auto-delete, and that only RPC Queues/Exchanges exhibit this problem, which mean that from now on we will be only discussing the RPC case.

Also for the purpose of this example, let's say that we have a 3 node cluster of RabbitMQ (node1, node2, node3) and that some neutron agent X is using one node1, by that i mean that the connection are made from neutron-x-agent to node1, we should also state that we will be talking about agents only b/c they are the one that create the RPC Queues/Exchanges.

Now when node1 go down [step 1], the Queue consumers will be removed b/c connection from agent X to node1 is broken, which mean that the the Queue will be deleted, and when Queues are deleted the exchanges will also be deleted, this all happen **eventually** in all cluster nodes that are still alive [step 2]. **In the same time** when neutron agents detect that node1 is down, the agent will try to reconnect to another node and re-create the same RPC Queues and Exchanges [step 3].

As you may have guessed there a race condition here over Queues and Exchanges, from one hand in [step 2] the cluster is trying to delete the Queues and Exchanges, and from the other hand Neutron is trying to create them.

In details this is what happen from neutron side:

N1. Connect to node 2.
N2. Create Exchange X.
N3. Create Queue Q.
N4. Create Binding from Q to X.

NOTE: (N2, N3, N4) are all part of kombu.entity.Queue.declare()

And in each alive node of the cluster side:

R1. Delete consumer.
R2. Delete Queue Q (Binding is deleted explicitly).
R3. Delete Exchange X.

So there is actually a good change that before N4 is being done R3 has already been executed, which mean that N4 will fail with Exchange NOT_FOUND (Check first traceback) !

As for what happen after this happen, well while agent can send RPC calls to the Neutron server, the Neutron server cannot send reply back b/c there is no exchange to send it to, which mean that replies will be dropped by RabbitMQ server, and the agent will wait until the timeout is reached (Check second traceback).

To generalize, actu...

Read more...

mouadino (mouadino) wrote :

Just to be clear about something the fact that RabbitMQ delete the exchange event though Neutron re-create it (N2), can put the blame on RabbitMQ b/c he should have threaded the N2 not just as no op (b/c the exchange is already there) but also as update exchange like if it's new so that **old** delete doesn't remove it again.

To make sure of my claim here i will try to contact the RabbitMQ mailing list about this problem.

Fix proposed to branch: master
Review: https://review.openstack.org/93399

Changed in neutron:
assignee: nobody → mouadino (mouadino)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/102555

Changed in oslo:
assignee: nobody → mouadino (mouadino)
status: New → In Progress

Change abandoned by mouad benchchaoui (<email address hidden>) on branch: master
Review: https://review.openstack.org/93399
Reason: Moved to oslo: https://review.openstack.org/#/c/102555/

mouadino (mouadino) wrote :

I am including here the discussion that i had with rabbitmq mailing list: http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2014-May/035923.html

tl;dr If queue mirror is not enabled there is a bigger chance of this bug to happen, but it's still something that should be dealt with according to rabbitMQ mailing list guys.

affects: oslo-incubator → oslo.messaging
Changed in neutron:
assignee: mouadino (mouadino) → Dr. Jens Rosenboom (j-rosenboom-j)
Changed in oslo.messaging:
assignee: mouadino (mouadino) → Dr. Jens Rosenboom (j-rosenboom-j)
Mehdi Abaakouk (sileht) wrote :
Changed in neutron:
status: In Progress → Invalid
Changed in oslo.messaging:
assignee: Dr. Jens Rosenboom (j-rosenboom-j) → mouad.benchchaoui (mouad-benchchaoui)
Changed in neutron:
assignee: Dr. Jens Rosenboom (j-rosenboom-j) → nobody
Changed in oslo.messaging:
importance: Undecided → Medium
Changed in oslo.messaging:
assignee: mouad.benchchaoui (mouad-benchchaoui) → Dr. Jens Rosenboom (j-rosenboom-j)

Reviewed: https://review.openstack.org/103157
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=7ad0d7eaf9cb095a14b07a08c814d9f1f9c8ff12
Submitter: Jenkins
Branch: master

commit 7ad0d7eaf9cb095a14b07a08c814d9f1f9c8ff12
Author: Jens Rosenboom <email address hidden>
Date: Fri Jun 27 16:46:47 2014 +0200

    Fix reconnect race condition with RabbitMQ cluster

    Retry Queue creation to workaround race condition
    that may happen when both the client and broker race over
    exchange creation and deletion respectively which happen only
    when the Queue/Exchange were created with auto-delete flag.

    Queues/Exchange declared with auto-delete instruct the Broker to
    delete the Queue when the last Consumer disconnect from it, and
    the Exchange when the last Queue is deleted from this Exchange.

    Now in a RabbitMQ cluster setup, if the cluster node that we are
    connected to go down, 2 things will happen:

     1. From RabbitMQ side, the Queues w/ auto-delete will be deleted
        from the other cluster nodes and then the Exchanges that the
        Queues are bind to if they were also created w/ auto-delete.
     2. From client side, client will reconnect to another cluster
        node and call queue.declare() which create Exchanges then
        Queues then Binding in that order.

    Now in a happy path the queues/exchanges will be deleted from the
    broker before client start re-creating them again, but it also
    possible that the client first start by creating queues/exchange
    as part of the queue.declare() call, which are no-op operations
    b/c they alreay existed, but before it could bind Queue to
    Exchange, RabbitMQ nodes just received the 'signal' that the
    queue doesn't have any consumer so it should be delete, and the
    same with exchanges, which will lead to binding fail with
    NotFound error.

    Illustration of the time line from Client and RabbitMQ cluster
    respectively when the race condition happen:

           e-declare(E) q-declare(Q) q-bind(Q, E)
         -----+------------------+----------------+----------->
                                   e-delete(E)
         ------------------------------+---------------------->

    Change-Id: Ideb73af6f246a8282780cdb204d675d5d4555bf0
    Closes-Bug: #1318721

Changed in oslo.messaging:
status: In Progress → Fix Committed
Changed in oslo.messaging:
milestone: none → 1.6.0
status: Fix Committed → Fix Released
Xiang Hui (xianghui) on 2015-12-17
description: updated

The attachment "trusty-juno.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
affects: ubuntu → neutron (Ubuntu)
Changed in neutron (Ubuntu):
importance: Undecided → Medium
Changed in oslo.messaging (Ubuntu):
importance: Undecided → Medium
no longer affects: oslo.messaging (Ubuntu)
Xiang Hui (xianghui) wrote :
Xiang Hui (xianghui) wrote :
Corey Bryant (corey.bryant) wrote :

Hui,

Thanks for the patches. Your juno update to oslo.messaging is now being regression tested in the trusty juno-staging PPA. After testing passes we'll promote it to cloud-archive:juno-proposed, and at that point we'll also be able to upload the icehouse updates to trusty.

Regards,
Corey

Changed in neutron (Ubuntu):
status: New → In Progress
Changed in neutron (Ubuntu Trusty):
status: New → In Progress
Changed in oslo.messaging (Ubuntu Trusty):
status: New → In Progress
Changed in oslo.messaging (Ubuntu):
status: New → Invalid
Changed in neutron (Ubuntu):
status: In Progress → Invalid
Changed in cloud-archive:
status: New → Invalid
Changed in neutron (Ubuntu Trusty):
assignee: nobody → Xiang Hui (xianghui)
Changed in oslo.messaging (Ubuntu Trusty):
assignee: nobody → Xiang Hui (xianghui)
importance: Undecided → High
Changed in neutron (Ubuntu Trusty):
importance: Undecided → High

Hello mouadino, or anyone else affected,

Accepted neutron into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/1:2014.1.5-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Chris J Arges (arges) wrote :

Hello mouadino, or anyone else affected,

Accepted oslo.messaging into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.3.0-0ubuntu1.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Trusty):
status: In Progress → Fix Committed
Changed in neutron (Ubuntu Trusty):
milestone: none → trusty-updates
Changed in oslo.messaging (Ubuntu Trusty):
milestone: none → trusty-updates
Xiang Hui (xianghui) wrote :

Used neutron 1:2014.1.5-0ubuntu2 and oslo.messaging 1.3.0-0ubuntu1.4 from trusty-proposed, confirmed that didn't have such problem.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package oslo.messaging - 1.3.0-0ubuntu1.4

---------------
oslo.messaging (1.3.0-0ubuntu1.4) trusty; urgency=medium

  * Backport upstream fix (LP: #1318721):
    - d/p/fix-reconnect-race-condition-with-rabbitmq-cluster.patch:
      Redeclare if exception is catched after self.queue.declare() failed.

 -- Hui Xiang <email address hidden> Thu, 17 Dec 2015 16:22:35 +0800

Changed in oslo.messaging (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for oslo.messaging has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Serge Hallyn (serge-hallyn) wrote :

Hi @arges,

I notice that neutron did not get promoted to trusty-updates, though oslo.messaging did.

Brian Murray (brian-murray) wrote :

Neutron was not promoted because it is failing its auto package tests.

http://autopkgtest.ubuntu.com/packages/n/neutron/trusty/armhf/

Xiang Hui (xianghui) wrote :

Hi Serge, Brian,

Thanks for looking at it, could the auto package tests re-run again or some other approach to identify the failure, since I can't see how this simple fix can break brocade-plugin showed in http://autopkgtest.ubuntu.com/packages/n/neutron/trusty/armhf/, plus if it has passed the -proposal which if running the neutron tests, it should be work for -updates as well.

brocade-plugin FAIL non-zero exit status 1

Thanks guys.

Xiang Hui (xianghui) wrote :
Xiang Hui (xianghui) wrote :

Hello all,

After reproducing the test on my environment, the error is reported as:

The following packages have unmet dependencies:
 neutron-plugin-hyperv : Depends: neutron-common (= 1:2014.1.5-0ubuntu1) but 1:2014.1.5-0ubuntu2 is to be installed
adt-run: & apt0t-hyperv-plugin: - - - - - - - - - - stderr - - - - - - - - - -
E: Unable to correct problems, you have held broken packages.

It seems unrelated with the neutron fix in the patch but other dependency problem.

Corey Bryant (corey.bryant) wrote :

Hui, I think you're seeing that error because you don't have the proposed universe pocket enabled.

Brian, I'm fairly certain this is a race in the test, where we restart the neutron-server and immediately test if it's running. [1] Can we try rerunning the test for now and I'll fixup the tests in the package for the next time we upload a fix?

[1] snippet of code from test, could use a sleep after line 11

  11 service neutron-server restart > /dev/null 2>&1
  12 if pidof -x neutron-server > /dev/null; then
  13 apt-get -y remove --purge neutron-plugin-brocade
  14 echo "OK"
  15 else
  16 echo "ERROR: BROCADE PLUGIN IS NOT RUNNING"
  17 apt-get -y remove --purge neutron-plugin-brocade
  18 exit 1
  19 fi

Xiang Hui (xianghui) wrote :

Hi Corey,

Thanks for the fixing.

And I am using 'sudo adt-run neutron -U --apt-pocket=proposed --- qemu adt-trusty-amd64-cloud.img -d -o /tmp/neutron/' for running the pkgtest.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 1:2014.1.5-0ubuntu3

---------------
neutron (1:2014.1.5-0ubuntu3) trusty; urgency=medium

  [ Corey Bryant ]
  * d/p/make_del_fdb_flow_idempotent.patch: Cherry pick from Juno
    to prevent KeyError on duplicate port removal in del_fdb_flow()
    (LP: #1531963).
  * d/tests/*-plugin: Fix race between service restart and pidof test.

  [ James Page ]
  * d/p/ovs-restart.patch: Ensure that tunnels are fully reset on ovs
    restart (LP: #1460164).

 -- Corey Bryant <email address hidden> Wed, 10 Feb 2016 14:52:04 -0500

Changed in neutron (Ubuntu Trusty):
status: Fix Committed → Fix Released
Changed in cloud-archive:
status: Invalid → Fix Released
Changed in oslo.messaging (Ubuntu):
importance: Undecided → High
status: Invalid → Fix Released
no longer affects: neutron (Ubuntu)
tags: added: sts-sru
Louis Bouchard (louis) on 2016-11-09
tags: removed: sts-sru
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers