Response Queues for RPC Calls aren't being torn down

Bug #803168 reported by Antony Messerli
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Chris Behrens

Bug Description

I have a lot of queues in Rabbit that don't appear to be getting torn down from Nova builds.

Example:
ca866fee5eb748bd9526b2cf603e47a0 0
ff74b19d985c4353a0a3d611431f94b2 0
e9b4770ab96149b0b93669abecb237c8 0
44f063dc8b1044509f381137a1ed2cb2 0
9bd1e93e59444c8d92bdca72194e1666 0
b0e18bbb291b4277974eb61a2935e815 0
1dcdf4cd82754219859758ee9ae0f511 0
6f0d8278821342bea82a93ac28735028 0
9d91add65c074093bd44ff4a53f1f8b1 0
4736420006f84d1ea7845bc86c167dc0

root@z1-rabbit:~# rabbitmqctl list_queues | wc -l
10381

Most of these are types of queues listed above. I'm currently running Rev 1215.

Related branches

Revision history for this message
Chris Behrens (cbehrens) wrote :

These are the queues for responses from rpc.call/multicall. The queues are declared auto_delete=True, so all I can guess so far is that MulticallWaiter.close() is not being called. I don't see how this can occur right now... but it's definitely happening. If I kill off nova services, the queues are deleted.

Changed in nova:
assignee: nobody → Chris Behrens (cbehrens)
status: New → Confirmed
Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Johannes Erdfelt (johannes.erdfelt) wrote :

Turns out this problem happens regardless if there is an exception or not.

It stems from the awkward use of the carrot API, but I don't know exactly why it's causing the queues to not be deleted. I do know that switching over to iterconsume() solves the problem. However, it causes unit test failures that aren't trivial to fix.

I stopped working on this after having trouble with the unit tests because other people are working on replacing carrot with kombu. That might make fixing this bug moot.

Revision history for this message
Thierry Carrez (ttx) wrote :

@Chris: are you actually working on this, or should we unassign you to potentially let someone else have a shot at it ?

Revision history for this message
Chris Behrens (cbehrens) wrote :

I'm working on it (sorta), but as part of a larger rpc refactor to use kombu instead of carrot. There doesn't seem to be a simple fix for carrot, according to Johannes's comments above. Someone else is welcome to take a look, though. I've really not focused specifically on fixing this bug... but I took ownership because I knew it'd go away when switching to kombu.

And kombu probably needs to wait to post-diablo... unless there's a desire to try to get it in with carrot still being the default (as to not introduce a new dependency this late).

Revision history for this message
Chris Behrens (cbehrens) wrote :

Crap. It actually appears that even when you call Consumer().close(), the consumer is not actually detached from the queue on the server. This seems to be the case even if you close the channel being used. Same issue happens with kombu, so it's not just a carrot issue. I wonder if it's a rabbitmqserver bug. The queue seems to only be removed when you close the actual connection to rabbit.

So, the only known workaround is to rip out connection pooling and only use a connection once for consumers. :-/ That has performance implications, though, but it's probably better than having a million queues in rabbit.. which probably has its own issues.

Revision history for this message
Chris Behrens (cbehrens) wrote :

Dug more into this after talking with Johannes about what we had found so far.

Turns out that the queue is not removed when closing a channel... only if you've not called the interface that does a amqp basic_consume command. I think this is technically a rabbit bug and will bring it up on the list. Closing a channel should forget about queues you've bound while using the channel.. and remove queues that are auto_delete=True.

Anyway, this is not a huge problem if you can use an interface that calls amqp's basic_consume command. Unfortunately, with carrot, we use carrot Consumer.fetch() which does a basic_get() command instead. This is where the real problem is. If we can switch this to use .wait() or .iterconsume(), this bug goes away in all cases except a very rare case where you have an exception between between declaring the queue and calling .wait(). Closing the channel without calling .wait() would leave the queue around.

Now, when I was working on connection pooling and termie was working on multicall, there was difficulty getting all tests to pass when using Consumer.wait() vs Consumer.fetch(). This is why we stuck with fetch(), even though it is also less efficient than .wait().

I think the best way to solve this is to wait until kombu is done where tests have to be completely fixed anyway. This is coming very soon. I'll link my kombu branch here. It is also tied to lp:798876 and will also fix lp:794627 (auto-reconnecting to rabbit when it restarts).

Thierry Carrez (ttx)
Changed in nova:
status: Confirmed → In Progress
Thierry Carrez (ttx)
Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → diablo-rbp
Thierry Carrez (ttx)
Changed in nova:
milestone: diablo-rbp → 2011.3
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.