reply queues fill up with unacked messages

Bug #1362863 reported by Sam Morrison
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Unassigned
oslo.messaging
Fix Released
High
Mehdi Abaakouk
oslo.messaging (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Won't Fix
High
Jorge Niedbalski

Bug Description

Since upgrading to icehouse we consistently get reply_xxxxx queues filling up with unacked messages. To fix this I have to restart the service. This seems to happen when something is wrong for a short period of time and it doesn't clean up after itself.

So far I've seen the issue with nova-api, nova-compute, nova-network, nova-api-metadata, cinder-api but I'm sure there are others.

Revision history for this message
Sam Morrison (sorrison) wrote :

I should also mention this is on Ubuntu 14.04

Revision history for this message
Ken Giusti (kgiusti) wrote :

Hi Sam - what messaging broker are you using when you hit this issue?

Revision history for this message
Sam Morrison (sorrison) wrote :

Sorry should've mentioned that, we're using RabbitMQ. Have seen it against the following versions tested. 3.3.5 (latest) 3.3.4, 3.1.5, 3.1.3

Revision history for this message
Sam Morrison (sorrison) wrote :

This is quite a big issue for us, it's happening 3-4 times a day.

Revision history for this message
Ken Giusti (kgiusti) wrote :

Ah, thanks - there was a qpid-specific issue (since fixed) that sounded similar to what you're hitting:

https://bugs.launchpad.net/oslo.messaging/+bug/1318742

This doesn't apply to rabbitmq. Just checking to be sure.

Revision history for this message
Tracy Jones (tjones-i) wrote :

can you attach logs please?

Changed in nova:
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

What's your rpc configuration in nova.conf?

Tracy Jones (tjones-i)
Changed in nova:
status: New → Incomplete
Revision history for this message
Sam Morrison (sorrison) wrote :

Configs in nova and cinder:
rabbit_hosts="mq:5671,mq1:5671,mq2:5671,mq3:5671"
rabbit_ha_queues=True
rabbit_userid=nova
rabbit_password=XXXXXXX
rabbit_durable_queues=True
rabbit_use_ssl=True

Some related package versions:

apt-cache policy librabbitmq1
librabbitmq1:
  Installed: 0.4.1-1
  Candidate: 0.4.1-1

apt-cache policy python-amqp
python-amqp:
  Installed: 1.3.3-1ubuntu1
  Candidate: 1.3.3-1ubuntu1

apt-cache policy python-amqplib
python-amqplib:
  Installed: 1.0.2-1
  Candidate: 1.0.2-1

apt-cache policy python-kombu
python-kombu:
  Installed: 3.0.7-1ubuntu1
  Candidate: 3.0.7-1ubuntu1

The reply_xxxx queue in RabbitMQ that has unacked messages does have a consumer. I trace that back to the offending API host using the port number and see that it's a nova-api process that has an established connection to rabbit. Then I've been using strace on the process to see what it's doing. All it has is:

gettimeofday({1409785170, 966963}, NULL) = 0
gettimeofday({1409785170, 967066}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=7, u64=39432335262744583}}}, 1023, 482100) = 1
epoll_ctl(6, EPOLL_CTL_DEL, 7, {EPOLLWRNORM|EPOLLERR|EPOLLONESHOT|0xe0a1800, {u32=32681, u64=22396489217114025}}) = 0
accept(7, 0x7fffdf911350, [16]) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(6, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=7, u64=39432335262744583}}) = 0
gettimeofday({1409785188, 431971}, NULL) = 0
gettimeofday({1409785188, 432035}, NULL) = 0
epoll_wait(6, {{EPOLLIN, {u32=7, u64=39432335262744583}}}, 1023, 464636) = 1
epoll_ctl(6, EPOLL_CTL_DEL, 7, {EPOLLWRNORM|EPOLLERR|EPOLLONESHOT|0xe0a1800, {u32=32681, u64=22396489217114025}}) = 0
accept(7, 0x7fffdf911350, [16]) = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(6, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=7, u64=39432335262744583}}) = 0

The API host isn't under any load etc. either nor are the rabbit and DB hosts.

This is happening for us multiple times a day so it would be easy for me to get more information. I just don't know how to debug further.

Revision history for this message
Sam Morrison (sorrison) wrote :
Download full text (9.9 KiB)

We are getting these in the nova-api logs for a few things so this could be related. We aren't getting anything in the cinder-api logs though and we are seeing the exact same issue there. (just less frequent as the api is used less)

2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack Traceback (most recent call last):
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/nova/api/openstack/__init__.py", line 125, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack return req.get_response(self.application)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1320, in send
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack application, catch_exc_info=False)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1284, in call_application
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack app_iter = application(self.environ, start_response)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack return resp(environ, start_response)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/keystoneclient/middleware/auth_token.py", line 582, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack return self.app(env, start_response)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack return resp(environ, start_response)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack return resp(environ, start_response)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/routes/middleware.py", line 131, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack response = self.app(environ, start_response)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack return resp(environ, start_response)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack resp = self.call_func(req, *args, **self.kwargs)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack return self.func(req, *args, **kwargs)
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack File "/usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py", line 917, in __call__
2014-09-05 08:36:09.025 17883 TRACE nova.api.openstack content_type, body, accept)
2014-09-05 08...

Revision history for this message
Sam Morrison (sorrison) wrote :

Have provided more info so moving back to new

Changed in nova:
status: Incomplete → New
Revision history for this message
Sam Morrison (sorrison) wrote :
Download full text (6.4 KiB)

Actually we do see this in cinder-api logs too:

2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault Traceback (most recent call last):
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/cinder/api/middleware/fault.py", line 75, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault return req.get_response(self.application)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1320, in send
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault application, catch_exc_info=False)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1284, in call_application
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault app_iter = application(self.environ, start_response)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/keystoneclient/middleware/auth_token.py", line 582, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault return self.app(env, start_response)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/routes/middleware.py", line 131, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault response = self.app(environ, start_response)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault resp = self.call_func(req, *args, **self.kwargs)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault return self.func(req, *args, **kwargs)
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault File "/usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py", line 895, in __call__
2014-09-04 17:17:54.149 1542 TRACE cinder.api.middleware.fault content_type, body, accept...

Read more...

Revision history for this message
Sam Morrison (sorrison) wrote :

Some more observations. In our test cloud we have an environment with 1 controller (icehouse/trusty) and 4 compute nodes.

2 of the computes are Icehouse/trusty and 2 are havana/precise.

If I turn off the controller or reboot it etc. then the 2 icehouse computes get in a mess, messages start piling up unacked in reply queues and I need to restart nova-compute, nova-network and nova-api-metadata.

The havana/precise computes are fine and require no restarting of things.

Tom Fifield (fifieldt)
tags: added: ops
Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
Changed in oslo.messaging:
importance: Undecided → High
Revision history for this message
Sam Morrison (sorrison) wrote :

We see things like this in the logs:

2014-10-22 11:18:18.358 30300 WARNING oslo.messaging._drivers.amqpdriver [-] _queues: {'40282b6608f245dc9903a6d31cbf09dd': <Queue at 0x7f5ebab97390 maxsize=None>}
2014-10-22 11:18:18.359 30300 WARNING oslo.messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 2fdbf0667a1a49d4af6a022549f6737d, message : {u'_unique_id': u'cea90764a9714fe8ade5a5ff2dbd2a17', u'failure': None, u'result': None, u'ending': True}
2014-10-22 11:18:18.360 30300 WARNING oslo.messaging._drivers.amqpdriver [-] _queues: {'40282b6608f245dc9903a6d31cbf09dd': <Queue at 0x7f5ebab97390 maxsize=None>}

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
assignee: nobody → Mehdi Abaakouk (sileht)
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: New → Confirmed
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

Hi,

I guess, the issue was due that before oslo.messaging use eventlet.kill instead of eventlet.cancel to stop the server thread. So this issue should be fixed since this commit is merged:

https://github.com/openstack/oslo.messaging/commit/7bce31a2d14de29f1fc306938c7085f5205c9110#diff-9e692a136baf4fc701a9868fa789b442R102

Please reopen the bug is this still occurs with oslo.messaging >= 1.5.0.

Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
status: Confirmed → Fix Released
Changed in nova:
status: Confirmed → Invalid
Revision history for this message
Jacob Godin (jacobgodin) wrote :

I'm still seeing exactly what comment #11 shows with cinder-api. Running oslo.messaging 1.6.0

Revision history for this message
Jacob Godin (jacobgodin) wrote :
Download full text (26.7 KiB)

Steps to reproduce:

- Create Volume
- Attach to instance
- Kill all rabbitmq connections (ie restart service)
- Try to detach volume from instance

Log:

...
2015-04-09 16:15:22.574 27 ERROR cinder.api.middleware.fault [req-f423763d-ec3d-4f22-b079-e6d395b90c75 886bb3d50acc4a0c8f9d91d6ca97c298 831491c8ae8042e9b1099c7e05cbbc80 - - -] Caught error: Timed out waiting for a reply to message ID f3170587210c4df0affee0b2f9534e80.
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault Traceback (most recent call last):
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/cinder/api/middleware/fault.py", line 76, in __call__
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault return req.get_response(self.application)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/webob/request.py", line 1320, in send
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault application, catch_exc_info=False)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/webob/request.py", line 1284, in call_application
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault app_iter = application(self.environ, start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py", line 823, in __call__
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault return self._call_app(env, start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py", line 758, in _call_app
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault return self._app(env, _fake_start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/routes/middleware.py", line 136, in __call__
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault response = self.app(environ, start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault return resp(environ, start_response)
2015-04-09 16:15:22.574 27 TRACE cinder.api.middleware.fault File "/usr/local/lib/pyt...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/194296

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/icehouse)

Change abandoned by Jorge Niedbalski R. (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/194296
Reason: Not suitable for security update

Changed in oslo.messaging (Ubuntu Trusty):
assignee: nobody → Jorge Niedbalski (niedbalski)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/194460

Revision history for this message
Martin Pitt (pitti) wrote :

This patch is applied in wily's python-oslo.messaging source package. Jorge, please always check this for the current development series, as it's not at all obvious what the status of that is.

Changed in oslo.messaging (Ubuntu):
status: New → Invalid
Revision history for this message
Martin Pitt (pitti) wrote :

I uploaded Jorge's trusty patch to the trusty SRU review queue. Jorge, please add the necessary SRU information to this bug (https://wiki.ubuntu.com/StableReleaseUpdates#SRU_Bug_Template). Thanks!

Revision history for this message
Martin Pitt (pitti) wrote :

Sorry, I had to reject the upload. There is already a different oslo.messaging SRU waiting in the queue for bug 1448650. Can you please prepare a new debdiff relative to that? Please also add an (LP: #xxx) bug ref to your changelog. Thanks!

Please re-subscribe sponsors when this is done.

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Martin @piti,

I have attached the trusty patch on top of the current -proposed version.

Thanks for sponsoring this.

Revision history for this message
Chris J Arges (arges) wrote :

Sponsored for Trusty

Changed in oslo.messaging (Ubuntu):
status: Invalid → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.messaging (stable/icehouse)

Change abandoned by Jeremy Stanley (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/194460
Reason: This branch has reached end of life and is being deleted.

Revision history for this message
Martin Pitt (pitti) wrote :

@Chris: Apparently you didn't? There is no new oslo.messaging in https://launchpad.net/ubuntu/trusty/+queue?queue_state=1&queue_text=oslo

This is blocked anyway until https://launchpad.net/ubuntu/+source/oslo.messaging/1.8.3-0ubuntu0.15.04.1 moves out of -proposed.

Changed in oslo.messaging (Ubuntu Trusty):
importance: Undecided → High
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Revision history for this message
Chris J Arges (arges) wrote :

This time really sponsored for Trusty.

tags: added: sts
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Sam, or anyone else affected,

Accepted oslo.messaging into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/oslo.messaging/1.3.0-0ubuntu1.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in oslo.messaging (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
James Page (james-page) wrote :

This proposed packaged causes problems in the nova unit test suite; namely the tests get a few in and then lockup and the builds timeout.

tags: added: verification-failed
removed: verification-needed
Revision history for this message
James Page (james-page) wrote :

Marked 'verification-failed'.

Revision history for this message
Chris J Arges (arges) wrote : Proposed package removed from archive

The version of oslo.messaging in the proposed pocket of Trusty that was purported to fix this bug report has been removed because the bugs that were to be fixed by the upload were not verified in a timely (105 days) fashion.

Changed in oslo.messaging (Ubuntu Trusty):
status: Fix Committed → Won't Fix
Revision history for this message
Edward Hope-Morley (hopem) wrote :

If you are hitting this issues and you are using rabbitmq SSL with the latest Trusty/Icehouse oslo.messaging then you will hitting the bug from:

https://bugs.launchpad.net/ubuntu/+source/python-amqp/+bug/1472712

For which there is a fix in trusty-proposed and it will shortly land in trusty-updates. I suggest you upgrade to that version of python-amqp to fix your problem.

no longer affects: cinder
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.