OpenStack services can't find rabbitmq queues

Bug #1472593 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
MOS Oslo
5.1.x
Invalid
High
MOS Maintenance
6.0.x
Invalid
High
MOS Maintenance
6.1.x
Invalid
High
MOS Maintenance
7.0.x
Fix Released
High
MOS Oslo

Bug Description

During create_and_attach_volume rally nova couldn't create snapshot (500 ERROR):

from rally.log: http://paste.openstack.org/show/354825/

from haproxy:
<134>Jul 7 17:33:26 node-13 haproxy[16037]: 172.16.44.13:55788 [07/Jul/2015:17:32:26.473] nova-api-2 nova-api-2/node-13 0/0/0/60251/60252 500 354 - - ---- 9/9/9/3/0 0/0 "POST /v2/387a7feb33974707be92c177434c1d8e/servers/1989dd2c-a938-4305-839b-4d23ea2e5911/os-volume_attachments HTTP/1.1"

from nova-all.log on node-13: http://paste.openstack.org/show/354984/
The same errors on node-8 at the time: http://paste.openstack.org/show/355146/

from rabbitmq.log on node-8: http://paste.openstack.org/show/354985/
from rabbitmq.log on node-13: http://paste.openstack.org/show/354990/
from rabbitmq.log on node-15 (no messages at the time): http://paste.openstack.org/show/354991/

Similar from cinder-all on node-13: http://paste.openstack.org/show/355148/
Similar from cinder-all on node-8: http://paste.openstack.org/show/355149/
Similar from cinder-all on node-15: http://paste.openstack.org/show/355150/

pacemaker on node-8:
Jul 07 17:31:27 [26773] node-8.domain.tld pengine: info: master_color: Promoting p_rabbitmq-server:2 (Master node-13.domain.tld)
Jul 07 17:31:27 [26773] node-8.domain.tld pengine: info: master_color: master_p_rabbitmq-server: Promoted 1 instances of a possible 1 to master
...
Jul 07 17:32:05 [26773] node-8.domain.tld pengine: info: master_color: Promoting p_rabbitmq-server:2 (Master node-13.domain.tld)
Jul 07 17:32:05 [26773] node-8.domain.tld pengine: info: master_color: master_p_rabbitmq-server: Promoted 1 instances of a possible 1 to master

changing rabbit master:
Jul 07 11:47:21 [26773] node-8.domain.tld pengine: info: master_color: Promoting p_rabbitmq-server:0 (Master node-8.domain.tld)
Jul 07 11:47:21 [26773] node-8.domain.tld pengine: info: master_color: master_p_rabbitmq-server: Promoted 1 instances of a possible 1 to master
Jul 07 11:48:27 [26773] node-8.domain.tld pengine: info: master_color: Promoting p_rabbitmq-server:2 (Slave node-13.domain.tld)

atop at 07 17:33 on node-8: http://paste.openstack.org/show/355151/
atop at 07 17:33 on node-13: http://paste.openstack.org/show/355152/
atop at 07 17:33 on node-15: http://paste.openstack.org/show/355153/

Diagnostic Snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-08_11-04-15.tar.xz

Tags: scale
Revision history for this message
Leontii Istomin (listomin) wrote :

on each controller node during the tests we kept rabbitmq statistics by http://paste.openstack.org/show/355156/
This info+logs are attached

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Leontiy,

Looking at ./fuel-snapshot-2015-07-08_11-04-15/10.20.0.2/var/log/docker-logs/remote/node-13.domain.tld/nova-api.log.1, i see the following.

 91218 2015-07-07T17:33:24.317514+00:00 info: 240.0.0.2 "GET /v2/c1c9ead90c0a4d9f948385bdde16f1ac/servers/0dd6c0f1-31bc-4b78-a845-e8652389f871 HTTP/1.1" status: 404 len: 286 time: 0.0425940
 91219 2015-07-07T17:33:24.710213+00:00 warning: Queue not found:Basic.consume: (404) NOT_FOUND - no queue 'reply_439e63e6bce44c1b9aaffdbe8f81f924' in vhost '/'. Waiting.
 91220 2015-07-07T17:33:25.714896+00:00 warning: Queue not found:Basic.consume: (404) NOT_FOUND - no queue 'reply_439e63e6bce44c1b9aaffdbe8f81f924' in vhost '/'. Waiting.
 91221 2015-07-07T17:33:26.722997+00:00 err: Caught error: Timed out waiting for a reply to message ID c694462f97b541cfa18972573897ae04.
 91222 2015-07-07 17:33:26.714 14808 TRACE nova.api.openstack Traceback (most recent call last):
 91223 2015-07-07 17:33:26.714 14808 TRACE nova.api.openstack File "/usr/lib/python2.6/site-packages/nova/api/openstack/__init__.py", line 124, in __call__
 91224 2015-07-07 17:33:26.714 14808 TRACE nova.api.openstack return req.get_response(self.application)
 91225 2015-07-07 17:33:26.714 14808 TRACE nova.api.openstack File "/usr/lib/python2.6/site-packages/webob/request.py", line 1296, in send
 91226 2015-07-07 17:33:26.714 14808 TRACE nova.api.openstack application, catch_exc_info=False)
 91227 2015-07-07 17:33:26.714 14808 TRACE nova.api.openstack File "/usr/lib/python2.6/site-packages/webob/request.py", line 1260, in call_application

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Which means that this is a duplicate of https://bugs.launchpad.net/mos/+bug/1463802 and i don’t see the fix for https://review.fuel-infra.org/#/c/7925/ in this environment

Thanks,
dims

Revision history for this message
Dina Belova (dbelova) wrote :

Here I'd love to mention one moment: no rabbitMQ failures were observed in case of this bug. Sure, Alexey's fix from https://bugs.launchpad.net/mos/+bug/1463802 will help to recreate the queues, but messages will be still lost. And it's completely unclear *WHY* queues are lost without RabbitMQ failures.

Revision history for this message
Sergey Yudin (tsipa740) wrote :
Revision history for this message
Sergey Yudin (tsipa740) wrote :

https://review.fuel-infra.org/#/c/10336/ probably this fix can solve the issue for 6.1.

For 7.0 fix already merge into oslo.messaging

Revision history for this message
Sergey Yudin (tsipa740) wrote :

I've read code and trace more carefully and have to agree that https://bugs.launchpad.net/mos/+bug/1463802 and https://review.fuel-infra.org/#/c/7925/9 more related to this bug.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue has been fixed by the following CR - https://review.fuel-infra.org/#/c/10420/

Dina, regarding your question, how can it be that the queue is deleted, the commit message of this CR explains pretty well how using auto-delete queues cause such issues - https://review.openstack.org/#/c/103157/1

And we do use auto-delete queues (check output of "rabbitmqctl list_queues auto_delete name")

Revision history for this message
Alexey Khivin (akhivin) wrote :

fixed earlier for the old branches (see cherry-pick list for https://review.fuel-infra.org/#/c/10420/ )

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.