Mirantis OpenStack

a lot of messages in scheduler rabbitmq queue during create-and-delete-volume rally scenario

Bug #1497961 reported by Leontii Istomin on 2015-09-21

This bug affects 9 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	High	Alex Schultz	Mirantis OpenStack 8.0
7.0.x	Fix Released	High	Denis Meltsaykin	Mirantis OpenStack 7.0-mu-5
8.0.x	Fix Released	High	Alex Schultz	Mirantis OpenStack 8.0

Bug Description

Steps to reproduce:

1. Install MOS environment with 3 controllers
2. Execute
     service nova-scheduler restart
     service cinder-scheduler restart
   on one of the controllers.
3. Execute 'pcs resource' on the same controller and in its output find a slave instance of RabbitMQ.
4. Log into controller where the found slave instance runs and kill RabbitMQ here with 'kill' command.
5. Once Pacemaker restores killed RabbitMQ, check output of
     rabbitmqctl list_queues messages consumers name | grep scheduler_fanout
   You will find that there are one 'cinder-scheduler_fanout_*' and one 'scheduler_fanout_*' queues from which nobody consumes.

You may find state of the queues in that paste: http://paste.openstack.org/show/476830/

===== RCA

The issue is caused by CR https://review.openstack.org/#/c/132967/21. Here Fuel dumps state of RabbitMQ users, queues, etc. Then, if RabbitMQ slave node restart happens, the OCF script imports the definitions back into RabbitMQ. As a result, *_fanout_* queues, which where dropped previously, get recreated. But nobody consumes from them, and as a result they constantly grow.

===== Initial description from Scale team

During create-and-delete-volume rally scenarion we have found that cinder can't get messages from cinder-scheduler_fanout queues on time.
root@node-100:~# rabbitmqctl list_queues | grep -v 0$
Listing queues ...
cinder-scheduler_fanout_036c9e58e6d1400c9fca60ae9ee89088 28122
cinder-scheduler_fanout_69e02aa5e9b94e98b8fb712b3444e0c5 28122
cinder-scheduler_fanout_a30261e9ba58486bb977eac8a14ccd99 28122
reply_ba0c220da0ea44f6b1d59c04affaa945 1
scheduler_fanout_0cebacfc84894e569b8260d8ccfcf367 250015
scheduler_fanout_65388589e08b40a589bba14bfdf61759 250016
scheduler_fanout_7b0cdd5420bc44bb86c8df27097c7469 250016

Cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vxlan,Ceph-all,Nova-debug,Nova-quotas,7.0-297
Controllers:3 Computes:178 Copmutes+Ceph:20 LMA:2

api: '1.0'
astute_sha: 6c5b73f93e24cc781c809db9159927655ced5012
auth_required: true
build_id: '298'
build_number: '298'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: 0623b4daad438ceeb5dc41b10cdd3011795fff7e
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
fuelmain_sha: 6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85
nailgun_sha: d590b26dbb09785b8a8b3651b0ef69746fcf9991
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 486bde57cda1badb68f915f66c61b544108606f3
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-22_06-53-27.tar.xz

See original description

Tags:

Dina Belova (dbelova) on 2015-09-21

Changed in mos:
milestone:	8.0-updates → 8.0

Ivan Kolodyazhny (e0ne) on 2015-09-21

tags:

added: oslo.messaging

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-09-21:

The scheduler_fanout are more in number and i believe they belong to Nova. So Nova team should look at this as well

Revision history for this message

Ivan Kolodyazhny (e0ne) wrote on 2015-09-21:

Errors from cinder-scheduler log:
2015-09-19 10:30:43.848 2497 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 11
1] ECONNREFUSED
2015-09-19 10:30:58.861 2497 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 11
1] ECONNREFUSED
2015-09-19 10:31:13.875 2497 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 11
1] ECONNREFUSED
2015-09-19 10:31:28.882 2497 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 11
1] ECONNREFUSED
2015-09-19 10:31:43.894 2497 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 11
1] ECONNREFUSED

Leontii Istomin (listomin) on 2015-09-21

summary:

- a lot of messages in cinder-scheduler rabbitmq queue during create-and-
- delete-volume rally scenario
+ a lot of messages in scheduler rabbitmq queue during create-and-delete-
+ volume rally scenario

Dina Belova (dbelova) on 2015-09-21

tags:

added: cinder nova

Revision history for this message

Adam Heczko (aheczko-mirantis) wrote on 2015-09-22:

Maybe we should consider setting up TTL for some queues?
http://www.rabbitmq.com/ttl.html#queue-ttl

Leontii Istomin (listomin) on 2015-09-22

description:	updated
description:	updated

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-09-22:

An observation from my side: this paste was taken from the env when the bug was filed:
http://paste.openstack.org/show/472546/

And this one was taken a day later when rabbitmq was restarted:
http://paste.openstack.org/show/473479/

Note that
a. The overflown queues were deleted during restart because they are not durable. Also it can be seen that they have much less messages after restart
b. But services recreated queues with exactly the same name after reconnection. And the services still do not consume from the queues.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-10-20:

Updated the description with exact steps to reproduce and RCA.

description:

updated

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-11-12:

The issue in master (8.0) is fixed by this change request: https://review.openstack.org/#/c/239739/

Revision history for this message

Mike Nguyen (moozoo) wrote on 2015-11-19:

Until this is fixed in 7.0, is purging the cinder-scheduler_fanout_* and scheduler_fanout_* queues where these messages are accumulating an action that can be done without any consequences or risk?

Getting the same behaviour, and to be quite honest, probably would not have noticed this if it was not for LMA Grafana and Nagios, which is set to warn when there's >200 outstanding messages in rabbitmq...

Revision history for this message

Sergii Rizvan (srizvan) wrote on 2015-11-26:

Here is the link on CR for stable/7.0 branch https://review.openstack.org/#/c/249692/

Roman Rufanov (rrufanov) on 2015-12-15

tags:

added: customer-found support

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-12-23:

Mike Nguyen: in case this is still relevant, yes purging fanout queues is safe if they have no consumer.

Alex Ermolov (aermolov) on 2015-12-29

tags:

added: on-verification

Alex Ermolov (aermolov) on 2016-01-11

tags:

removed: on-verification

Alexander Zatserklyany (zatserklyany) on 2016-01-22

tags:

added: on-verification

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-01-26:

#10

The 7.0 fix is being reverted in https://review.openstack.org/#/c/272497/, as it introduces regression during a controller addition to the existing environment, as puppet expects to use this new python-script and may fail if the packages were not updated on _every_ controller. Moreover, it seems that the original fix only works during deployment and doesn't even resolve the issue mentioned in this report.

Revision history for this message

Alexander Zatserklyany (zatserklyany) wrote on 2016-02-02:

#11

Fix released

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "507"

There isn't any 'cinder-scheduler_fanout_*' and one 'scheduler_fanout_*' queues from which nobody consumes after 10 min after the step 5.

See details at http://paste.openstack.org/show/485739/

tags:

removed: on-verification

Revision history for this message

Andrey Grebennikov (agrebennikov) wrote on 2016-03-02:

#12

Any progress on this bug resolution?

Revision history for this message

Andrey Grebennikov (agrebennikov) wrote on 2016-03-02:

#13

Clarification - I'm asking about 7.0 since there are still a lot of deployments with it.

Revision history for this message

Andrey Grebennikov (agrebennikov) wrote on 2016-03-02:

#14

@dmeltsaykin
Could you please explain the reason of reverting the patch for 7.0?
If you do Not apply the patch for manifests there will be no regression.
If you update manifests package (assuming MU is applied), you consequently update packages on the nodes as well. SO if the new utility is included into the fuel-ha-utils package, it will be installed to all controllers.
I believe we should include the patch as well as rebuild the package of fuel-ha-utils including rabbitmq-dump-clean utility, as well as provide the notice in MU instruction regarding the procedure.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-03-03:

#15

I guess that is a little late, but let me explain how bug fix should be applied (if it is possible to push the fix into updates):

First we need bug fix ported to 7.0-updates. If that is done, the upgrade procedure will look like that:
1. user updates packages on master node and all envs
2. user manually invokes the script on each controller to 'purify' existing /etc/rabbitmq/definitions file:
      cp /etc/rabbitmq/definitions /etc/rabbitmq/definitions.backup
      cat /etc/rabbitmq/definitions | rabbitmq-dump-clean.py > /tmp/newdefinitions
      mv /tmp/newdefinitions /etc/rabbitmq/definitions
3. User either manually deletes existing fanout queues with no consumers. Or (easier) user just shuts down rabbitmq cluster and then starts it up again. RabbitMQ restart will cause all existing queues to be flushed.

Note: if you don't have updates applied, you still can do only steps #2 and #3. That will fix your environment until you add another controller or maybe if you delete an existing controller. After that you will have to repeat steps #2 and #3 to fix your environment once again. It is safe to repeat step #2 on a controller, the script actions are idempotent.

Another note: if bug fix is ported into 7.0-updates, all _new_ clusters will not be affected by the problem and users will not have to perform steps #2 and #3 for them

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-03-04:

#16

Let me elaborate on the topic.

First of all, the issue itself has been introduced during the 7.0 release cycle in https://review.openstack.org/#/c/132967/. The corresponding bugreport #1383258 doesn't clearly state why this is needed, what it is for, et cetera. So I assume it was just some badly designed feature which no one knows why was needed. Moreover, it doesn't take into account the nature of RPC and its specifics (like randomly generated queues).

Secondly, this particular fix looks like a temporary solution to a previous temporary solution and introduces a regression in specific scenarios, which is not acceptable in stable release, because we do really want to make it all right. The fix won't be landed into 7.0 Maintenance Update.

And finally, the issue may be avoided by making an "expiration" policy for a particular object in rabbitmq. See https://www.rabbitmq.com/ttl.html for details. Or by cleaning the definitions file from useless contents. And I'm strongly against backporting of something from oslo.messaging/master if there is a good workaround that can be tuned in-place.

Therefore, taking into account all from the above, I'm closing this bug as Opinion. Feel free to reopen it if you have a good solution to the issue, that can be included into a maintenance update.

Revision history for this message

Alexander Rubtsov (arubtsov) wrote on 2016-04-25:

#17

Denis,

Is it possible to implement one of the workarounds you mentioned (expiration policy / cleaning definitions) in the manifests and put it to the next 7.0 Maintenance Update? Some customers still deploy MOS 7.0 from scratch and it would be beneficial for them to have this issue eliminated in MU rather than making post-deployment steps manually

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-04-25:

#18

Alexander, I personally think that it is not so harmful as it is stated. The definitions dump only occurs during a deployment, so basically there shouldn't be too many RPC queues dumped. The contents of the dump is only used on RabbitMQ restart, which is not a daily procedure, and should occur rarely. The reverted fix was useful in very rare cases involving frequent re-deployments of controllers, so in a normal use case everything should be ok without the fix, I believe.

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-06-07:

#19

>Alexander, I personally think that it is not so harmful as it is stated.
Denis, do you mean the bug, or that workaround?

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-06-22:

#20

Fix for 7.0 has been merged: https://review.openstack.org/#/c/321038/

TatyanaGladysheva (tgladysheva) on 2016-08-15

tags:

added: on-verification

Revision history for this message

TatyanaGladysheva (tgladysheva) wrote on 2016-08-15:

#21

Verified on MOS 7.0 + MU5 updates.

Before updates:
root@node-3:~# cat /etc/rabbitmq/definitions | grep -c scheduler_fanout
1

After updates:
root@node-5:~# cat /etc/rabbitmq/definitions | grep -c scheduler_fanout
0

tags:

removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.