Number of heat Rabbitmq queues is growing from failover to failover

Bug #1599104 reported by Yves-Gwenael Bourhis
74
This bug affects 16 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Mitaka
In Progress
Medium
Rico Lin
OpenStack Heat
In Progress
Medium
Rico Lin

Bug Description

https://bugs.launchpad.net/heat/+bug/1414674 is not fixed in mitaka.

- Launch a mitaka devstack
- reattach the screen (sreen -r)
- sudo rabbitmqctl list_queues | grep -i heat | wc -l # I get 33
- go to the h-eng session of screen, <ctrl-c> it and relaunch it.
- go to the "shell" session of screen and relaunch "sudo rabbitmqctl list_queues | grep -i heat | wc -l" I now have 65
- repeat again, I get 97 heat queues...

every time I stop/start h-eng :
/usr/local/bin/heat-engine --config-file=/etc/heat/heat.conf & echo $! >/opt/stack/status/stack/h-eng.pid; fg || echo "h-eng failed to start" | tee "/opt/stack/status/stack/h-eng.failure"
The heat queues increase.

when looking at the heat queues in rabbitmq, we have tons of queues such as "heat-engine-listener.01ce4b29-7fac-44a1-873c-efb4f6817f88" which never expire.

When checking all other openstack elements queues (nova, neutron, cinder, etc...) their queues which have a random id in the name of the queue are all expirable, heat only sets the fanout queues as expirable.

Revision history for this message
Anant Patil (ananta) wrote :

Could you please give the details for number of engine workers etc.? Also please give the heat.conf file.

Changed in heat:
status: New → Confirmed
Revision history for this message
Anant Patil (ananta) wrote :

<opinion>
I can see this clearly on master also. At exit, heat or oslo.messaging is suppose to close the sessions. I wonder why rabbitmq still keeps the queue? May be the keep-alive from heat engine is disabled?
</opinion>

Revision history for this message
Anant Patil (ananta) wrote :

As per https://review.openstack.org/#/c/243845, the queues from previous run are suppose to get deleted after 10 mins, which is a default value. But that doesn't happen. The queues linger around. BTW, the patch takes care of reply queues and fanout queues only.

Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

To reproduce the issue : http://paste.openstack.org/show/525938/
in devstack "git checkout stable/mitaka" before launching ./stack.sh

Since it's a devstack I only have one engin worker.

Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

When I mean "since it's a devstack", I "reproduce" it in a devstack, however for production environments it's a HUGE issue : we easily have thousands of trailing queues...

Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

Here is the heat.conf file generated by devstack: http://paste.openstack.org/show/525983/

Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

NOTE: I think it would be welcomed by lots of openstack sys admins to have a "secure" method to clean the queues after fix release, because I don't think the fix will clean the previous queues.

I was thinking of suspending heat services and deleting heat queues before relaunching the services, but if someone has a better idea...

Revision history for this message
Anant Patil (ananta) wrote :

Not a bad workaround, but you have be extra cautious not to delete the queues having unprocessed messages. You will have to make sure the heat-api is not taking any requests, and that all the existing messages in queue are processed. IMO, this needs to be fixed properly so that the queues are auto-deleted when the underlying connections are gone/reset. I am not sure how part of it though.

Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

Another workaround would be:

    sudo rabbitmqctl set_policy expiry ".*" '{"expires":43200000}' --apply-to queues

to delete all queues which are IDLE after 12 hours.
It worked on devstack... and didn't seem to impact other services (because it also autodeletes IDLE queues of other services) but I don't know if it's safe for a production environment...

Revision history for this message
Anant Patil (ananta) wrote :

No, this cannot be the solution. What if the service is up but not used for that amount of time?

Revision history for this message
Yves-Gwenael Bourhis (yves-gwenael-bourhis) wrote :

Here is a quick workaround to automatically delete the idle queues without stopping heat services:

https://gist.github.com/ygbourhis/8d258ae76d62ef11a2f77f0251c906ff

Tested successfully on devstack. Would this be safe on production?

FYI :
The faulty queues seem to be declared somewhere around
here: https://github.com/openstack/heat/blob/master/heat/engine/service.py#L267
and here: https://github.com/openstack/heat/blob/master/heat/rpc/listener_client.py#L37
But I can be wrong because I can't find how to manage queues at all with oslo_messaging, and discovered that I fail to propose any heat fix because of oslo_messaging's documentation... After 3 days trying oslo_messaging's documentation I admit my total and absolute failure in understanding it... while I understood pika and rabbitmq in a few hours even though I never ever used them at all before...
In fact oslo_messaging doc looks more like a reminder for those who already know it but is unaccessible for new commers.
I would totally agree if told that this bug is a side effect of a faulty oslo_messaging doc...
Unless I have the wrong doc : https://wiki.openstack.org/wiki/Oslo/Messaging

Revision history for this message
Anant Patil (ananta) wrote :

I am not sure why heat engine should create new message queues. It should connect to the already existing queues and start processing the messages left there before it crashed/restarted. There could be messages in the old queues, and I guess those queues cannot be deleted. In fact they should not be deleted, but reconnected to when ever the heat engine processes come up again.

Rico Lin (rico-lin)
Changed in heat:
assignee: nobody → Rico Lin (rico-lin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/353909

Changed in heat:
status: Confirmed → In Progress
Rico Lin (rico-lin)
Changed in heat:
importance: Undecided → Medium
milestone: none → newton-3
Changed in devstack:
assignee: nobody → Rico Lin (rico-lin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/355374

Changed in devstack:
status: New → In Progress
Rico Lin (rico-lin)
Changed in fuel:
assignee: nobody → Rico Lin (rico-lin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/356272

Rico Lin (rico-lin)
Changed in fuel:
status: New → In Progress
Changed in fuel:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/357045

Changed in fuel:
milestone: none → 9.1
Thomas Herve (therve)
Changed in heat:
milestone: newton-3 → next
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on devstack (master)

Change abandoned by Rico Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/355374

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/356272
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=46eed2a513518b34d26edb869c1710b91650add1
Submitter: Jenkins
Branch: master

commit 46eed2a513518b34d26edb869c1710b91650add1
Author: ricolin <email address hidden>
Date: Wed Aug 17 15:06:34 2016 +0800

    Add RabbitMQ expiration policies for convergence

    Right now heat can support convergence mode, means we will give it
    chances to raise bug 1599104. We simply add policies for worker queues.
    This fix services with interruption signal and leave the target queues
    open. The number of queues keep growing by repeat above action.
    This patch add a expiration for queues. allow queues to be close as
    long as they didn't been used for a long period of time(1 hour).
    This fix must solve problem with growing number of RabbitMQ queues.

    Change-Id: Icda32000f391780c4e3d5d3ebcc519bf853283b7
    Related-Bug: #1599104

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/366717

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/366717
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d2bb3637b38ea3defdf4fb3330c1c2766d66314c
Submitter: Jenkins
Branch: stable/mitaka

commit d2bb3637b38ea3defdf4fb3330c1c2766d66314c
Author: ricolin <email address hidden>
Date: Wed Aug 17 15:06:34 2016 +0800

    Add RabbitMQ expiration policies for convergence

    Right now heat can support convergence mode, means we will give it
    chances to raise bug 1599104. We simply add policies for worker queues.
    This fix services with interruption signal and leave the target queues
    open. The number of queues keep growing by repeat above action.
    This patch add a expiration for queues. allow queues to be close as
    long as they didn't been used for a long period of time(1 hour).
    This fix must solve problem with growing number of RabbitMQ queues.

    Change-Id: Icda32000f391780c4e3d5d3ebcc519bf853283b7
    Related-Bug: #1599104
    (cherry picked from commit 46eed2a513518b34d26edb869c1710b91650add1)

tags: added: in-stable-mitaka
Roman Vyalov (r0mikiam)
Changed in fuel:
status: In Progress → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Rico Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/353909
Reason: Some part of fix already adopted by other patch and merged in master.
This patch no longer match for current architecture anymore, so abandon it.

Revision history for this message
JiaJunsu (jiajunsu) wrote :

I agree with Patil. We found this problem in production environments. Setting policy in RabbitMQ is a good solution to deleting the expired queues, but I guess we should use the same queues after heat restarted.
I set the expired time 1 hour, but when heat restarted twice in 1 hour(upgrading scenario), it leads to thousands of useless queues.

Revision history for this message
Sean Dague (sdague) wrote :

This devstack bug was last updated over 180 days ago, as devstack
is a fast moving project and we'd like to get the tracker down to
currently actionable bugs, this is getting marked as Invalid. If the
issue still exists, please feel free to reopen it.

Changed in devstack:
status: In Progress → Invalid
Rico Lin (rico-lin)
no longer affects: devstack
no longer affects: fuel
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Rico Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/357045

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Has this problem been solved? But I still reproduced the problem on Ussuri release in the following way as described in the bug description. Is the workaround mentioned in comment #9 the final best-recommended solution?

1, create a heat ussuri test env

2, queue number before the test is 100 (here's more detailed data - https://paste.ubuntu.com/p/sN5pnSP8Mc/)

# rabbitmqctl list_queues -p openstack | grep -E 'engine_worker|heat-engine-listener' |wc -l
100

3, restart heat-engine to trigger the problem

juju ssh heat/0 -- sudo systemctl restart heat-engine

4, queue number has increased from 100 to 108 (here's more detailed data - https://paste.ubuntu.com/p/GWqWhSGyXm/)

# rabbitmqctl list_queues -p openstack | grep -E 'engine_worker|heat-engine-listener' |wc -l
108

and two new services (42cee820-4f0c-4aef-b8b6-705e7db3253a 8c12d70c-b00c-4e9b-b33e-7cbf0cb8c510) were created, so we saw:

# rabbitmqctl list_queues -p openstack | grep -E 'engine_worker|heat-engine-listener' |grep -E '42cee820-4f0c-4aef-b8b6-705e7db3253a|8c12d70c-b00c-4e9b-b33e-7cbf0cb8c510'
engine_worker.42cee820-4f0c-4aef-b8b6-705e7db3253a 0
engine_worker.8c12d70c-b00c-4e9b-b33e-7cbf0cb8c510 0
heat-engine-listener.42cee820-4f0c-4aef-b8b6-705e7db3253a 0
heat-engine-listener.8c12d70c-b00c-4e9b-b33e-7cbf0cb8c510 0

5, 4550fe68-ecad-457c-b080-29d6b5fb2e7f is a old engine, and it was also soft deleted by L2360[1], but the old queue record was still there so the queues will keep growing.

# rabbitmqctl list_queues -p openstack | grep -E 'engine_worker|heat-engine-listener' |grep -E '4550fe68-ecad-457c-b080-29d6b5fb2e7f'
heat-engine-listener.4550fe68-ecad-457c-b080-29d6b5fb2e7f 0
engine_worker.4550fe68-ecad-457c-b080-29d6b5fb2e7f 0

[1] https://github.com/openstack/heat/blob/stable/ussuri/heat/engine/service.py#L2360

Revision history for this message
David Hill (david-hill-ubisoft) wrote :

We can reproduce this in Train too...
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
9
 [root@undercloud-0-rhosp-beta ~]# podman restart heat_engine

4ecc3695a065bce7cb4d112e4dc62b1c35908c558f354b48833843fa9f583b3f
 [root@undercloud-0-rhosp-beta ~]#
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
5
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
5
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
13
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
13
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
13
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
13
 [root@undercloud-0-rhosp-beta ~]# podman restart heat_engine
4ecc3695a065bce7cb4d112e4dc62b1c35908c558f354b48833843fa9f583b3f
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
17
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
17
 [root@undercloud-0-rhosp-beta ~]# podman restart heat_engine
4ecc3695a065bce7cb4d112e4dc62b1c35908c558f354b48833843fa9f583b3f
 [root@undercloud-0-rhosp-beta ~]# podman exec -it rabbitmq rabbitmqctl list_queues | grep heat | wc -l
21

Revision history for this message
Damian Dąbrowski (damiandabrowski) wrote :

I just tested this and the issue still persists in 2023.2

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

This is the same issue reported at https://storyboard.openstack.org/#!/story/2007843.

I suggested a fix at https://storyboard.openstack.org/#!/story/2007843#comment-216180

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.