RabbitMQ

[OpenStack Yoga] Creating a VM is failed when stops only one rabbitmq

Bug #1990257 reported by Son Do Xuan on 2022-09-20

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Opinion	Wishlist	Unassigned
	RabbitMQ	New	Undecided	Unassigned

Bug Description

Hi, I deploy a new OpenStack cluster (OpenStack Yoga) by kolla-ansible. Everything works fine.
Then, I try to stop only one rabbitmq-server in cluster; after that, I can't create a new VM.

Reproduce:
- Deploy a new openstack cluster yoga by kolla-ansible
- Stop random rabbitmq on one node (docker stop rabbitmq)
- Test create new server

Revision history for this message

Sofia Enriquez (lsofia-enriquez) wrote on 2022-11-16:

Hi Son Do Xuan,hope this email finds you well. This doesn't look like a Cinder bug. Moving to Nova and RabbitMQ. Thanks.

no longer affects:

cinder

Revision history for this message

Rajat Dhasmana (whoami-rajat) wrote on 2022-11-16:

Hi,

Can you provide more details on how you are creating the VM? is it from an image or from a bootable volume?
Can you also provide the ERROR trace in the nova logs for the failure and also check cinder logs if the instance is launched from bootable volume.

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2022-11-17:

This is not really unexpected.

nova like most opesntack service depends on rabbit being reliable and if a RPC is lost then thinks can break.

fixing that is not in scope fo nova.
it would require a fundamental reacatecure to introduced tasks and track state to allows us to resend messages if we
see that the action was not performed after some time.
that is not something that we have development capstiy to do or an inclinationt to address currently.

we do not have any logs to review but i would guess there is an interval where the compute agent is unable to receive messages form queus or where they are lost when rabbit is stopped.

you might be able to address this by using durable or quoram queues but that is out of scope of nova.

Changed in nova:
importance:	Undecided → Wishlist
status:	New → Opinion

Michal Nasiadka (mnasiadka) on 2023-01-31

affects:

kolla → kolla-ansible

Revision history for this message

Matt Crees (mattcrees) wrote on 2023-01-31:

Hi,

I've recently been exploring this same issue. This behaviour is caused by a bug in oslo.messaging, there's currently a couple of patches ongoing to resolve it:

https://review.opendev.org/c/openstack/oslo.messaging/+/866616
https://review.opendev.org/c/openstack/oslo.messaging/+/866617

In the meantime, I've found that setting the config option `[oslo_messaging_rabbit] kombu_reconnect_delay = 0.1` fixes this bug. This should be set for all services which use RabbitMQ.

These bug reports are relevant to see the oslo.messaging patch that caused the issue, and discussion on how it is being resolved:

https://bugs.launchpad.net/cloud-archive/+bug/1993149
https://bugs.launchpad.net/kolla-ansible/+bug/1990257

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2023-07-27:

hi everyone, there exist a bug on oslo.messaging which brokes the HA of rabbitmq. When rabbitmq leader unit is down then the consumers can not connect to other rabbitmq nodes.

There is a fix for oslo.messaging 12.9.4-1

But the installed oslo.messaging version is 12.9.4 for xena images. Can we update oslo.messaging to 12.9.4-1 for all images which uses oslo.messaging?

There is no problem for Yoga images, Yoga images has oslo.messaging 12.13.1 which has the patch.

$ docker pull quay.io/openstack.kolla/ubuntu-source-neutron-server:xena
$ docker run -it quay.io/openstack.kolla/ubuntu-source-neutron-server:xena bash
()[neutron@51cabaa9dc23 /]$ pip freeze
oslo.messaging==12.9.4

Sources:

- https://bugs.launchpad.net/kolla-ansible/+bug/1993876
- https://bugs.launchpad.net/cloud-archive/+bug/1993149
- https://docs.openstack.org/releasenotes/oslo.messaging/en_GB/xena.html

Revision history for this message

Khoi (khoinh5) wrote on 2023-07-27:

Setting:

[oslo_messaging_rabbit]
kombu_reconnect_delay=0.5

will fix it. From yoga we can use quorum queue to suffer when 1 of 3 node is down.

https://docs.openstack.org/releasenotes/oslo.messaging/yoga.html

If you use SAN then use both of them:

https://docs.openstack.org/releasenotes/oslo.messaging/yoga.html

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2023-07-28:

Hi Khoi, thanks for reply.

oslo.messaging team increased ACK_REQUEUE_EVERY_SECONDS_MAX to resolve issues with RabbitMQ HA failover.

Old default value for ACK_REQUEUE_EVERY_SECONDS_MAX is 1.0 seconds and kombu_reconnect_delay is also 1.0 seconds. Now they have changed ACK_REQUEUE_EVERY_SECONDS_MAX to 5 seconds to resolve issue.

Adding kombu_reconnect_delay config to oslo_messaging_rabbit is hard for us, because there are lots of units which are uses oslo.messaging

We prefer to continue with defaults after upgrading oslo.messaging to 12.9.4-1

Michal Nasiadka (mnasiadka) on 2023-08-16

no longer affects:	kolla
no longer affects:	kolla-ansible

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.