[OpenStack Yoga] Creating a VM is failed when stops only one rabbitmq

Bug #1990257 reported by Son Do Xuan
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Opinion
Wishlist
Unassigned
RabbitMQ
New
Undecided
Unassigned

Bug Description

Hi, I deploy a new OpenStack cluster (OpenStack Yoga) by kolla-ansible. Everything works fine.
Then, I try to stop only one rabbitmq-server in cluster; after that, I can't create a new VM.

Reproduce:
- Deploy a new openstack cluster yoga by kolla-ansible
- Stop random rabbitmq on one node (docker stop rabbitmq)
- Test create new server

Revision history for this message
Sofia Enriquez (lsofia-enriquez) wrote :

Hi Son Do Xuan,hope this email finds you well. This doesn't look like a Cinder bug. Moving to Nova and RabbitMQ. Thanks.

no longer affects: cinder
Revision history for this message
Rajat Dhasmana (whoami-rajat) wrote :

Hi,

Can you provide more details on how you are creating the VM? is it from an image or from a bootable volume?
Can you also provide the ERROR trace in the nova logs for the failure and also check cinder logs if the instance is launched from bootable volume.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

This is not really unexpected.

nova like most opesntack service depends on rabbit being reliable and if a RPC is lost then thinks can break.

fixing that is not in scope fo nova.
it would require a fundamental reacatecure to introduced tasks and track state to allows us to resend messages if we
see that the action was not performed after some time.
that is not something that we have development capstiy to do or an inclinationt to address currently.

we do not have any logs to review but i would guess there is an interval where the compute agent is unable to receive messages form queus or where they are lost when rabbit is stopped.

you might be able to address this by using durable or quoram queues but that is out of scope of nova.

Changed in nova:
importance: Undecided → Wishlist
status: New → Opinion
affects: kolla → kolla-ansible
Revision history for this message
Matt Crees (mattcrees) wrote :

Hi,

I've recently been exploring this same issue. This behaviour is caused by a bug in oslo.messaging, there's currently a couple of patches ongoing to resolve it:

https://review.opendev.org/c/openstack/oslo.messaging/+/866616
https://review.opendev.org/c/openstack/oslo.messaging/+/866617

In the meantime, I've found that setting the config option `[oslo_messaging_rabbit] kombu_reconnect_delay = 0.1` fixes this bug. This should be set for all services which use RabbitMQ.

These bug reports are relevant to see the oslo.messaging patch that caused the issue, and discussion on how it is being resolved:

https://bugs.launchpad.net/cloud-archive/+bug/1993149
https://bugs.launchpad.net/kolla-ansible/+bug/1990257

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

hi everyone, there exist a bug on oslo.messaging which brokes the HA of rabbitmq. When rabbitmq leader unit is down then the consumers can not connect to other rabbitmq nodes.

There is a fix for oslo.messaging 12.9.4-1

But the installed oslo.messaging version is 12.9.4 for xena images. Can we update oslo.messaging to 12.9.4-1 for all images which uses oslo.messaging?

There is no problem for Yoga images, Yoga images has oslo.messaging 12.13.1 which has the patch.

$ docker pull quay.io/openstack.kolla/ubuntu-source-neutron-server:xena
$ docker run -it quay.io/openstack.kolla/ubuntu-source-neutron-server:xena bash
()[neutron@51cabaa9dc23 /]$ pip freeze
oslo.messaging==12.9.4

Sources:

- https://bugs.launchpad.net/kolla-ansible/+bug/1993876
- https://bugs.launchpad.net/cloud-archive/+bug/1993149
- https://docs.openstack.org/releasenotes/oslo.messaging/en_GB/xena.html

Revision history for this message
Khoi (khoinh5) wrote :

Setting:

[oslo_messaging_rabbit]
kombu_reconnect_delay=0.5

will fix it. From yoga we can use quorum queue to suffer when 1 of 3 node is down.

https://docs.openstack.org/releasenotes/oslo.messaging/yoga.html

If you use SAN then use both of them:

https://docs.openstack.org/releasenotes/oslo.messaging/yoga.html

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

Hi Khoi, thanks for reply.

oslo.messaging team increased ACK_REQUEUE_EVERY_SECONDS_MAX to resolve issues with RabbitMQ HA failover.

Old default value for ACK_REQUEUE_EVERY_SECONDS_MAX is 1.0 seconds and kombu_reconnect_delay is also 1.0 seconds. Now they have changed ACK_REQUEUE_EVERY_SECONDS_MAX to 5 seconds to resolve issue.

Adding kombu_reconnect_delay config to oslo_messaging_rabbit is hard for us, because there are lots of units which are uses oslo.messaging

We prefer to continue with defaults after upgrading oslo.messaging to 12.9.4-1

no longer affects: kolla
no longer affects: kolla-ansible
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.