VMs stay stuck in scheduling when rabbitmq leader unit is down
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack RabbitMQ Server Charm |
Invalid
|
Low
|
Unassigned | ||
Ubuntu Cloud Archive |
Fix Released
|
Critical
|
Unassigned | ||
Yoga |
Fix Released
|
Critical
|
Unassigned | ||
Zed |
Fix Released
|
Critical
|
Unassigned | ||
oslo.messaging |
Fix Released
|
Undecided
|
Unassigned | ||
python-oslo.messaging (Ubuntu) |
Fix Released
|
Critical
|
Corey Bryant | ||
Jammy |
Fix Released
|
Critical
|
Unassigned | ||
Kinetic |
Fix Released
|
Critical
|
Corey Bryant |
Bug Description
When testing rabbitmq-server HA in our OpenStack Yoga cloud environment (Rabbitmq Server release 3.9/stable) we faced the following issues:
- When the leader unit is down we are unable to launch any VMs and the launched ones stay stuck in the 'BUILD' state.
- While checking the logs we see that several OpenStack services has issues in communicating with the rabbitmq-server
- After restarting all the services using rabbitmq (like Nova, Cinder, Neutron etc) the issue gets resolved and the VMs can be launched successfully
The corresponding logs are available at: https:/
We also observed the same for rabbitmq-server unit which is first in the list of 'nova.conf' file, and after restarting the concerned rabbitmq unit we see that scheduling of VMs work fine again.
As this can be seen from this part of the log as well:
"Reconnected to AMQP server on 192.168.34.251:5672 via [amqp] client with port 41922."
====== Ubuntu SRU Details =======
[Impact]
Active/active HA for rabbitmq is broken when a node goes down.
[Test Case]
Deploy openstack with 3 units of rabbitmq in active/active HA.
[Regression Potential]
Due to the criticality of this issue, I've decided to revert the upstream change that is causing the problem as a stop-gap until a proper fix is in place. That fix came in via https:/
Changed in charm-rabbitmq-server: | |
status: | New → Triaged |
importance: | Undecided → Critical |
tags: | added: verification-needed-zed |
tags: |
added: verification-done removed: verification-needed |
We followed one of the below steps to produce the bug:
1. Login to leader rabbitmq-server unit and run "shutdown now".
2. Login to rabbitmq-server unit which is first in the list of 'nova.conf' file and run "shutdown now".
Work Around applied to resolve the issue of vm scheduling when hit with this bug:
juju ssh cinder/0 "sudo systemctl restart cinder- scheduler. service" volume. service" scheduler. service" volume. service" scheduler. service" volume. service" server. service" server. service" server. service" controller/ 0 "sudo systemctl restart nova-conductor. service" controller/ 0 "sudo systemctl restart nova-scheduler. service" controller/ 0 "sudo systemctl restart nova-spiceproxy .service" controller/ 1 "sudo systemctl restart nova-conductor. service" controller/ 1 "sudo systemctl restart nova-scheduler. service" controller/ 1 "sudo systemctl restart nova-spiceproxy .service" controller/ 2 "sudo systemctl restart nova-conductor. service" controller/ 2 "sudo systemctl restart nova-scheduler. service" controller/ 2 "sudo systemctl restart nova-spiceproxy .service"
juju ssh cinder/0 "sudo systemctl restart cinder-
juju ssh cinder/1 "sudo systemctl restart cinder-
juju ssh cinder/1 "sudo systemctl restart cinder-
juju ssh cinder/2 "sudo systemctl restart cinder-
juju ssh cinder/2 "sudo systemctl restart cinder-
juju ssh neutron-api/0 "sudo systemctl restart neutron-
juju ssh neutron-api/1 "sudo systemctl restart neutron-
juju ssh neutron-api/2 "sudo systemctl restart neutron-
juju ssh nova-cloud-
juju ssh nova-cloud-
juju ssh nova-cloud-
juju ssh nova-cloud-
juju ssh nova-cloud-
juju ssh nova-cloud-
juju ssh nova-cloud-
juju ssh nova-cloud-
juju ssh nova-cloud-
for i in {6..14}; do juju ssh $i " sudo systemctl restart nova-compute"; done
for i in {6..14}; do juju ssh $i " sudo systemctl restart nova-api-metadata"; done