RPC clients cannot find a reply queue after restart of the last RabbitMQ server in the cluster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mirantis OpenStack |
Fix Released
|
High
|
Dmitry Mescheryakov | ||
5.1.x |
Fix Released
|
High
|
Alexey Khivin | ||
6.0.x |
Fix Released
|
High
|
Alexander Nevenchannyy | ||
6.1.x |
Fix Released
|
High
|
Alexey Khivin | ||
7.0.x |
Fix Released
|
High
|
Dmitry Mescheryakov | ||
8.0.x |
Fix Released
|
High
|
MOS Oslo |
Bug Description
Steps to reproduce:
1. Deploy MOS environment in HA mode with several controllers
2. Shut down one of the controllers, either gracefully or not
3. Wait for MySQL, RabbitMQ and OpenStack to failover (several minutes)
4. Try to use OpenStack API which invokes internal messaging via Rabbit MQ. For instance, you can view console log of an instance, create instances, networks, volumes, etc.
Some requests sent in step #4 might fail with timeout. In the logs of the affected service the following message could be seen:
"Queue not found:Basic.
Conditions for reproduction:
The issue occurs rather infrequently, though we don't have exact date. So far we have only 4 reproductions reported during last 2-3 weeks.
User impact:
While the issue exists, some requests to OpenStack might fail (those, which are processed on the affected controller/
Workaround:
The workaround is to restart the affected service. After restart, the service will immediately become operational.
Current plan:
We are planning to fix the issue in updates for 6.1. Right now we are reproducing it with additional logging enabled to understand the root cause.
Detailed analisys by Roman Podoliaka
=======
Reply queues created by oslo.messaging are not durable (i.e. they are gone after restart of the last RabbitMQ in the cluster). The problem is that after successful failover of RabbitMQ OpenStack services will correctly reconnect, but RPC calls will be broken until we restart the affected service: a reply queue is not recreated, which means no reply can be received for a given call, and the call will eventually fail with TimeoutError.
As it can be seen in the output of commands below, this particular reply queue of nova-conductor first migrated from one RabbitMQ node to another, then saw death of another mirror, and after RabbitMQ server on node-16 was restarted the queue was gone, still nova-conductor RPC client tried to consume messages from it.
rabbitmqctl list_queues: http://
root@node-16:~# grep reply_f7cac1a24
This wouldn't be a problem, if a new reply queue was created for new RPC calls, but currently this makes RPC client unusable unless we restart the whole process.
Note: description of the original error in nova-conductor has been put below.
Initial description by Artem Panchenko
=======
Fuel version info (6.1 build #521 RC1): http://
After shutting down of primary controller OSTF tests which create Nova instances fail, because all new booted instances have ERROR state:
http://
Here is a part of nova-conductor.log (node-16):
http://
RabbitMQ cluster status looks good:
[root@fuel-
DEPRECATION WARNING: /etc/fuel/
node-16.
Cluster status of node 'rabbit@node-16' ...
[{nodes,
{running_
{cluster_
{partitions,[]}]
...done.
node-7.mirantis.com
Cluster status of node 'rabbit@node-7' ...
[{nodes,
{running_
{cluster_
{partitions,[]}]
...done.
Here is AMQP queues info:
http://
Steps to reproduce:
1. Create environment: Ubuntu, NeutronGRE, Ceph, Sahara, Ceilometer
2. Add 1 controller, 2 controller+ceph, 1 compute and 3 mongo nodes
3. Deploy changes.
4. Run OSTF
5. Shutdown primary controller (gracefully using `poweroff` command)
6. Run OSTF
Expected result:
- all tests passed except 'Check that required services are running'
Actual:
- all tests which create Nova instances fail
Also, I didn't find why, but all API requests to Nova take a long time, for example `nova list` simple command execution takes 17 seconds:
http://
Diagnostic snapshot (environment ID - 2, nodes: 5,16,7,6,11,13,14): https:/
Changed in mos: | |
assignee: | nobody → MOS Oslo (mos-oslo) |
milestone: | none → 6.1 |
importance: | Undecided → High |
no longer affects: | fuel |
summary: |
- Nova can't boot instances after primary controller graceful shutdown - 'MessagingTimeout: Timed out waiting for a reply to message ID xxx' + RPC clients do not recreate a reply queue after restart of the last + RabbitMQ server in the cluster |
description: | updated |
description: | updated |
Changed in mos: | |
status: | New → Confirmed |
description: | updated |
description: | updated |
tags: | added: 6.1rc2 |
Changed in mos: | |
assignee: | MOS Oslo (mos-oslo) → Victor Sergeyev (vsergeyev) |
description: | updated |
description: | updated |
description: | updated |
Changed in mos: | |
milestone: | 6.1 → 6.1-updates |
Changed in mos: | |
status: | Confirmed → In Progress |
Changed in mos: | |
status: | In Progress → Fix Committed |
tags: | added: customer-found |
tags: | added: 6.1-mu-1 |
Changed in mos: | |
milestone: | 6.1-updates → 6.1-mu-1 |
tags: | added: support |
tags: | added: on-verification |
This issue can be resolved using 'amqp_durable_ queues' option which makes rabbitmq to keep queues on the server. After controller restart messages are gone but queue persists with the same name. The option may be set to True in Nova config without oslo.messaging modification.