After restart of 'rabbitmq-server' service on all controllers, previously generated messages were lost

Bug #1561894 reported by Alexander Koryagin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
Dmitry Mescheryakov

Bug Description

Hello,
Please take a look at the following issue:
After restart of 'rabbitmq-server' service on all controllers, messages, generated with 'oslo.messaging-check-tool' were lost.

   Note: I am not restarting service on ALL controllers at ONCE.
   My actions are:
   - Restart rabbit service on one controller;
   - Wait till it'll be up and running;
   - Wait till this controller will present in cluster;
   - Perform the same actions on next controller.

My env is MOS 9.0 (ISO: fuel-9.0-94-2016-03-21_14-00-00.iso)
With 3x controller, 1x compute-cinder.

Actions performed from controller(s):
1) OK - Install oslo.messaging-check-tool on first controller:
    # apt-get update
    # apt-get install git python-pip python-dev -y
    # cd /root/
    # git clone https://github.com/dmitrymex/oslo.messaging-check-tool.git
    # cd /root/oslo.messaging-check-tool/
    # pip install -r requirements.txt -r test-requirements.txt
    # dpkg -i oslo.messaging-check-tool_1.0-1~u14.04+mos1_all.deb
    # apt-get -f install -y

2) OK - Get nodes inside RabbitMQ cluster:
# rabbitmqctl cluster_status | grep -A1 'running_nodes'
    {running_nodes,['rabbit@messaging-node-1','rabbit@messaging-node-4',
                 'rabbit@messaging-node-2']},

3) OK - Get IPs of nodes:
    # getent hosts node-1 --> 10.109.1.4 node-1.test.domain.local node-1
    # getent hosts node-2 --> 10.109.1.7 node-2.test.domain.local node-2
    # getent hosts node-4 --> 10.109.1.6 node-4.test.domain.local node-4

4) OK - Fill oslo config file:
# cat /root/oslo.messaging-check-tool/oslo_msg_check.conf
    [DEFAULT]
    debug=true
    [oslo_messaging_rabbit]
    rabbit_hosts = 10.109.1.4:5673, 10.109.1.6:5673, 10.109.1.7:5673
    rabbit_userid = nova
    rabbit_password = Ajl9OxOMW2mgB7fZRZbH6aPu

5) OK - Generate and consume 10000 messages without rabbitmq-server restart:
    # cd /root/oslo.messaging-check-tool/
    # oslo_msg_load_generator --config-file oslo_msg_check.conf --messages-to-send 10000 --nodebug
    # oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
        >>> OK - Consumed 10000 messages

6) OK - Generate and consume 10000 messages WITH rabbitmq-server restart on ONE controller:
    # cd /root/oslo.messaging-check-tool/
    # oslo_msg_load_generator --config-file oslo_msg_check.conf --messages-to-send 10000 --nodebug
    # service rabbitmq-server restart && sleep 10
    # rabbitmqctl cluster_status | grep -A1 'running_nodes' #\\ check that all 3 nodes present
    # oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
        >>> OK - Consumed 10000 messages

6) NOK - Generate and consume 10000 messages WITH rabbitmq-server restart on ALL controller:
On first controller:
    # cd /root/oslo.messaging-check-tool/
    # oslo_msg_load_generator --config-file oslo_msg_check.conf --messages-to-send 10000 --nodebug

Perform restart on each controller one-by-one and check cluster status after it:
    # rabbitmqctl list_queues slave_pids name | grep `hostname` | wc -l #\\ Remember number before restart. (For me: 20)
    # service rabbitmq-server restart && sleep 10
    # rabbitmqctl cluster_status | grep -A1 'running_nodes' #\\ check that all 3 nodes present
    # rabbitmqctl list_queues slave_pids name | grep `hostname` | wc -l #\\ Check that num is the same after restart. (For me: 20)
    After that perform the same commands above on next controller.

On first controller:
    # oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
        >>> NOK - Consumed 0 messages \\\ Should be 10000

Re-try on first controller after 5 minutes:
    # oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
        >>> NOK - Consumed 0 messages \\\ Should be 10000

root@node-2:~# rabbitmqctl list_policies
Listing policies ...
/ ha-notif all ^(event|metering|notifications)\\. {"ha-mode":"all","ha-sync-mode":"automatic"} 0
/ heat_rpc_expire all ^heat-engine-listener\\. {"expires":3600000} 1
/ tasks_expire all ^tasks\\. {"expires":3600000} 1
/ results_expire all ^results\\. {"expires":3600000} 1

description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Dina Belova (dbelova) wrote :

Alexander, it's as designed and expected behaviour. MOS does not support dumping of queued messages in case if ALL RabbitMQ services are restarted at once. Services are restarted one by one and there is enough time for RabbitMQ sync, it will be ok.

Changed in mos:
status: New → Invalid
Revision history for this message
Alexander Koryagin (akoryagin) wrote :

Hello Diana,
Actually I am not restarting service on ALL controllers at ONCE.
Sorry it is not clear from my description.

My actions are:
- Restart rabbit service on one controller;
- Wait till it'll be up and running;
- Wait till this controller will present in cluster;
- Perform the same actions on next controller.

Changed in mos:
status: Invalid → New
Revision history for this message
Alexander Koryagin (akoryagin) wrote :
description: updated
Revision history for this message
Dina Belova (dbelova) wrote :

Alexander, thanks for the clarification. Oslo team, please take a look.

Changed in mos:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → MOS Oslo (mos-oslo)
milestone: none → 9.0
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Alexander Koryagin (akoryagin) wrote :

Hello,
I’ve also faced that this bug can be reproduced in this way:
Restart 2 controllers out of 3.

Actions:
1) Configuration with 3 controllers and 1 compute-node;
2) On one of the controllers install "oslo.messaging-check-tool";
3) In configuration file for tool add all 3 controller’s IPs;
4) Generate 10 000 messaged with a help of tool;
5) Only on two others controllers kill rabbit server (don’t touch first controller):
    # kill -9 $(rabbitmqctl status | grep '{pid' | tr -dc '0-9')
6) Wait and check that rabbit is ok and synchronized on all nodes
    # rabbitmqctl cluster_status | grep -A1 "running_nodes"
7) NOK -- From the first controller try to consume back 10 000 messages.
Only 0 messages consumed

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

According to 'list_policies' output HA is enabled only for ceilometer queues. And it is the way it should be. For RPC loss of rabbitmq broker will result in loss of all queue contents located on that broker. Queue is created on the node where some client issued 'queue.declare' for the first time - so its evident that one the the brokers you've killed contained those 10000 message.

Actually, even in pause_minority mode of clustering and with HA enabled there are no guarantees of no data loss (see https://aphyr.com/posts/315-jepsen-rabbitmq)

tags: added: swarm-blocker
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Alexey, to your point - the oslo_msg_load_generator utility generates notifications, not RPC messages, so they should be safe. There is no snapshot attached, but I strongly suspect that restart of RabbitMQ on one of the controllers caused restart of all other RabbitMQs due to https://bugs.launchpad.net/fuel/+bug/1559136

Once the referenced bug is fixed, I will ask QA team to retest the bug again.

Changed in mos:
assignee: MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Dina Belova (dbelova) wrote :

Moving to 9.0-updates after conversation with Dmitry. Changes are ready, but they require extended reliability testing to ensure fix actually works ok, This work is still in progress.

tags: added: move-to-mu
Changed in mos:
milestone: 9.0 → 9.0-updates
tags: added: 10.0-reviewed
Revision history for this message
Sergey Shevorakov (sshevorakov) wrote :

Not a swarm-blocker anymore.
The last run where it was reproduced: https://mirantis.testrail.com/index.php?/plans/view/8318

tags: removed: swarm-blocker
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Marking as invalid for 10.0 since RabbitMQ's OCF script was moved to rabbitmq-server package

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Alexey Galkin (agalkin) wrote :

Verificated on 9.1 snapshot #233.

Revision history for this message
Alexey Galkin (agalkin) wrote :
Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.