Mirantis OpenStack

After restart of 'rabbitmq-server' service on all controllers, previously generated messages were lost

Bug #1561894 reported by Alexander Koryagin on 2016-03-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	Status tracked in 10.0.x
	10.0.x	Invalid	High	Dmitry Mescheryakov	Mirantis OpenStack 10.0

Bug Description

Hello,
Please take a look at the following issue:
After restart of 'rabbitmq-server' service on all controllers, messages, generated with 'oslo.messaging-check-tool' were lost.

   Note: I am not restarting service on ALL controllers at ONCE.
   My actions are:
   - Restart rabbit service on one controller;
   - Wait till it'll be up and running;
   - Wait till this controller will present in cluster;
   - Perform the same actions on next controller.

My env is MOS 9.0 (ISO: fuel-9.0-94-2016-03-21_14-00-00.iso)
With 3x controller, 1x compute-cinder.

Actions performed from controller(s):
1) OK - Install oslo.messaging-check-tool on first controller:
    # apt-get update
    # apt-get install git python-pip python-dev -y
    # cd /root/
    # git clone https://github.com/dmitrymex/oslo.messaging-check-tool.git
    # cd /root/oslo.messaging-check-tool/
    # pip install -r requirements.txt -r test-requirements.txt
    # dpkg -i oslo.messaging-check-tool_1.0-1~u14.04+mos1_all.deb
    # apt-get -f install -y

2) OK - Get nodes inside RabbitMQ cluster:
# rabbitmqctl cluster_status | grep -A1 'running_nodes'
{running_nodes,['rabbit@messaging-node-1','rabbit@messaging-node-4',
'rabbit@messaging-node-2']},

3) OK - Get IPs of nodes:
    # getent hosts node-1 --> 10.109.1.4 node-1.test.domain.local node-1
    # getent hosts node-2 --> 10.109.1.7 node-2.test.domain.local node-2
    # getent hosts node-4 --> 10.109.1.6 node-4.test.domain.local node-4

4) OK - Fill oslo config file:
# cat /root/oslo.messaging-check-tool/oslo_msg_check.conf
    [DEFAULT]
    debug=true
    [oslo_messaging_rabbit]
    rabbit_hosts = 10.109.1.4:5673, 10.109.1.6:5673, 10.109.1.7:5673
    rabbit_userid = nova
    rabbit_password = Ajl9OxOMW2mgB7fZRZbH6aPu

5) OK - Generate and consume 10000 messages without rabbitmq-server restart:
    # cd /root/oslo.messaging-check-tool/
    # oslo_msg_load_generator --config-file oslo_msg_check.conf --messages-to-send 10000 --nodebug
    # oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
        >>> OK - Consumed 10000 messages

6) OK - Generate and consume 10000 messages WITH rabbitmq-server restart on ONE controller:
    # cd /root/oslo.messaging-check-tool/
    # oslo_msg_load_generator --config-file oslo_msg_check.conf --messages-to-send 10000 --nodebug
    # service rabbitmq-server restart && sleep 10
    # rabbitmqctl cluster_status | grep -A1 'running_nodes' #\\ check that all 3 nodes present
    # oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
        >>> OK - Consumed 10000 messages

6) NOK - Generate and consume 10000 messages WITH rabbitmq-server restart on ALL controller:
On first controller:
# cd /root/oslo.messaging-check-tool/
# oslo_msg_load_generator --config-file oslo_msg_check.conf --messages-to-send 10000 --nodebug

Perform restart on each controller one-by-one and check cluster status after it:
    # rabbitmqctl list_queues slave_pids name | grep `hostname` | wc -l #\\ Remember number before restart. (For me: 20)
    # service rabbitmq-server restart && sleep 10
    # rabbitmqctl cluster_status | grep -A1 'running_nodes' #\\ check that all 3 nodes present
    # rabbitmqctl list_queues slave_pids name | grep `hostname` | wc -l #\\ Check that num is the same after restart. (For me: 20)
    After that perform the same commands above on next controller.

On first controller:
# oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
>>> NOK - Consumed 0 messages \\\ Should be 10000

Re-try on first controller after 5 minutes:
# oslo_msg_load_consumer --config-file oslo_msg_check.conf --nodebug
>>> NOK - Consumed 0 messages \\\ Should be 10000

root@node-2:~# rabbitmqctl list_policies
Listing policies ...
/ ha-notif all ^(event|metering|notifications)\\. {"ha-mode":"all","ha-sync-mode":"automatic"} 0
/ heat_rpc_expire all ^heat-engine-listener\\. {"expires":3600000} 1
/ tasks_expire all ^tasks\\. {"expires":3600000} 1
/ results_expire all ^results\\. {"expires":3600000} 1

See original description

Tags:

Alexander Koryagin (akoryagin) on 2016-03-25

description:

updated

Alexander Koryagin (akoryagin) on 2016-03-25

description:	updated
description:	updated
description:	updated

Alexander Koryagin (akoryagin) on 2016-03-25

description:

updated

Revision history for this message

Dina Belova (dbelova) wrote on 2016-03-25:

Alexander, it's as designed and expected behaviour. MOS does not support dumping of queued messages in case if ALL RabbitMQ services are restarted at once. Services are restarted one by one and there is enough time for RabbitMQ sync, it will be ok.

Changed in mos:
status:	New → Invalid

Revision history for this message

Alexander Koryagin (akoryagin) wrote on 2016-03-25:

Hello Diana,
Actually I am not restarting service on ALL controllers at ONCE.
Sorry it is not clear from my description.

My actions are:
- Restart rabbit service on one controller;
- Wait till it'll be up and running;
- Wait till this controller will present in cluster;
- Perform the same actions on next controller.

Changed in mos:
status:	Invalid → New

Revision history for this message

Alexander Koryagin (akoryagin) wrote on 2016-03-25:

it is this test case: https://mirantis.testrail.com/index.php?/cases/view/838285

Alexander Koryagin (akoryagin) on 2016-03-28

description:

updated

Revision history for this message

Dina Belova (dbelova) wrote on 2016-03-28:

Alexander, thanks for the clarification. Oslo team, please take a look.

Changed in mos:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → MOS Oslo (mos-oslo)
milestone:	none → 9.0

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-03-28: Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Revision history for this message

Alexander Koryagin (akoryagin) wrote on 2016-04-06:

Hello,
I’ve also faced that this bug can be reproduced in this way:
Restart 2 controllers out of 3.

Actions:
1) Configuration with 3 controllers and 1 compute-node;
2) On one of the controllers install "oslo.messaging-check-tool";
3) In configuration file for tool add all 3 controller’s IPs;
4) Generate 10 000 messaged with a help of tool;
5) Only on two others controllers kill rabbit server (don’t touch first controller):
# kill -9 $(rabbitmqctl status | grep '{pid' | tr -dc '0-9')
6) Wait and check that rabbit is ok and synchronized on all nodes
# rabbitmqctl cluster_status | grep -A1 "running_nodes"
7) NOK -- From the first controller try to consume back 10 000 messages.
Only 0 messages consumed

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-04-06:

According to 'list_policies' output HA is enabled only for ceilometer queues. And it is the way it should be. For RPC loss of rabbitmq broker will result in loss of all queue contents located on that broker. Queue is created on the node where some client issued 'queue.declare' for the first time - so its evident that one the the brokers you've killed contained those 10000 message.

Actually, even in pause_minority mode of clustering and with HA enabled there are no guarantees of no data loss (see https://aphyr.com/posts/315-jepsen-rabbitmq)

Sergey Shevorakov (sshevorakov) on 2016-04-15

tags:

added: swarm-blocker

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-05-20:

Alexey, to your point - the oslo_msg_load_generator utility generates notifications, not RPC messages, so they should be safe. There is no snapshot attached, but I strongly suspect that restart of RabbitMQ on one of the controllers caused restart of all other RabbitMQs due to https://bugs.launchpad.net/fuel/+bug/1559136

Once the referenced bug is fixed, I will ask QA team to retest the bug again.

Changed in mos:
assignee:	MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)

Dmitry Mescheryakov (dmitrymex) on 2016-06-01

Changed in mos:
status:	Confirmed → In Progress

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-08:

Moving to 9.0-updates after conversation with Dmitry. Changes are ready, but they require extended reliability testing to ensure fix actually works ok, This work is still in progress.

tags:	added: move-to-mu
Changed in mos:
milestone:	9.0 → 9.0-updates

Dmitry Mescheryakov (dmitrymex) on 2016-07-11

tags:

added: 10.0-reviewed

Revision history for this message

Sergey Shevorakov (sshevorakov) wrote on 2016-08-16:

#10

Not a swarm-blocker anymore.
The last run where it was reproduced: https://mirantis.testrail.com/index.php?/plans/view/8318

tags:

removed: swarm-blocker

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-08-19:

#11

Marking as invalid for 10.0 since RabbitMQ's OCF script was moved to rabbitmq-server package

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-08-31:

#12

Fixed by https://review.openstack.org/#/c/324647/

Changed in mos:
status:	In Progress → Fix Committed

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-09-09:

#13

Verificated on 9.1 snapshot #233.

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-09-09:

#14

Result - All Passed: https://mirantis.testrail.com/index.php?/runs/view/20166

Changed in mos:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.