Fuel for OpenStack

Prefer duplicate messages delivery and reordering to data loss caused by the built-in RabbitMQ partitions recovery

Bug #1495125 reported by Bogdan Dobrelya on 2015-09-12

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	Wishlist	Fuel Sustaining	Fuel for OpenStack 10.0
Mitaka	Won't Fix	Wishlist	Fuel Library (Deprecated)	Fuel for OpenStack 9.0
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Invalid	Wishlist	MOS Oslo	Mirantis OpenStack 10.0
9.x	Won't Fix	Wishlist	MOS Oslo	Mirantis OpenStack 9.0

Bug Description

According to Jepsen testing results [0], RabbitMQ partitions recovery logic normally wipes the minority nodes' state, causing significant data loss, up to 35% of enqueued messages for the given test case.
This issue will remain actual unless the rabbit's autoheal and pause_minority would recover by taking the union of the messages extant on both nodes, rather than blindly destroying all data on one replica.
This also applies to Fuel's OCF agent, when a rabbit node failed to join the cluster.

It is yet unknown what would be more preferable for Openstack apps (Oslo.messaging library) - lost state or reordering and duplicated delivery of messages. So, this is rather an architecture change request, which should be based on a research results, than a bug. If the latter behavior fits better for OpenStack, the OCF agent in Fuel should detect, enter and recover partitions instead. Also when a node is failing to join cluster or recovering from minority partition, its enqueued messages must be cared of before the mnesia erase: "isolate one of the nodes from all clients, drain all of its messages, and enqueue them into a selected primary. Finally, restart that node and it’ll pick up the primary’s state. Repeat the process for node which was isolated, and you’ll have a single authoritative cluster again–albeit with duplicates for each copy of the message on each node."

[0] https://aphyr.com/posts/315-call-me-maybe-rabbitmq

Tags:

Bogdan Dobrelya (bogdando) on 2015-09-12

summary:	- Prefer duplicate messages delivery and reordering to message loss caused - by the built-in RabbitMQ partitions recovery + Prefer duplicate messages delivery and reordering to data loss caused by + the built-in RabbitMQ partitions recovery
Changed in fuel:
importance:	Undecided → Wishlist
milestone:	none → 8.0
tags:	added: ha rabbitmq

Matthew Mosesohn (raytrac3r) on 2015-09-16

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)
status:	New → Confirmed

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-library

Stanislaw Bogatkin (sbogatkin) on 2015-12-14

tags:

added: team-bugfix

Dmitry Pyzhov (dpyzhov) on 2015-12-29

Changed in fuel:
milestone:	8.0 → 9.0

Kyrylo Galanov (kgalanov) on 2016-02-17

tags:

added: feature