Prefer duplicate messages delivery and reordering to data loss caused by the built-in RabbitMQ partitions recovery
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
Wishlist
|
Fuel Sustaining | |||
Mitaka |
Won't Fix
|
Wishlist
|
Fuel Library (Deprecated) | |||
Mirantis OpenStack | Status tracked in 10.0.x | |||||
10.0.x |
Invalid
|
Wishlist
|
MOS Oslo | |||
9.x |
Won't Fix
|
Wishlist
|
MOS Oslo |
Bug Description
According to Jepsen testing results [0], RabbitMQ partitions recovery logic normally wipes the minority nodes' state, causing significant data loss, up to 35% of enqueued messages for the given test case.
This issue will remain actual unless the rabbit's autoheal and pause_minority would recover by taking the union of the messages extant on both nodes, rather than blindly destroying all data on one replica.
This also applies to Fuel's OCF agent, when a rabbit node failed to join the cluster.
It is yet unknown what would be more preferable for Openstack apps (Oslo.messaging library) - lost state or reordering and duplicated delivery of messages. So, this is rather an architecture change request, which should be based on a research results, than a bug. If the latter behavior fits better for OpenStack, the OCF agent in Fuel should detect, enter and recover partitions instead. Also when a node is failing to join cluster or recovering from minority partition, its enqueued messages must be cared of before the mnesia erase: "isolate one of the nodes from all clients, drain all of its messages, and enqueue them into a selected primary. Finally, restart that node and it’ll pick up the primary’s state. Repeat the process for node which was isolated, and you’ll have a single authoritative cluster again–albeit with duplicates for each copy of the message on each node."
summary: |
- Prefer duplicate messages delivery and reordering to message loss caused - by the built-in RabbitMQ partitions recovery + Prefer duplicate messages delivery and reordering to data loss caused by + the built-in RabbitMQ partitions recovery |
Changed in fuel: | |
importance: | Undecided → Wishlist |
milestone: | none → 8.0 |
tags: | added: ha rabbitmq |
Changed in fuel: | |
assignee: | nobody → Fuel Library Team (fuel-library) |
status: | New → Confirmed |
tags: | added: area-library |
tags: | added: team-bugfix |
Changed in fuel: | |
milestone: | 8.0 → 9.0 |
tags: | added: feature |
no longer affects: | fuel/newton |
An update. Now t it *is* known what would be more preferable for Openstack apps (Oslo.messaging library) - lost state. /blueprints. launchpad. net/oslo. messaging/ +spec/at- least-once- guarantee
While reordering and duplicated delivery of messages is a major threat for current state of OpenStack apps relying on Oslo.messaging rpc calls, which is only at-most-once, and seem never be at-least-once. More details here https:/