Prefer duplicate messages delivery and reordering to data loss caused by the built-in RabbitMQ partitions recovery

Bug #1495125 reported by Bogdan Dobrelya
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Wishlist
Fuel Sustaining
Mitaka
Won't Fix
Wishlist
Fuel Library (Deprecated)
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
Wishlist
MOS Oslo
9.x
Won't Fix
Wishlist
MOS Oslo

Bug Description

According to Jepsen testing results [0], RabbitMQ partitions recovery logic normally wipes the minority nodes' state, causing significant data loss, up to 35% of enqueued messages for the given test case.
This issue will remain actual unless the rabbit's autoheal and pause_minority would recover by taking the union of the messages extant on both nodes, rather than blindly destroying all data on one replica.
This also applies to Fuel's OCF agent, when a rabbit node failed to join the cluster.

It is yet unknown what would be more preferable for Openstack apps (Oslo.messaging library) - lost state or reordering and duplicated delivery of messages. So, this is rather an architecture change request, which should be based on a research results, than a bug. If the latter behavior fits better for OpenStack, the OCF agent in Fuel should detect, enter and recover partitions instead. Also when a node is failing to join cluster or recovering from minority partition, its enqueued messages must be cared of before the mnesia erase: "isolate one of the nodes from all clients, drain all of its messages, and enqueue them into a selected primary. Finally, restart that node and it’ll pick up the primary’s state. Repeat the process for node which was isolated, and you’ll have a single authoritative cluster again–albeit with duplicates for each copy of the message on each node."

[0] https://aphyr.com/posts/315-call-me-maybe-rabbitmq

summary: - Prefer duplicate messages delivery and reordering to message loss caused
- by the built-in RabbitMQ partitions recovery
+ Prefer duplicate messages delivery and reordering to data loss caused by
+ the built-in RabbitMQ partitions recovery
Changed in fuel:
importance: Undecided → Wishlist
milestone: none → 8.0
tags: added: ha rabbitmq
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
tags: added: team-bugfix
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 8.0 → 9.0
tags: added: feature
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

An update. Now t it *is* known what would be more preferable for Openstack apps (Oslo.messaging library) - lost state.
While reordering and duplicated delivery of messages is a major threat for current state of OpenStack apps relying on Oslo.messaging rpc calls, which is only at-most-once, and seem never be at-least-once. More details here https://blueprints.launchpad.net/oslo.messaging/+spec/at-least-once-guarantee

Changed in mos:
status: New → Confirmed
importance: Undecided → Wishlist
assignee: nobody → MOS Oslo (mos-oslo)
milestone: none → 9.0
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

We will not fix this issue until 9.0 SCF, hence bumping target milestone

Changed in mos:
status: Confirmed → Won't Fix
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

Please be aware partition handling has been disabled by
https://review.openstack.org/322269

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Partition handling is not actually disabled, it's just handled completely by OCF script. Before that both autoheal and OCF were responsible for resetting rabbitmq nodes, and now responsibility lies completely on OCF.

no longer affects: fuel/newton
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

I've meant disabled in terms of RabbitMQ daemon.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Invalid as it cant'be fixed downstream MOS or in Fuel. While upstream implementation can not tolerate duplicate delivery of messages and disrupts services, it prefers messages loss instead of duiplicates. See for details Mehdi's comments here https://review.openstack.org/#/c/229186/

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.