3-way split brain during NIC outage for dedicated replication network

Bug #1390472 reported by Brian Cline
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Invalid
Undecided
Brian Cline

Bug Description

We've seen a three-way split brain scenario crop up for a container database during a replication network NIC outage that occurred on one replica holder. Since the replication network is physically different, it was still able to accept requests from proxy servers, and continued processing them. However, when the NIC link came back up for that host, there somehow ended up being three different copies of that container database. After several weeks replication has not reconciled any of the different copies.

This is a known issue that has cropped up before with several other major vendors, so documenting here.

Working on a way to produce a test that can reliably reproduce for this issue.

Brian Cline (briancline)
description: updated
Changed in swift:
assignee: nobody → Brian Cline (briancline)
status: New → Incomplete
Brian Cline (briancline)
summary: - 3-way split brain during rpelication network NIC outage on one object-
- server
+ 3-way split brain during NIC outage for dedicated replication network
Revision history for this message
clayg (clay-gerrard) wrote :

still need more info - examples/representative databases that we can use to understand what operations are being performed and how progress is failing to be made.

Even if we don't understand how it got into this state, it will help spark new ideas if we understood simply what is the replicator doing if not *getting the databases in-sync*. Even if they're in-sync with the *wrong* information - it seems like they should eventually get themselves to agree on a single version of the truth (even tho it may be lie) - if there's a db that can exist where this doesn't work we should understand why - what information is wrong enough that it can not be fixed?

side-node: as an orthogonal pursuit, additional debug logging lines could be merged to master to help gather information from other people that might be effected by this scenario.

Revision history for this message
Tim Burke (1-tim-z) wrote :

Closing as we haven't gotten more info in two years. I'd love to re-open if we can repro the problem, though!

Changed in swift:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.