3-way split brain during NIC outage for dedicated replication network
Bug #1390472 reported by
Brian Cline
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Invalid
|
Undecided
|
Brian Cline |
Bug Description
We've seen a three-way split brain scenario crop up for a container database during a replication network NIC outage that occurred on one replica holder. Since the replication network is physically different, it was still able to accept requests from proxy servers, and continued processing them. However, when the NIC link came back up for that host, there somehow ended up being three different copies of that container database. After several weeks replication has not reconciled any of the different copies.
This is a known issue that has cropped up before with several other major vendors, so documenting here.
Working on a way to produce a test that can reliably reproduce for this issue.
description: | updated |
Changed in swift: | |
assignee: | nobody → Brian Cline (briancline) |
status: | New → Incomplete |
summary: |
- 3-way split brain during rpelication network NIC outage on one object- - server + 3-way split brain during NIC outage for dedicated replication network |
To post a comment you must log in.
still need more info - examples/ representative databases that we can use to understand what operations are being performed and how progress is failing to be made.
Even if we don't understand how it got into this state, it will help spark new ideas if we understood simply what is the replicator doing if not *getting the databases in-sync*. Even if they're in-sync with the *wrong* information - it seems like they should eventually get themselves to agree on a single version of the truth (even tho it may be lie) - if there's a db that can exist where this doesn't work we should understand why - what information is wrong enough that it can not be fixed?
side-node: as an orthogonal pursuit, additional debug logging lines could be merged to master to help gather information from other people that might be effected by this scenario.