The multi-zeoraid startup problem.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gocept.zeoraid |
Confirmed
|
Medium
|
Unassigned |
Bug Description
So, the setup is that you have a machine hosting zeoraid1 and zeo1, another machine hosting zeoraid2 and zeo2, both zeoraids connecting to both zeos, and multiple client machines with both zeoraid1 and zeoraid2 configured as their zeo server.
Imagine a power failure to the rack holding all these machines, when power is restored, all machines start at roughly the same time.
Bad things may happen:
- the zeoraids may come up before the zeos are ready, causing them both to fail because they both have no backends available
- worst case, zeoraid1 ends up connected to only zeo2, zeoraid2 ends up connected to only zeo1, clients are connected to a mix of zeoraid1 and zeoraid2, writing transactions to all. zeo1 and zeo2 get out of sync in an unrecoverable fashion :-(
The optimal case would be to have zeoraid1 and zeoraid2 up and connected to zeo1 and zeo2, with everything in sync.
An acceptable case would be to have one zeoraid up and both zeos in sync.
Changed in gocept.zeoraid: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
A possible solution would be to allow a grace period on startup for storages to connect successfully (e.g. 10-15 seconds or so).