writes to zeoraid2 when zeoraid1 is recovering a backend cause recovery to fail
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gocept.zeoraid |
Fix Released
|
Critical
|
Christian Theune |
Bug Description
The exception on zeoraid1 was:
2009-11-18T10:45:32 INFO ZEO.zrpc.
lastTransaction() raised exception: RAID is inconsistent and was closed.
Traceback (most recent call last):
File
"/var/buildout-
line 581, in handle_request
ret = meth(*args)
File
"/var/buildout-
line 229, in lastTransaction
return self._apply_
File
"/var/buildout-
line 48, in check_open
return method(self, *args, **kw)
File
"/var/buildout-
line 823, in _apply_all_storages
"RAID is inconsistent and was closed.")
RuntimeError: RAID is inconsistent and was closed.
Christian said:
This means that a call for lastTransaction was made to the ZEORaid server who delegated it to all optimal back-end storages at that time.
However, the results were inconsistent (meaning the storages had differing views on what the last transaction was) and thus ZEORaid refuses to do anything else. (Unfortunately in this situation we can only explicitly fail, not recover automatically.)
However, my guess was:
> Looking at the logs, it happened when I was attempting to recover zeo1
> from zeo2, after having stopped zeo1 for a while. There's no real
> indication of what went wrong to end up in an inconsistent state.
> Here's my guess...
>
> I took down zeoraid1 to do the b6 upgrade. zeoraid2 is still up. The
> customer edited some page templates in the unpacked storage (which would
> have gone through zeoraid2). I attempted to do a recover on zeoraid1. My
> guess is that the customer edited something via zeoraid2 while zeoraid1
> was recovering, causing this inconsistency. Sound plausible?
I still stand by this. I'm not sure how zeoraid1 is supposed to know that zeoraid2 has been writing transactions to zeo2 while it is trying to recover zeo1 from zeo2.
In any case, what worked for me was only doing the recovery from the zeoraid that's actually serving the clients.
Changed in gocept.zeoraid: | |
assignee: | nobody → Christian Theune (ct-gocept) |
importance: | Undecided → Critical |
milestone: | none → 1.0b7 |
Changed in gocept.zeoraid: | |
status: | Fix Committed → Fix Released |
Hmm. Looking at the scenario I find this:
- The recovery itself should not fail. It also does see the new transactions that were written to zeo2 - that's what ZEO is for after all.
- However, the backends definitely will get out of sync immediately if someone writes to zeoraid1 because all writes of zeoraid2 only will target zeo2 that will cause zeoraid1 to either degrade zeo1 very quickly or even become inconsistent and shut down.
- Interestingly enough, zeoraid2 will continue to function properly on the remaining zeo2.
I think what we can learn from this is: during recovery you can only run a single zeoraid server where all new transactions go to because otherwise as soon as recovery is finished the redundant zeoraid servers won't know about the recovered storage. This needs documentation.
I was also able to see the failed lastTransaction call. I did not expect this so I need to look where that actually came from.