writes to zeoraid2 when zeoraid1 is recovering a backend cause recovery to fail

Bug #484727 reported by ChrisW
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
gocept.zeoraid
Fix Released
Critical
Christian Theune

Bug Description

The exception on zeoraid1 was:

2009-11-18T10:45:32 INFO ZEO.zrpc.Connection(S) (127.0.0.1:51611)
lastTransaction() raised exception: RAID is inconsistent and was closed.
Traceback (most recent call last):
File
"/var/buildout-eggs/ZODB3-3.9.3-py2.6-linux-i686.egg/ZEO/zrpc/connection.py",
line 581, in handle_request
ret = meth(*args)
File
"/var/buildout-eggs/gocept.zeoraid-1.0b6-py2.6.egg/gocept/zeoraid/storage.py",
line 229, in lastTransaction
return self._apply_all_storages('lastTransaction')
File
"/var/buildout-eggs/gocept.zeoraid-1.0b6-py2.6.egg/gocept/zeoraid/storage.py",
line 48, in check_open
return method(self, *args, **kw)
File
"/var/buildout-eggs/gocept.zeoraid-1.0b6-py2.6.egg/gocept/zeoraid/storage.py",
line 823, in _apply_all_storages
"RAID is inconsistent and was closed.")
RuntimeError: RAID is inconsistent and was closed.

Christian said:
This means that a call for lastTransaction was made to the ZEORaid server who delegated it to all optimal back-end storages at that time.
However, the results were inconsistent (meaning the storages had differing views on what the last transaction was) and thus ZEORaid refuses to do anything else. (Unfortunately in this situation we can only explicitly fail, not recover automatically.)

However, my guess was:

> Looking at the logs, it happened when I was attempting to recover zeo1
> from zeo2, after having stopped zeo1 for a while. There's no real
> indication of what went wrong to end up in an inconsistent state.
> Here's my guess...
>
> I took down zeoraid1 to do the b6 upgrade. zeoraid2 is still up. The
> customer edited some page templates in the unpacked storage (which would
> have gone through zeoraid2). I attempted to do a recover on zeoraid1. My
> guess is that the customer edited something via zeoraid2 while zeoraid1
> was recovering, causing this inconsistency. Sound plausible?

I still stand by this. I'm not sure how zeoraid1 is supposed to know that zeoraid2 has been writing transactions to zeo2 while it is trying to recover zeo1 from zeo2.

In any case, what worked for me was only doing the recovery from the zeoraid that's actually serving the clients.

Changed in gocept.zeoraid:
assignee: nobody → Christian Theune (ct-gocept)
importance: Undecided → Critical
milestone: none → 1.0b7
Revision history for this message
Christian Theune (ctheune) wrote :

Hmm. Looking at the scenario I find this:

- The recovery itself should not fail. It also does see the new transactions that were written to zeo2 - that's what ZEO is for after all.

- However, the backends definitely will get out of sync immediately if someone writes to zeoraid1 because all writes of zeoraid2 only will target zeo2 that will cause zeoraid1 to either degrade zeo1 very quickly or even become inconsistent and shut down.

- Interestingly enough, zeoraid2 will continue to function properly on the remaining zeo2.

I think what we can learn from this is: during recovery you can only run a single zeoraid server where all new transactions go to because otherwise as soon as recovery is finished the redundant zeoraid servers won't know about the recovered storage. This needs documentation.

I was also able to see the failed lastTransaction call. I did not expect this so I need to look where that actually came from.

Changed in gocept.zeoraid:
status: New → In Progress
Revision history for this message
Christian Theune (ctheune) wrote :

Ok, here's why zeoraid1 shut down: it *did* recover completely. Only after that during a regular ZEO request did it shut down. So who talked to it? Of course that was the manager script trying to get the status which caused the lastTransaction call due to ZEO initialisation protocol.

Revision history for this message
Christian Theune (ctheune) wrote :

Documented in 105855 and writing a new issue that describes a feature to avoid this behaviour in future versions of ZEORaid.

Changed in gocept.zeoraid:
status: In Progress → Fix Committed
Revision history for this message
ChrisW (chris-simplistix) wrote :

OK, so the short version of how to recover a storage until #485210 is implemented is:

- shut down all zeoraid servers bar one

- recover the storage

- start up the remaining zeoraid servers

Revision history for this message
Christian Theune (ctheune) wrote :

Yup.

Changed in gocept.zeoraid:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.