one zeo marked as failed for one storage, inconsistent data seen

Bug #499888 reported by ChrisW
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
gocept.zeoraid
New
Undecided
Unassigned

Bug Description

This may be a duplicate of #485976, but the observed behaviour was different, so reporting again.

So, I got an issue raised by the customer saying that the batch job which clears all content out of the database wasn't working.

I manually ran the batch job with a pdb in and verified the transaction had committed fine and there was no content present.
However, in the web app, I still saw all the content.

So, I looked at the zeoraid status.

On zeoraid1, zeo1 was marked as failed: inconsistent oids.
On zeoraid2, zeo2 was marked as failed: inconsistent oids.

So, each zeoraid thought its local zeo had failed but kept on working with the one on the other server :-( :-( :-(

I shut down zeoraid2 and started to recover zeo1 from zeoraid1. I await with baited breath as to what might happen next.

However, as soon as I shut down zeoraid2, the content in the web app showed once more as correctly cleared.

So, this seems like a real world example of the "split brain" problem, and kinda CRITICAL for me...

Hope you can help!

Chris

Revision history for this message
ChrisW (chris-simplistix) wrote :

Of course, the recovery failed but, worse still, it failed leaving the zeoraid saying "verifying transaction".
I've grepped the zeoraid logs and no indication of why the recovery failed (I can guess...) and no real indication that it has actually failed other than it's been verifying the same transaction for several days now.

I guess I'll cover the storage by copying the .fs file across...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.