Random ValueError due to inconsistent ZEO persistent cache code

Bug #143744 reported by ChrisW
2
Affects Status Importance Assigned to Milestone
Zope 2
Fix Released
High
Unassigned

Bug Description

Any server running an affected version of ZEO with a persistent client side cache may, at random intervals, die and start emitting tracebacks as below until the .zec files are removed and the app server is restarted:

Traceback (most recent call last):
  File "/usr/local/zope/2.9.3/lib/python/ZEO/zrpc/client.py", line 506, in notify_client
    self.client.notifyConnected(self.conn)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/ClientStorage.py", line 521, in notifyConnected
    self.verify_cache(stub)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/ClientStorage.py", line 589, in verify_cache
    self.invalidateTransaction(*pair)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/ClientStorage.py", line 1114, in invalidateTransaction
    self._cache.setLastTid(tid)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/cache.py", line 151, in setLastTid
    self.fc.settid(tid)
  File "/usr/local/zope/2.9.3/lib/python/ZEO/cache.py", line 1060, in settid
    raise ValueError("new last tid (%s) must be greater than "
ValueError: new last tid (245115767599186653) must be greater than previous one (245115767599186653)

Okay, so why have I logged this as critical?
Well, because no-one seems to understand this code and it's got pretty wide ranging effects. The quick solution is do remove persistent client caches completely, but this would be a big shame as they can be a huge performance win in certain circumstances.

So, does ANYONE want to stump up and make a call on whether the < should be changed to <= or whether this is actualyl a real bgu caused by some other race condition?

Some Zope 2 related details:
http://mail.zope.org/pipermail/zodb-dev/2006-April/010031.html
http://mail.zope.org/pipermail/zodb-dev/2006-June/010157.html

A standalone ZODB example:
http://www.zope.org/Collectors/Zope/2038

A Zope 3 example:
http://mail.zope.org/pipermail/zope3-dev/2006-July/019697.html

Tags: bug database
Revision history for this message
Christian Theune (ctheune) wrote :

Wow. This *is* a nasty error.

I spent some time trying to make sens out of this. I've tried to trace all possible callers and their interactions starting from settid().

I've ruled out a number of edge cases that might conflict here, but I didn't finally solve it.

I'm pretty sure that the tid passed to settid() must be strictly larger than the one before, but I can't prove it (yet). (My feeling
comes from the fact that most places that cause settid() to be called come from the invalidation code. This shouldn't cause
the same tid to be passed into the settid() twice.)

I do have the feeling that some race condition is around. The only
reason I can see for this is that another ZEO client might be
committing at the very same moment and the server's tpc_finish
gets called. This would trigger an invalidation from the server and calls invalidateTransaction.

I can imagine an unfortunate situation where the client would ask
for the current validations *and* get a new invalidation request
from the server at the same time.

This could be ruled out, if the server can't send any requests
to the client until notifyConnected has finished.

Andreas Jung (ajung)
Changed in zope2:
importance: Critical → High
Revision history for this message
ChrisW (chris-simplistix) wrote : Re: [Bug 143744] Re: Random ValueError due to inconsistent ZEO persistent cache code

Andreas Jung wrote:
> ** Changed in: zope2
> Importance: Critical => High

Why the arbitary dropping from Critical to High?

For the people this affects, which can be anyone, it's pretty critical...

Chris

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Revision history for this message
Andreas Jung (ajung) wrote :

Critical=Blocker=needs to be fixed for the next release= affects everyone
Since we do releases with this bug it can not be blocker.

Revision history for this message
Patrick Gerken (do3cc) wrote :

The way this is handled has been changed on the 5.9.2008 in changeset 90876 by jim to fix some threading bugs. The way it has been changed, it cannot throw this exception any longer. So maybe it can be closed?

Revision history for this message
ChrisW (chris-simplistix) wrote :

Patrick Gerken wrote:
> The way this is handled has been changed on the 5.9.2008 in changeset
> 90876 by jim to fix some threading bugs. The way it has been changed, it
> cannot throw this exception any longer. So maybe it can be closed?

I guess so. I'd like to know when this is going to land in a real ZODB
release first though...

Chris

--
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk

Revision history for this message
Andreas Jung (ajung) wrote :

Appearently this issue can be closed since the fix has been committed

Changed in zope2:
status: New → Fix Committed
Revision history for this message
ChrisW (chris-simplistix) wrote :

Is this going to make it into Zope 2.12?

Revision history for this message
Andreas Jung (ajung) wrote :

On 07.05.09 19:54, ChrisW wrote:
> Is this going to make it into Zope 2.12?
>
>
Should be fixed in ZODB 3.9.0

-aj

--
ZOPYX Ltd. & Co. KG - Charlottenstr. 37/1 - 72070 Tübingen - Germany
Web: www.zopyx.com - Email: <email address hidden> - Phone +49 - 7071 - 793376
Registergericht: Amtsgericht Stuttgart, Handelsregister A 381535
Geschäftsführer/Gesellschafter: ZOPYX Limited, Birmingham, UK
------------------------------------------------------------------------
E-Publishing, Python, Zope & Plone development, Consulting

Revision history for this message
ChrisW (chris-simplistix) wrote :

Okay lets close then.

Changed in zope2:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.