Random ValueError due to inconsistent ZEO persistent cache code
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Zope 2 |
Fix Released
|
High
|
Unassigned |
Bug Description
Any server running an affected version of ZEO with a persistent client side cache may, at random intervals, die and start emitting tracebacks as below until the .zec files are removed and the app server is restarted:
Traceback (most recent call last):
File "/usr/local/
self.
File "/usr/local/
self.
File "/usr/local/
self.
File "/usr/local/
self.
File "/usr/local/
self.
File "/usr/local/
raise ValueError("new last tid (%s) must be greater than "
ValueError: new last tid (24511576759918
Okay, so why have I logged this as critical?
Well, because no-one seems to understand this code and it's got pretty wide ranging effects. The quick solution is do remove persistent client caches completely, but this would be a big shame as they can be a huge performance win in certain circumstances.
So, does ANYONE want to stump up and make a call on whether the < should be changed to <= or whether this is actualyl a real bgu caused by some other race condition?
Some Zope 2 related details:
http://
http://
A standalone ZODB example:
http://
A Zope 3 example:
http://
Changed in zope2: | |
importance: | Critical → High |
Wow. This *is* a nasty error.
I spent some time trying to make sens out of this. I've tried to trace all possible callers and their interactions starting from settid().
I've ruled out a number of edge cases that might conflict here, but I didn't finally solve it.
I'm pretty sure that the tid passed to settid() must be strictly larger than the one before, but I can't prove it (yet). (My feeling
comes from the fact that most places that cause settid() to be called come from the invalidation code. This shouldn't cause
the same tid to be passed into the settid() twice.)
I do have the feeling that some race condition is around. The only action.
reason I can see for this is that another ZEO client might be
committing at the very same moment and the server's tpc_finish
gets called. This would trigger an invalidation from the server and calls invalidateTrans
I can imagine an unfortunate situation where the client would ask
for the current validations *and* get a new invalidation request
from the server at the same time.
This could be ruled out, if the server can't send any requests
to the client until notifyConnected has finished.