Corruption of ZEO Persistent Cache

Bug #143618 reported by Sidnei da Silva
2
Affects Status Importance Assigned to Milestone
ZODB
Fix Released
Medium
Unassigned

Bug Description

After a "accidental shutdown" (power failure), the persistent zeo cache on my blog got corrupted.

Here's the traceback:

File "/home/sidnei/src/zope/28five/lib/python/Zope2/__init__.py",
line 47, in startup
  _startup()
File "/home/sidnei/src/zope/28five/lib/python/Zope2/App/startup.py",
line 56, in startup
  DB = configuration.dbtab.getDatabase('/', is_root=1)
File "/home/sidnei/src/zope/28five/lib/python/DBTab/DBTab.py", line
96, in getDatabase
  db = self._createDatabase(name, is_root)
File "/home/sidnei/src/zope/28five/lib/python/DBTab/DBTab.py", line
113, in _createDatabase
  db = factory.open()
File
"/home/sidnei/src/zope/28five/lib/python/Zope2/Startup/datatypes.py",
line 163, in open
  DB = self.createDB()
File
"/home/sidnei/src/zope/28five/lib/python/Zope2/Startup/datatypes.py",
line 160, in createDB
  return ZODBDatabase.open(self)
File "/home/sidnei/src/zope/28five/lib/python/ZODB/config.py", line
97, in open
  storage = section.storage.open()
File "/home/sidnei/src/zope/28five/lib/python/ZODB/config.py", line
153, in open
  read_only_fallback=self.config.read_only_fallback)
File "/home/sidnei/src/zope/28five/lib/python/ZEO/ClientStorage.py",
line 314, in __init__
  self._cache.open()
File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line
112, in open self.fc.scan(self.install)
File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line
835, in scan install(self.f, ent)
File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line
121, in install
  o = Object.fromFile(f, ent.key, skip_data=True)
File "/home/sidnei/src/zope/28five/lib/python/ZEO/cache.py", line
630, in fromFile
  raise ValueError("corrupted record, oid")
ValueError: corrupted record, oid

Here's the cache files (zeo1-1.zec is the important one apparently):
http://awkly.org/files/zeo-cache.tar.bz2

Tags: bug zope
Revision history for this message
Sidnei da Silva (sidnei) wrote :
Revision history for this message
Andreas Jung (ajung) wrote :

Why is this an error? You can't expect that a file remains consistent while possibly writing to it?!

Revision history for this message
Tim Peters (tim-one) wrote :

Andreas, the idea here is that when ZEO can _detect_ that a persistent cache file is corrupt upon startup, it would be much better for ZEO to throw the cache file away. As is, Sidnei couldn't start Zope, and it wasn't 100% obvious why not. More info in the zodb-dev thread.

Revision history for this message
alzhimer (alzhimer) wrote :

I also think catching and handling such corruptions would be a good thing... see http://www.zope.org/Collectors/Zope/1878

Revision history for this message
Tim Peters (tim-one) wrote :

In a later zodb-dev msg, someone else added:

    I have seen a similar problems bringing up a system
    which has been brought down with the usual

 zeoctl stop
 zopectl start

    mantra. The zope/zeo connection is via a network
    connection.

Revision history for this message
Tim Peters (tim-one) wrote :

It looks like the cache file would be OK if the 214400 bytes
in slice 17883865:18098265 were removed from the file. IOW,
that's the only part of the file that appears to be corrupted.

The record starting at offset 17883865 looks OK at first. The
record claims to span 156151 bytes (which is not insane -- some
of the uncorrupted records span more than a million bytes), to
contain data for oid 0x9727 (which fits very comfortably in the
range of oids in the file), and to represent object state
written by tid 0x03566b3a5ac52acc (2004-07-12 13:14:21.27 --
also reasonable when compared to the other data here). It also
claims the version string is empty (and versions aren't used
anywhere else here either, so that makes sense), and that
MVCC hasn't "kicked in" for this oid (i.e., the cache doesn't
believe any newer data exists for this oid, and that looks
reasonable too).

Then it says the pickled state is 156108 bytes. A well-formed
30-byte pickle to construct an empty OFS.Image.Pdata object
follows, and then a large pickle to construct its __dict__.
The bulk of that pickle is consumed by a large BINSTRING
that would become the value of the `data` attribute. The
corruption must start in the interior of that large BINSTRING,
because pickletools.dis() blows up trying to makes sense of
the senseless 0xc1 "opcode" immediately following the BINSTRING
portion.

Alas, I have to no way to guess _where_ the corruption starts --
the BINSTRING contains binary data that would be meaningless
to me even if it weren't corrupted. It looks like it's raw JPEG
data: it starts with 0xffd8 and "JFIF" appears soon after.
Indeed, when I took the 156058 bytes (that's what the BINSTRING
pickle opcode says the length is) starting at offset 17883946
(the starting offset of the BINSTRING data), and stuffed them
in a file, a JPEG viewer had no problem displaying a picture of
two people.

The rest of the pickle is nevertheless fatally corrupted, and
so is the low-level structure of the ZEO cache file in this area.
There aren't any long strings of NUL bytes, and it doesn't look
like cache.py tried to write another record "in the middle" of
this one, so there's no obvious theory for how or why the
corruption occurred. In particular, because cache.py doesn't
know anything about the internal structure of pickles (they're
just meaningless giant strings to the ZEO cache, read and
written in one gulp), it's hard to imagine a way in which a
cache.py bug could cause corruption in the _interior_ of a
pickle string.

So, in all, best guess is that not all the I/O buffers got
flushed to disk when the power went out, or even that the disk
controller wrote some nonsense bytes while it was running out
of power. It doesn't look plausible that it's a cache.py bug,
but it remains attractive to teach cache.py to throw away a
cache file when it detects corruption.

affects: zope2 → zodb
Revision history for this message
Jim Fulton (jim-zope) wrote :

1. Cache corruption is far less likely now. :)

2. In ZODB 3.10, if there is a problem reading a cache file, the
    cache file is discarded.

Changed in zodb:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.