Intermittent inconsistency in pgtextindex

Bug #920667 reported by Chris Rossi
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Shane Hathaway

Bug Description

A while back I was investigating an issue where under some circumstances the catalog would return a docid for a document that was no longer in the system, and therefore couldn't be loaded. Usually this was some non-sensical looking error like "None has no attribute, 'title'".

I wrote a test script, attached, which goes through and checks the catalog for consistency and set it up to run and email me the results every four hours via cron. What I've seen is the vast majority of the time, the catalog is consistent. But occasionally the pgtextindex will have a docid in it that is not in the system. This usually clears up by the next run. So it doesn't seem to be a matter of the database getting corrupted, but seems more a matter of an inconsistent read creeping in.

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

from repoze.catalog.indexes.field import CatalogFieldIndex
from repoze.catalog.indexes.keyword import CatalogKeywordIndex
from repoze.catalog.indexes.path2 import CatalogPathIndex2
from repoze.catalog.indexes.text import CatalogTextIndex
from karl.tagging.index import TagIndex
from karl.utils import find_catalog
from karlserve.textindex import KarlPGTextIndex
from pyramid.traversal import find_resource

catalog = find_catalog(root)
dm = catalog.document_map
for docid in dm.docid_to_address.keys():
    assert dm.address_for_docid(docid), docid
for address in dm.address_to_docid.keys():
    assert find_resource(root, address), address

for name, index in catalog.items():
    if isinstance(index, (CatalogFieldIndex, CatalogKeywordIndex)):
        print "Checking", name
        for docids in index._fwd_index.values():
            for docid in docids:
                path = dm.address_for_docid(docid)
                assert path, docid
        for docid in index._rev_index.keys():
            path = dm.address_for_docid(docid)
            assert path, docid

    elif isinstance(index, CatalogTextIndex):
        print "Checking", name
        for docid in index.index._docwords.keys():
            path = dm.address_for_docid(docid)
            assert path, docid

    elif isinstance(index, KarlPGTextIndex):
        print "Checking", name
        for docid in index.docids():
            path = dm.address_for_docid(docid)
            assert path, docid

    elif isinstance(index, CatalogPathIndex2):
        print "Checking", name
        for docid in index.docid_to_path.keys():
            path = dm.address_for_docid(docid)
            assert path, docid

    elif isinstance(index, TagIndex):
        print "Checking", name
        engine = index.site.tags
        for doc in engine._tagid_to_obj.values():
            path = dm.address_for_docid(doc.item)
            assert path, doc.item

    else:
        print "Skipping", name

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Assigning this to Shane to noodle over *next* week in M89.

Changed in karl3:
assignee: nobody → Shane Hathaway (shane-hathawaymix)
importance: Undecided → Medium
milestone: none → m89
Changed in karl3:
milestone: m89 → m90
Changed in karl3:
milestone: m90 → m89
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Got one this morning, here's what the output looks like:

Checking creator
Checking member_name
Checking creation_date
Checking texts
Traceback (most recent call last):
 File "/srv/osfkarl/production/current/bin/karlserve", line 106, in <module>
   karlserve.scripts.main.main()
 File "/srv/osfkarl/production/39/eggs/karlserve-1.12-py2.6.egg/karlserve/scripts/main.py", line 79, in main
   func(args)
 File "/srv/osfkarl/production/39/eggs/karlserve-1.12-py2.6.egg/karlserve/scripts/debug.py", line 28, in main
   exec code in local_ns
 File "/srv/osfkarl/production/current/scripts/check_catalog.py", line 39, in <module>
   assert path, docid
AssertionError: 884868959

Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

There will always be temporary inconsistencies between the database and the text index. The test script, as written, magnifies the window of apparent inconsistency because it uses long transactions. The object with docid 884868959 is a comment that was created on 2012-1-24 at 04:13:22. Here is what probably happened:

 - At 4 AM, the test script started. It opened a long-running ZODB transaction, so its view of the database was a snapshot of 4:00 AM.

 - From 4 to 4:30 (guessing), the test script checked ZODB-only indexes. They were all consistent.

 - At 4:30, the script started checking a pgtextindex. The pgtextindex opened its database connection at this time, so the pgtextindex snapshot was for 4:30, while the ZODB snapshot stayed at 4:00.

 - The script found a document that had been created between 4 and 4:30. The document existed in the 4:30 pgtextindex snapshot but not in the 4:00 ZODB snapshot, so it reported an error.

I suggest the script should abort the transaction for each top-level loop iteration, which should reduce the window of apparent inconsistency to a very short time. That way, the script will compare a 4:30 pgtextindex with a 4:30 ZODB.

Changed in karl3:
assignee: Shane Hathaway (shane-hathawaymix) → Chris Rossi (chris-archimedeanco)
Changed in karl3:
status: New → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Hi Shane,

Your theory seems pretty credible. But, remember that the reason for looking into this was to try go figure out why we could see inconsistencies at the application level that could lead to occasional user visible exceptions. You can see here, for example, that Carlos has added a guard to the code utilizing the catalog to prevent the user visible exception in the case of a catalog inconsistency:

https://github.com/karlproject/karl/commit/366cc8a5ac4ebe11f48958e2d165b08c98f11306

Paul, do you know which LP bug that fix was done for?

I would like to be able to take out guards like the one Carlos has written above. They're not going to applied consistently to every place the catalog is used--it's a band-aid applied haphazardly in the most recent spot someone saw the problem that doesn't address the underlying cause.

So, if the issue is that the transaction for pgtextindex is taking a different starting point than the transaction for relstorage, how can we sync those up? Can we/should we be sharing the database connection? What are our other options?

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Shane Hathaway (shane-hathawaymix)
Revision history for this message
Shane Hathaway (shane-hathawaymix) wrote :

That was the fix for lp:918255, which involved a problematic folder in the trash of a large community. It was a long-lived inconsistency, not an intermittent inconsistency like the kind this bug is trying to deal with. It might have been caused by some bug in the trash restore code, which is to be replaced with code that hopefully reindexes more thoroughly (on the shane-trash-redo branch).

Regarding intermittent inconsistency, I feel like text indexes are special because of their performance constraints. Text indexes should be considered eventually consistent, not always consistent like ZODB. That means to me that text index queries should sometimes return docids that aren't in the rest of the catalog.

We could hide the intermittent inconsistency from application code by storing, in ZODB, a TreeSet of all the docids that the pgtextindex will eventually have when it's consistent, then at query time, we could remove docids from the result set that aren't in that TreeSet. It will make text queries slightly more expensive, but it will prevent the need for application code hacks. OTOH it would have no effect on lp:918255, since that was caused by a more permanent inconsistency.

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I'll but that.

Changed in karl3:
status: In Progress → Fix Released
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Err, *buy* it, even.

Revision history for this message
claytron (clayton-sixfeetup) wrote :

We are still seeing this issue on portlets quite often:

https://github.com/karlproject/karl/blob/master/karl/content/views/adapters.py#L672

Here is a traceback that we often see:

2011-11-17 11:39:06,078 ERROR karl.system: Exception when processing https://karl.sixfeetup.com/offices/fortville/
Referer:
Traceback (most recent call last):
  File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/errorlog.py", line 18, in middleware
    return app(environ, start_response)
  File "/usr/local/karl/production/1/eggs/repoze.retry-0.9.4-py2.6.egg/repoze/retry/__init__.py", line 88, in __call__
    app_iter = self.application(environ, replace_start_response)
  File "/usr/local/karl/production/1/eggs/repoze.zodbconn-0.11-py2.6.egg/repoze/zodbconn/connector.py", line 21, in __call__
    result = self.next_app(environ, start_response)
  File "/usr/local/karl/production/1/eggs/repoze.tm2-1.0a5-py2.6.egg/repoze/tm/__init__.py", line 23, in __call__
    result = self.application(environ, save_status_and_headers)
  File "/usr/local/karl/production/1/eggs/repoze.who-1.0.15-py2.6.egg/repoze/who/middleware.py", line 107, in __call__
    app_iter = app(environ, wrapper.wrap_start_response)
  File "/usr/local/karl/production/1/eggs/repoze.bfg-1.2-py2.6.egg/repoze/bfg/router.py", line 130, in __call__
    response = view_callable(context, request)
  File "/usr/local/karl/production/1/eggs/repoze.bfg-1.2-py2.6.egg/repoze/bfg/configuration.py", line 1753, in _secured_view
    return view(context, request)
  File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/views/retail.py", line 40, in retail_view
    context.right_portlets)
  File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/views/retail.py", line 75, in _get_portlet_html
    html += portlet_info.asHTML
  File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/content/views/adapters.py", line 695, in asHTML
    entries = self.entries
  File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/content/views/adapters.py", line 678, in entries
    'title': doc.title,
AttributeError: 'NoneType' object has no attribute 'title'

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 920667] Intermittent inconsistency in pgtextindex
Download full text (3.4 KiB)

Clayton, are you updated to the latest-greatest that has this fix?

--Paul

On Feb 9, 2012, at 9:02 AM, claytron wrote:

> We are still seeing this issue on portlets quite often:
>
> https://github.com/karlproject/karl/blob/master/karl/content/views/adapters.py#L672
>
> Here is a traceback that we often see:
>
> 2011-11-17 11:39:06,078 ERROR karl.system: Exception when processing https://karl.sixfeetup.com/offices/fortville/
> Referer:
> Traceback (most recent call last):
> File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/errorlog.py", line 18, in middleware
> return app(environ, start_response)
> File "/usr/local/karl/production/1/eggs/repoze.retry-0.9.4-py2.6.egg/repoze/retry/__init__.py", line 88, in __call__
> app_iter = self.application(environ, replace_start_response)
> File "/usr/local/karl/production/1/eggs/repoze.zodbconn-0.11-py2.6.egg/repoze/zodbconn/connector.py", line 21, in __call__
> result = self.next_app(environ, start_response)
> File "/usr/local/karl/production/1/eggs/repoze.tm2-1.0a5-py2.6.egg/repoze/tm/__init__.py", line 23, in __call__
> result = self.application(environ, save_status_and_headers)
> File "/usr/local/karl/production/1/eggs/repoze.who-1.0.15-py2.6.egg/repoze/who/middleware.py", line 107, in __call__
> app_iter = app(environ, wrapper.wrap_start_response)
> File "/usr/local/karl/production/1/eggs/repoze.bfg-1.2-py2.6.egg/repoze/bfg/router.py", line 130, in __call__
> response = view_callable(context, request)
> File "/usr/local/karl/production/1/eggs/repoze.bfg-1.2-py2.6.egg/repoze/bfg/configuration.py", line 1753, in _secured_view
> return view(context, request)
> File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/views/retail.py", line 40, in retail_view
> context.right_portlets)
> File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/views/retail.py", line 75, in _get_portlet_html
> html += portlet_info.asHTML
> File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/content/views/adapters.py", line 695, in asHTML
> entries = self.entries
> File "/usr/local/karl/production/1/eggs/karl-3.72-py2.6.egg/karl/content/views/adapters.py", line 678, in entries
> 'title': doc.title,
> AttributeError: 'NoneType' object has no attribute 'title'
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/920667
>
> Title:
> Intermittent inconsistency in pgtextindex
>
> Status in KARL3:
> Fix Released
>
> Bug description:
> A while back I was investigating an issue where under some
> circumstances the catalog would return a docid for a document that was
> no longer in the system, and therefore couldn't be loaded. Usually
> this was some non-sensical looking error like "None has no attribute,
> 'title'".
>
> I wrote a test script, attached, which goes through and checks the
> catalog for consistency and set it up to run and email me the results
> every four hours via cron. What I've seen is the vast majority of the
> time, the catalog is consistent. But occasionally the pgtextindex
> will have a docid in it that is not in the system. This usually
...

Read more...

Revision history for this message
claytron (clayton-sixfeetup) wrote :

If I understand the ticket correctly, this was only fixed for the tags. The portlet code still assumes that all entries coming back from the catalog are valid. It needs the same `if doc is None: continue` treatment.

Revision history for this message
claytron (clayton-sixfeetup) wrote :

And to answer your question, no. This is still the repoze.bfg KARL.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Chris, do you think we need to wrap this portlet case, or continue trying to get to the bottom of the problem?

--Paul

On Feb 9, 2012, at 10:02 AM, claytron wrote:

> If I understand the ticket correctly, this was only fixed for the tags.
> The portlet code still assumes that all entries coming back from the
> catalog are valid. It needs the same `if doc is None: continue`
> treatment.
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/920667
>
> Title:
> Intermittent inconsistency in pgtextindex
>
> Status in KARL3:
> Fix Released
>
> Bug description:
> A while back I was investigating an issue where under some
> circumstances the catalog would return a docid for a document that was
> no longer in the system, and therefore couldn't be loaded. Usually
> this was some non-sensical looking error like "None has no attribute,
> 'title'".
>
> I wrote a test script, attached, which goes through and checks the
> catalog for consistency and set it up to run and email me the results
> every four hours via cron. What I've seen is the vast majority of the
> time, the catalog is consistent. But occasionally the pgtextindex
> will have a docid in it that is not in the system. This usually
> clears up by the next run. So it doesn't seem to be a matter of the
> database getting corrupted, but seems more a matter of an inconsistent
> read creeping in.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/920667/+subscriptions

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

If it's happening often, I wouldn't mind taking a closer look. But I don't
know how practical that is. We're not getting these errors often on our
end, I don't think.

Chris

On Thu, Feb 9, 2012 at 10:35 AM, Paul Everitt <email address hidden> wrote:

>
> Chris, do you think we need to wrap this portlet case, or continue trying
> to get to the bottom of the problem?
>
> --Paul
>
> On Feb 9, 2012, at 10:02 AM, claytron wrote:
>
> > If I understand the ticket correctly, this was only fixed for the tags.
> > The portlet code still assumes that all entries coming back from the
> > catalog are valid. It needs the same `if doc is None: continue`
> > treatment.
> >
> > --
> > You received this bug notification because you are subscribed to KARL3.
> > https://bugs.launchpad.net/bugs/920667
> >
> > Title:
> > Intermittent inconsistency in pgtextindex
> >
> > Status in KARL3:
> > Fix Released
> >
> > Bug description:
> > A while back I was investigating an issue where under some
> > circumstances the catalog would return a docid for a document that was
> > no longer in the system, and therefore couldn't be loaded. Usually
> > this was some non-sensical looking error like "None has no attribute,
> > 'title'".
> >
> > I wrote a test script, attached, which goes through and checks the
> > catalog for consistency and set it up to run and email me the results
> > every four hours via cron. What I've seen is the vast majority of the
> > time, the catalog is consistent. But occasionally the pgtextindex
> > will have a docid in it that is not in the system. This usually
> > clears up by the next run. So it doesn't seem to be a matter of the
> > database getting corrupted, but seems more a matter of an inconsistent
> > read creeping in.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/karl3/+bug/920667/+subscriptions
>
>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.