lib/lp/services/database/doc/textsearching.txt fails intermittently/rarely on parallel tests

Bug #992184 reported by Gary Poster
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Stuart Bishop

Bug Description

======================================================================
FAILURE: lib/lp/services/database/doc/textsearching.txt
worker ID: worker-2
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.6/unittest.py", line 279, in run
    testMethod()
  File "/usr/lib/python2.6/doctest.py", line 2152, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for textsearching.txt
  File "lib/lp/services/database/doc/textsearching.txt", line 0

----------------------------------------------------------------------
File "lib/lp/services/database/doc/textsearching.txt", line 579, in textsearching.txt
Failed example:
    nl_term_candidates('firefox foo-bar give me trouble')
Differences (ndiff with -expected +actual):
    - [u'firefox', u'foo', u'bar', u'foobar', u'give', u'troubl']
    ? - ^ --- ^^^
    + [u'ive', u'bl', u'bar', u'foobar', u'give', u'troubl']
    ? ^ ^^

worker-2's test list is here: http://pastebin.ubuntu.com/958370/

Tags: paralleltest
Gary Poster (gary)
Changed in launchpad:
status: New → Triaged
importance: Undecided → High
tags: added: paralleltest
Revision history for this message
Gary Poster (gary) wrote :

nl_term_candidates is something that defers to a postgres function, ftq. It looks like that's what was misbehaving somehow. If this is not a test interaction I'll have no idea. :-/

Revision history for this message
Gary Poster (gary) wrote :

I could not duplicate this in a lucid lxc with the tests from worker 2's test list. :-/ I'm not sure where to take this next, other than maybe to stub.

Revision history for this message
Stuart Bishop (stub) wrote :

This may be sensitive to PostgreSQL versions.

Revision history for this message
Stuart Bishop (stub) wrote :

ftq() is a python stored procedure. Even though it is a horrible collection of regexps, it looks deterministic. At least until the end, when it sends the reformatted query to to_tsquery (formerly part of tsearch2, now part of PostgreSQL core). I cannot imagine how we end up with the results we are seeing though. The word 'firefox' is lost entirely, and there is no way that will ever be a stop word. The word 'give' seems to end up in the results twice, once correctly and the other time just 'ive'; I don't think words should ever lose their prefixes (just suffixes, like 'trouble' after going through the english language stemmer). 'trouble' also got duplicated, and I have no idea how 'bl' managed to get extracted from the middle of it.

Because of the duplication of words in the query, I suspect this isn't something odd like the stemmer misbehaving but some sort of corruption of the query. Possibly converting the PostgreSQL string to a Python string when invoking the ftq() stored procedure. I would suspect a PG 9.1 issue rather than a multiple process issue, as the separate backend database processes should not be able to corrupt each other in this way.

If you want to test this theory, make a database patch containing a copy of the 'CREATE FUNCTION ftq(text)' from database/schema/launchpad-2209-00-0.sql. Uncomment one of the debug lines or add debugging using plpy.notice or plpy.warning. Provided log_min_messages is set to a suitable value in postgresql.conf, these messages will end up in the postgresql log file and you can check that what you think is being passed in is actually what is being received. Or, as you have full access to a Python interpreter, open('mylog','w+') can let you emit log messages wherever you like. Alternatively, for nl_term_candidates you could move the body of the ftq() stored procedure client side, leaving just the final call to to_tsquery to be sent to the database.

Benji York (benji)
Changed in launchpad:
assignee: nobody → Benji York (benji)
status: Triaged → In Progress
Revision history for this message
Gary Poster (gary) wrote :

Thank you very much Stuart. We'll check into it.

It happened again twice today. Test list for the pertinent workers: http://pastebin.ubuntu.com/965317/ and http://pastebin.ubuntu.com/965512/ .

Revision history for this message
Benji York (benji) wrote :

Using Postgres 9.1 I am able to reproduce the test failure by running the test in question in a loop until it fails. It usually only takes a dozen or so runs before I get a failure.

By logging the progress of the ftq stored procedure, I was able to verify that the corruption is introduced by the following two lines of code:

        p = plpy.prepare("SELECT to_tsquery('default', $1) AS x", ["text"])
        query = plpy.execute(p, [query], 1)[0]["x"]

Benji York (benji)
Changed in launchpad:
status: In Progress → Confirmed
assignee: Benji York (benji) → Stuart Bishop (stub)
Revision history for this message
Stuart Bishop (stub) wrote :

I've been unable to reproduce with Precise + PG 9.1:

$ for i in {1..100}; do echo ======= $i =======; bin/test -q --test=doc/textsearching.txt; done 2>&1 | grep failures
  Ran 1 tests with 0 failures and 0 errors in 4.914 seconds.
  Ran 1 tests with 0 failures and 0 errors in 2.763 seconds.
  Ran 1 tests with 0 failures and 0 errors in 1.501 seconds.
  Ran 1 tests with 0 failures and 0 errors in 1.969 seconds.
  Ran 1 tests with 0 failures and 0 errors in 1.066 seconds.
  Ran 1 tests with 0 failures and 0 errors in 0.367 seconds.
  Ran 1 tests with 0 failures and 0 errors in 0.386 seconds.
  Ran 1 tests with 0 failures and 0 errors in 4.172 seconds.
  Ran 1 tests with 0 failures and 0 errors in 6.243 seconds.
  Ran 1 tests with 0 failures and 0 errors in 0.365 seconds.
  Ran 1 tests with 0 failures and 0 errors in 0.356 seconds.
[...]

Revision history for this message
Stuart Bishop (stub) wrote :

Need to discover how to set up an environment where this can be reproduced. lxc? Lucid? I don't know the trigger.

Changed in launchpad:
status: Confirmed → Incomplete
Revision history for this message
Gary Poster (gary) wrote :

I expect that benji has been working with a lucid lxc, since that's the environment we have been working in for months. That said, he is the one who knows the details, and we won't be back till the middle of next week.

Revision history for this message
Benji York (benji) wrote :

This is a message from the F-U-T-U-R-E...

Yep, I'm using lucid in an lxc. I've seen it with both 8.4 and 9.1. It normally takes several test runs to produce the error.

There might be a timing component because if I attach ltrace (but not strace) to the PG process (one of them, I forget which) I never see the error.

Revision history for this message
Gary Poster (gary) wrote :

Repeated today.

Changed in launchpad:
status: Incomplete → Triaged
Revision history for this message
Gary Poster (gary) wrote :

To be clear, Stuart, this may well end up being an lxc bug of some sort, but we really need/want your help to identify the problem so we can direct it their way productively, if so.

If you run into any issues with setting up an lxc launchpad environment, please let us know if we can help. https://dev.launchpad.net/Running/LXC is supposed to work and did for us in past months, but we've been using automation for some time to set this up so the page may have bitrotted. The automation we use is available, but we haven't gotten around to preparing it for other devs yet (install lpsetup from our PPA, https://launchpad.net/~yellow/+archive/ppa , if you want, and then look at the lpsetup command).

Revision history for this message
Gary Poster (gary) wrote :

2 more instances.

Revision history for this message
Stuart Bishop (stub) wrote :

I've setup a Lucid amd64 lxc container under precise with PG 9.1 per the lxc setup docs and standard rocketfuel-setup/launchpad-database-setup scripts. 60 test runs and no fails.

Revision history for this message
Benji York (benji) wrote :

I wrote a small script to execute this command

    psql launchpad_dev -c "select ftq('firefox foo-bar give me trouble')::text"

in a loop and stop if the result ever changes. It ran for many hours without seeing the bug in question. I can reproduce the bug by running the failing test in a loop for only a few minutes.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 992184] Re: lib/lp/services/database/doc/textsearching.txt fails intermittently/rarely on parallel tests

That suggests an isolation issue of some sort; possibly lxc /
overlayfs related, or possibly test code related. (The test run in a
loop drops-and-adds db's a lot).

Revision history for this message
Stuart Bishop (stub) wrote :

Next up, I'd add an explicit store.commit or transaction.commit()
before the failing statement in the test.

Then, remove the entirety of the test except for the failing statement.

This should let us see if it is the particular statement failing in
isolation in the test suite, or if it requires a particular sequence
of operations before it to trigger the failure.

--
Stuart Bishop <email address hidden>

Revision history for this message
Benji York (benji) wrote :

Per Suart's suggestion I pared the test down to just an import and the call to nl_term_candidates. After a few loops I got the same error.

I then added an import of transaction and call to transaction.commit() before the nl_term_candidates call and also got an error after a few loops.

Revision history for this message
Benji York (benji) wrote :

I put the call to nl_term_candidates in a loop inside the test (as opposed to running the test itself in a loop) and couldn't reproduce the error (both with a transaction.commit() in the loop and not).

Another thing to note: there are other calls to nl_term_candidates in that test, but this one is the only one that ever fails.

Revision history for this message
Benji York (benji) wrote :

Gary suggested using bin/harness to run the call in question in a smaller environment. I hacked the "python" function in lib/lp/scripts/harness.py to run the call in question and assert that the result was right in a loop. I couldn't get it to fail.

Revision history for this message
Benji York (benji) wrote :

Another data point: I modified the test in question to just run the single call to nl_term_candidates in question (as above) but I made the call twice and verified the output after each call. After running the test in a loop for a while the *second* call produced the error.

Revision history for this message
Benji York (benji) wrote :

We believe that this bug has been fixed upstream (see https://groups.google.com/forum/?fromgroups#!topic/pgsql.committers/vIkcjHqVfjY).

Stuart, is there anything to be done to expedite getting this fix in to production?

Revision history for this message
Stuart Bishop (stub) wrote :

On Sat, May 19, 2012 at 12:32 AM, Benji York <email address hidden> wrote:
> We believe that this bug has been fixed upstream (see
> https://groups.google.com/forum/?fromgroups#!topic/pgsql.committers/vIkcjHqVfjY).
>
> Stuart, is there anything to be done to expedite getting this fix in to
> production?

Production, we are probably best waiting on real releases to be made.
I can try to build 9.1 packages with the patch for testing (and maybe
production - it doesn't seem to be a crasher, and if you have
identified this correctly its been around since at least PG 8.4).

--
Stuart Bishop <email address hidden>

Revision history for this message
Gary Poster (gary) wrote :

Great, Stuart, please do. Thank you.

Revision history for this message
Stuart Bishop (stub) wrote :

PostgreSQL 9.1 + patch packages are in in ppa:stub/junk

Revision history for this message
Gary Poster (gary) wrote :

Thank you. I've started running tests with this today and will report back.

Revision history for this message
Gary Poster (gary) wrote :

I have 13 runs without this error using the new PPA. The most test runs in-between incidents of this error before the PPA was 39, so we have quite a while to go before we have any confidence that we identified this one. Right now, I'm thinking 60 runs would be good.

Revision history for this message
Gary Poster (gary) wrote :

34 runs without this error...about halfway there.

Revision history for this message
Stuart Bishop (stub) wrote :

The PG patch is included in the 9.1.4 security fix, out now.

--
Stuart Bishop <email address hidden>

Revision history for this message
Gary Poster (gary) wrote :

Great, Stuart, thank you. I'm feeling pretty optimistic about this: 48 runs without this error now. Again, the longest run we had before without this error was 39, so we're already well past that. If we don't have any of these errors by the end of today I'm going to mark it fix released.

Revision history for this message
Gary Poster (gary) wrote :

Yay. Thanks.

Changed in launchpad:
status: Triaged → Fix Released
Revision history for this message
Gary Poster (gary) wrote :

Stuart, I stopped using your junk PPA package, and within a few runs I saw the error again, after more than 50 without the error. It's possible that we simply were lucky all this time, but it goes against the odds. Could you tell me what version of Precise Postgres has this patch, so I can verify that we have it on EC2? Thank you.

Changed in launchpad:
status: Fix Released → Triaged
Revision history for this message
Gary Poster (gary) wrote :

Stuart, you said this was in 9.1.4, and I verified that I saw it in the changelog for that version (9.1.4-0ubuntu12.04, 'Fix memory copying bug in to_tsquery().').

We have 9.1.3-1~lucid installed in the LXC containers in EC2.

I see http://www.ubuntuupdates.org/ppa/pitti_postgresql?dist=lucid has 9.1.4, as does precise, but Launchpad-on-Lucid does not yet (without installing pitti's PPA).

Should we just copy the packages (postgresql-9.1 postgresql-contrib-9.1 postgresql-plpython-9.1) from pitti's PPA over into https://launchpad.net/~launchpad/+archive/ppa? (Wow, that's a lot of "p"s in that sentence.)

Thanks

Gary

Revision history for this message
Stuart Bishop (stub) wrote :

On Wed, Jun 13, 2012 at 12:33 AM, Gary Poster <email address hidden> wrote:
> Stuart, you said this was in 9.1.4, and I verified that I saw it in the
> changelog for that version (9.1.4-0ubuntu12.04, 'Fix memory copying bug
> in to_tsquery().').
>
> We have 9.1.3-1~lucid installed in the LXC containers in EC2.
>
> I see http://www.ubuntuupdates.org/ppa/pitti_postgresql?dist=lucid has
> 9.1.4, as does precise, but Launchpad-on-Lucid does not yet (without
> installing pitti's PPA).
>
> Should we just copy the packages (postgresql-9.1 postgresql-contrib-9.1
> postgresql-plpython-9.1) from pitti's PPA over into
> https://launchpad.net/~launchpad/+archive/ppa?  (Wow, that's a lot of
> "p"s in that sentence.)

I think we need 9.1.4 for Lucid everywhere as this is a security fix.
IS might already be on it. I haven't trawled the docs to see if the
security fix bites us.

Copying packages from pitti's PPA into the Launchpad PPA should be
fine, and how we have pulled in updates before. The bits and pieces
not in the PPA (postgresql-9.1-debversion) shouldn't need to be
rebuilt. Just do lucid - we should stick to the shipped packages under
Precise.

--
Stuart Bishop <email address hidden>

security vulnerability: no → yes
visibility: public → private
Revision history for this message
Gary Poster (gary) wrote :

I added the copied the package to the Launchpad PPA, and verified this morning that it is installing as I had hoped. Thanks, Stuart.

Changed in launchpad:
status: Triaged → Fix Released
William Grant (wgrant)
security vulnerability: yes → no
visibility: private → public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.