test_full_integration fails intermittently

Bug #745738 reported by Jonathan Lange
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Unassigned
loggerhead
Fix Released
Critical
Unassigned
loggerhead-breezy
Fix Released
Critical
Unassigned

Bug Description

WARNING: A test appears to be hung. There has been no output for 600 seconds.
Forcibly shutting down the test suite
Process group 10522 will be killed
Sending signal 15 to process group 10522
Sending signal 2 to process group 10522
Sending signal 1 to process group 10522
Sending signal 9 to process group 10522
Process group 10522 is now empty.
The dying processes left behind the following output:
--------------- BEGIN OUTPUT ---------------

---------------- END OUTPUT ----------------

Tests failed (exit code -15)

But it earlier includes the error:

 test_full_integration (loggerhead.tests.test_load_test.TestActionScriptIntegration)

Error in test test_full_integration (loggerhead.tests.test_load_test.TestActionScriptIntegration)
Traceback (most recent call last):
_StringException: Text attachment: log
------------
439.318 opening working tree '/tmp/testbzr-ZP4bnH.tmp'
------------
Text attachment: traceback
------------
Traceback (most recent call last):
  File "/srv/buildbot/slaves/launchpad/lucid-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/runtest.py", line 169, in _run_user
    return fn(*args, **kwargs)
  File "/srv/buildbot/slaves/launchpad/lucid-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 499, in _run_test_method
    return self._get_test_method()()
  File "/srv/buildbot/slaves/launchpad/lucid-devel/build/lib/loggerhead/tests/test_load_test.py", line 298, in test_full_integration
    script.run()
  File "/srv/buildbot/slaves/launchpad/lucid-devel/build/lib/loggerhead/load_test.py", line 220, in run
    worker.queue.put(full_url, True, self.blocking_timeout)
  File "/usr/lib/python2.6/Queue.py", line 134, in put
    raise Full
Full
------------

The following test left new threads behind:
test_full_integration (loggerhead.tests.test_load_test.TestActionScriptIntegration)

Both a Launchpad and a loggerhead bug because of the hang.

Related branches

Revision history for this message
Jonathan Lange (jml) wrote :
Changed in loggerhead:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
John A Meinel (jameinel) wrote :

The test has a 0.1s timeout:
            "parameters": {"base_url": "%s", "blocking_timeout": 0.1},

On my system, the test completes in 63ms. I would be fine increasing that timeout. It is mostly there so that if the test *would* fail, it would do so without blocking forever. I don't have an idea of what would be a good time. 10s should be more than generous.

Revision history for this message
John A Meinel (jameinel) wrote :

Note that the timeout *shouldn't* be triggered. It is there to make sure the test would fail quickly if there was a bug. Even with the timeout changed, it still completes in 60ms for me.

You'll need an updated loggerhead and a matching Launchpad update to start using the new version of loggerhead.

=== modified file 'loggerhead/tests/test_load_test.py'
--- loggerhead/tests/test_load_test.py 2011-03-25 14:14:42 +0000
+++ loggerhead/tests/test_load_test.py 2011-03-30 14:34:46 +0000
@@ -286,7 +286,7 @@
         self.build_tree(['first', 'second', 'third', 'fourth'])
         url = self.get_readonly_url()
         script = load_test.ActionScript.parse("""{
- "parameters": {"base_url": "%s", "blocking_timeout": 0.1},
+ "parameters": {"base_url": "%s", "blocking_timeout": 10.0},
             "requests": [
                 {"thread": "1", "relpath": "first"},
                 {"thread": "2", "relpath": "second"},

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 745738] Re: test_full_integration fails intermittently

A test failure in buildbot costs us *5 hours*.

What does the test demonstrate? Why can it *ever* fail?

Revision history for this message
Robert Collins (lifeless) wrote :

Failed again; I'm backing out the patch to run the loggerhead tests on changes to LP; we don't generally do this now, and choosing this route multiplies any fragility together; I think we should find a different way to be confident loggerhead works with the very of bzr launchpad uses.

Changed in launchpad:
status: Triaged → Fix Released
Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/31/2011 5:21 AM, Robert Collins wrote:
> Failed again; I'm backing out the patch to run the loggerhead tests on
> changes to LP; we don't generally do this now, and choosing this route
> multiplies any fragility together; I think we should find a different
> way to be confident loggerhead works with the very of bzr launchpad
> uses.
>

If it is just the one test, it is pretty easy to change the test *as I
described* and have it pass regularly rather than back it out. I realize
the test itself was fragile, you could set the timeout to 60min if you
are unhappy. The point was that if the test was every broken (by a
real-bug somewhere) I didn't want it to block indefinitely waiting to
find that bug.

In this case, it is giving false positives, so should be set to a higher
timeout threshold.

I'm pretty sure this isn't the first flaky test that has been landed in
the launchpad test suite. I'm not sure why you felt the need to ignore
the advice I posted here and back out all of the loggerhead tests.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2UDisACgkQJdeBCYSNAAM7SgCfS4mjNrdCB6plIh9qmxZlMrI1
A3QAn21Nh/qz/fjkrmBbU3dncIUzZ/sZ
=tU5C
-----END PGP SIGNATURE-----

Revision history for this message
Robert Collins (lifeless) wrote :

On Thu, Mar 31, 2011 at 6:16 PM, John A Meinel <email address hidden> wrote:
> If it is just the one test, it is pretty easy to change the test *as I
> described* and have it pass regularly rather than back it out. I realize
> the test itself was fragile, you could set the timeout to 60min if you
> are unhappy. The point was that if the test was every broken (by a
> real-bug somewhere) I didn't want it to block indefinitely waiting to
> find that bug.
>
> In this case, it is giving false positives, so should be set to a higher
> timeout threshold.
>
> I'm pretty sure this isn't the first flaky test that has been landed in
> the launchpad test suite. I'm not sure why you felt the need to ignore
> the advice I posted here and back out all of the loggerhead tests.

Because of a few things that combine to have me thinking that having
the loggerhead tests run with the lp tests is a mistake.

But first, on the flakiness: we don't know that its a simple timeout
threshold issue (well, you may, *I* don't). It has cost us at least 10
hours of latency so far, and our policy for dealing with flaky tests
is to disable them; as this test was in loggerhead itself, the way to
do that was to stop running the loggerhead tests when changing LP.

On the larger issue; we don't run the tests for other projects when
committing to LP - not bzr, not zope, etc. All these components can
fail to integrate. This leaves me asking 'why is loggerhead special' -
and I don't have a good answer to that. It seems to me that running
the loggerhead test suite when changing LP is a very heavy hammer to
use to solve the bug 'we deployed an incompatible loggerhead + bzr'.
Some other ways we can catch that:
 - have loggerhead declare its bzr compatibility and check that that
is ok in LP.
 - test loggerhead with the bzr that launchpad is using
 - manual qa

-Rob

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (3.7 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Coming back to this, since it was in my inbox.

...
> Because of a few things that combine to have me thinking that having
> the loggerhead tests run with the lp tests is a mistake.
>
> But first, on the flakiness: we don't know that its a simple timeout
> threshold issue (well, you may, *I* don't). It has cost us at least 10
> hours of latency so far, and our policy for dealing with flaky tests
> is to disable them; as this test was in loggerhead itself, the way to
> do that was to stop running the loggerhead tests when changing LP.
>
> On the larger issue; we don't run the tests for other projects when
> committing to LP - not bzr, not zope, etc. All these components can
> fail to integrate. This leaves me asking 'why is loggerhead special' -
> and I don't have a good answer to that. It seems to me that running
> the loggerhead test suite when changing LP is a very heavy hammer to
> use to solve the bug 'we deployed an incompatible loggerhead + bzr'.

The reason to add the tests is because we had a specific integration
failure that caused a regression in production, which would have been
caught by the test suite, which only occurred because of the specific
way which launchpad combined bzr and loggerhead (the specific version of
bzr vs the specific version of loggerhead).

And further, at least on my machine, the entire Loggerhead test suite
runs in <4s. So it doesn't seem like a very heavy hammer. Which is very
different from running a 30min bzr test suite, or a ??? zope suite.

Arguably, loggerhead is special because of how Launchpad is *combining*
it with a specific version of bzr, etc. If Launchpad was providing
special Paramiko libraries for bzr, it seems like it would make sense to
at least run bzr tests that exercise its paramiko capabilities.
(Certainly we run the lpserve tests which is just a plugin for bzr...)

> Some other ways we can catch that:
> - have loggerhead declare its bzr compatibility and check that that
> is ok in LP.

While it is true we could do this, it is something that you have to
regression test on every update to loggerhead. As in, run 'bzr selftest
- -s bp.loggerhead' against all versions of bzr in which you are claiming
compatibility. I don't see how that is going to be maintained in the
general case.

> - test loggerhead with the bzr that launchpad is using

Sure. I was just codifying that (so people like me and WGrant don't
forget to do so) by adding it to the launchpad test suite.

I do, at this point, manually run bzr22 selftest -s bp.loggerhead and
bzr selftest -s bp.loggerhead to exercise bzr2.2 versus loggerhead
before I commit to trunk. But this is *very* manual and certainly
doesn't scale well. (Will anyone else do the same?)

Also, it is whatever bzr-2.2 version I happen to have (generally
lp:bzr/2.2). Which isn't the exact code that Launchpad uses (because it
uses an egg of a very specific packaged version).

> - manual qa

WGrant did QA the change. He happened to check N revisions, when the N+1
revision had a tag and would have exposed the breakage in this
particular case. (It was something like 5 revisions that he checked.)

>
> -Rob

I do und...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :

I'm not clear what remains to fix this in LH? Is it that 0.1s isn't reliably small, or that its timing based at all?

John, you say you have an unlanded fix - is it the one pasted in this bug, or something else?

Jelmer Vernooij (jelmer)
Changed in loggerhead-breezy:
status: New → Triaged
status: Triaged → Fix Released
importance: Undecided → Critical
Changed in loggerhead:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers