ec2test sometimes hangs when running the windmill test suite

Bug #570380 reported by Māris Fogels
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Māris Fogels

Bug Description

On occasion our test suite will hang when loading the first windmill test via ec2test. This only appears to affect developer systems running ec2test.

If you encounter this problem, rerunning the test suite should work.

The following stack trace was pulled from a hung system:

> Thread 2
> #0 0x00002b85227fd7fb in accept () from None
> #1 0x00002b852388f947 in sock_accept (s=0x94409c0) from
> /build/buildd/python2.5-2.5.2/Modules/socketmodule.c
> /usr/lib/python2.5/socket.py (167): accept
> /usr/lib/python2.5/SocketServer.py (374): get_request
> /usr/lib/python2.5/SocketServer.py (216): handle_request
> /var/launchpad/tmp/eggs/windmill-1.3beta3_lp_r1440-
> py2.5.egg/windmill/server/https.py (394): start
> /usr/lib/python2.5/threading.py (445): run
> /usr/lib/python2.5/threading.py (469): __bootstrap_inner
> /usr/lib/python2.5/threading.py (461): __bootstrap

MaxB said "This must be the culprit of the hang, it appears similar to one I've been looking at for the Python 2.6 migration. Whatever was supposed to knock this thread out of its accept loop, hasn't. "

Attached are two log files from two simultaneously hung test runs. They both hang in the same place, just after the RegistryWindmillLayer setUp() function.

The following thread is also relevant: https://lists.launchpad.net/launchpad-dev/msg03237.html

Related branches

Revision history for this message
Māris Fogels (mars) wrote :
Revision history for this message
Māris Fogels (mars) wrote :
Revision history for this message
Māris Fogels (mars) wrote :

Quoted from the mailing list thread:

"Windmill implements a custom HTTPS web server, which is waiting for data. I would guess that something in the web browser itself hung: either loading the test harness, loading the site under test, or passing back test results. We need a log file to know for sure."

I think the next thing we should do is look for a robust mechanism for running and restarting hung tests. Even if we find what is hanging in the test harness (browser, server, network, moon phase) we will probably end up fixing it with a full restart anyway.

Māris Fogels (mars)
description: updated
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I have this semi-suspicion that firefox is dying on us somehow. I don't know how to validate or deny this suspicion -- there might be something in .xsession-errors?

Revision history for this message
Māris Fogels (mars) wrote : Re: [Bug 570380] Re: ec2test sometimes hangs on the first windmill test

Firefox could be outright dying, or it could be blocking on a dialog
window. There is no way to know without running the test suite using
a live console. Since this problem is intermittent, I suspect it is
simply the browser dying, and not blocking. If it were blocking on a
dialog, it would die every time, not at random.

I do not know the frequency with the browser hangs, so I assume it
would be annoying to try and catch the hang by watching a live window.

Even if we do discover what is hanging, I am not sure we will be able
to do anything besides restarting the suite anyway.

I am hesitant to hack windmill itself to fix this. Adding both
browser death detection and the ability to restart the browser
mid-test is more work than we can do. However, we may be able to add
just the browser death detection to windmill, cause windmill to die
really loudly, then use that death to signal a retry for our own
suite. Worth thinking about.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Re: ec2test sometimes hangs on the first windmill test

Firefox definitely isn't blocking -- there's only a zombie firefox on the system when it's all hung up.

Loud clanging and horns when firefox dies sounds like it would be a useful addition to windmill, it might even help the windmill developers fix the problem one day...

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh, and it's not always the first windmill test -- I just had a case when it was at least the third chunk of windmill tests to run.

Revision history for this message
Māris Fogels (mars) wrote :

Guilherme confirms that it doesn't always happen on the first test.

Another symptom of this may be the disappearance of the ec2 instance entirely. ec2 instances shut themselves down after 8 hours of inactivity.

As discussed in today's Reviewer's meeting, this bug also appears to be affecting a number of people with "annoying regularity". It has also been happening for a few weeks now.

ec2test changed recently, but evidence suggests this bug predates the ec2test changes.

summary: - ec2test sometimes hangs on the first windmill test
+ ec2test sometimes hangs when running the windmill test suite
Revision history for this message
Māris Fogels (mars) wrote :

Here is the process tree for a hung system. You can clearly see that the RegistryWindmillLayer is stuck, and that there are zombie firefox and memcached processes.

Changed in launchpad-foundations:
assignee: nobody → Māris Fogels (mars)
status: Triaged → In Progress
Revision history for this message
Māris Fogels (mars) wrote :

It appears the test_on_merge.py script should be killing the testrunner after 15 minutes of inactivity, but this is not happening. Fixing this will at least cause the server to shut down earlier. See bug 578886 about that.

Revision history for this message
Ursula Junque (ursinha) wrote : Bug fixed by a commit
Changed in launchpad-foundations:
milestone: none → 10.05
status: In Progress → Fix Committed
tags: added: qa-needstesting
Māris Fogels (mars)
Changed in launchpad-foundations:
status: Fix Committed → In Progress
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
milestone: 10.05 → 10.06
Curtis Hovey (sinzui)
tags: removed: qa-needstesting
Revision history for this message
Māris Fogels (mars) wrote :

This bug was ultimately caused by a cascading failure of our test infrastructure, rooted in a bug in the zope.testing package's subunit output. See bug 589787 for the root cause.

Māris Fogels (mars)
Changed in launchpad-foundations:
status: In Progress → Fix Committed
Revision history for this message
Ursula Junque (ursinha) wrote :
tags: added: qa-needstesting
Māris Fogels (mars)
tags: added: qa-done
removed: qa-needstesting
Ursula Junque (ursinha)
tags: added: qa-ok
removed: qa-done
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.