Launchpad itself

ec2test sometimes hangs when running the windmill test suite

Bug #570380 reported by Māris Fogels on 2010-04-26

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Māris Fogels	Launchpad itself 10.07

Bug Description

On occasion our test suite will hang when loading the first windmill test via ec2test. This only appears to affect developer systems running ec2test.

If you encounter this problem, rerunning the test suite should work.

The following stack trace was pulled from a hung system:

> Thread 2
> #0 0x00002b85227fd7fb in accept () from None
> #1 0x00002b852388f947 in sock_accept (s=0x94409c0) from
> /build/buildd/python2.5-2.5.2/Modules/socketmodule.c
> /usr/lib/python2.5/socket.py (167): accept
> /usr/lib/python2.5/SocketServer.py (374): get_request
> /usr/lib/python2.5/SocketServer.py (216): handle_request
> /var/launchpad/tmp/eggs/windmill-1.3beta3_lp_r1440-
> py2.5.egg/windmill/server/https.py (394): start
> /usr/lib/python2.5/threading.py (445): run
> /usr/lib/python2.5/threading.py (469): __bootstrap_inner
> /usr/lib/python2.5/threading.py (461): __bootstrap

MaxB said "This must be the culprit of the hang, it appears similar to one I've been looking at for the Python 2.6 migration. Whatever was supposed to knock this thread out of its accept loop, hasn't. "

Attached are two log files from two simultaneously hung test runs. They both hang in the same place, just after the RegistryWindmillLayer setUp() function.

The following thread is also relevant: https://lists.launchpad.net/launchpad-dev/msg03237.html

See original description

Tags:

Related branches

lp:~mars/launchpad/disable-windmill-tests

Merged into lp:launchpad at revision 10873

Gary Poster (community): Approve on 2010-05-17

lp:~mars/launchpad/re-enable-windmill-ec2-suite

Merged into lp:launchpad at revision 10973

Leonard Richardson (community): Approve on 2010-06-08

Revision history for this message

Māris Fogels (mars) wrote on 2010-04-26:

2010-04-26 01 current_test.log.gz Edit (180.8 KiB, application/octet-stream)

Revision history for this message

Māris Fogels (mars) wrote on 2010-04-26:

2010-04-26 02 current_test.log.gz Edit (175.8 KiB, application/octet-stream)

Revision history for this message

Māris Fogels (mars) wrote on 2010-04-26:

Quoted from the mailing list thread:

"Windmill implements a custom HTTPS web server, which is waiting for data. I would guess that something in the web browser itself hung: either loading the test harness, loading the site under test, or passing back test results. We need a log file to know for sure."

I think the next thing we should do is look for a robust mechanism for running and restarting hung tests. Even if we find what is hanging in the test harness (browser, server, network, moon phase) we will probably end up fixing it with a full restart anyway.

Māris Fogels (mars) on 2010-04-26

description:

updated

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2010-04-26:

I have this semi-suspicion that firefox is dying on us somehow. I don't know how to validate or deny this suspicion -- there might be something in .xsession-errors?

Revision history for this message

Māris Fogels (mars) wrote on 2010-04-27: Re: [Bug 570380] Re: ec2test sometimes hangs on the first windmill test

Firefox could be outright dying, or it could be blocking on a dialog
window. There is no way to know without running the test suite using
a live console. Since this problem is intermittent, I suspect it is
simply the browser dying, and not blocking. If it were blocking on a
dialog, it would die every time, not at random.

I do not know the frequency with the browser hangs, so I assume it
would be annoying to try and catch the hang by watching a live window.

Even if we do discover what is hanging, I am not sure we will be able
to do anything besides restarting the suite anyway.

I am hesitant to hack windmill itself to fix this. Adding both
browser death detection and the ability to restart the browser
mid-test is more work than we can do. However, we may be able to add
just the browser death detection to windmill, cause windmill to die
really loudly, then use that death to signal a retry for our own
suite. Worth thinking about.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2010-04-29: Re: ec2test sometimes hangs on the first windmill test

Firefox definitely isn't blocking -- there's only a zombie firefox on the system when it's all hung up.

Loud clanging and horns when firefox dies sounds like it would be a useful addition to windmill, it might even help the windmill developers fix the problem one day...

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2010-04-29:

Oh, and it's not always the first windmill test -- I just had a case when it was at least the third chunk of windmill tests to run.

Revision history for this message

Māris Fogels (mars) wrote on 2010-05-05:

Guilherme confirms that it doesn't always happen on the first test.

Another symptom of this may be the disappearance of the ec2 instance entirely. ec2 instances shut themselves down after 8 hours of inactivity.

As discussed in today's Reviewer's meeting, this bug also appears to be affecting a number of people with "annoying regularity". It has also been happening for a few weeks now.

ec2test changed recently, but evidence suggests this bug predates the ec2test changes.

summary:

- ec2test sometimes hangs on the first windmill test
+ ec2test sometimes hangs when running the windmill test suite

Revision history for this message

Māris Fogels (mars) wrote on 2010-05-05:

ps.output Edit (7.2 KiB, text/plain)

Here is the process tree for a hung system. You can clearly see that the RegistryWindmillLayer is stuck, and that there are zombie firefox and memcached processes.

Changed in launchpad-foundations:
assignee:	nobody → Māris Fogels (mars)
status:	Triaged → In Progress

Revision history for this message

Māris Fogels (mars) wrote on 2010-05-11:

#10

It appears the test_on_merge.py script should be killing the testrunner after 15 minutes of inactivity, but this is not happening. Fixing this will at least cause the server to shut down earlier. See bug 578886 about that.

Revision history for this message

Ursula Junque (ursinha) wrote on 2010-05-18: Bug fixed by a commit

#11

Fixed in stable r10873 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/10873>

Changed in launchpad-foundations:
milestone:	none → 10.05
status:	In Progress → Fix Committed
tags:	added: qa-needstesting

Māris Fogels (mars) on 2010-05-18

Changed in launchpad-foundations:
status:	Fix Committed → In Progress

Curtis Hovey (sinzui) on 2010-05-31

Changed in launchpad-foundations:
milestone:	10.05 → 10.06

Curtis Hovey (sinzui) on 2010-06-01

tags:

removed: qa-needstesting

Revision history for this message

Māris Fogels (mars) wrote on 2010-06-08:

#12

This bug was ultimately caused by a cascading failure of our test infrastructure, rooted in a bug in the zope.testing package's subunit output. See bug 589787 for the root cause.

Māris Fogels (mars) on 2010-06-09

Changed in launchpad-foundations:
status:	In Progress → Fix Committed

Revision history for this message

Ursula Junque (ursinha) wrote on 2010-06-11:

#13

Fixed in stable r10973 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/10973>

tags:

added: qa-needstesting

Māris Fogels (mars) on 2010-06-15

tags:

added: qa-done
removed: qa-needstesting

Ursula Junque (ursinha) on 2010-06-17

tags:

added: qa-ok
removed: qa-done

Curtis Hovey (sinzui) on 2010-07-07

Changed in launchpad-foundations:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.