supermirror-pull mirror sometimes hangs

Bug #121331 reported by Michael Hudson-Doyle
10
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Michael Hudson-Doyle

Bug Description

We thought that not using pycurl any more might fix this, but it seems not.

Here's what gdb thinks we're doing:

(gdb) pystack
/usr/lib/python2.4/socket.py (71): ssl
/usr/lib/python2.4/httplib.py (1068): connect
/usr/lib/python2.4/httplib.py (1068): send
/usr/lib/python2.4/httplib.py (642): _send_output
/usr/lib/python2.4/httplib.py (671): endheaders
/usr/lib/python2.4/httplib.py (790): _send_request
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib2_wrappers.py (812): do_open
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib2_wrappers.py (501): https_open
/usr/lib/python2.4/urllib2.py (330): _call_chain
/usr/lib/python2.4/urllib2.py (368): _open
/usr/lib/python2.4/urllib2.py (341): open
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib.py (77): _perform
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib.py (77): _get
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/__init__.py (222): get
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (222): probe_transport
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (1248): find_format
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (540): find_format
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/__init__.py (1069): do_catching_redirections
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (1069): open_from_transport
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (521): open
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/branch.py (126): open
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/branchtomirror.py (79): _openSourceBranch
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/branchtomirror.py (79): mirror
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/jobmanager.py (165): run
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/__init__.py (28): mirror
/srv/sm-ng/production/launchpad/cronscripts/supermirror-pull.py (11): ?
(gdb) bt
#0 0xb7fca1ee in __read_nocancel () from /lib/tls/libpthread.so.0
#1 0xb7c5936b in BIO_sock_should_retry () from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#2 0xb7c57013 in BIO_read () from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#3 0xb7d14f0e in ssl23_read_bytes () from /usr/lib/i686/cmov/libssl.so.0.9.8
#4 0xb7d1426d in ssl23_connect () from /usr/lib/i686/cmov/libssl.so.0.9.8
#5 0xb7d205f4 in SSL_connect () from /usr/lib/i686/cmov/libssl.so.0.9.8
#6 0xb7d39c17 in PySocket_ssl (self=0x0, args=0xfffffe00) at /build/buildd/python2.4-2.4.3/Modules/_ssl.c:263
#7 0x080b63c7 in PyEval_EvalFrame (f=0x841cf1c) at ../Python/ceval.c:3563
[...]

lsof says we're connected to i5387A6BB.versanet.de, but experience says that this happens on more than just one host.

I don't really know how to go about fixing this, or even diagnosing it properly. Some kind of watchdog script that killed any supermirror-pull processes older than an hour or something might do it.

Revision history for this message
David Allouche (ddaa) wrote :

"I said it so", when we wrote the branch-puller script I argued that the actual bzr pull should be done in a subprocess, so we can 1. easily kill it if it hangs 2. happily go forward if it fail with any exception whatsoever.

I still think this would be the right way to handle this problem.

Jonathan Lange (jml)
Changed in launchpad-bazaar:
assignee: nobody → jml
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

We so have to fix/work around this soon. At least the database script monitoring coming in 1.1.7 will give us a way to find out this has happened more quickly... I killed a process that had been stuck for three days today :(

I would be much happier if I had any idea what the real problem was.

Revision history for this message
Tim Penhey (thumper) wrote :

Is this just a time-out on connect problem?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Timeout on connect is the problem we are seeing currently. This bug report is about something else, though -- for a start, the true hangs always seem associated with SSL.

Revision history for this message
David Allouche (ddaa) wrote :

In my opinion, it would be appropriate to use a big hammer here: pull branches in a subprocess, kill if we see no progress in some time.

Revision history for this message
Jonathan Lange (jml) wrote :

<jml> ddaa: what were you thinking of when you said "if we see no progress"
<ddaa> jml: similar mechanism to what we have in buildbot
<ddaa> 1. use a custom logger to print one line frequently (progress bar support enables this)
 2. have a periodic timer in a parent process
 3. kill subprocess if it has not produced any output in "a long time"

Revision history for this message
Jonathan Lange (jml) wrote :

Once we have the timeout system in place, there's still an underlying (perhaps SSL-related) bug that we have to fix.

Changed in launchpad-bazaar:
status: Incomplete → Triaged
Jonathan Lange (jml)
Changed in launchpad-bazaar:
status: Triaged → In Progress
Revision history for this message
Jonathan Lange (jml) wrote :

RF 5027 has landed, which means that a hanging connection won't hold up the entire mirroring process. Timeout still needs to be implemented though.

Revision history for this message
Jonathan Lange (jml) wrote :

Significant chunk done. Still need to implement timeout.

Changed in launchpad-bazaar:
milestone: 1.1.10 → 1.1.11
Revision history for this message
Jonathan Lange (jml) wrote :

There's a branch that adds timeout logic to the puller, but still needs to have its progress indicator hooked up to Bazaar.

The branch as-is is fairly self-contained, but landing it without the progress indicator will make lots of mirrored branches fail spuriously.
bzr+ssh://devpad.canonical.com/code/jml/launchpad/puller-timeout-bug-156897

Changed in launchpad-bazaar:
assignee: jml → nobody
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Well, I don't know. Having a timeout of 12 hours shouldn't make anything fail spuriously and would still be an improvement.

Revision history for this message
Jonathan Lange (jml) wrote :

Then probably the best thing to do is to review the branch and see if it's acceptable to land with a bigger timeout.

Tim Penhey (thumper)
Changed in launchpad-bazaar:
milestone: 1.1.11 → 1.1.12
Revision history for this message
Jonathan Lange (jml) wrote :

I gather that the branch that I put up for review stagnated?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

No, I'm working on it: bzr+ssh://devpad.canonical.com/code/mwh/launchpad/puller-timeout-bug-121331

Currently stalled on how to test the progress bar stuff (and the launchpad meeting).

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

RF launchpad/devel 5189.

(Strictly speaking it's not a "fix" that got committed, but it should make the problem go away, which is good enough for me)

Changed in launchpad-bazaar:
milestone: 1.1.12 → 1.1.11
status: In Progress → Fix Committed
Changed in launchpad-bazaar:
assignee: nobody → mwhudson
Changed in launchpad-bazaar:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.