supermirror-pull mirror sometimes hangs

Bug #121331 reported by Michael Hudson-Doyle on 2007-06-20
10
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Michael Hudson-Doyle

Bug Description

We thought that not using pycurl any more might fix this, but it seems not.

Here's what gdb thinks we're doing:

(gdb) pystack
/usr/lib/python2.4/socket.py (71): ssl
/usr/lib/python2.4/httplib.py (1068): connect
/usr/lib/python2.4/httplib.py (1068): send
/usr/lib/python2.4/httplib.py (642): _send_output
/usr/lib/python2.4/httplib.py (671): endheaders
/usr/lib/python2.4/httplib.py (790): _send_request
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib2_wrappers.py (812): do_open
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib2_wrappers.py (501): https_open
/usr/lib/python2.4/urllib2.py (330): _call_chain
/usr/lib/python2.4/urllib2.py (368): _open
/usr/lib/python2.4/urllib2.py (341): open
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib.py (77): _perform
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/_urllib.py (77): _get
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/http/__init__.py (222): get
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (222): probe_transport
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (1248): find_format
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (540): find_format
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/transport/__init__.py (1069): do_catching_redirections
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (1069): open_from_transport
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/bzrdir.py (521): open
/srv/sm-ng/production/launchpad/cronscripts/../lib/bzrlib/branch.py (126): open
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/branchtomirror.py (79): _openSourceBranch
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/branchtomirror.py (79): mirror
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/jobmanager.py (165): run
/srv/sm-ng/production/launchpad/cronscripts/../lib/canonical/launchpad/scripts/supermirror/__init__.py (28): mirror
/srv/sm-ng/production/launchpad/cronscripts/supermirror-pull.py (11): ?
(gdb) bt
#0 0xb7fca1ee in __read_nocancel () from /lib/tls/libpthread.so.0
#1 0xb7c5936b in BIO_sock_should_retry () from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#2 0xb7c57013 in BIO_read () from /usr/lib/i686/cmov/libcrypto.so.0.9.8
#3 0xb7d14f0e in ssl23_read_bytes () from /usr/lib/i686/cmov/libssl.so.0.9.8
#4 0xb7d1426d in ssl23_connect () from /usr/lib/i686/cmov/libssl.so.0.9.8
#5 0xb7d205f4 in SSL_connect () from /usr/lib/i686/cmov/libssl.so.0.9.8
#6 0xb7d39c17 in PySocket_ssl (self=0x0, args=0xfffffe00) at /build/buildd/python2.4-2.4.3/Modules/_ssl.c:263
#7 0x080b63c7 in PyEval_EvalFrame (f=0x841cf1c) at ../Python/ceval.c:3563
[...]

lsof says we're connected to i5387A6BB.versanet.de, but experience says that this happens on more than just one host.

I don't really know how to go about fixing this, or even diagnosing it properly. Some kind of watchdog script that killed any supermirror-pull processes older than an hour or something might do it.

David Allouche (ddaa) wrote :

"I said it so", when we wrote the branch-puller script I argued that the actual bzr pull should be done in a subprocess, so we can 1. easily kill it if it hangs 2. happily go forward if it fail with any exception whatsoever.

I still think this would be the right way to handle this problem.

Jonathan Lange (jml) on 2007-07-09
Changed in launchpad-bazaar:
assignee: nobody → jml
importance: Undecided → High
status: New → Incomplete
Michael Hudson-Doyle (mwhudson) wrote :

We so have to fix/work around this soon. At least the database script monitoring coming in 1.1.7 will give us a way to find out this has happened more quickly... I killed a process that had been stuck for three days today :(

I would be much happier if I had any idea what the real problem was.

Tim Penhey (thumper) wrote :

Is this just a time-out on connect problem?

Michael Hudson-Doyle (mwhudson) wrote :

Timeout on connect is the problem we are seeing currently. This bug report is about something else, though -- for a start, the true hangs always seem associated with SSL.

David Allouche (ddaa) wrote :

In my opinion, it would be appropriate to use a big hammer here: pull branches in a subprocess, kill if we see no progress in some time.

Jonathan Lange (jml) wrote :

<jml> ddaa: what were you thinking of when you said "if we see no progress"
<ddaa> jml: similar mechanism to what we have in buildbot
<ddaa> 1. use a custom logger to print one line frequently (progress bar support enables this)
 2. have a periodic timer in a parent process
 3. kill subprocess if it has not produced any output in "a long time"

Jonathan Lange (jml) wrote :

Once we have the timeout system in place, there's still an underlying (perhaps SSL-related) bug that we have to fix.

Changed in launchpad-bazaar:
status: Incomplete → Triaged
Jonathan Lange (jml) on 2007-10-06
Changed in launchpad-bazaar:
status: Triaged → In Progress
Jonathan Lange (jml) wrote :

RF 5027 has landed, which means that a hanging connection won't hold up the entire mirroring process. Timeout still needs to be implemented though.

Jonathan Lange (jml) wrote :

Significant chunk done. Still need to implement timeout.

Changed in launchpad-bazaar:
milestone: 1.1.10 → 1.1.11
Jonathan Lange (jml) wrote :

There's a branch that adds timeout logic to the puller, but still needs to have its progress indicator hooked up to Bazaar.

The branch as-is is fairly self-contained, but landing it without the progress indicator will make lots of mirrored branches fail spuriously.
bzr+ssh://devpad.canonical.com/code/jml/launchpad/puller-timeout-bug-156897

Changed in launchpad-bazaar:
assignee: jml → nobody

Well, I don't know. Having a timeout of 12 hours shouldn't make anything fail spuriously and would still be an improvement.

Jonathan Lange (jml) wrote :

Then probably the best thing to do is to review the branch and see if it's acceptable to land with a bigger timeout.

Tim Penhey (thumper) on 2007-11-15
Changed in launchpad-bazaar:
milestone: 1.1.11 → 1.1.12
Jonathan Lange (jml) wrote :

I gather that the branch that I put up for review stagnated?

No, I'm working on it: bzr+ssh://devpad.canonical.com/code/mwh/launchpad/puller-timeout-bug-121331

Currently stalled on how to test the progress bar stuff (and the launchpad meeting).

RF launchpad/devel 5189.

(Strictly speaking it's not a "fix" that got committed, but it should make the problem go away, which is good enough for me)

Changed in launchpad-bazaar:
milestone: 1.1.12 → 1.1.11
status: In Progress → Fix Committed
Changed in launchpad-bazaar:
assignee: nobody → mwhudson
Changed in launchpad-bazaar:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers