infinite recursion trying to format NotBranchError

Bug #687653 reported by Steve McInerney on 2010-12-09
44
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Critical
Andrew Bennetts
Launchpad itself
Critical
Aaron Bentley
bzr (Ubuntu)
Undecided
Unassigned
Maverick
Undecided
Unassigned

Bug Description

Have seen at least 2 branches apparently send codehost into a recursion death sprial.
eg:
https://pastebin.canonical.com/40701/
https://pastebin.canonical.com/40702/
https://pastebin.canonical.com/40703/

note to anyone else - do NOT try and gdb bt one of these without pager being on (the default is on).

https://code.launchpad.net/tsumego20k is the project and code that's notably causing the pain.

Have had to kill the processes off, eventually with enough attempts - eating swap and cpu.

Related branches

Steve McInerney (spm) on 2010-12-09
tags: added: canonical-losa-lp
Steve McInerney (spm) wrote :

marked as critical as this causes severe pain on codehost, necessitating process kills on the offending.

Changed in launchpad-code:
importance: Undecided → Critical
Andrew Bennetts (spiv) wrote :

Maybe related to, or even the same as, <https://bugs.launchpad.net/bugs/668176>.

Martin Pool (mbp) wrote :

marked private for now, because it can generate high load on the server because of client actions

security vulnerability: no → yes
visibility: public → private
Andrew Bennetts (spiv) wrote :

The recursion chain is pretty long, here's what it appears to be:

repr(NotBranchError) -> calls open_repository -> calls into codehosting vfs -> XMLRPC -> uses Twisted callbacks -> something breaks -> Twisted stringifies the error/traceback -> repr(NotBranchError) -> ...

It's a new instance of NotBranchError in every round of the recursion, although presumably with the same parameters.

It's hard for bzrlib to mitigate this without losing functionality (it calls open_repository there to give more helpful errors to users in the case of "bzr branch shared_repo_with_no_branch", which happens sometimes and can be confusing for new users, and it doesn't call open_repository until error formatting time to avoid paying the price of the IO unless the error is being displayed somewhere). bzrlib could do something ugly like set a global flag while in NotBranchError's _format method, and if it is already set skip the open_repository call. That would probably work, although that's what I thought about the previous mitigation added to bzrlib.

It would also paper over the other factors here: why is NotBranchError being triggered and propagated? Can we fix Twisted to avoid expensive and dangerous stringifying here?

I spent a lot of time yesterday looking over the codehosting code and I don't see an obvious way to trigger this sort of problem, but clearly there is one lurking somewhere.

We don't seem to have a reliable way to reproduce this problem. Part of what confounded us yesterday was the branches on disk were changing while we were investigating as the user worked on them. e.g. at one point there was an empty bzrdir (no repo, branch or tree), then a few minutes later it had a branch and repo.

Other facts from the pasted traceback:
 * there's a NoSuchFile for /srv/bazaar.launchpad.net/mirrors/00/06/43/cc/.bzr/repository/format in some frames. Presumably the result of the open_repository call.
 * The branch involved appears to be <lp:~goadict/tsumego20k/release-1.0>.
 * the XML-RPC call is almost certainly translatePath

Other observations:
 * some of the processes appeared to be connected to /+branch/...
 * some processes appeared to be for non-existent branches under +branch, e.g. BRANCH:%2Bbranch/tsumego20k/1.0
 * the ps listing of all lp-serve processes showed both /+branch/... and /%2Bbranch/... which seems a bit odd
 * the branches for this project are small, and several lp-serve processes were stuck on it. I think that suggests it's not a race condition with e.g. removing a branch via the web UI (or if it is a race, it's remarkably easy to hit repeatedly)

Many of these observations are likely to be red herrings :(

Martin Pool (mbp) on 2010-12-10
Changed in bzr:
status: New → Confirmed
importance: Undecided → Critical
Martin Pool (mbp) wrote :

<poolie> <poolie> istm that NotBranchError shouldn't be opening the repository
<poolie> <poolie> i think that's a pretty weird thing for an exception class to ever do
<spiv> Yeah, it is a bit weird.
<spiv> Maybe removing it would be best, but it would be a regression in user experience unless we think of an alternative.
<spiv> https://bugs.launchpad.net/bzr/+bug/440952 is why we added that.
<_mup_> Bug #440952: be more helpful when attempting to branch a shared repo <Bazaar:Fix Released by spiv> <https://launchpad.net/bugs/440952>

Martin Pool (mbp) wrote :

I think as a general principle we really shouldn't be doing much processing when formatting an exception, and especially not large complicated (relatively) slow things like opening a repository.

The question then is whether we can keep the benefits of <https://code.launchpad.net/~spiv/bzr/bug-440952-bzrdir/+merge/17183> without trying to do IO within the exception. The specific thing that was wanted there was that trying to branch a directory that contains only a repository not a branch should give you a sensible error.

If we're going to fix this we ought to do it in (at least) the 2.2 branch, so that it can easily be deployed onto Launchpad. The bug is probably also present in 2.1, but since it seems a bit hard to hit and nobody else has reported problems, I'm not convinced it's worth merging the fix there.

Changed in bzr:
assignee: nobody → Martin Pool (mbp)
status: Confirmed → In Progress
Martin Pool (mbp) on 2010-12-10
summary: - codehost error handling infinite recursion death spiral of doom
+ infinite recursion trying to format NotBranchError
Andrew Bennetts (spiv) wrote :

<https://code.launchpad.net/~spiv/bzr/just-add-repr-687653-2.2/+merge/43319> has been cowboyed onto production by spm, and poolie indicated he's happy with it for bzr 2.2.

Tim Penhey (thumper) on 2010-12-12
Changed in launchpad-code:
status: New → Triaged
Martin Pool (mbp) wrote :

Spiv has an mp <https://code.launchpad.net/~spiv/bzr/just-add-repr-687653-2.2/+merge/43319> that fixes this, which is approved for 2.2.

lp:~mbp/bzr/doc adds a coding guideline to try to avoid this in future.

I don't think any specific changes are required in launchpad-bazaar beyond retaining the cowboyed patch until you next update to bzr 2.2.

It would be nice if we could actually try to reproduce this and confirm it's gone.

Changed in bzr:
assignee: Martin Pool (mbp) → Andrew Bennetts (spiv)
Andrew Bennetts (spiv) wrote :

Fix landed on lp:bzr/2.2

Changed in bzr:
milestone: none → 2.2.3
status: In Progress → Fix Released
Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Aaron Bentley (abentley)
milestone: none → 11.01
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Aaron Bentley (abentley) on 2011-01-07
tags: added: qa-ok
removed: qa-needstesting
Curtis Hovey (sinzui) on 2011-01-13
Changed in launchpad:
status: Fix Committed → Fix Released
Vincent Ladeuil (vila) on 2011-01-20
visibility: private → public
Vincent Ladeuil (vila) on 2011-01-28
tags: added: sru
Vincent Ladeuil (vila) wrote :

Dear ~ubuntu-sru, we'd like to get this into maverick-updates.
Per <https://wiki.ubuntu.com/StableReleaseUpdates/MicroReleaseExceptions>, we have an exception to take all of 2.2.3, conditional on running the tests during the package build.

The tests have been run and passed successfully.

jelmer prepared a branch for this: lp:~jelmer/ubuntu/maverick/bzr/proposed-2.2.3

affects: ubuntu → bzr (Ubuntu)
Max Bowsher (maxb) wrote :

Dear all,

bzr 2.2.3 contains a regression in communication with the Launchpad web service for commands like "bzr lp-propose" - see bug 707075.

Therefore this SRU needs to be aborted.

Martin Pitt (pitti) wrote :

Max,

it hasn't even been uploaded/accepted yet, so nothing to abort yet.

Max Bowsher (maxb) wrote :

Yes, but by commenting now, I can save anyone the effort of uploading/reviewing :-)

Thanks, Max. We should do a 2.2.4 with the fix and try to SRU that.

Kees Cook (kees) on 2011-02-04
security vulnerability: yes → no
Jelmer Vernooij (jelmer) on 2011-03-13
Changed in bzr (Ubuntu):
status: New → Fix Released

Accepted bzr into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in bzr (Ubuntu Maverick):
status: New → Fix Committed
tags: added: verification-needed
Martin Pool (mbp) wrote :

SRU verification:

This is not trivial to reproduce in isolation (see #4 above), though probably possible if someone wants to write Python code for it. I have inspected the installed copy of errors.py in maverick 2.2.4-0ubuntu1, and I can confirm that it has the code that's not prone to recursion.

so, unless someone objects, verification done

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bzr - 2.2.4-0ubuntu1

---------------
bzr (2.2.4-0ubuntu1) maverick-proposed; urgency=low

  [ Jelmer Vernooij ]
  * Update watch file to use 2.2 series.
  * New upstream release.
   + Fixes closing of leaked sockets to SSH subprocesses, which causes
     dput sftp uploads to hang. LP: #659590
   + Fixes the use of 'lp:' urls behind a http proxy. LP: #558343
   + Correctly sets the Content-Type header when http POSTing to comply
     with stricter web frameworks. LP: #665100
   + Fixes propagating tags to the master branch in a bound branch or
     heavyweight checkout. LP: #603395
   + Fixes the use of 'bzr resolve --take-other' if the file is
     involved in an unresolved text conflict. LP: #646961
   + Fixes https access with newer versions of python2.7. LP: #693880
   + Fixes crash during pack caused by a concurrent repository pack
     operation. LP: #701940
   + Fixes communication with the Launchpad web service when using
     launchpadlib >= 1.5.5. LP: #707075
   + Switches away from deprecated 'edge.launchpad.net' LP: #583667
   + Fixes resolving of content (and path) conflicts for files in subdirs.
     LP: #660935
   + Fixes nasty recursion loop while displaying branch opening error.
     LP: #687653

  [ Martin Pool ]
  * Propose for maverick SRU.
 -- Martin Pool <email address hidden> Thu, 07 Apr 2011 15:30:17 +1000

Changed in bzr (Ubuntu Maverick):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers