Timeout running branch scanner job

Bug #808930 reported by Diogo Matsubara on 2011-07-11
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Aaron Bentley

Bug Description

OOPS-2017SMS6 OOPS-2007SMS10 TimeoutError: Job ran too long.

Francis J. Lacoste (flacoste) wrote :

Escalated by James as it interferes with Linaro code review workflows.

tags: added: escalated linaro

Half of the OOPSes are generally useless in that they contain no proper traceback.

It seems the relevant OOPSes for branches above are OOPS-2076SMS9, OOPS-2075SMS15, OOPS-2075SMS17, OOPS-2075SMS19 (btw, each of these OOPSes is paired with a relatively useless OOPS after these which is a re-raised TimeoutError from within _logTimeout which we should be able to get rid of).

From IRC:

<abentley> danilos: I think the way to close this bug is to optimize bzr.
 danilos: I would start by downloading the branch locally and profiling the script.
<abentley> danilos: it is already using the fastest bzrlib operations I know of. I wrote the script, and I also wrote the Graph operations, several years before that.
<abentley> danilos: it's possible the algorithm could be improved. it's also possible that we have to switch to C or Pyrex to get that performance.
 danilos: Or we may need to implement a greatest-distance-from-origin cache in order to get good performance.
 danilos: jam has also done a lot of work in this area, and may have advice.

(note that the above OOPSes are not available yet because it seems our OOPS scanner is not scanning ackee OOPSes yet)

William Grant (wgrant) wrote :

We had some more fun cascading scanner failures this evening.

2011-09-14 09:00:04 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:00:11 INFO Running through Twisted.
2011-09-14 09:00:15 INFO Running BranchScanJob (ID 9734101).
2011-09-14 09:00:26 INFO Running BranchScanJob (ID 9734103).
2011-09-14 09:01:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:02:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:03:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:04:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:05:03 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:05:29 INFO Job resulted in OOPS: OOPS-2083SMS11
2011-09-14 09:05:32 INFO Running BranchScanJob (ID 9734104).
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 42.
Unhandled Error
[... snip the usual traceback ...]
  File "/srv/bzrsyncd.launchpad.net/production/launchpad-rev-13921/lib/lp/services/job/runner.py", line 441, in update
    if response['success']:
exceptions.TypeError: 'NoneType' object is unsubscriptable
2011-09-14 09:05:32 INFO Running BranchScanJob (ID 9734109).
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 42.
[... snip ...]

gmb noticed that his lp:~gmb/launchpad/scratch hadn't scanned, so pushed it again once the tracebacks had stopped. The tracebacks immediately started again. Examining the hanging and subsequent job IDs from the snippet above and the second incident, we get <https://pastebin.canonical.com/52731/>. 9734103 and 9734574 are the two that hung, and are both for branch 62300, which is lp:~gmb/launchpad/scratch.

Something about the job to bring that branch (which isn't new -- it was last scanned more than a year ago, and was originally older still) up to date causes the scanner to take more than 5 minutes. We may finally be able to track this down.

William Grant (wgrant) wrote :

Also note from the paste that the jobs mentioned in the fallout tracebacks were not actually marked as failed, so the failure doesn't seem to be harmful to anything but the actual problematic job.

Aaron Bentley (abentley) wrote :

Danilos, the reason two oopses are emitted is because if the timeout happens outside the job-running code, only the "relatively useless" oops will be emitted.

Michael Hope (michaelh1) wrote :

bzr 2.3.4 and 2.4.0 are significantly faster for many operations. I wonder if they'd help? I generate the diff between the ancestor and merge request as part of my system and, although I haven't measured it, a straight bzr diff -r xxx..yyy takes less than a minute.

Michael Hope (michaelh1) wrote :

This is causing more pain. See the list of code reviews at:
 https://code.launchpad.net/gcc-linaro/+activereviews

Most of the 'ready to land' proposals have been landed, but the branch scanner isn't detecting this and isn't closing them out. It's hard to tell what needs to be done.

Abel Deuring (adeuring) wrote :

Running a version of cronscripts/scan_branches.py modified for profiling:

--- cronscripts/scan_branches.py 2011-06-20 20:07:33 +0000
+++ cronscripts/scan_branches.py 2011-10-10 11:13:03 +0000
@@ -8,11 +8,12 @@
 __metaclass__ = type

 import _pythonpath
+import cProfile

 from lp.code.interfaces.branchjob import IBranchScanJobSource
 from lp.services.job.runner import (
     JobCronScript,
- TwistedJobRunner,
+ JobRunner,
     )

@@ -23,10 +24,17 @@
     source_interface = IBranchScanJobSource

     def __init__(self):
- super(RunScanBranches, self).__init__(runner_class=TwistedJobRunner)
+ super(RunScanBranches, self).__init__(runner_class=JobRunner)

 if __name__ == '__main__':

     script = RunScanBranches()
- script.lock_and_run()
+ import os
+ all_names = os.listdir('.')
+ index = 0
+ fname = 'scan-branches-%04i.profile' % index
+ while fname in all_names:
+ index += 1
+ fname = 'scan-branches-%04i.profile' % index
+ cProfile.run('script.lock_and_run()', fname)

for this branch: for this branch
bzr+ssh://bazaar.launchpad.net/~ramana/gcc-linaro/fix-lp-838994/

shows that approximately 50% (604s) of the total run time (1196s)
is spent in Storm. (note: of this time, only 268s are spent in
psycopg2._psycopg.cursor.execute(), and 417s in
storm.database.raw_execute())

The only obvious hotspot I am seeing is the method
acquireRevisionAuthor() in lp.code.mode.revision, which needs
464s.

Danilo dug up several OOPSes, where the timeout errors always
occur in methods called by planDatabaseChanges(). Interestingly,
this method has a runtime of only 4.3s in my test run.

Benji York (benji) on 2011-10-28
Changed in launchpad:
assignee: nobody → Benji York (benji)
status: Triaged → In Progress
Benji York (benji) wrote :

Here's a version of the patch without dropped whitespace: http://paste.ubuntu.com/728245/

Benji York (benji) wrote :

Michael Hope (michaelh1) wrote on 2011-09-21:
> bzr 2.3.4 and 2.4.0 are significantly faster for many operations.

The newer Bazaar in use by Launchpad does appear to be a win here. In profiling branch scan runs I see much better performance. I'll mark this Fix Released until evidence to the contrary arises.

Changed in launchpad:
status: In Progress → Fix Released
Francis J. Lacoste (flacoste) wrote :

Michael Hope reports that this merge proposal from 2 days ago is still broken: https://code.launchpad.net/~irar/gcc-linaro/slp-for-any-nops-4.6/+merge/82130

Changed in launchpad:
status: Fix Released → In Progress

If most of the time is spent in Storm then we may need to do more than
just update bzr.

In the mean time perhaps the timeout can be increased?

Robert Collins (lifeless) wrote :

The scanner causes UI timeouts, running it for longer means more
likelyhood of those timeouts occuring.

So no, this is a case where we need to fix it.

Benji York (benji) on 2011-12-13
Changed in launchpad:
assignee: Benji York (benji) → Graham Binns (gmb)
Graham Binns (gmb) wrote :

It seems likely that bug 904683 is symptomatic of this bug.

Francesco Banconi (frankban) wrote :

Same problem for me today:
OOPS-0ab3b93f9dcb0b2aecdc0dcac9993c85 and OOPS-23e8246e0c6eee1e8e2303fc34753355
Related merge proposals are not broken.

Graham Binns (gmb) on 2012-01-10
Changed in launchpad:
status: In Progress → Fix Committed
status: Fix Committed → Triaged
assignee: Graham Binns (gmb) → nobody
Jelmer Vernooij (jelmer) wrote :

I'm landing two fixes that will reduce the number of queries done by the scanner and might fix some of the timeouts.

See bug 903703 and bug 930166

Aaron Bentley (abentley) wrote :

I've landed some improvements that make branch scanning faster (r14865), so you can try triggering a scan (by pushing to the relevant branches) and see if it works now.

Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Aaron Bentley (abentley)
William Grant (wgrant) wrote :

We probably have a few different bugs here, but here's some diagnosis from experiments with lp:~wallyworld/launchpad/create-aag-privacy-transitions-1009364, a Launchpad branch which is now permanently cursed and unable to scan within the timeout.

wallyworld pushed the branch at around 01:04. It starts to scan, but times out 5 minutes later.

2012-06-18 01:05:11 INFO Running <SCAN_BRANCH branch job (4134693) for ~wallyworld/launchpad/create-aag-privacy-transitions-1009364> (ID 13838195).
2012-06-18 01:10:15 INFO Job resulted in OOPS: OOPS-415b4f54e93cb4a66345cf3807221142

It is probably relevant that MP 110705 was created for this branch seconds before 01:06:40, although no job appears to have run for it until 01:22:10.

Another scan is attempted at 01:11:10, timing out at 01:16:23 with OOPS-d214e0439183a7ec5ccb2bf73c34efec. wallyworld pushed it to a new name at 01:22, and the new branch scanned successfully in 109s.

I noticed the accursed branch and coordinated with wallyworld and spm to debug it a little. A scan was forced by pushing a new rev at 03:33, and the scan timed out as expected with OOPS-d83500d8b2cd4f2355127bb914508982. The branch state after this was https://pastebin.canonical.com/68306/ -- the most notable takeaway being that the branch has a full set of BranchRevisions, but no last_scanned_id. This was likely the state after the initial failed scan.

https://pastebin.canonical.com/68305/ is a series of snapshots of the relevant bits of pg_stat_activity throughout the scan. They reveal that it spent more than 4 minutes in the existing_branchrevisions query of planDatabaseChanges, checking for existing BranchRevisions for more than 122000 revids. While that query should in principle be satisfiable in just a second or so, it's not inconceivable that the parsing and planning overhead could be somewhat ridiculous. It may also generate a terrible plan when run with 122000 revids on a branch which already has 122000 BranchRevisions, as that's a situation that would not normally be encountered.

While that query explains the slowness of the scan that we watched, it doesn't explain the timeout of the initial one. There would have been no BranchRevisions to cause the slow query, and it managed to get through the slow bit to create and commit the new BranchRevisions, but failed to set last_scanned_id. This means that it got past the post-BranchRevision commit (bzrsync.py:109) without trouble, but then timed out during TipChanged, updateBranchStatus or ScanCompleted. So there was probably some other long job holding a write lock on the branch row.

So there's at least two separate bugs here: the slow query when searching a full set of BranchRevisions for a full set of BranchRevisions, and the thing that's probably keeping the branch row write-locked. I have a suspicion that the latter is related to the merge proposal that was created during the scan, but I can find no evidence that the job was running at that time.

Aaron Bentley (abentley) wrote :

I've filed bug #1018460 for the issue that a failed initial scan causes later scans to fail.

Francis J. Lacoste (flacoste) wrote :

All fixes related to this are now live.

Changed in launchpad:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers