Bug #808930 “Timeout running branch scanner job” : Bugs : Launchpad itself

Revision history for this message

Michael Hope (michaelh1) wrote on 2011-09-06:

#1

The following merge requests are stuck due to the fault:
https://code.launchpad.net/~ramana/gcc-linaro/fix-lp-838994/+merge/74230
https://code.launchpad.net/~ams-codesourcery/gcc-linaro/merge-from-fsf-20110906-4.5/+merge/74213
https://code.launchpad.net/~ams-codesourcery/gcc-linaro/merge-from-fsf-20110906-4.6/+merge/74199

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2011-09-07:

#2

Escalated by James as it interferes with Linaro code review workflows.

tags:

added: escalated linaro

Revision history for this message

Данило Шеган (danilo) wrote on 2011-09-13:

#3

Half of the OOPSes are generally useless in that they contain no proper traceback.

It seems the relevant OOPSes for branches above are OOPS-2076SMS9, OOPS-2075SMS15, OOPS-2075SMS17, OOPS-2075SMS19 (btw, each of these OOPSes is paired with a relatively useless OOPS after these which is a re-raised TimeoutError from within _logTimeout which we should be able to get rid of).

Revision history for this message

Данило Шеган (danilo) wrote on 2011-09-13:

#4

From IRC:

<abentley> danilos: I think the way to close this bug is to optimize bzr.
danilos: I would start by downloading the branch locally and profiling the script.
<abentley> danilos: it is already using the fastest bzrlib operations I know of. I wrote the script, and I also wrote the Graph operations, several years before that.
<abentley> danilos: it's possible the algorithm could be improved. it's also possible that we have to switch to C or Pyrex to get that performance.
danilos: Or we may need to implement a greatest-distance-from-origin cache in order to get good performance.
danilos: jam has also done a lot of work in this area, and may have advice.

(note that the above OOPSes are not available yet because it seems our OOPS scanner is not scanning ackee OOPSes yet)

Revision history for this message

William Grant (wgrant) wrote on 2011-09-14:

#5

We had some more fun cascading scanner failures this evening.

2011-09-14 09:00:04 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:00:11 INFO Running through Twisted.
2011-09-14 09:00:15 INFO Running BranchScanJob (ID 9734101).
2011-09-14 09:00:26 INFO Running BranchScanJob (ID 9734103).
2011-09-14 09:01:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:02:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:03:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:04:02 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:05:03 INFO Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:05:29 INFO Job resulted in OOPS: OOPS-2083SMS11
2011-09-14 09:05:32 INFO Running BranchScanJob (ID 9734104).
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 42.
Unhandled Error
[... snip the usual traceback ...]
File "/srv/bzrsyncd.launchpad.net/production/launchpad-rev-13921/lib/lp/services/job/runner.py", line 441, in update
if response['success']:
exceptions.TypeError: 'NoneType' object is unsubscriptable
2011-09-14 09:05:32 INFO Running BranchScanJob (ID 9734109).
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 42.
[... snip ...]

gmb noticed that his lp:~gmb/launchpad/scratch hadn't scanned, so pushed it again once the tracebacks had stopped. The tracebacks immediately started again. Examining the hanging and subsequent job IDs from the snippet above and the second incident, we get <https://pastebin.canonical.com/52731/>. 9734103 and 9734574 are the two that hung, and are both for branch 62300, which is lp:~gmb/launchpad/scratch.

Something about the job to bring that branch (which isn't new -- it was last scanned more than a year ago, and was originally older still) up to date causes the scanner to take more than 5 minutes. We may finally be able to track this down.

We had some more fun cascading scanner failures this evening.

2011-09-14 09:00:04 INFO    Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:00:11 INFO    Running through Twisted.
2011-09-14 09:00:15 INFO    Running BranchScanJob (ID 9734101).
2011-09-14 09:00:26 INFO    Running BranchScanJob (ID 9734103).
2011-09-14 09:01:02 INFO    Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:02:02 INFO    Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:03:02 INFO    Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:04:02 INFO    Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:05:03 INFO    Creating lockfile: /var/lock/launchpad-runscanbranches.lock
2011-09-14 09:05:29 INFO    Job resulted in OOPS: OOPS-2083SMS11
2011-09-14 09:05:32 INFO    Running BranchScanJob (ID 9734104).
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 42.
Unhandled Error
[... snip the usual traceback ...]
  File "/srv/bzrsyncd.launchpad.net/production/launchpad-rev-13921/lib/lp/services/job/runner.py", line 441, in update
    if response['success']:
exceptions.TypeError: 'NoneType' object is unsubscriptable
2011-09-14 09:05:32 INFO    Running BranchScanJob (ID 9734109).
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 42.
[... snip ...]

gmb noticed that his lp:~gmb/launchpad/scratch hadn't scanned, so pushed it again once the tracebacks had stopped. The tracebacks immediately started again. Examining the hanging and subsequent job IDs from the snippet above and the second incident, we get  <https://pastebin.canonical.com/52731/>. 9734103 and 9734574 are the two that hung, and are both for branch 62300, which is lp:~gmb/launchpad/scratch.

Something about the job to bring that branch (which isn't new -- it was last scanned more than a year ago, and was originally older still) up to date causes the scanner to take more than 5 minutes. We may finally be able to track this down.

Revision history for this message

William Grant (wgrant) wrote on 2011-09-14:

#6

Also note from the paste that the jobs mentioned in the fallout tracebacks were not actually marked as failed, so the failure doesn't seem to be harmful to anything but the actual problematic job.

Revision history for this message

Aaron Bentley (abentley) wrote on 2011-09-14:

#7

Danilos, the reason two oopses are emitted is because if the timeout happens outside the job-running code, only the "relatively useless" oops will be emitted.

Revision history for this message

Michael Hope (michaelh1) wrote on 2011-09-21:

#8

bzr 2.3.4 and 2.4.0 are significantly faster for many operations. I wonder if they'd help? I generate the diff between the ancestor and merge request as part of my system and, although I haven't measured it, a straight bzr diff -r xxx..yyy takes less than a minute.

Revision history for this message

Michael Hope (michaelh1) wrote on 2011-10-05:

#9

This is causing more pain. See the list of code reviews at:
https://code.launchpad.net/gcc-linaro/+activereviews

Most of the 'ready to land' proposals have been landed, but the branch scanner isn't detecting this and isn't closing them out. It's hard to tell what needs to be done.

Revision history for this message

Abel Deuring (adeuring) wrote on 2011-10-10:

#10

raw profile data Edit (4.2 MiB, application/octet-stream)

Running a version of cronscripts/scan_branches.py modified for profiling:

--- cronscripts/scan_branches.py 2011-06-20 20:07:33 +0000
+++ cronscripts/scan_branches.py 2011-10-10 11:13:03 +0000
@@ -8,11 +8,12 @@
__metaclass__ = type

import _pythonpath
+import cProfile

from lp.code.interfaces.branchjob import IBranchScanJobSource
from lp.services.job.runner import (
JobCronScript,
- TwistedJobRunner,
+ JobRunner,
)

@@ -23,10 +24,17 @@
source_interface = IBranchScanJobSource

def __init__(self):
- super(RunScanBranches, self).__init__(runner_class=TwistedJobRunner)
+ super(RunScanBranches, self).__init__(runner_class=JobRunner)

if __name__ == '__main__':

script = RunScanBranches()
- script.lock_and_run()
+ import os
+ all_names = os.listdir('.')
+ index = 0
+ fname = 'scan-branches-%04i.profile' % index
+ while fname in all_names:
+ index += 1
+ fname = 'scan-branches-%04i.profile' % index
+ cProfile.run('script.lock_and_run()', fname)

for this branch: for this branch
bzr+ssh://bazaar.launchpad.net/~ramana/gcc-linaro/fix-lp-838994/

shows that approximately 50% (604s) of the total run time (1196s)
is spent in Storm. (note: of this time, only 268s are spent in
psycopg2._psycopg.cursor.execute(), and 417s in
storm.database.raw_execute())

The only obvious hotspot I am seeing is the method
acquireRevisionAuthor() in lp.code.mode.revision, which needs
464s.

Danilo dug up several OOPSes, where the timeout errors always
occur in methods called by planDatabaseChanges(). Interestingly,
this method has a runtime of only 4.3s in my test run.

Benji York (benji) on 2011-10-28

Changed in launchpad:
assignee:	nobody → Benji York (benji)
status:	Triaged → In Progress

Revision history for this message

Benji York (benji) wrote on 2011-11-04:

#11

Here's a version of the patch without dropped whitespace: http://paste.ubuntu.com/728245/

Revision history for this message

Benji York (benji) wrote on 2011-11-15:

#12

Michael Hope (michaelh1) wrote on 2011-09-21:
> bzr 2.3.4 and 2.4.0 are significantly faster for many operations.

The newer Bazaar in use by Launchpad does appear to be a win here. In profiling branch scan runs I see much better performance. I'll mark this Fix Released until evidence to the contrary arises.

Changed in launchpad:
status:	In Progress → Fix Released

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2011-11-15:

#13

Michael Hope reports that this merge proposal from 2 days ago is still broken: https://code.launchpad.net/~irar/gcc-linaro/slp-for-any-nops-4.6/+merge/82130

Changed in launchpad:
status:	Fix Released → In Progress

Revision history for this message

Martin Pool (mbp) wrote on 2011-11-16: Re: [Bug 808930] Re: Timeout running branch scanner job

#14

If most of the time is spent in Storm then we may need to do more than
just update bzr.

In the mean time perhaps the timeout can be increased?

Revision history for this message

Robert Collins (lifeless) wrote on 2011-11-16:

#15

The scanner causes UI timeouts, running it for longer means more
likelyhood of those timeouts occuring.

So no, this is a case where we need to fix it.

Benji York (benji) on 2011-12-13

Changed in launchpad:
assignee:	Benji York (benji) → Graham Binns (gmb)

Revision history for this message

Graham Binns (gmb) wrote on 2011-12-15:

#16

It seems likely that bug 904683 is symptomatic of this bug.

Revision history for this message

Francesco Banconi (frankban) wrote on 2012-01-09:

#17

Same problem for me today:
OOPS-0ab3b93f9dcb0b2aecdc0dcac9993c85 and OOPS-23e8246e0c6eee1e8e2303fc34753355
Related merge proposals are not broken.

Graham Binns (gmb) on 2012-01-10

Changed in launchpad:
status:	In Progress → Fix Committed
status:	Fix Committed → Triaged
assignee:	Graham Binns (gmb) → nobody

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2012-02-10:

#18

I'm landing two fixes that will reduce the number of queries done by the scanner and might fix some of the timeouts.

See bug 903703 and bug 930166

Revision history for this message

Aaron Bentley (abentley) wrote on 2012-02-27:

#19

I've landed some improvements that make branch scanning faster (r14865), so you can try triggering a scan (by pushing to the relevant branches) and see if it works now.

Francis J. Lacoste (flacoste) on 2012-05-22

Changed in launchpad:
status:	Triaged → In Progress
assignee:	nobody → Aaron Bentley (abentley)

Revision history for this message

William Grant (wgrant) wrote on 2012-06-18:

#20

We probably have a few different bugs here, but here's some diagnosis from experiments with lp:~wallyworld/launchpad/create-aag-privacy-transitions-1009364, a Launchpad branch which is now permanently cursed and unable to scan within the timeout.

wallyworld pushed the branch at around 01:04. It starts to scan, but times out 5 minutes later.

2012-06-18 01:05:11 INFO Running <SCAN_BRANCH branch job (4134693) for ~wallyworld/launchpad/create-aag-privacy-transitions-1009364> (ID 13838195).
2012-06-18 01:10:15 INFO Job resulted in OOPS: OOPS-415b4f54e93cb4a66345cf3807221142

It is probably relevant that MP 110705 was created for this branch seconds before 01:06:40, although no job appears to have run for it until 01:22:10.

Another scan is attempted at 01:11:10, timing out at 01:16:23 with OOPS-d214e0439183a7ec5ccb2bf73c34efec. wallyworld pushed it to a new name at 01:22, and the new branch scanned successfully in 109s.

I noticed the accursed branch and coordinated with wallyworld and spm to debug it a little. A scan was forced by pushing a new rev at 03:33, and the scan timed out as expected with OOPS-d83500d8b2cd4f2355127bb914508982. The branch state after this was https://pastebin.canonical.com/68306/ -- the most notable takeaway being that the branch has a full set of BranchRevisions, but no last_scanned_id. This was likely the state after the initial failed scan.

https://pastebin.canonical.com/68305/ is a series of snapshots of the relevant bits of pg_stat_activity throughout the scan. They reveal that it spent more than 4 minutes in the existing_branchrevisions query of planDatabaseChanges, checking for existing BranchRevisions for more than 122000 revids. While that query should in principle be satisfiable in just a second or so, it's not inconceivable that the parsing and planning overhead could be somewhat ridiculous. It may also generate a terrible plan when run with 122000 revids on a branch which already has 122000 BranchRevisions, as that's a situation that would not normally be encountered.

While that query explains the slowness of the scan that we watched, it doesn't explain the timeout of the initial one. There would have been no BranchRevisions to cause the slow query, and it managed to get through the slow bit to create and commit the new BranchRevisions, but failed to set last_scanned_id. This means that it got past the post-BranchRevision commit (bzrsync.py:109) without trouble, but then timed out during TipChanged, updateBranchStatus or ScanCompleted. So there was probably some other long job holding a write lock on the branch row.

So there's at least two separate bugs here: the slow query when searching a full set of BranchRevisions for a full set of BranchRevisions, and the thing that's probably keeping the branch row write-locked. I have a suspicion that the latter is related to the merge proposal that was created during the scan, but I can find no evidence that the job was running at that time.

We probably have a few different bugs here, but here's some diagnosis from experiments with lp:~wallyworld/launchpad/create-aag-privacy-transitions-1009364, a Launchpad branch which is now permanently cursed and unable to scan within the timeout.

wallyworld pushed the branch at around 01:04. It starts to scan, but times out 5 minutes later.

2012-06-18 01:05:11 INFO    Running <SCAN_BRANCH branch job (4134693) for ~wallyworld/launchpad/create-aag-privacy-transitions-1009364> (ID 13838195).
2012-06-18 01:10:15 INFO    Job resulted in OOPS: OOPS-415b4f54e93cb4a66345cf3807221142

It is probably relevant that MP 110705 was created for this branch seconds before 01:06:40, although no job appears to have run for it until 01:22:10.

Another scan is attempted at 01:11:10, timing out at 01:16:23 with OOPS-d214e0439183a7ec5ccb2bf73c34efec. wallyworld pushed it to a new name at 01:22, and the new branch scanned successfully in 109s.

I noticed the accursed branch and coordinated with wallyworld and spm to debug it a little. A scan was forced by pushing a new rev at 03:33, and the scan timed out as expected with OOPS-d83500d8b2cd4f2355127bb914508982. The branch state after this was https://pastebin.canonical.com/68306/ -- the most notable takeaway being that the branch has a full set of BranchRevisions, but no last_scanned_id. This was likely the state after the initial failed scan.

https://pastebin.canonical.com/68305/ is a series of snapshots of the relevant bits of pg_stat_activity throughout the scan. They reveal that it spent more than 4 minutes in the existing_branchrevisions query of planDatabaseChanges, checking for existing BranchRevisions for more than 122000 revids. While that query should in principle be satisfiable in just a second or so, it's not inconceivable that the parsing and planning overhead could be somewhat ridiculous. It may also generate a terrible plan when run with 122000 revids on a branch which already has 122000 BranchRevisions, as that's a situation that would not normally be encountered.

While that query explains the slowness of the scan that we watched, it doesn't explain the timeout of the initial one. There would have been no BranchRevisions to cause the slow query, and it managed to get through the slow bit to create and commit the new BranchRevisions, but failed to set last_scanned_id. This means that it got past the post-BranchRevision commit (bzrsync.py:109) without trouble, but then timed out during TipChanged, updateBranchStatus or ScanCompleted. So there was probably some other long job holding a write lock on the branch row.

So there's at least two separate bugs here: the slow query when searching a full set of BranchRevisions for a full set of BranchRevisions, and the thing that's probably keeping the branch row write-locked. I have a suspicion that the latter is related to the merge proposal that was created during the scan, but I can find no evidence that the job was running at that time.

Revision history for this message

Aaron Bentley (abentley) wrote on 2012-06-27:

#21

I've filed bug #1018460 for the issue that a failed initial scan causes later scans to fail.

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2012-08-16:

#22

All fixes related to this are now live.

Changed in launchpad:
status:	In Progress → Fix Released

Launchpad itself

Timeout running branch scanner job

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches