Failure during initial branch scan increases scan time enormously

Bug #1018460 reported by Aaron Bentley on 2012-06-27
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Unassigned

Bug Description

Excerpted from https://bugs.launchpad.net/launchpad/+bug/808930/comments/20 by William Grant:

...here's some diagnosis from experiments with lp:~wallyworld/launchpad/create-aag-privacy-transitions-1009364, a Launchpad branch which is now permanently cursed and unable to scan within the timeout.

wallyworld pushed the branch at around 01:04. It starts to scan, but times out 5 minutes later.

2012-06-18 01:05:11 INFO Running <SCAN_BRANCH branch job (4134693) for ~wallyworld/launchpad/create-aag-privacy-transitions-1009364> (ID 13838195).
2012-06-18 01:10:15 INFO Job resulted in OOPS: OOPS-415b4f54e93cb4a66345cf3807221142

Another scan is attempted at 01:11:10, timing out at 01:16:23 with OOPS-d214e0439183a7ec5ccb2bf73c34efec. wallyworld pushed it to a new name at 01:22, and the new branch scanned successfully in 109s.

I noticed the accursed branch and coordinated with wallyworld and spm to debug it a little. A scan was forced by pushing a new rev at 03:33, and the scan timed out as expected with OOPS-d83500d8b2cd4f2355127bb914508982. The branch state after this was https://pastebin.canonical.com/68306/ -- the most notable takeaway being that the branch has a full set of BranchRevisions, but no last_scanned_id. This was likely the state after the initial failed scan.

https://pastebin.canonical.com/68305/ is a series of snapshots of the relevant bits of pg_stat_activity throughout the scan. They reveal that it spent more than 4 minutes in the existing_branchrevisions query of planDatabaseChanges, checking for existing BranchRevisions for more than 122000 revids. While that query should in principle be satisfiable in just a second or so, it's not inconceivable that the parsing and planning overhead could be somewhat ridiculous. It may also generate a terrible plan when run with 122000 revids on a branch which already has 122000 BranchRevisions, as that's a situation that would not normally be encountered.

Aaron Bentley (abentley) on 2012-06-27
summary: - Failure to scan branch increases scan time enormously
+ Failure during initial branch scan increases scan time enormously
William Grant (wgrant) wrote :

Branches hit by this bug can now be unstuck by running 'scripts/unscan-branch.py --rescan lp:~PATH/TO/BRANCH' as bzrsyncd@ackee (or anywhere with access to the branchscanner DB user). It erases the BranchRevisions, last_scanned_id, and some other stuff, and requests a rescan from the normal branch scanner. For really big branches the DB cache may be too cold for the first scan to complete, so it might need to be retried a couple of times.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers