Comment 20 for bug 808930

Revision history for this message
William Grant (wgrant) wrote :

We probably have a few different bugs here, but here's some diagnosis from experiments with lp:~wallyworld/launchpad/create-aag-privacy-transitions-1009364, a Launchpad branch which is now permanently cursed and unable to scan within the timeout.

wallyworld pushed the branch at around 01:04. It starts to scan, but times out 5 minutes later.

2012-06-18 01:05:11 INFO Running <SCAN_BRANCH branch job (4134693) for ~wallyworld/launchpad/create-aag-privacy-transitions-1009364> (ID 13838195).
2012-06-18 01:10:15 INFO Job resulted in OOPS: OOPS-415b4f54e93cb4a66345cf3807221142

It is probably relevant that MP 110705 was created for this branch seconds before 01:06:40, although no job appears to have run for it until 01:22:10.

Another scan is attempted at 01:11:10, timing out at 01:16:23 with OOPS-d214e0439183a7ec5ccb2bf73c34efec. wallyworld pushed it to a new name at 01:22, and the new branch scanned successfully in 109s.

I noticed the accursed branch and coordinated with wallyworld and spm to debug it a little. A scan was forced by pushing a new rev at 03:33, and the scan timed out as expected with OOPS-d83500d8b2cd4f2355127bb914508982. The branch state after this was https://pastebin.canonical.com/68306/ -- the most notable takeaway being that the branch has a full set of BranchRevisions, but no last_scanned_id. This was likely the state after the initial failed scan.

https://pastebin.canonical.com/68305/ is a series of snapshots of the relevant bits of pg_stat_activity throughout the scan. They reveal that it spent more than 4 minutes in the existing_branchrevisions query of planDatabaseChanges, checking for existing BranchRevisions for more than 122000 revids. While that query should in principle be satisfiable in just a second or so, it's not inconceivable that the parsing and planning overhead could be somewhat ridiculous. It may also generate a terrible plan when run with 122000 revids on a branch which already has 122000 BranchRevisions, as that's a situation that would not normally be encountered.

While that query explains the slowness of the scan that we watched, it doesn't explain the timeout of the initial one. There would have been no BranchRevisions to cause the slow query, and it managed to get through the slow bit to create and commit the new BranchRevisions, but failed to set last_scanned_id. This means that it got past the post-BranchRevision commit (bzrsync.py:109) without trouble, but then timed out during TipChanged, updateBranchStatus or ScanCompleted. So there was probably some other long job holding a write lock on the branch row.

So there's at least two separate bugs here: the slow query when searching a full set of BranchRevisions for a full set of BranchRevisions, and the thing that's probably keeping the branch row write-locked. I have a suspicion that the latter is related to the merge proposal that was created during the scan, but I can find no evidence that the job was running at that time.