Launchpad itself

Bug #1031764
Comment #2

Comment 2 for bug 1031764

Revision history for this message

John A Meinel (jameinel) wrote on 2012-08-01:

For the second query, it takes ~1200ms on staging as well. The query planner says:
Limit (cost=50118.35..50118.60 rows=101 width=830) (actual time=1304.710..1304.751 rows=101 loops=1)
   InitPlan 1 (returns $0)
     -> Aggregate (cost=59.47..59.48 rows=1 width=4) (never executed)
           -> Index Scan using teamparticipation_person_idx on teamparticipation (cost=0.00..59.34 rows=51 width=4) (never executed)
                 Index Cond: (person = 10583)
   -> Sort (cost=50058.86..50669.06 rows=244079 width=830) (actual time=1304.707..1304.725 rows=101 loops=1)
         Sort Key: branch.date_last_modified, branch.target_suffix, branch.lifecycle_status, branch.owner_name, branch.name
         Sort Method: top-N heapsort Memory: 77kB
         -> Seq Scan on branch (cost=3245.05..40712.82 rows=244079 width=830) (actual time=0.027..782.749 rows=373887 loops=1)
               Filter: ((owner = 2866082) AND (lifecycle_status = ANY ('{10,30,50}'::integer[])) AND ((information_type = ANY ('{1,2}'::integer[])) OR COALESCE((access_grants && $0), false) OR (hashed SubPlan 2)))
               SubPlan 2
                 -> Hash Join (cost=2167.74..3245.04 rows=6 width=4) (never executed)
                       Hash Cond: (public.teamparticipation.team = accesspolicygrant.grantee)
                       -> Index Scan using teamparticipation_person_idx on teamparticipation (cost=0.00..59.34 rows=51width=4) (never executed)
                             Index Cond: (person = 10583)
                       -> Hash (cost=1229.55..1229.55 rows=75055 width=8) (never executed)
                             -> Seq Scan on accesspolicygrant (cost=0.00..1229.55 rows=75055 width=8) (never executed)
Total runtime: 1304.974 ms
(18 rows)

That also shows the expensive part being:
-> Seq Scan on branch (cost=3245.05..40712.82 rows=244079 width=830) (actual time=0.027..782.749 rows=373887 loops=1)

So it is starting with a sequential scan of all 370k branches, it then creates a 100 node bucket, which as it finds nodes 'newer' than what is in the bucket, it puts them in there, caveat the following hash join (which never executes for some reason).

Now, maybe this is actually reasonable for ~ubuntu-branches. By my count, there are 562,000 branches, and according to this query, 373,887 of them are part of ~ubuntu-branches. That is far more than the standard >10% => full-table-scan.

Note that part of the query:
launchpad_staging=> select count(*) from branch where lifecycle_status in (10,30,50) and (information_type in (1,2));
count
--------
455713

Since it does need so much information, but it is filtering from 300k rows down to 100. It feels like having an index on those exact fields would help a lot here.

I can't create an actual index to test it, but something like:
CREATE INDEX CONCURRENTLY branch_recent_branches_idx ON Branch(date_last_modified DESC, target_suffix ASC, lifecycle_status DESC, owner_name ASC, name ASC);

For the second query, it takes ~1200ms on staging as well. The query planner says:
 Limit  (cost=50118.35..50118.60 rows=101 width=830) (actual time=1304.710..1304.751 rows=101 loops=1)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=59.47..59.48 rows=1 width=4) (never executed)
           ->  Index Scan using teamparticipation_person_idx on teamparticipation  (cost=0.00..59.34 rows=51 width=4) (never executed)
                 Index Cond: (person = 10583)
   ->  Sort  (cost=50058.86..50669.06 rows=244079 width=830) (actual time=1304.707..1304.725 rows=101 loops=1)
         Sort Key: branch.date_last_modified, branch.target_suffix, branch.lifecycle_status, branch.owner_name, branch.name
         Sort Method: top-N heapsort  Memory: 77kB
         ->  Seq Scan on branch  (cost=3245.05..40712.82 rows=244079 width=830) (actual time=0.027..782.749 rows=373887 loops=1)
               Filter: ((owner = 2866082) AND (lifecycle_status = ANY ('{10,30,50}'::integer[])) AND ((information_type = ANY ('{1,2}'::integer[])) OR COALESCE((access_grants && $0), false) OR (hashed SubPlan 2)))
               SubPlan 2
                 ->  Hash Join  (cost=2167.74..3245.04 rows=6 width=4) (never executed)
                       Hash Cond: (public.teamparticipation.team = accesspolicygrant.grantee)
                       ->  Index Scan using teamparticipation_person_idx on teamparticipation  (cost=0.00..59.34 rows=51width=4) (never executed)
                             Index Cond: (person = 10583)
                       ->  Hash  (cost=1229.55..1229.55 rows=75055 width=8) (never executed)
                             ->  Seq Scan on accesspolicygrant  (cost=0.00..1229.55 rows=75055 width=8) (never executed)
 Total runtime: 1304.974 ms
(18 rows)

That also shows the expensive part being:
         ->  Seq Scan on branch  (cost=3245.05..40712.82 rows=244079 width=830) (actual time=0.027..782.749 rows=373887 loops=1)

So it is starting with  a sequential scan of all 370k branches, it then creates a 100 node bucket, which as it finds nodes 'newer' than what is in the bucket, it puts them in there, caveat the following hash join (which never executes for some reason).

Note that part of the query:
launchpad_staging=> select count(*) from branch where lifecycle_status in (10,30,50) and (information_type in (1,2));
 count
--------
 455713

Since it does need so much information, but it is filtering from 300k rows down to 100. It feels like having an index on those exact fields would help a lot here.