BugNomination:+edit-form structural subscriptions taking 4.8 seconds during nomination editing POST

Bug #723999 reported by Robert Collins
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Yellow Squad

Bug Description

OOPS-1880K1665

HTTP method: POST
URL: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/723945/nominations/39265/+edit-form

Branch: launchpad-rev-12432
Revno: 12432
SQL time: 11979 ms
Non-sql time: 2622 ms
Total time: 14601 ms
Statement Count: 367

4 -very- long queries, both long to copy and long to execute are taking up 4.8 seconds on this page on their own. This may be related to the number of subscribers - I haven't analysed the query plans.

On top of that, lots of late evaluation of email addresses - with a total of 483ms for that.

Related branches

description: updated
description: updated
Gary Poster (gary)
Changed in launchpad:
assignee: nobody → Launchpad Yellow Squad (yellow)
tags: added: story-better-bug-notification
Revision history for this message
Gary Poster (gary) wrote :

I'm fairly sure this comes from code in lp.bugs.model.structuralsubscription.BugFilterSetBuilder.

To satisfy my own curiosity, some small hand-made stats about the long queries.

- There are four verrrry similar long queries, each taking between 1 and 1.5 seconds. They may (hopefully) differ in the variables they use when they are called. I wouldn't be surprised if the four are looking at different duped bugs.

- Each unions query unions three distinct select statements, each of which is repeated several times. In the first long query, for instance, the first and second are *each* repeated (unioned) five times; the last is repeated four times. Other long queries apparently vary this somewhat (the second has five, five, and one), but the three unioned queries are identical across all four long queries.

Revision history for this message
Gary Poster (gary) wrote :

As an aside, I was already wondering why we calculate any part of the recipients live at all (as opposed to waiting for the cron script), but I was hoping not to have to address that potential change as part of this story.

Revision history for this message
Gavin Panella (allenap) wrote :

> I'm fairly sure this comes from code in
> lp.bugs.model.structuralsubscription.BugFilterSetBuilder.

Yes :)

> To satisfy my own curiosity, some small hand-made stats about the
> long queries.
>
> - There are four verrrry similar long queries, each taking between 1
> and 1.5 seconds. They may (hopefully) differ in the variables they
> use when they are called. I wouldn't be surprised if the four are
> looking at different duped bugs.

They may well be identical queries. Fixing this was part of the point
of my work on BugSubscriptionInfo.

> - Each unions query unions three distinct select statements, each of
> which is repeated several times. In the first long query, for
> instance, the first and second are *each* repeated (unioned) five
> times; the last is repeated four times. Other long queries
> apparently vary this somewhat (the second has five, five, and
> one), but the three unioned queries are identical across all four
> long queries.

Here's my analysis of just the first query in that OOPS:

- The "three distinct selects" look for subscription filter matches
  against each of status, importance and tags.

- The subscription ID is pulled from the intersection of these
  queries, and is UNIONed with all subscriptions without filter.

- The full structural subscription rows are materialized from these
  IDs.

- *14* of these queries are UNIONed together:

  - The last 4 are looking for distroseries subscriptions (and any
    associated filters). I'm not sure why there are 4, and not say 5,
    of these, but BugTaskSet._getStructuralSubscriptionTargets()
    probably will explain more.

  - Of the remaining 10, 5 are for source package subscriptions, one
    for each bugtask.

  - The other 5 are looking for subscriptions on the distribution, one
    for each bugtask.

> As an aside, I was already wondering why we calculate any part of the
> recipients live at all (as opposed to waiting for the cron script),
> but I was hoping not to have to address that potential change as part
> of this story.

Like for you, I think it's there simply because it's more work to move
it. BugSubscriptionInfo is meant to make offlining this stuff easier,
because it's intended to be able to work with a bug snapshot too,
rather than requiring a "live" bug in the correct state.

Revision history for this message
Gary Poster (gary) wrote :

I thanked Gavin on IRC for the reply. I also clarified my understanding of it. Here's a summary.

1) The query is doing the right/expected thing right now.
2) He does not see a way to improve it (at this time) without off-lining the recipient calculation.

Perhaps we can get to offlining the recipient calculation while we are still in feature rotation, but if not, this task might be a reasonable thing to do when we first move to a bug rotation.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

> 1) The query is doing the right/expected thing right now.

I'm not sure I understand this part? Is this a regression? Or was structural subscription always timing out in these cases? If it's a regression (not timing out before, now timing out), we shouldn't wait until the end of the feature to decide if it is in scope or not.

Revision history for this message
Robert Collins (lifeless) wrote :

Bug subscribers are shown in the UI: we absolutely need to be able to generate the list of folk being notified on a bug in under a second because we show that list in the web UI: subscriber portlets, which are shown on every bug load (and in fact we calculate the subscribers when showing the core page too - it controls the non-ajax subscribers portlet content (whether it says 'subscribe to' or 'edit subscription').

Given that, off-lining really isn't worth doing at this point. Also, in terms of offlining things, we're going to get timeouts on backend scripts soon, because such scripts are causing web request timeouts when they take foreign row locks out. Offline tasks *also* need to be fast.

So, the question here is - why is this slow?

A few things occur to me:
 * A one second query is probably doing full materialization. When we do that, having complex conditions doesn't slow things down much, and is considerably more efficient that doing full materialization 4 times.
 * Its a complex query: 300 lines of SQL is unlikely to be optimised well by the postgresql query planner. We need to get an explain for the plans from qastaging and see whether thats the case.
 * The schema may not be suitable for efficient querying of the data we need. We may need to improve the schema so that we can efficiently query whatever is needed.

Stuart and I would be delighted to help with any of these - just ask.

summary: - structural subscriptions taking 4.8 seconds during nomination editing
- POST
+ BugNomination:+edit-form structural subscriptions taking 4.8 seconds
+ during nomination editing POST
Revision history for this message
Gary Poster (gary) wrote :

@Francis: I have no idea if this should be classified as a regression, to be honest. This is code that existed before the Yellow squad existed, but was developed for this story. It's conceivable that we might have aggravated something somehow, but I'm not clear on how immediately. I'd suggest using another metric to decide whether this should be addressed immediately/sooner.

@Robert: A few replies, for a lengthier comment.
 - Showing the subscribers abstractly--for the web UI, in particular--is a different question than the one this query is asking. The latter takes structural subscription filters into account, and the former does not. The code that is timing out is exclusively about figuring out who should be notified for a particular notification, not about who is subscribed in the abstract. The notification code could be run offline.
 - Offline tasks need to be fast, but they can be slower than in-browser tasks, or we have yet another problem IMO. [A side thought: in this particular case if a 1.5 second transaction were too long (and I am aware of the dangers of long write transactions) then we also might be able to divide up the work into a slave-based read transaction and a gracefully-recover-if-it-fails write transaction.]
 - My immediate suspect would be that perhaps filters need an index we do not have, but investigation will presumably point our way to this relatively quickly.
 - It's possible that a quick run through with the query planner will be a smoking gun to an immediate remediation. I tried to do that just now but I'm not clear on how to quickly resolve the query variables from the OOPS, not having done this before. If Robert/Stuart can get data from the query planner that would be cool, though if we are told to drop everything and focus on this I suspect someone else on the yellow squad will know how to do it.

In sum, I hope to assign myself (so I can learn) with Danilo's assistance (so he can teach me) when we tackle this task. For obvious reasons, I would prefer not to treat this as a burning task that requires us to drop our current feature work.

Revision history for this message
Robert Collins (lifeless) wrote :

 - Showing the subscribers abstractly--for the web UI, in particular--is a different question than the one this query is asking. The latter takes structural subscription filters into account, and the former does not. The code that is timing out is exclusively about figuring out who should be notified for a particular notification, not about who is subscribed in the abstract. The notification code could be run offline.

This is very relevant to the discussion about whether we should show subscribers : direct subscribers are equally interesting to 'who will be notified', if the question people want to answer is is 'will fred be told'.

 - Offline tasks need to be fast, but they can be slower than in-browser tasks, or we have yet another problem IMO. [A side thought: in this particular case if a 1.5 second transaction were too long (and I am aware of the dangers of long write transactions) then we also might be able to divide up the work into a slave-based read transaction and a gracefully-recover-if-it-fails write transaction.]

Indeed, thats the sort of logic thats needed. Offline tasks cannot have write transactions longer than in-browser transactions, or in browser transactions will serialise behind offline tasks and fail. E.g. if the timeout is 10 seconds, most pages have to happen in e.g. 2 seconds, and we need write transactions to reliably complete within 4 seconds (allows 2 backend transactions and one webapp one serialised on the thing within 10 seconds). {*: exact numbers are made up}.

 - My immediate suspect would be that perhaps filters need an index we do not have, but investigation will presumably point our way to this relatively quickly.
 - It's possible that a quick run through with the query planner will be a smoking gun to an immediate remediation. I tried to do that just now but I'm not clear on how to quickly resolve the query variables from the OOPS, not having done this before. If Robert/Stuart can get data from the query planner that would be cool, though if we are told to drop everything and focus on this I suspect someone else on the yellow squad will know how to do it.

The bind variables are now rendered in the OOPS: skip the 'long requests' and 'repeated requests' sections - go down to the complete query list and each of those has been substituted already - you may need to handle the occasional u"foo" because we capture the variables outside the sql layer, but other than string literals it should be good to go.

Revision history for this message
Gary Poster (gary) wrote : Re: [Bug 723999] Re: BugNomination:+edit-form structural subscriptions taking 4.8 seconds during nomination editing POST

On Feb 24, 2011, at 3:08 PM, Robert Collins wrote:

> - Showing the subscribers abstractly--for the web UI, in particular--is
> a different question than the one this query is asking. The latter takes
> structural subscription filters into account, and the former does not.
> The code that is timing out is exclusively about figuring out who should
> be notified for a particular notification, not about who is subscribed
> in the abstract. The notification code could be run offline.
>
> This is very relevant to the discussion about whether we should show
> subscribers : direct subscribers are equally interesting to 'who will be
> notified', if the question people want to answer is is 'will fred be
> told'.

Direct subscribers are not part of the problem here. Structural subscription filters are the key difference, as I said. Perhaps I miss your point.

In any case, it is a good question whether these pages will need to consider filters in the future, and one we had not considered deeply. I will ask Jono. It would have pros and cons.

> The bind variables are now rendered in the OOPS: skip the 'long
> requests' and 'repeated requests' sections - go down to the complete
> query list and each of those has been substituted already - you may need
> to handle the occasional u"foo" because we capture the variables outside
> the sql layer, but other than string literals it should be good to go.

Ah, thank you! Cool, and good to know.

I did do an explain on staging and nothing looked obviously wrong to my untrained eye other than it being very large. I then did an explain analyze and was surprised to see much shorter query times than I expected. I then turned on \timing and ran the queries. I consistently got around 60-70 ms for the first query, returning 78 rows. This is in comparison to 1499 ms on the OOPS report.

Maybe something else was going on in the production database at the time? Maybe the production database is significantly different?

We've had about 5000 of these in the past week, though, according to the OOPS tool. I don't know. :-/

I'll ask a LOSA to try this on production if I haven't heard back from Stuart or others.

Revision history for this message
Robert Collins (lifeless) wrote :

Could you paste in the plan to the bug? I'll happily have a look for
gotchas for you.

Revision history for this message
Gary Poster (gary) wrote :

Awesome, thank you.

explain output: http://pastebin.ubuntu.com/572019/
explain analyze output: http://pastebin.ubuntu.com/572020/

Revision history for this message
Robert Collins (lifeless) wrote :

Ok, this suggests it is moderately cheap (4000 estimated cost) - note
that in a cold cache situation that is several seconds on its own.

The plan is very complex, but wgrant pointed out a key element: there
are only -4- subscription filters on qastaging at the moment, so we're
not seeing anything matching what prod is doing.

e.g.
                                       -> Hash (cost=1.02..1.02
rows=2 width=8)
                                             -> Seq Scan on
bugsubscriptionfilter (cost=0.00..1.02 rows=2 width=8)

is reading every bugsubscription filter and hashing it with no
condition - its then filtering *after* doing the hash join. We should
expect this to change radically if we put the filters we have on prod
into the staging environment - 24K worth.

Revision history for this message
Robert Collins (lifeless) wrote :

I don't know if its possible, but one way we could fix the critical component of this bug is to nuke all the bugsubscription filters that were recently added to production, then we can test with them on staging before reintroducing them.

Revision history for this message
Gavin Panella (allenap) wrote :

> In any case, it is a good question whether these pages will need to
> consider filters in the future, and one we had not considered
> deeply. I will ask Jono. It would have pros and cons.

My feeling is that the list of subscribers shown on the page should
either not include structural subscribers, or show all structural
subscribers regardless of filters.

We don't know what the next mutation of a bug is going to be, so we
don't know which filters that change is going to match ahead of
time. A list of subscribers that considers filters against the current
state of the bug is often going to be misleading.

Revision history for this message
Данило Шеган (danilo) wrote :

A few notes myself:
 - we can get rid of LEFT JOINs with BugSubscriptionFilter now, since we'll always have at least one
 - clearing empty BSFs out would help only for existing filters, unless we also roll out a change that auto-creates them
 - and this definitely means that BSFs are largely unusable so I think we'll have to fix this anyway

With 24k BugSubscriptionFilters added to staging DB, I am seeing roughly the same performance of the query with warm caches as on production, so staging DB should now be more suitable for optimization work. Though, I think it's quite clear that the only thing that needs to happen is to not do so many very similar subselects in a single query and instead do that with joins. Of course, as Rob points out, we need to ensure that the most restricting conditions apply first if Postgres doesn't do that for us (and we can sometimes achieve that with subselects). I'll probably be looking at this further with Gary later today.

Revision history for this message
Данило Шеган (danilo) wrote :

FWIW, I believe this is something where using our existing Collection pattern would be quite suitable.

Revision history for this message
Gary Poster (gary) wrote :

1) Danilo and I agree with Robert that we should remove the empty BugSubscriptionFilters from production, at least for now while we dig into the problem. 5000+ timeouts a week from this probably means that pursuing a short term band aid is worth it, especially when we have a relatively easy one within our grasp. Danilo is crafting the SQL now.

2) Afterwards, we will try out the ideas Danilo mentions in comment #15 by trying things out on staging.

3) If those ideas are insufficient, we will reconsider the empty filters generally. They make things nicer from the perspective of normalizing our data to correspond better to our UI, but there are other approaches available to us if they are prohibitively expensive. I'm hopeful work on #2 will make this reconsideration unnecessary though, because if we can't perform #2 well then that calls into question our ability to execute on our plans generally, if filters become as important as we hope and expect them to.

In regards to Gavin's comment #14: thank you for the thoughts. That makes sense to me too. I'll pass those on to Jono today.

Revision history for this message
Данило Шеган (danilo) wrote :
Revision history for this message
Данило Шеган (danilo) wrote :

SQL has been run. This should buy us some time to solve the actual query problem.

Revision history for this message
Данило Шеган (danilo) wrote :

FWIW, we should re-do the query to create missing BSFs once this lands (and is proven to work): https://pastebin.canonical.com/43604/

Revision history for this message
Gary Poster (gary) wrote :

Данило's reminder from comment #20 is now tracked in bug 728818.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
Changed in launchpad:
milestone: none → 11.03
tags: added: qa-needstesting
Changed in launchpad:
status: Triaged → Fix Committed
Revision history for this message
Gary Poster (gary) wrote :

I'm marking this qa-ok so that I don't block deployment; I won't know if it actually helps things until it makes it to staging, and it won't do that for another 9 hours or so.

tags: added: qa-ok
removed: qa-needstesting
tags: added: qa-needstesting
removed: qa-ok
Revision history for this message
Gary Poster (gary) wrote :

belay that last; I'm doing basic testing in qastaging to make sure I haven't made things worse.

Revision history for this message
Gary Poster (gary) wrote :

qatesting seems no worse, at least; we'll have to see if staging, with all the filters, is significantly better.

tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Gary Poster (gary) wrote :

staging seems pretty snappy now that the branch is there. We should now try reinstating the filters on production, bug 728818.

Revision history for this message
Robert Collins (lifeless) wrote :

Prod is only at 12515, so we need to wait for the branch to reach
production first. We'll do a nodowntime deploy today with (hopefully)
all the changes leading up to the db-devel merge. If you wanted to
wait 24 hours after that to get a reasonable baseline, then put the
filters in on Tuesday, I think that would give enough room to see if
they cause timeouts before the db deploy happens.

-Rob

William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
Revision history for this message
Gary Poster (gary) wrote :

The OOPS tool does not provide a reliable measurement of the scope of this problem. :-(

The OOPS grouping for this traceback does not distinguish this cause from many others. For instance, when I just checked it thought that https://lp-oops.canonical.com/oops/?oopsid=OOPS-1894C1466 is the most recent instance of the problem, and it has nothing to do with it. Therefore the previous count of "5000+ timeouts in a week" was spurious--the count is down to 4375 now, but 4375 of what?

The good news is that, looking at staging, reports like https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1894STAGING110 and https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-1894STAGING109 show radically smaller queries, and runtimes of 1ms + 30ms for the two pertinent queries. (I'm not sure why bug heat is timing out like that, but that's not where I need to focus right now.)

Therefore, unless I hear otherwise, I think we have as good an indication as I can find that we are OK to reinstate filters on production. I'll wait till tomorrow to see if there are any objections or thoughts, then give it a whirl.

Revision history for this message
Robert Collins (lifeless) wrote :

Thats great news. Please don't enable the filters on production until Monday:

 - on thursday we have a db deploy (hard to rollback from)
 - on friday if it does have trouble we'll have noone around to fix it
due to the weekend.

We are also still short asia-pac losa coverage due to sickness.

We should be good to do this on Monday, and if there is a simple
script to set them up I'd be delighted to arrange for that to run on
Monday morning for you if we have asia-pac losa coverage.

Revision history for this message
Robert Collins (lifeless) wrote :

> We should be good to do this on Monday, and if there is a simple
> script to set them up I'd be delighted to arrange for that to run on
> Monday morning for you if we have asia-pac losa coverage.

(And if there isn't asia pac coverage I'm totally happy for you to
arrange it first thing your Monday)

Revision history for this message
Gary Poster (gary) wrote :

Thank you! The bug for reinstating the filters is bug 728818, which has the SQL: https://pastebin.canonical.com/43604/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.