Log query plan for timeout oopses

Bug #604509 reported by Jeroen T. Vermeulen
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Stuart Bishop

Bug Description

For timeout oopses with long queries in them, I think it would be helpful to run an EXPLAIN on the longest query, and show the query plan in the oops report. (Not an EXPLAIN ANALYZE; that would execute the query and cause lots of trouble).

On the one hand this would save us a step in reproducing slow queries. On the other, it will tell us something we've been wondering about mysterious timeouts that we can't reproduce: are we testing the same query plan that applied at the time and it sometimes behaves unpredictably, or has it changed? So far we've been guessing.

(For merely repetitive queries this information isn't interesting).

Curtis Hovey (sinzui)
affects: launchpad → launchpad-foundations
summary: - Log EXPLAIN timeout oopses
+ Log query plan for timeout oopses
Gary Poster (gary)
Changed in launchpad-foundations:
status: New → Incomplete
Revision history for this message
Gary Poster (gary) wrote :

I assume/hope EXPLAIN is very very fast, so it's probably no problem to do this, other than prioritization to find time to do it. However, I'm not clear on when this is useful.

This is for a case in which we have rare timeouts, I guess? If the timeout is frequent, then presumably an EXPLAIN on the production (or maybe even staging?) database will be sufficient, I would think. So, this is only helpful for rare timeouts? ...because when you run EXPLAIN on the production database later, it doesn't seem to be a problem any more?

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 604509] Re: Log query plan for timeout oopses

Right, the use case Jeroen presented to me was things where ever so
slightly stale statistics on prod cause bad behaviour, which we can't
then reproduce later: so we should consistently and pervasively gather
this data to counter that.

I do have a concern about perf impact on prod doing this, would like
Stuarts input there.

Revision history for this message
Gary Poster (gary) wrote :

Stuart, I assigned for you to comment on the idea when you get a chance. Thank you

Changed in launchpad-foundations:
status: Incomplete → Triaged
importance: Undecided → Low
assignee: nobody → Stuart Bishop (stub)
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

EXPLAIN is fast—definitely sub-second even in tragically difficult cases; I doubt we'll ever spend more than 20 milliseconds on one.

It's EXPLAIN ANALYZE that's slow. But we shouldn't do that anyway because it executes the query in order to get the timings.

The practical use of this feature isn't clear until we try it. Right now we often face the question: "it runs fine for me now, and the plan looks good... did the plan change or did something else happen?" We may find that the plan is always the same, in which case we can stop worrying about it—and the devs can be confident that they get representative oopses that they can use for optimization. Conversely we may find that the plans change, in which case we can stop agonizing over the exact plans and start worrying about avoiding unpredictability.

Revision history for this message
Gary Poster (gary) wrote :

Understood.

FWIW, 20 milliseconds per query would add up unpleasantly in some cases, I think. Perhaps "log EXPLAIN output for the first N queries in the request?" Maybe combine that with "if time of individual query > X, log EXPLAIN output."

I'll leave this for Stuart's opinion, though.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Two points:

1. Based on discussion on the postgres developers' list on this topic, I would believe 20ms to be an extreme case. Take the last digit off for a more typical number.

2. If it's still too long, well that's why the bug only asks for the longest query to be explained. :-)

Revision history for this message
Robert Collins (lifeless) wrote :

I think doing this in the oops for the longest query is sensible, but
only once we're in oops territory and only the longest query as a
starting point.

Revision history for this message
Stuart Bishop (stub) wrote :

EXPLAIN (without analyze) is fast.

We only want to do this for queries that take longer than some threshold (eg. 5 seconds, or 50% of the soft timeone, or whatever). We don't want query plans for all queries because there might be an awful lot of them, making the whole process slow and consuming gobs of disk space.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

My original call was for "the longest query in a timeout oops," but what Stuart says is just as good for me.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.