Launchpad should return 503 error pages when database is unavailable

Reported by Stuart Bishop on 2011-09-08
42
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Gary Poster

Bug Description

When a PostgreSQL database is unavailable, we should return a 503 error code instead of a 500.

We should still log an OOPS.

We can make the page pretty. Wording might be tricky, as Launchpad will not know if this was a scheduled outage or if a power cord has been kicked out.

Related branches

lp:~gary/launchpad/bug844631
Merged into lp:launchpad at revision 14096
Brad Crittenden: Approve (code) on 2011-09-30
Robert Collins (community): Approve on 2011-09-30
Stuart Bishop (stub) on 2011-09-08
Changed in launchpad:
status: New → Triaged
importance: Undecided → High
tags: added: fastdowntime-later
Stuart Bishop (stub) wrote :

This is different to Bug #373191, as that bug is about what HAProxy or Apache should return if Launchpad is not available. This bug is about what Launchpad should return if PostgreSQL is not available.

Stuart Bishop (stub) wrote :

OOPS-2077A100 - "AssertionError: Bug #504291: Store left in a disconnected state."

Stuart Bishop (stub) wrote :

Also OOPS-2077A101 - DisconnectionError

(this is the expected exception)

Stuart Bishop (stub) wrote :

OOPS-2077A73 for the initial shutdown

Changed in launchpad:
assignee: nobody → Huw Wilkins (huwshimi)
Changed in launchpad:
status: Triaged → In Progress
Matthew Revell (matthew.revell) wrote :

Huw has produced a design for this page and is now implementing it.

Vincent Ladeuil (vila) wrote :

\o/

Gary Poster (gary) wrote :

I linked Huw's branch, which I expect is the UI. Yellow can hook it up.

Matthew Revell (matthew.revell) wrote :

I marked this Critical as this is producing a user visible OOPs.

Gary, thanks for linking the branch with the UI.

Changed in launchpad:
importance: High → Critical
Robert Collins (lifeless) wrote :

I've uncriticaled this because its a *symptom* not a cause of users seeing OOPs: fastdowntime is the cause, and we'll still be generating OOPSes when this event happens: the appservers cannot tell fastdowntime from the-network-is-broken.

I'd also like to note that we should include the OOPS in the 503 page (but very small, in a corner, or even just in the source) - so that if someone gets one of these when we are *not* deploying, we can track it down. e.g. users don't need to see it, but if some hits the page and asks in IRC, we can tell them what to do to get the OOPS id out.

Changed in launchpad:
importance: Critical → High
Gary Poster (gary) on 2011-09-21
Changed in launchpad:
assignee: Huw Wilkins (huwshimi) → Launchpad Yellow Squad (yellow)
status: In Progress → Triaged
importance: High → Critical
Robert Collins (lifeless) wrote :

(reverting the high->critical as there was no comment explaining it)

Changed in launchpad:
importance: Critical → High
Changed in launchpad:
importance: High → Critical

To clarify why I consider this bug Critical:

Our daily fast down-time is a great boost to the speed of getting new db-change-reliant features released. It has also got rid of that awful 90 minute window where parts of LP were down and other were read-only.

However, it has given us a user-experience regression in that we now present an OOPS rather than a friendlier "Hey, Launchpad is read-only but you can still do some stuff" message.

In my view:

 * an unexplained OOPS for up to five minutes a day, potentially every
day, just looks awful and I think it's safe to say that most people
seeing the OOPS won't have read the blog nor will they have found out
through our stakeholder process
 * Huw's new design brings in updates from the status feed and sends
people to subscribe to that, thereby making us appear more reliable in
future as they'll have had warning of such down-time.

Rob said: "I'd also like to note that we should include the OOPS in the 503 page (but very small, in a corner, or even just in the source)".

Thanks for that Rob.

Yellow squad, could you put "(OOPS 123456)" directly after "...because our database has gone offline." please?

Huw, if you have a better suggestion then please update your branch and let us know in a comment here.

Gary Poster (gary) wrote :

@Matthew
Will do.

tags: added: escalatedd
tags: added: escalated
removed: escalatedd
Stuart Bishop (stub) wrote :

The assert I mentioned in comment #2 should probably go. I think it was added to track down disconnection errors we were seeing (and have now been fixed in Storm).

http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/view/head:/lib/canonical/librarian/tests/test_db_outage.py has tests for how the Librarian handles an outage. A similar approach should be doable to bring up an appserver connected via pgbouncer. If it is a problem, I was toying with the idea of installing the pgbouncer fixture once for everything and running all tests through it.

Stuart Bishop (stub) wrote :

Bug #846162 is related. Please update that bug if your tests show that the 503 page is taking a long time to render.

Gary Poster (gary) wrote :

Thanks, Stuart. I'm giving it a whirl.

Changed in launchpad:
assignee: Launchpad Yellow Squad (yellow) → Gary Poster (gary)
status: Triaged → In Progress
William Grant (wgrant) wrote :

This caused test failures in buildbot (see <https://lpbuildbot.canonical.com/builders/lucid_lp/builds/1428/steps/shell_6/logs/summary>), so has been rolled back.

Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant) on 2011-10-03
Changed in launchpad:
status: Fix Committed → In Progress
Steve Kowalik (stevenk) on 2011-10-03
tags: added: qa-untestable
removed: qa-needstesting
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-untestable
Changed in launchpad:
status: In Progress → Fix Committed
Gary Poster (gary) on 2011-10-04
tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant) on 2011-10-08
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers