Make PostgreSQL replicable on Open Library

Bug #600018 reported by George
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Open Library
New
Critical
Jim Shankland

Bug Description

Our database is a single point of failure today. We need to address this promptly.

Jim, please do the first round of analysis of our options here, and come up with a recommended course of action. Ideally, within a week.

George (george-archive)
Changed in openlibrary:
importance: Undecided → Critical
milestone: none → stability
George (george-archive)
Changed in openlibrary:
assignee: nobody → Jim Shankland (jim-archive)
Revision history for this message
Jim Shankland (jim-archive) wrote :

We are now running the openlibrary d/b on a pair of SSDs, which has massively improved its performance and allowed us to take nightly backups. The current state is that in the case of a hard failure, we will bring the replacement machine up after restoring the database from the nightly backup -- which will require several hours and lose some or all database updates since the last backup.

Work is in progress to bring up a "warm standby" server, which will reduce the maximum data loss in case of a hard failure to 5 minutes' worth, and should reduce downtime to under an hour. Development work on the warm standby server will be completed this week. The standby server will also require its own pair of SSDs before it can be put into full production. I've ordered these, but I'm not sure when they'll get delivered.

Revision history for this message
Will Glynn (willglynn) wrote :

PostgreSQL 9 added streaming replication:
    http://www.postgresql.org/docs/9.0/interactive/warm-standby.html#STREAMING-REPLICATION

This lets you cut down the replication lag (and thus potential for data loss) from a whole WAL file to the level of individual transactions as they occur.

Additionally, replication can be optionally paired with the new hot standby function:
  http://www.postgresql.org/docs/9.0/interactive/hot-standby.html

This would permit read-only access to slave database(s) without triggering a failover event -- useful for not spreading load, or even separating different types of load. For example, interactive users go straight to the master, while backups, exports, etc. run on the slave.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.