Critical data loss bug in postgresql-common initscript

Bug #1042556 reported by Craig Ringer
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
postgresql-common (Debian)
Fix Released
Unknown
postgresql-common (Ubuntu)
Fix Released
High
Unassigned
Lucid
Fix Released
High
Unassigned
Precise
Fix Released
High
Martin Pitt

Bug Description

Hi

The Debian packages for PostgreSQL (and thus the Ubuntu packages because of the shared use of pg_wrapper) are subject to a potentially critical data loss bug because of an unsafe procedure for restarting PostgreSQL.

This issue has been recognised and patched in Debian:

    http://anonscm.debian.org/loggerhead/pkg-postgresql/postgresql-common/trunk/revision/1181
    http://archives.postgresql.org/pgsql-general/2012-07/msg00501.php

but should be urgently included in Ubuntu and backported.

I quote Tom Lane (key PostgreSQL dev):

        [The] forced unlink on the postmaster.pid file [...] (a) is entirely
        unnecessary, and (b) defeats the safety interlock against starting a
        new postmaster before all the old backends have flushed out.

It is VITAL that pg_wrapper NEVER unlink the postmaster.pid file. The postmaster will do that its self if it finds the pid to be stale, but only after performing some checks to make sure there are no backends still running and to ensure that there's no other postmaster running against the database.

See:
    http://archives.postgresql.org/pgsql-general/2012-07/msg00475.php

Context here:

    http://archives.postgresql.org/pgsql-general/2012-07/msg00350.php
    http://dba.stackexchange.com/questions/20959/recover-postgresql-database-from-wal-errors-on-startup/20961

SRU INFORMATION:
 * Impact: Severe data loss in rare corner cases.

 * Regression potential: Very low. The change has been in Debian, Quantal, and my very popular PostgreSQL backports repository for quite some time. pg_ctlcluster has a function start_check_pid_file() which cleans up a stale PID file on startup if it still exists after pg_ctlcluster stop --force goes to kill -9 the postmaster, so that does not stop a subsequent startup. The test suite (t/030_errors.t) explicitly covers scenarios with missing, broken, and stale PID files and ensures that they are handled properly.

 * Test case: I do not know a realistic and reliable test case to cause the data loss, but the analysis of the bug in above ML thread is very clear. I suggest to regression-test the change only, i. e. run the postgresql-common test suite and a manual check that starting a cluster still works with a stale pid file being around:

  sudo pg_createcluster 9.1 test --start
  sudo cp /var/lib/postgresql/9.1/test/postmaster.pid{,.save}
  sudo pg_ctlcluster 9.1 test stop
  # now cause a stale pid file
  sudo cp /var/lib/postgresql/9.1/test/postmaster.pid{.save,}

  # this should succeed and say "Removed stale pid file."
  sudo pg_ctlcluster 9.1 test start

  # this should say that 9.1/test is online
  pg_lsclusters

Revision history for this message
Craig Ringer (ringerc) wrote :
Changed in postgresql-common (Ubuntu):
status: New → Confirmed
Revision history for this message
Craig Ringer (ringerc) wrote :

BTW, while the Debian patch says "potentially dangerous", Tom Lane (core Pg developer & Red Hat package maintainer) says "horridly, horridly dangerous and stupid".

Revision history for this message
Martin Pitt (pitti) wrote :

This is fixed in Quantal.

Changed in postgresql-common (Ubuntu):
status: Confirmed → Fix Released
importance: Undecided → High
Changed in postgresql-common (Ubuntu Lucid):
status: New → Triaged
Changed in postgresql-common (Ubuntu Precise):
status: New → Triaged
Revision history for this message
Martin Pitt (pitti) wrote :

Adding tasks for LTSes.

Changed in postgresql-common (Ubuntu Precise):
assignee: nobody → Martin Pitt (pitti)
importance: Undecided → High
Changed in postgresql-common (Ubuntu Lucid):
importance: Undecided → High
Revision history for this message
Craig Ringer (ringerc) wrote :

Thanks. I wanted to make sure this didn't slip under the radar in terms of merging with Debian etc.

Changed in postgresql-common (Debian):
status: Unknown → New
Revision history for this message
Martin Pitt (pitti) wrote :

Uploaded to precise-proposed. I do not know a sensible test case how to reproduce the failure reliably, but the discussion on the mailing list makes it rather clear. I'll update the description for SRU as good as I can.

Changed in postgresql-common (Ubuntu Precise):
status: Triaged → Fix Committed
Revision history for this message
Martin Pitt (pitti) wrote :

Uploaded to lucid-proposed unapproved queue as well. This now needs to be reviewed by the SRU team.

description: updated
Changed in postgresql-common (Ubuntu Lucid):
status: Triaged → Fix Committed
Revision history for this message
Scott Kitterman (kitterman) wrote : Please test proposed package

Hello Craig, or anyone else affected,

Accepted into precise-proposed. The package will build now and be available in a few hours in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Revision history for this message
Scott Kitterman (kitterman) wrote :

Hello Craig, or anyone else affected,

Accepted into lucid-proposed. The package will build now and be available in a few hours in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Martin Pitt (pitti) wrote :

I run the p-common tests and manual tests for precise with the packages from -proposed, and they all succeed, marking verified for precise.

For lucid I now noticed a slight regression: pg_ctlcluster prints out a warning

   Name "main::result" used only once: possible typo at /usr/bin/pg_ctlcluster line 292

which is harmless, but wasn't there before, and breaks expected script output. When merging the patch I accidentally added an additional "$result = 0" which doesn't belong there. Sorry for missing that earlier! I uploaded 106ubuntu3 to the unapproved queue which fixes this.

tags: added: verification-done-precise
Martin Pitt (pitti)
Changed in postgresql-common (Ubuntu Lucid):
status: Fix Committed → In Progress
Revision history for this message
Scott Kitterman (kitterman) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package postgresql-common - 129ubuntu1

---------------
postgresql-common (129ubuntu1) precise-proposed; urgency=low

  * pg_ctlcluster: Do not remove the PID file after SIGKILLing the
    postmaster in the "last-ditch effort to shut down" in --force mode. This
    is a potentially dangerous thing to do when trying to start a second
    postmaster in parallel while the first one is still being shut down.
    (see http://archives.postgresql.org/pgsql-general/2012-07/msg00475.php)
    Backported from trunk r1181. (LP: #1042556)
  * debian/control: Update Vcs-* for new precise branch.
 -- Martin Pitt <email address hidden> Thu, 06 Sep 2012 15:01:02 +0200

Changed in postgresql-common (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
Scott Kitterman (kitterman) wrote : Please test proposed package

Hello Craig, or anyone else affected,

Accepted postgresql-common into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/postgresql-common/106ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please change the bug tag from verification-needed to verification-done. If it does not, change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in postgresql-common (Debian):
status: New → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Verification still needed

The fix for this bug has been awaiting testing feedback in the -proposed repository for lucid for more than 90 days. Please test this fix and update the bug appropriately with the results. In the event that the fix for this bug is still not verified 15 days from now, the package will be removed from the -proposed repository.

tags: added: removal-candidate
Revision history for this message
Martin Pitt (pitti) wrote :

Sorry, I didn't notice that/when the updated lucid package got accepted into -proposed. I ran the regression tests again, and they are all happy.

tags: added: verification-done
removed: verification-done-precise verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package postgresql-common - 106ubuntu3

---------------
postgresql-common (106ubuntu3) lucid-proposed; urgency=low

  * pg_ctlcluster: Drop erroneous $result assignment which was introduced in
    the previous version due to a mis-merge.

postgresql-common (106ubuntu2) lucid-proposed; urgency=low

  * pg_ctlcluster: Do not remove the PID file after SIGKILLing the
    postmaster in the "last-ditch effort to shut down" in --force mode. This
    is a potentially dangerous thing to do when trying to start a second
    postmaster in parallel while the first one is still being shut down.
    (see http://archives.postgresql.org/pgsql-general/2012-07/msg00475.php)
    Patch taken from trunk r1181. (LP: #1042556)
 -- Martin Pitt <email address hidden> Tue, 18 Sep 2012 08:05:03 +0200

Changed in postgresql-common (Ubuntu Lucid):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.