cstore/pcrud/rstore can silently hang during initialization

Bug #702206 reported by Galen Charlton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Won't Fix
Low
Unassigned

Bug Description

Evergreen version: 2.0 (and presumed to affect earlier versions)

oilsExtendIDL(), which is invoked during initialization of the cstore, pcrud, and rstore, can silently hang if a given table cannot be selected from because it is locked or tied up by another process' hanging transaction. If this occurs during an app server or brick start, this can result in a situation where every service except cstore/pcrud/rstore is running but there's zero hint to the sysadmin that cstore failed to initialize and, more importantly, why.

To avoid this problem, I suggest having oilsExtendIDL() set a session statement timeout to something reasonable (like 5 seconds) and either abort the initialization of cstore/pcrud/rstore if an IDL query times out or at least squawk loudly enough that a sysadmin has a better chance of figuring out that she needs to check the database server for stuck transactions. Since currently all clients of oilsExtendIDL() immediately close the database connection after the IDL has been scanned, a session statement timeout for oilsExtendIDL() will not interfere with other queries.

Galen Charlton (gmc)
Changed in evergreen:
importance: Undecided → Low
Revision history for this message
Mike Rylander (mrylander) wrote :

This seems like a reasonable plan. It will require some investigation of how libdbi interacts with statements that time out, and when they do we should loudly log the failure and stop trying to do anything -- the cstore-ish backend should immediately exit after logging the failure. Not doing so will cause the backend in question to have incomplete metadata about the database, which will end up causing failures when trying to manipulate the timed-out database object.

Revision history for this message
Mike Rylander (mrylander) wrote :

Attaching the beginnings of an implementation. This just blindly sets the statement_timeout (and sets it back), but we need to check the error number returned by dbi_conn_error() to see if the timeout was the cause, or if the table/view just didn't exist. To do that, we'll need to actually simulate the failures, as I don't see a normative list of error codes in the libdbi docs, nor a mapping from SQLCODE for the libdbd-pgsql driver.

Revision history for this message
Mike Rylander (mrylander) wrote :

Attaching the beginnings of an implementation. This just blindly sets the statement_timeout (and sets it back), but we need to check the error number returned by dbi_conn_error() to see if the timeout was the cause, or if the table/view just didn't exist. To do that, we'll need to actually simulate the failures, as I don't see a normative list of error codes in the libdbi docs, nor a mapping from SQLCODE for the libdbd-pgsql driver.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Branchified on git://git.evergreen-ils.org/working/Evergreen.git as:

collab/dyrcona/lp702206
collab/dyrcona/lp702206_2_1
collab/dyrcona/lp702206_2_0

tags: added: pullrequest
tags: removed: pullrequest
Revision history for this message
Jason Stephenson (jstephenson) wrote :

What is the status on this one? Should it remain incomplete or should it be changed to Won't Fix?

Ben Shum (bshum)
no longer affects: evergreen/master
Ben Shum (bshum)
no longer affects: evergreen/2.0
no longer affects: evergreen/2.1
no longer affects: evergreen/2.2
Elaine Hardy (ehardy)
tags: added: silentfailure
Changed in evergreen:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.