open-ils.storage does not recognize loss of database connections

Bug #1830968 reported by Galen Charlton on 2019-05-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Medium
Unassigned

Bug Description

I've observed that it is possible for an open-ils.storage drone to have its database connection terminated (e.g., via a pg_terminate_backend()) without recognizing the situation. When that happens, the drone will continue to receive and attempt to process requests but, of course, fail to return results.

A better outcome would be for the drone to either attempt to reconnect to the database or to terminate itself.

Evergreen 3.1+

Galen Charlton (gmc) on 2019-05-29
description: updated
Changed in evergreen:
importance: Undecided → Medium
description: updated
Galen Charlton (gmc) wrote :

Some further research:

- it looks like DBD::Pg does support meaningful values for ->state(), with 'S8006' in particular representing a connection issue
- if we go the route of letting the child terminate, setting force_recycle (see OpenSRF bug 1706147) is an option
- slightly smarter might be letting the drone make one attempt to reconnect, then retrying the method from the beginning if the connection succeeds or terminating the drone otherwise

Galen Charlton (gmc) wrote :

Also noting that drones of open-ils.cstore and its peers will terminate themselves (and drop the current request) if their database connection goes away.

Galen Charlton (gmc) wrote :

Continuing the monologue, currently it looks like error handling in open-ils.storage is all over the map, with some exceptions being caught and others being allowed to propagate all the way up to OpenSRF::Application. Consequently, it may take some doing to fully implement a solution.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers