Storage Listener Dies During Fine Generation

Bug #1819798 reported by Jason Stephenson on 2019-03-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Undecided
Unassigned

Bug Description

Evergreen version: 3.0
OpenSRF Version: 3.0 and 3.1/master
PostgreSQL version: 9.5.16
O/S Version: Ubuntu 16.04 & Ubuntu 18.04

In several cases, I have seen the storage listener die, with drones still running, while using a parallel setting of 6 for the fine generator.

osrf_control --diagnostic says: ERR open-ils.storage Has PID file entry [1625], which matches no running open-ils.storage processes

And pgrep -af storage shows 7 drones running:
21046 OpenSRF Drone [open-ils.storage]
21052 OpenSRF Drone [open-ils.storage]
21059 OpenSRF Drone [open-ils.storage]
21063 OpenSRF Drone [open-ils.storage]
21064 OpenSRF Drone [open-ils.storage]
21070 OpenSRF Drone [open-ils.storage]
21077 OpenSRF Drone [open-ils.storage]

The fine generator itself is still running. pgrep -af fine reports:
21026 /usr/bin/perl /openils/bin/fine_generator.pl /openils/conf/opensrf_core.xml

The lock file is still in place.

My latest occurrence of this happened today after the 5:00 pm run. When the 6:00 pm run started, it reported that the fine generator seemed to already be running, so I had a look.

Looking through the 50MB of syslog that exist on this one server for the hour of 17:00, I see no signs of errors or other problems. I've checked kern.log for OOM Killer events, and nothing there, either.

However, the Evergreen syslog entries just stop at 17:15:56. A "normal" syslog has double the number of lines with log entries ending at 19:38:16. In the case of both logged hours, the fine generator was the only Evergreen process running.

This is not the first time that this has happened. It was a fairly frequent occurrence after we "improved" our database configuration until I moved the fine generator and some other cron jobs to a separate VM from the main utility server. I also seem to recall seeing this happen in the past, prior to the database changes, but I don't have very good documentation of the past events.

description: updated
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers