Storage Listener Dies During Fine Generation

Bug #1819798 reported by Jason Stephenson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
New
Undecided
Unassigned

Bug Description

Evergreen version: 3.0
OpenSRF Version: 3.0 and 3.1/master
PostgreSQL version: 9.5.16
O/S Version: Ubuntu 16.04 & Ubuntu 18.04

In several cases, I have seen the storage listener die, with drones still running, while using a parallel setting of 6 for the fine generator.

osrf_control --diagnostic says: ERR open-ils.storage Has PID file entry [1625], which matches no running open-ils.storage processes

And pgrep -af storage shows 7 drones running:
21046 OpenSRF Drone [open-ils.storage]
21052 OpenSRF Drone [open-ils.storage]
21059 OpenSRF Drone [open-ils.storage]
21063 OpenSRF Drone [open-ils.storage]
21064 OpenSRF Drone [open-ils.storage]
21070 OpenSRF Drone [open-ils.storage]
21077 OpenSRF Drone [open-ils.storage]

The fine generator itself is still running. pgrep -af fine reports:
21026 /usr/bin/perl /openils/bin/fine_generator.pl /openils/conf/opensrf_core.xml

The lock file is still in place.

My latest occurrence of this happened today after the 5:00 pm run. When the 6:00 pm run started, it reported that the fine generator seemed to already be running, so I had a look.

Looking through the 50MB of syslog that exist on this one server for the hour of 17:00, I see no signs of errors or other problems. I've checked kern.log for OOM Killer events, and nothing there, either.

However, the Evergreen syslog entries just stop at 17:15:56. A "normal" syslog has double the number of lines with log entries ending at 19:38:16. In the case of both logged hours, the fine generator was the only Evergreen process running.

This is not the first time that this has happened. It was a fairly frequent occurrence after we "improved" our database configuration until I moved the fine generator and some other cron jobs to a separate VM from the main utility server. I also seem to recall seeing this happen in the past, prior to the database changes, but I don't have very good documentation of the past events.

Tags: performance
description: updated
tags: added: performance
Revision history for this message
Jason Stephenson (jstephenson) wrote :

So, this still happens to us occasionally with Evergreen 3.7.3, but with a twist.

It happened just now, and the only thing running was the storage listener: osrf_control --diagnostic reported the listener running with no drones. pgrep -af open-ils.storage also showed the listener running but no drones. Trying to restart the service in the normal way failed ad the listener did not respond to the TERM signal. I ended up killing the listener manually, removed its lock file, and did a --start --service via osrf-control. (Unfortunately, the output of the above commands has exceeded the scrollback buffer in my terminal.)

Revision history for this message
Jason Stephenson (jstephenson) wrote :

This might prove useful. The following messages were found in the log file for this incident:

[2022-12-09 14:01:03] open-ils.storage [ERR :2718415:System.pm:123:] server: died with error Use of freed value in iteration at /usr/lib/x86_64-linux-gnu/perl/5.26/IO/Select.pm line 71.
[2022-12-09 14:01:03] open-ils.storage [INFO:2718415:Server.pm:72:] server: shutting down and cleaning up...
[2022-12-09 14:01:03] open-ils.storage [INFO:2718415:Server.pm:643:] server: disconnecting from router <email address hidden>/router
[2022-12-09 14:01:03] open-ils.storage [INFO:2718415:System.pm:125:] server: restarting after fatal crash...
[2022-12-09 14:01:05] open-ils.storage [INFO:2718415:Server.pm:622:] server: registering with router <email address hidden>/router

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.