Storage Listener Dies During Fine Generation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
New
|
Undecided
|
Unassigned |
Bug Description
Evergreen version: 3.0
OpenSRF Version: 3.0 and 3.1/master
PostgreSQL version: 9.5.16
O/S Version: Ubuntu 16.04 & Ubuntu 18.04
In several cases, I have seen the storage listener die, with drones still running, while using a parallel setting of 6 for the fine generator.
osrf_control --diagnostic says: ERR open-ils.storage Has PID file entry [1625], which matches no running open-ils.storage processes
And pgrep -af storage shows 7 drones running:
21046 OpenSRF Drone [open-ils.storage]
21052 OpenSRF Drone [open-ils.storage]
21059 OpenSRF Drone [open-ils.storage]
21063 OpenSRF Drone [open-ils.storage]
21064 OpenSRF Drone [open-ils.storage]
21070 OpenSRF Drone [open-ils.storage]
21077 OpenSRF Drone [open-ils.storage]
The fine generator itself is still running. pgrep -af fine reports:
21026 /usr/bin/perl /openils/
The lock file is still in place.
My latest occurrence of this happened today after the 5:00 pm run. When the 6:00 pm run started, it reported that the fine generator seemed to already be running, so I had a look.
Looking through the 50MB of syslog that exist on this one server for the hour of 17:00, I see no signs of errors or other problems. I've checked kern.log for OOM Killer events, and nothing there, either.
However, the Evergreen syslog entries just stop at 17:15:56. A "normal" syslog has double the number of lines with log entries ending at 19:38:16. In the case of both logged hours, the fine generator was the only Evergreen process running.
This is not the first time that this has happened. It was a fairly frequent occurrence after we "improved" our database configuration until I moved the fine generator and some other cron jobs to a separate VM from the main utility server. I also seem to recall seeing this happen in the past, prior to the database changes, but I don't have very good documentation of the past events.
So, this still happens to us occasionally with Evergreen 3.7.3, but with a twist.
It happened just now, and the only thing running was the storage listener: osrf_control --diagnostic reported the listener running with no drones. pgrep -af open-ils.storage also showed the listener running but no drones. Trying to restart the service in the normal way failed ad the listener did not respond to the TERM signal. I ended up killing the listener manually, removed its lock file, and did a --start --service via osrf-control. (Unfortunately, the output of the above commands has exceeded the scrollback buffer in my terminal.)