Websocketd graceful shutdown support

Bug #1803182 reported by Bill Erickson on 2018-11-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenSRF
Wishlist
Unassigned

Bug Description

OpenSRF 3.1

When stopping websocketd-osrf, the flow of data is immediately broken between the client and the websocketd-osrf instance. If a client is in the middle of a request at shutdown time, the client will be disconnected before the response is delivered. Because of this, it's not possible to gracefully "detach" a server from a load-balanced group (e.g. for maintenance) without potentially disrupting clients.

Contrary to what I originally thought, there does not appear to be a way to make websocketd send a signal then wait before closing STDIN on the websocket-osrf instance. (There is however a way to add a delay between closing STDIN and sending SIGTERM, which doesn't help us here).

I propose a graceful shutdown signal, similar to the apache-websockets graceful shutdown signal.

Essentially, we inform the websocketd-osrf back-end instances of a pending websocket shutdown. Once received, the instance will enter shutdown mode, where it continues replying to the client until a gap in the communication opens where no requests or stateful connections are pending, at which point the websockted-osrf back-end instance disconnects the client and shuts itself down.

At this point, the client will detect the severed websocket connection and open a new connection with another available server.

The key differences between this and the apache2-websockets is it can be done without threads (in the main event loop) and we will likely have to send the signal ourselves to the back-end processes (via process group?) instead of signaling websocketd directly, which IIUC does not relay signals.

Bill Erickson (berick) wrote :

osrf-websocket-stdio.c changes pushed to:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1803182-websocketd-graceful-shutdown

This teaches the back-ends to perform a graceful shutdown when receiving a SIGUSR1 signal. I have confirmed the websocketd ignores this signal. I've also confirmed this works as expected:

kill -s USR1 -<websocketd-parent-pid>

Next question is whether we can make these changes to the sample systemd service files and/or if we need to add websocketd stop/start support to osrf_control.

Bill Erickson (berick) wrote :

In response to my last question, systemd can also manage the signals for us pretty easily:

$ sudo systemctl kill -s USR1 websocketd-osrf

This sends SIGUSR1 to all websocket-osrf processes (not just the parent), putting the back-ends in shutdown mode and leaving the main process alive.

Since we have 2 reasonably simple ways to signal a graceful shutdown, I'm adding a pullrequest for 3.1. Suggestions on how/where best to document this appreciated.

Changed in opensrf:
milestone: none → 3.1-beta
tags: added: pullrequest
Changed in opensrf:
assignee: Bill Erickson (berick) → nobody
Bill Erickson (berick) wrote :

Force-pushed back to same branch with improved commit message.

Galen Charlton (gmc) on 2018-12-13
Changed in opensrf:
importance: Undecided → Wishlist
status: New → Confirmed
assignee: nobody → Galen Charlton (gmc)
Galen Charlton (gmc) wrote :

I've successfully tested use of USR1 to gracefully shut down osrf-websocket-stdio backends, but I'm not clear how one goes about cleanly shutting down the websocketd process itself (or more precisely, putting it in a state where it won't accept any more connection requests while the backends are gracefully shutting down). Bill, do you have any insight on that?

Of course, if something like ldirectord or NGINX is being used as a proxy and/or load balancer, I can envision ways to make the load balancer stop directing new connections to a websocketd instance that is being shut down, but I'm curious whether there's any way make websocketd itself do it.

Bill Erickson (berick) wrote :

Thanks, Galen.

I don't believe there's a way to make websocketd proper reject new connections while leaving the child processes alive (w/ communication channels intact) for graceful shutdown. It requires something sit out front (ldirector, etc.) to block new connections.

Galen Charlton (gmc) wrote :

Thanks, Bill. I've pushed this to master for inclusion in 3.1-beta and will draft something suitable for the release notes.

Changed in opensrf:
status: Confirmed → Fix Committed
Galen Charlton (gmc) on 2018-12-20
Changed in opensrf:
assignee: Galen Charlton (gmc) → nobody
Galen Charlton (gmc) on 2019-01-10
Changed in opensrf:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers