OpenSRF Router should be more resilient against dropped connections

Bug #1954519 reported by Jason Boyer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenSRF
New
Undecided
Unassigned

Bug Description

Currently if connectivity to the ejabberd server is lost for any reason (ejabberd crash / restart, network shenanigans, etc.) the OpenSRF routers just disappear. I feel that it would be better for system stability if they instead noted the disruption, discarded all registrations, and tried to reconnect.

In an HA setup this would allow an ejabberd server to be restarted at will without having to re-start all routers connected to it. (Though in-flight requests would be thrown to the floor; nothing's perfect.) This would additionally allow reliable router startup at system boot time via systemd or your init system of choice even if ejabberd is not installed on the local host (meaning the init system can't reliably order dependent services).

A note about reconnecting: This ideally wouldn't be a hot loop like while (connected == false) {try_again();} but should wait a staggered amount of time delaying a little longer each time between connections before cycling back to faster checks. Something like this:

tries = 0;

while (not_connected) {
  tries++;
  sleep((tries * 2) + rand(2)); // either seconds or some number of hundreds of ms.
  connect();

  if (tries > reconn_cycle_max) { tries = 0; }
}

Where reconn_cycle_max is either set in opensrf_core.xml or via #define. An additional setting could determine when to just give up entirely and exit if that's desired.

Revision history for this message
Jason Boyer (jboyer) wrote :

Though I'll note that many / most of these benefits could come from the init system or an external supervisor process, etc. Am happy to hear what others may be doing or opinions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.