Websocket translator responder thread loops on broken jabber socket

Bug #1746577 reported by Bill Erickson on 2018-01-31
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

OpenSRF 3.0+

If the Jabber socket is disconnected after a request has been relayed to OpenSRF but before its response has been delivered back to the websocket client (e.g. if the Jabber disconnects on max-stanza-size for the response message) the responder thread will loop fast and forever attempting to read data from the broken Jabber socket.

This is the responder thread analog to bug #1744158.

I was able to manually confirm the bug by adding a 10 second sleep to a Perl API, calling the API via websockets, then stopping ejabberd during the sleep. The result is an Apache websocket process spinning on high CPU, strace showing futex() call loops.

Patch en route.

Bill Erickson (berick) wrote :

Fix pushed:


Fix confirmed by applying same test as above and verifying the gateway logs the new disconnect warning, the client shows a "closing websocket" in the browser console, and no looping websocket processes persist. (Note the above test can lead to other opensrf processes looping though because killing ejabberd can have that affect).

Changed in opensrf:
milestone: none → 3.0.1
tags: added: pullrequest
Changed in opensrf:
assignee: Bill Erickson (berick) → nobody
Chris Sharp (chrissharp123) wrote :
tags: added: signedoff
Changed in opensrf:
status: New → Confirmed
Changed in opensrf:
assignee: nobody → Jason Stephenson (jstephenson)
Jason Stephenson (jstephenson) wrote :

The fix also works for me. I verified that the apache drones do not sit there spinning when the ejabberd connection is cut. I saw a 50% reduction in server load with the patch applied. I still had a couple of opensrf-c drones spinning, but this branch is not intended to address that.

I've added my signoff and pushed to master and rel_3_0.

Thanks, Bill and Chris (for testing)!

Changed in opensrf:
assignee: Jason Stephenson (jstephenson) → nobody
status: Confirmed → Fix Committed
Galen Charlton (gmc) on 2018-05-07
Changed in opensrf:
importance: Undecided → Medium
Changed in opensrf:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers