OpenSRF

Websocket translator responder thread loops on broken jabber socket

Bug #1746577 reported by Bill Erickson on 2018-01-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenSRF	Fix Released	Medium	Unassigned	OpenSRF 3.0.1

Bug Description

OpenSRF 3.0+

If the Jabber socket is disconnected after a request has been relayed to OpenSRF but before its response has been delivered back to the websocket client (e.g. if the Jabber disconnects on max-stanza-size for the response message) the responder thread will loop fast and forever attempting to read data from the broken Jabber socket.

This is the responder thread analog to bug #1744158.

I was able to manually confirm the bug by adding a 10 second sleep to a Perl API, calling the API via websockets, then stopping ejabberd during the sleep. The result is an Apache websocket process spinning on high CPU, strace showing futex() call loops.

Patch en route.

Tags:

Revision history for this message

Bill Erickson (berick) wrote on 2018-01-31:

Fix pushed:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1746577-ws-gateway-broken-socket-responder-loop

Fix confirmed by applying same test as above and verifying the gateway logs the new disconnect warning, the client shows a "closing websocket" in the browser console, and no looping websocket processes persist. (Note the above test can lead to other opensrf processes looping though because killing ejabberd can have that affect).

Changed in opensrf:
milestone:	none → 3.0.1
tags:	added: pullrequest
Changed in opensrf:
assignee:	Bill Erickson (berick) → nobody

Revision history for this message

Chris Sharp (chrissharp123) wrote on 2018-02-01:

Works as advertised:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/csharp/lp1746577-ws-gateway-broken-socket-responder-loop

tags:	added: signedoff
Changed in opensrf:
status:	New → Confirmed

Jason Stephenson (jstephenson) on 2018-02-01

Changed in opensrf:
assignee:	nobody → Jason Stephenson (jstephenson)

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2018-02-01:

The fix also works for me. I verified that the apache drones do not sit there spinning when the ejabberd connection is cut. I saw a 50% reduction in server load with the patch applied. I still had a couple of opensrf-c drones spinning, but this branch is not intended to address that.

I've added my signoff and pushed to master and rel_3_0.

Thanks, Bill and Chris (for testing)!

Changed in opensrf:
assignee:	Jason Stephenson (jstephenson) → nobody
status:	Confirmed → Fix Committed

Galen Charlton (gmc) on 2018-05-07

Changed in opensrf:
importance:	Undecided → Medium

Evergreen Bug Maintenance (bugmaster) on 2018-05-29

Changed in opensrf:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.