Websockets processes locked at 100% CPU
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenSRF |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
OpenSRF 3.0.1
Multiple busy sites reporting sporadic occurrences of Apache Websockets processes locked at 100% and no longer functioning. The same sites are also reporting occurrences of segfaults in the websockets processes. As of yet, we have no reliable steps to reproduce either.
One issue raised in the IRC discussion:
https:/
This could potentially explain both, since sometimes segfaults crash and sometimes they cause chaos.
This lead to the discovery of https:/
Testing this fork now. Will record results in comment.
Changed in opensrf: | |
status: | New → Confirmed |
milestone: | none → 3.0.2 |
Changed in opensrf: | |
milestone: | 3.0.2 → none |
Test results for https:/ /github. com/jchampio/ apache- websocket
1. Install (as root)
cd /tmp /github. com/jchampio/ apache- websocket
git clone https:/
cd apache-websocket
apxs2 -i -a -c mod_websocket.c
2. This code has additional x-site scripting security features. If you have a test server setup where the browser url host does not match the apache hostname, add this to the websockets config in /etc/apache2- websockets/ apache2. conf (otherwise you'll get Forbidden errors in the JS console):
WebSocketOrigin Check Off
# Or add a whitelist -- see github for docs
3. During initial testing, I found that I was able to create a thread contention lock under heavy traffic. I traced this to an apparent race condition between the opensrf thread locks and the new apache-websocket read/write threading additions. In essence, the inbound thread was locking the apache-websocket mutex while the outbound thread was locking the osrf-mutex, while each waited on the other. (trans- >server- >send was blocking in the responder thread).
It's not unreasonable to suspect this same scenario could have resulted in the segfault's reported in the pre-patched code, since the reader and writer threads are acting in a way that bumped against the new thread safety changes.
I was able to resolve this by limiting the scope of the osrf-mutex thread locking, specifically to unlock the osrf mutex before contention can occur in trans-server- send(). Patch with this change en route.
Note also for reference the osrf websockets thread mutexes are used out of an abundance of caution. It's entirely possible we don't need them at all, given the limited amount of data that's shared between threads. But, I'll post the more conservative patch that just fixes the known issue for now.