Websockets processes locked at 100% CPU

Bug #1774703 reported by Bill Erickson on 2018-06-01
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenSRF
Undecided
Unassigned

Bug Description

OpenSRF 3.0.1

Multiple busy sites reporting sporadic occurrences of Apache Websockets processes locked at 100% and no longer functioning. The same sites are also reporting occurrences of segfaults in the websockets processes. As of yet, we have no reliable steps to reproduce either.

One issue raised in the IRC discussion:

https://github.com/disconnect/apache-websocket/pull/41

This could potentially explain both, since sometimes segfaults crash and sometimes they cause chaos.

This lead to the discovery of https://github.com/jchampio/apache-websocket, which is a fork of the code we use now. This fork includes the patches from the pull request and other fixes.

Testing this fork now. Will record results in comment.

Revision history for this message
Bill Erickson (berick) wrote :

Test results for https://github.com/jchampio/apache-websocket

1. Install (as root)

cd /tmp
git clone https://github.com/jchampio/apache-websocket
cd apache-websocket
apxs2 -i -a -c mod_websocket.c

2. This code has additional x-site scripting security features. If you have a test server setup where the browser url host does not match the apache hostname, add this to the websockets config in /etc/apache2-websockets/apache2.conf (otherwise you'll get Forbidden errors in the JS console):

WebSocketOriginCheck Off
# Or add a whitelist -- see github for docs

3. During initial testing, I found that I was able to create a thread contention lock under heavy traffic. I traced this to an apparent race condition between the opensrf thread locks and the new apache-websocket read/write threading additions. In essence, the inbound thread was locking the apache-websocket mutex while the outbound thread was locking the osrf-mutex, while each waited on the other. (trans->server->send was blocking in the responder thread).

It's not unreasonable to suspect this same scenario could have resulted in the segfault's reported in the pre-patched code, since the reader and writer threads are acting in a way that bumped against the new thread safety changes.

I was able to resolve this by limiting the scope of the osrf-mutex thread locking, specifically to unlock the osrf mutex before contention can occur in trans-server-send(). Patch with this change en route.

Note also for reference the osrf websockets thread mutexes are used out of an abundance of caution. It's entirely possible we don't need them at all, given the limited amount of data that's shared between threads. But, I'll post the more conservative patch that just fixes the known issue for now.

Revision history for this message
Bill Erickson (berick) wrote :

Patch the fixes the thread contention issues on my test server:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1774703-websocket-thread-lock

My hope is this patch paired with the updated code in https://github.com/jchampio/apache-websocket will resolve these issues.

Note the patch will likely not help against the un-patched (current) github apache websockets code.

Changed in opensrf:
status: New → Confirmed
milestone: none → 3.0.2
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I have applied the above fix to two virtual machines this morning: 1 running master OpenSRF & Evergreen with the concerto data set, and the other running OpenSRF 3.0.1 and Evergreen 3.0.8 with a copy of production data.

All OpenSRF and Evergreen tests continue to succeed on the concerto VM and basic cataloging, circulation, search, and OPAC functions continue to work.

Everything looks OK on the VM with the 3.0 versions and production data, too.

I want to install this in production and run it for a few days before signing off on it. I will add a comment when we have installed it in production, etc.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I've noticed that the origin check includes the port as well, so if you are not using a proxy for websockets, it is probably safer to use the following in your websockets configuration:

WebSocketOriginCheck Trusted
WebSocketTrustedOrigin https://host.domain.tld

Where host, domain, and tld are naturally replaced with your own host, domain, and top-level domain names.

Using WebSocketOriginCheck Off potentially leaves your users open to Cross Site Websocket Hijacking:
http://www.christian-schneider.net/CrossSiteWebSocketHijacking.html

On a test vm local to your workstation or behind a firewall that restricts access, this may be no bit deal.

Revision history for this message
Jeff Davis (jdavis-sitka) wrote :

I'm testing this too. It works for me so far in a test environment. I'll put it in production tomorrow and we'll see if it prevents our spinning websockets processes.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

We put this patch and the jchampio websockets branch in production last night.

We have 3 spinning apache2-websockets processes today, but the strace is very different from before:

munmap(0x7f65e530b000, 181358592) = 0
mmap(NULL, 181362688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65e530a000
munmap(0x7f65f530b000, 181358592) = 0
mmap(NULL, 181362688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65f530a000
munmap(0x7f65e530a000, 181362688) = 0
mmap(NULL, 181366784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65e5309000
munmap(0x7f65f530a000, 181362688) = 0
mmap(NULL, 181366784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65f5309000
munmap(0x7f65e5309000, 181366784) = 0
mmap(NULL, 181370880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65e5308000
^Cstrace: Process 15654 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.109640 13705 8 munmap
  0.00 0.000000 0 8 mmap
  0.00 0.000000 0 1 brk
------ ----------- ----------- --------- --------- ----------------
100.00 0.109640 17 total

Revision history for this message
Jason Stephenson (jstephenson) wrote :
Download full text (5.5 KiB)

I have another one doing this in the main process:

strace: Process 15800 attached [76/116]
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, 0x7fae6fe93048, 8000) = -1 EAGAIN (Resource temporarily unava$
lable)
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
epoll_wait(12, [{EPOLLIN, {u32=1877571296, u64=140387178608352}}], 2, -1) = 1
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, "\27\3\3\0020\363vf\230\30\257K\302k\33\260\214\366d\265#\355\344|\304\$
57\21<\314\257\201p"..., 8000) = 565
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 385, MSG_NOSIGNAL, NULL, 0) =
385
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 386, MSG_NOSIGNAL, NULL, 0) =
386
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"<email address hidden>/o"..., 724, 0, NULL, 0) = 724
epoll_wait(12, [], 2, 0) = 0
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR) [53/116]
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, "\27\3\3\2P\271:~\0059\275\4\17\276\374\262\301%;\36\344-\272p\330\266n\
230\243d\202o"..., 8000) = 597
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 422, MSG_NOSIGNAL, NULL, 0) =
422
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 423, MSG_NOSIGNAL, NULL, 0) =
423
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"<email address hidden>/o"..., 761, 0, NULL, 0) = 761
epoll_wait(12, [], 2, 0) = 0
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, "\27\3\3\1\300@\\\243\200\10`Kc9\353v\363)&\211M\36\2-\7\1%Eg\227\336\34
5"..., 8000) = 1620
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 275, MSG_NOSIGNAL, NULL, 0) =
275
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 276, MSG_NOSIGNAL, NUL[30/116]
276
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"<email address hidden>/o"..., 614, 0, NULL, 0) = 614
epoll_wait(12, [], 2, 0) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 218, MSG_NOSIGNAL, NULL, 0) =
218
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 219, MSG_NOSIGNAL, NULL, 0) =
219
futex(...

Read more...

Revision history for this message
Jason Stephenson (jstephenson) wrote :

So, after getting reports of other issues today, bug 1773249 among them, and still seeing stuck apache2 websockets processes, we reverted both changes on our production system tonight. We are back to disconnect's old apache-websocket module and straight up OpenSRF 3.0.1. I suspect some of the issues we encountered will need to be addressed in the websocket module itself and some possibly in OpenSRF or Evergreen.

Revision history for this message
Jeff Davis (jdavis-sitka) wrote :

Sitka applied Bill's patch and the jchampio fork two days ago. We're still seeing spinning websockets, but things don't seem any worse than before either.

In our environment, strace on a spinning process consistently shows a stream of mmap/munmap output, as in Jason's comment #6.

Revision history for this message
Jeff Davis (jdavis-sitka) wrote :

As a workaround for this issue, Sitka has switched from Apache Websockets to websocketd, using Bill's branch from bug 1777180. It's working well for us: no spinning websockets, segfaults, or other problems that I know of.

Galen Charlton (gmc) on 2018-11-06
Changed in opensrf:
milestone: 3.0.2 → none
Revision history for this message
Galen Charlton (gmc) wrote :

Marking this as wont-fix as far as apache2-websockets is concerned, since websocketd has proven to be a viable alternative.

Changed in opensrf:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers