Bug #1774703 “Websockets processes locked at 100% CPU” : Bugs : OpenSRF

Revision history for this message

Bill Erickson (berick) wrote on 2018-06-01:

#1

Test results for https://github.com/jchampio/apache-websocket

1. Install (as root)

cd /tmp
git clone https://github.com/jchampio/apache-websocket
cd apache-websocket
apxs2 -i -a -c mod_websocket.c

2. This code has additional x-site scripting security features. If you have a test server setup where the browser url host does not match the apache hostname, add this to the websockets config in /etc/apache2-websockets/apache2.conf (otherwise you'll get Forbidden errors in the JS console):

WebSocketOriginCheck Off
# Or add a whitelist -- see github for docs

3. During initial testing, I found that I was able to create a thread contention lock under heavy traffic. I traced this to an apparent race condition between the opensrf thread locks and the new apache-websocket read/write threading additions. In essence, the inbound thread was locking the apache-websocket mutex while the outbound thread was locking the osrf-mutex, while each waited on the other. (trans->server->send was blocking in the responder thread).

It's not unreasonable to suspect this same scenario could have resulted in the segfault's reported in the pre-patched code, since the reader and writer threads are acting in a way that bumped against the new thread safety changes.

I was able to resolve this by limiting the scope of the osrf-mutex thread locking, specifically to unlock the osrf mutex before contention can occur in trans-server-send(). Patch with this change en route.

Note also for reference the osrf websockets thread mutexes are used out of an abundance of caution. It's entirely possible we don't need them at all, given the limited amount of data that's shared between threads. But, I'll post the more conservative patch that just fixes the known issue for now.

Revision history for this message

Bill Erickson (berick) wrote on 2018-06-01:

#2

Patch the fixes the thread contention issues on my test server:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1774703-websocket-thread-lock

My hope is this patch paired with the updated code in https://github.com/jchampio/apache-websocket will resolve these issues.

Note the patch will likely not help against the un-patched (current) github apache websockets code.

Jason Stephenson (jstephenson) on 2018-06-02

Changed in opensrf:
status:	New → Confirmed
milestone:	none → 3.0.2

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2018-06-02:

#3

I have applied the above fix to two virtual machines this morning: 1 running master OpenSRF & Evergreen with the concerto data set, and the other running OpenSRF 3.0.1 and Evergreen 3.0.8 with a copy of production data.

All OpenSRF and Evergreen tests continue to succeed on the concerto VM and basic cataloging, circulation, search, and OPAC functions continue to work.

Everything looks OK on the VM with the 3.0 versions and production data, too.

I want to install this in production and run it for a few days before signing off on it. I will add a comment when we have installed it in production, etc.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2018-06-03:

#4

I've noticed that the origin check includes the port as well, so if you are not using a proxy for websockets, it is probably safer to use the following in your websockets configuration:

WebSocketOriginCheck Trusted
WebSocketTrustedOrigin https://host.domain.tld

Where host, domain, and tld are naturally replaced with your own host, domain, and top-level domain names.

Using WebSocketOriginCheck Off potentially leaves your users open to Cross Site Websocket Hijacking:
http://www.christian-schneider.net/CrossSiteWebSocketHijacking.html

On a test vm local to your workstation or behind a firewall that restricts access, this may be no bit deal.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-06-04:

#5

I'm testing this too. It works for me so far in a test environment. I'll put it in production tomorrow and we'll see if it prevents our spinning websockets processes.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2018-06-07:

#6

We put this patch and the jchampio websockets branch in production last night.

We have 3 spinning apache2-websockets processes today, but the strace is very different from before:

munmap(0x7f65e530b000, 181358592) = 0
mmap(NULL, 181362688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65e530a000
munmap(0x7f65f530b000, 181358592) = 0
mmap(NULL, 181362688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65f530a000
munmap(0x7f65e530a000, 181362688) = 0
mmap(NULL, 181366784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65e5309000
munmap(0x7f65f530a000, 181362688) = 0
mmap(NULL, 181366784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65f5309000
munmap(0x7f65e5309000, 181366784) = 0
mmap(NULL, 181370880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f65e5308000
^Cstrace: Process 15654 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.109640 13705 8 munmap
0.00 0.000000 0 8 mmap
0.00 0.000000 0 1 brk
------ ----------- ----------- --------- --------- ----------------
100.00 0.109640 17 total

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2018-06-07:

#7

Download full text (5.5 KiB)

I have another one doing this in the main process:

strace: Process 15800 attached [76/116]
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, 0x7fae6fe93048, 8000) = -1 EAGAIN (Resource temporarily unava$
lable)
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
epoll_wait(12, [{EPOLLIN, {u32=1877571296, u64=140387178608352}}], 2, -1) = 1
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, "\27\3\3\0020\363vf\230\30\257K\302k\33\260\214\366d\265#\355\344|\304\$
57\21<\314\257\201p"..., 8000) = 565
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 385, MSG_NOSIGNAL, NULL, 0) =
385
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 386, MSG_NOSIGNAL, NULL, 0) =
386
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
{SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"<email address hidden>/o"..., 724, 0, NULL, 0) = 724
epoll_wait(12, [], 2, 0) = 0
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR) [53/116]
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, "\27\3\3\2P\271:~\0059\275\4\17\276\374\262\301%;\36\344-\272p\330\266n\
230\243d\202o"..., 8000) = 597
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 422, MSG_NOSIGNAL, NULL, 0) =
422
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 423, MSG_NOSIGNAL, NULL, 0) =
423
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
{SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"<email address hidden>/o"..., 761, 0, NULL, 0) = 761
epoll_wait(12, [], 2, 0) = 0
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(9, "\27\3\3\1\300@\\\243\200\10`Kc9\353v\363)&\211M\36\2-\7\1%Eg\227\336\34
5"..., 8000) = 1620
fcntl(9, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 275, MSG_NOSIGNAL, NULL, 0) =
275
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 276, MSG_NOSIGNAL, NUL[30/116]
276
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
{SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"<email address hidden>/o"..., 614, 0, NULL, 0) = 614
epoll_wait(12, [], 2, 0) = 0
sendto(10, "<142>Jun 7 13:36:52 osrf_websoc"..., 218, MSG_NOSIGNAL, NULL, 0) =
218
sendto(10, "<182>Jun 7 13:36:52 osrf_websoc"..., 219, MSG_NOSIGNAL, NULL, 0) =
219
futex(...

I have another one doing this in the main process:

strace: Process 15800 attached                                          [76/116]
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(9, 0x7fae6fe93048, 8000)           = -1 EAGAIN (Resource temporarily unava$
lable)
fcntl(9, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR)               = 0
epoll_wait(12, [{EPOLLIN, {u32=1877571296, u64=140387178608352}}], 2, -1) = 1
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(9, "\27\3\3\0020\363vf\230\30\257K\302k\33\260\214\366d\265#\355\344|\304\$
57\21<\314\257\201p"..., 8000) = 565
fcntl(9, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR)               = 0
sendto(10, "<142>Jun  7 13:36:52 osrf_websoc"..., 385, MSG_NOSIGNAL, NULL, 0) = 
385
sendto(10, "<182>Jun  7 13:36:52 osrf_websoc"..., 386, MSG_NOSIGNAL, NULL, 0) = 
386
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"router@bh1.public/o"..., 724, 0, NULL, 0) = 724
epoll_wait(12, [], 2, 0)                = 0
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)            [53/116]
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(9, "\27\3\3\2P\271:~\0059\275\4\17\276\374\262\301%;\36\344-\272p\330\266n\
230\243d\202o"..., 8000) = 597
fcntl(9, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR)               = 0
sendto(10, "<142>Jun  7 13:36:52 osrf_websoc"..., 422, MSG_NOSIGNAL, NULL, 0) = 
422
sendto(10, "<182>Jun  7 13:36:52 osrf_websoc"..., 423, MSG_NOSIGNAL, NULL, 0) = 
423
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"router@bh1.public/o"..., 761, 0, NULL, 0) = 761
epoll_wait(12, [], 2, 0)                = 0
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(9, "\27\3\3\1\300@\\\243\200\10`Kc9\353v\363)&\211M\36\2-\7\1%Eg\227\336\34
5"..., 8000) = 1620
fcntl(9, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR)               = 0
sendto(10, "<142>Jun  7 13:36:52 osrf_websoc"..., 275, MSG_NOSIGNAL, NULL, 0) = 
275
sendto(10, "<182>Jun  7 13:36:52 osrf_websoc"..., 276, MSG_NOSIGNAL, NUL[30/116]
276
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"router@bh1.public/o"..., 614, 0, NULL, 0) = 614
epoll_wait(12, [], 2, 0)                = 0
sendto(10, "<142>Jun  7 13:36:52 osrf_websoc"..., 218, MSG_NOSIGNAL, NULL, 0) = 
218
sendto(10, "<182>Jun  7 13:36:52 osrf_websoc"..., 219, MSG_NOSIGNAL, NULL, 0) = 
219
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"router@bh1.public/o"..., 560, 0, NULL, 0) = 560
epoll_wait(12, [], 2, 0)                = 0
sendto(10, "<142>Jun  7 13:36:52 osrf_websoc"..., 199, MSG_NOSIGNAL, NULL, 0) = 
199
sendto(10, "<182>Jun  7 13:36:52 osrf_websoc"..., 200, MSG_NOSIGNAL, NULL, 0) = 
200
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7sendto(11, "<message to=\"router@bh1.public/o"..., 539, 0, NULL, 0) = 539[7/116]
epoll_wait(12, [], 2, 0)                = 0
sendto(10, "<142>Jun  7 13:36:52 osrf_websoc"..., 224, MSG_NOSIGNAL, NULL, 0) = 
224
sendto(10, "<182>Jun  7 13:36:52 osrf_websoc"..., 225, MSG_NOSIGNAL, NULL, 0) = 
225
futex(0x7fae6c4396a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0},
 {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fae6f0904b0}, 8) = 0
sendto(11, "<message to=\"router@bh1.public/o"..., 563, 0, NULL, 0) = 563
epoll_wait(12, [], 2, 0)                = 0
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(9, 0x7fae6fe93048, 8000)           = -1 EAGAIN (Resource temporarily unavai
lable)
fcntl(9, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(9, F_SETFL, O_RDWR)               = 0
epoll_wait(12, ^Cstrace: Process 15800 detached
 <detached ...>
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         5         2 read
  0.00    0.000000           0         6           rt_sigaction
  0.00    0.000000           0        18           sendto
  0.00    0.000000           0        20           fcntl
  0.00    0.000000           0         6           futex
  0.00    0.000000           0         7           epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    62         2 total

While its first child process reports being in x32 mode and the second is doing the sleep handler.

This looks like maybe something on our end, but I am uncertain.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2018-06-08:

#8

So, after getting reports of other issues today, bug 1773249 among them, and still seeing stuck apache2 websockets processes, we reverted both changes on our production system tonight. We are back to disconnect's old apache-websocket module and straight up OpenSRF 3.0.1. I suspect some of the issues we encountered will need to be addressed in the websocket module itself and some possibly in OpenSRF or Evergreen.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-06-09:

#9

Sitka applied Bill's patch and the jchampio fork two days ago. We're still seeing spinning websockets, but things don't seem any worse than before either.

In our environment, strace on a spinning process consistently shows a stream of mmap/munmap output, as in Jason's comment #6.

Revision history for this message

Jeff Davis (jdavis-sitka) wrote on 2018-07-27:

#10

As a workaround for this issue, Sitka has switched from Apache Websockets to websocketd, using Bill's branch from bug 1777180. It's working well for us: no spinning websockets, segfaults, or other problems that I know of.

Galen Charlton (gmc) on 2018-11-06

Changed in opensrf:
milestone:	3.0.2 → none

Revision history for this message

Galen Charlton (gmc) wrote on 2019-03-01:

#11

Marking this as wont-fix as far as apache2-websockets is concerned, since websocketd has proven to be a viable alternative.

Changed in opensrf:
status:	Confirmed → Won't Fix

OpenSRF

Websockets processes locked at 100% CPU

Bug Description

Other bug subscribers

Remote bug watches