OpenSRF

Apache process spin out of control

Bug #690910 reported by Steven Chan on 2010-12-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenSRF	Fix Released	Undecided	Unassigned	OpenSRF 2.1.0-alpha

Bug Description

Every so often, an Apache process will get stuck and consume 100% of a CPU. To clear up the problem, we cycle out the application brick and restart Apache, killing the process. The problem happens unpredictably and multiple processes can be stucked, decreasing the number of CPUs available on the Apache server until we kill the processes to free up the CPUs. We reported the problem to ESI in March 2010 but have not made a bug report until now.

We are using open-ils version 1.6 and opensrf version 1.4. The uname signature of the Apache server is:
opensrf@app1-1:~$ uname -a
Linux app1-1 2.6.24-19-server #1 SMP Wed Aug 20 18:43:06 UTC 2008 x86_64 GNU/Linux

We have done a preliminary amount of diagnosis. First, we used the unix command 'lsof' ('list open files') on pid 13510:
# sudo lsof -p 13510

Excerpts from five interesting lines of output:
1. apache2 13510 opensrf 0u IPv4 46580313 TCP app2-1.sitka.bclibraries.ca:www->s142-179-54-149.bc.hsia.telus.net:3198 (CLOSE_WAIT)
2. apache2 13510 opensrf 1u IPv4 46564127 TCP app2-1.sitka.bclibraries.ca:54767->stanns.sitka.bclibraries.ca:11211 (ESTABLISHED)
3. apache2 13510 opensrf 9u IPv4 46563750 TCP app2-1.sitka.bclibraries.ca:53443->app2-1.sitka.bclibraries.ca:xmpp-client (ESTABLISHED)
4. apache2 13510 opensrf 12u IPv4 46563731 TCP app2-1.sitka.bclibraries.ca:56502->app2-1.sitka.bclibraries.ca:xmpp-client (ESTABLISHED)
5. apache2 13510 opensrf 14r REG 9,0 5754993 483709 /usr/local/share/perl/5.8.8/MARC/Charset/Table

Line 1 shows the system is waiting to close a TCP connection between pid 13510 and a workstation at 142.179.54.149. The workstation has sent a FIN packet and is waiting for pid 13510 to finish the closure, but it cannot since it got into a spin. Line 2 shows a connection from pid 13510 to the memcache service. Lines 3 and 4 show connections to the XMPP service. Line 5 shows an opened regular file.

Second, we used 'strace' ('system trace') on a similarly stuck process, pid 7180, on Apache server 192.168.0.170:
# sudo strace -p 7180

The output we get is the following repeating set of lines:
getpeername(12, {sa_family=AF_INET, sin_port=htons(5222), sin_addr=inet_addr("192.168.0.170")}, [180481603884023824]) = 0
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(12, F_SETFL, O_RDWR) = 0
select(16, [12], NULL, NULL, {0, 0}) = 0 (Timeout)

The system repeatedly gets the address information of a socket identified as file descriptor 12, which turns out to be connected to TCP port 5222 or xmpp-client; sets the socket to read/write access; and polls the socket for available data.

Although our lsof and strace commands were done on two different stuck processes, the outputs are the same for any stuck processes. An Apache child process seems to spin because it is continuously waiting for data from the XMPP service, which never appears. It seems to point to a troubled spot in the opensrf software.

Tags:

Revision history for this message

Dan Scott (denials) wrote on 2011-02-01:

We have OpenSRF 1.6.2 installed on Debian Lenny.

I ran strace against one of our wildly CPU-consuming processes and generated almost identical output as Sitka:

getpeername(15, {sa_family=AF_INET, sin_port=htons(5222), sin_addr=inet_addr("192.168.0.170")}, [127712192576356368]) = 0
fcntl(15, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(15, F_SETFL, O_RDWR) = 0
select(16, [15], NULL, NULL, {0, 0}) = 0 (Timeout)

Changed in evergreen:
status:	New → Confirmed

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-02-01:

We get this occasionally too at NRCan -- and just checked and we had 4 x 100 % CPU "stuck" processes that needed to be cleared out.

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-12-14:

lsof on spun apache pid Edit (15.4 KiB, text/plain)

Seems to me that since upgrading to EG 2.1.1 we've been getting these more and more. Don't seem to find the time to troubleshoot this further but let me know if anybody has ideas.

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-12-14:

strace of spun apache pid Edit (21.9 KiB, text/plain)

strace of spun apache pid

Revision history for this message

Dan Scott (denials) wrote on 2011-12-19:

FWIW, we still get this problem regularly on our Debian Lenny / Evergreen 2.0.10 production system - and would concur that the problems seem more frequent since upgrading to 2.0.x. At least recently.

Revision history for this message

George Duimovich (george-duimovich) wrote on 2011-12-23:

ho, ho, ho... For us, no more spun processes since this update (increased max_stanza):

http://markmail.org/message/b2mgfb4udqvo2u6a

Prior to this update, couldn't go more than 8-12 hours before seeing some 100%CPU Apache processes spinning away, a problem of increased frequency since upgrade couple of months ago (?? 2.0.8 >> 2.1.1 if I recall).

So fingers crossed, now going on to 48 hours later and still no problems so looks like the fix for our situation. But still, not clear to me why ejabberd (or whatever) might be playing bad at smaller max_stanza settings.

Revision history for this message

George Duimovich (george-duimovich) wrote on 2012-02-01:

FWIW - we've reduced max_stanza and increased debug to assist with related ticket with ESI.

Recently found 3 spun apache processes (100% CPU) and all three had following similar output from "lsof" command

lsof -p 28523
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
apache2 28523 opensrf cwd unknown /proc/28523/cwd (readlink: Permission denied)
apache2 28523 opensrf rtd unknown /proc/28523/root (readlink: Permission denied)
apache2 28523 opensrf txt unknown /proc/28523/exe (readlink: Permission denied)
apache2 28523 opensrf NOFD /proc/28523/fd (opendir: Permission denied)

only thing different with the three outputs was <pid> reference number after the /proc/<pid>/* etc.

Revision history for this message

Dan Scott (denials) wrote on 2012-02-29:

I got really sick of being a slave to the machine and having to manually kill processes that were misbehaving, so I added the following cron job to opensrf's crontab. It will kill apache processes owned by the opensrf user that have a run time of > 5 minutes and a consistent CPU usage of over 85%.

## Kill long-running annoyingly high-CPU Apache processes
*/5 * * * * ps -eo pid,pcpu,time,comm,user --sort=-pcpu | grep apache2 | grep opensrf | perl -ane '($h,$m,$s) = split /:/, $F[2]; if (int($m) > 5 && $F[1] > 85) { print("Killing " . join(" ", $F[0]."\n")); kill 15, $F[0]; }'

Revision history for this message

George Duimovich (george-duimovich) wrote on 2012-02-29:

Thanks Dan -- this is just what we needed to.

Strange, I thought we had this one licked but it's been a persistent one lately. I was thinking maybe one of our customizations was at cause but unlikely (??) that we have the same customization trigger this going on. Major annoyance lately..

Revision history for this message

Bill Erickson (berick) wrote on 2012-02-29:

#10

Here is a patch that I believe addresses (or, at minimum, verifies) the issue. It needs testing in an environment where the problem is likely to occur. Depending on how the client code is calling the OpenSRF libs, though, there is a chance this patch as-is will not resolve the looping, but only log that it is happening. This could result in a spew of log output. Testers be warned.

If the bug is confirmed, the next step will be to determine how best to handle the problem from the Apache side. E.g. do we kill the process and let it respawn (simplest), re-connect to jabber, etc.?

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/perl-loop-on-dead-socket

Revision history for this message

Dan Scott (denials) wrote on 2012-03-11:

#11

Pulled this into OpenSRF rel_2_1 so that we can test it with the 2.1.0-alpha1 release. Thanks Bill!

I'm thinking we should probably keep this open until such time as we confirm that it does resolve the problem, or that we need to take the next step - which would be, I think, just killing the process; simple is good!

Changed in opensrf:
status:	New → Fix Committed
milestone:	none → 2.1.0-alpha

Revision history for this message

George Duimovich (george-duimovich) wrote on 2012-03-12:

#12

Just to clarify, re: Bill's patch above (no. 10) : does this require recompiling OpenSRF or just drop in and restart?

Revision history for this message

Bill Erickson (berick) wrote on 2012-03-12: Re: [Bug 690910] Re: Apache process spin out of control

#13

On Mon, Mar 12, 2012 at 04:50:41PM -0000, George Duimovich wrote:
> Just to clarify, re: Bill's patch above (no. 10) : does this require
> recompiling OpenSRF or just drop in and restart?

It should be enough to drop the Perl changes in place and restart
Apache.

-b

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-05-21:

#14

Since making the change mentioned in this blog post: http://blog.mvlcstaff.org/2012/05/smp-what-is-it-and-why-do-you-care.html

We have not had any problems with apache processes spinning out of control.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-07-18:

#15

There were some OpenSRF issues that were resolved on this front.

I am fairly confident that on Ubuntu (and possibly Debian) using SMP with ejabbered resolves a lot of these issues.

However, I am not certain that this "bug" is totally resolved by the above steps so I am setting to incomplete until someone can confirm that there are still issues with apache performance after making sure that the advice above has been followed.

Changed in opensrf:
status:	Fix Committed → Fix Released
Changed in evergreen:
status:	Confirmed → Incomplete

Revision history for this message

Dan Wells (dbw2) wrote on 2013-03-20:

#16

Download full text (3.7 KiB)

Hello all,

This problem still exists, and has become a much worse problem for us since moving to TPAC (not sure of any actual causation, but just an observation).

I have implemented both Bill's code fix and Jason's ejabberd config change, but the problem still happens. It used to be maybe once a week, now it is multiple times per day.

I have attempted to get a Carp stacktrace using the kill -USR2 technique, and here is the result (apologies for the lack of formatting) using EG 2.3 and OSRF 2.1:

[Wed Mar 20 18:36:12 2013] [error] [client 153.106.144.241] egweb: Context Loader error: caught SIGUSR2! at /etc/apache2/startup.pl line 29\n\tStartUp::__ANON__('USR2')

called at /usr/local/share/perl/5.10.1/OpenSRF/Transport/SlimJabber/XMPPReader.pm line 177\n\teval {...}

called at /usr/local/share/perl/5.10.1/OpenSRF/Transport/SlimJabber/XMPPReader.pm line 177\n\tOpenSRF::Transport::SlimJabber::XMPPReader::set_block('IO::Socket::INET=GLOB(0x7f521399f1f8)')

called at /usr/local/share/perl/5.10.1/OpenSRF/Transport/SlimJabber/XMPPReader.pm line 202\n\tOpenSRF::Transport::SlimJabber::XMPPReader::wait('OpenSRF::Transport::SlimJabber::XMPPReader=HASH(0x7f521399f000)', 0)

called at /usr/local/share/perl/5.10.1/OpenSRF/Transport/SlimJabber/XMPPReader.pm line 257\n\tOpenSRF::Transport::SlimJabber::XMPPReader::wait_msg('OpenSRF::Transport::SlimJabber::XMPPReader=HASH(0x7f521399f000)', 0)

called at /usr/local/share/perl/5.10.1/OpenSRF/Transport/SlimJabber/Client.pm line 199\n\tOpenSRF::Transport::SlimJabber::Client::process('OpenSRF::Transport::PeerHandle=HASH(0x7f521387abd8)', 0)

called at /usr/local/share/perl/5.10.1/OpenSRF/Transport/SlimJabber/PeerConnection.pm line 84\n\tOpenSRF::Transport::SlimJabber::PeerConnection::process('OpenSRF::Transport::PeerHandle=HASH(0x7f521387abd8)', 0)

called at /usr/local/share/perl/5.10.1/OpenSRF/AppSession.pm line 790\n\tOpenSRF::AppSession::queue_wait('OpenSRF::AppSession=HASH(0x7f5214edfa10)', 0)

called at /usr/local/share/perl/5.10.1/OpenSRF/AppSession.pm line 911\n\tOpenSRF::AppRequest::complete('OpenSRF::AppRequest=HASH(0x7f5214c2d968)')

called at /usr/local/share/perl/5.10.1/OpenSRF/MultiSession.pm line 240\n\tOpenSRF::MultiSession::session_reap('OpenSRF::MultiSession=HASH(0x7f5214ded850)')

called at /usr/local/share/perl/5.10.1/OpenSRF/MultiSession.pm line 223\n\tOpenSRF::MultiSession::session_wait('OpenSRF::MultiSession=HASH(0x7f5214ded850)', 1)

called at /usr/local/share/perl/5.10.1/OpenILS/WWW/EGCatLoader/Util.pm line 306\n\tOpenILS::WWW::EGCatLoader::get_records_and_facets('OpenILS::WWW::EGCatLoader=HASH(0x7f5215202a58)', 'ARRAY(0x7f52152a31f0)', 'open-ils.search_9c4a6eb89e374cd0d19abe19d3fb4ac5_facets', 'HASH(0x7f5214def980)')

called at /usr/local/share/perl/5.10.1/OpenILS/WWW/EGCatLoader/Search.pm line 447\n\tOpenILS::WWW::EGCatLoader::load_rresults('OpenILS::WWW::EGCatLoader=HASH(0x7f5215202a58)')

called at /usr/local/share/perl/5.10.1/OpenILS/WWW/EGCatLoader.pm line 127\n\tOpenILS::WWW::EGCatLoader::load('OpenILS::WWW::EGCatLoader=HASH(0x7f5215202a58)')

called at /usr/local/share/perl/5.10.1/OpenILS/WWW/EGWeb.pm line 98\n\teval {...}

called at /usr/local/share/perl/5.10.1/OpenILS/WWW/EGWe...

Hello all,

This problem still exists, and has become a much worse problem for us since moving to TPAC (not sure of any actual causation, but just an observation).

I have implemented both Bill's code fix and Jason's ejabberd config change, but the problem still happens.  It used to be maybe once a week, now it is multiple times per day.