Apache process spin out of control
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenSRF |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Every so often, an Apache process will get stuck and consume 100% of a CPU. To clear up the problem, we cycle out the application brick and restart Apache, killing the process. The problem happens unpredictably and multiple processes can be stucked, decreasing the number of CPUs available on the Apache server until we kill the processes to free up the CPUs. We reported the problem to ESI in March 2010 but have not made a bug report until now.
We are using open-ils version 1.6 and opensrf version 1.4. The uname signature of the Apache server is:
opensrf@app1-1:~$ uname -a
Linux app1-1 2.6.24-19-server #1 SMP Wed Aug 20 18:43:06 UTC 2008 x86_64 GNU/Linux
We have done a preliminary amount of diagnosis. First, we used the unix command 'lsof' ('list open files') on pid 13510:
# sudo lsof -p 13510
Excerpts from five interesting lines of output:
1. apache2 13510 opensrf 0u IPv4 46580313 TCP app2-1.
2. apache2 13510 opensrf 1u IPv4 46564127 TCP app2-1.
3. apache2 13510 opensrf 9u IPv4 46563750 TCP app2-1.
4. apache2 13510 opensrf 12u IPv4 46563731 TCP app2-1.
5. apache2 13510 opensrf 14r REG 9,0 5754993 483709 /usr/local/
Line 1 shows the system is waiting to close a TCP connection between pid 13510 and a workstation at 142.179.54.149. The workstation has sent a FIN packet and is waiting for pid 13510 to finish the closure, but it cannot since it got into a spin. Line 2 shows a connection from pid 13510 to the memcache service. Lines 3 and 4 show connections to the XMPP service. Line 5 shows an opened regular file.
Second, we used 'strace' ('system trace') on a similarly stuck process, pid 7180, on Apache server 192.168.0.170:
# sudo strace -p 7180
The output we get is the following repeating set of lines:
getpeername(12, {sa_family=AF_INET, sin_port=
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(12, F_SETFL, O_RDWR) = 0
select(16, [12], NULL, NULL, {0, 0}) = 0 (Timeout)
The system repeatedly gets the address information of a socket identified as file descriptor 12, which turns out to be connected to TCP port 5222 or xmpp-client; sets the socket to read/write access; and polls the socket for available data.
Although our lsof and strace commands were done on two different stuck processes, the outputs are the same for any stuck processes. An Apache child process seems to spin because it is continuously waiting for data from the XMPP service, which never appears. It seems to point to a troubled spot in the opensrf software.
no longer affects: | evergreen |
We have OpenSRF 1.6.2 installed on Debian Lenny.
I ran strace against one of our wildly CPU-consuming processes and generated almost identical output as Sitka:
getpeername(15, {sa_family=AF_INET, sin_port= htons(5222) , sin_addr= inet_addr( "192.168. 0.170") }, [12771219257635 6368]) = 0
fcntl(15, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(15, F_SETFL, O_RDWR) = 0
select(16, [15], NULL, NULL, {0, 0}) = 0 (Timeout)