System stops responding
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Expired
|
Medium
|
Unassigned |
Bug Description
Presumably this is a kernel problem, but it has some curious aspects
to it and is being reported here for lack of a better place to turn.
lsb release: Ubuntu 8.04.4 LTS
uname -a: Linux www 2.6.24-26-server #1 SMP Tue Dec 1 19:19:20 UTC 2009 i686
GNU/Linux
Summary: server crash
Current Hardware: Dell PowerEdge 2650
Profile: Web server with vhosted clients, and basic LAMP functionality.
Typical load: less than .20, rarely above .50 (currently .03)
Symptom summary: System fails to fully respond. System is "running", and
answers pings quite normally, but ALL services fail to respond (apache, sshd,
etc), requiring a reboot to restore "normal" functionaltiy.
Related log data: Nothing, that I could find.
I've run into a troubling situation that has followed me from one hardware
profile to something radically different, with the same nasty results. As
mentioned above this system supports several client web sites. Its main
purpose is Apache/php. Mysql is running on a separate system. ftp is installed
but firewalled and really not used. Mail is only there to relay out mail from
the vhosted web clients. No incoming mail.
What is most troubling is that 2 months ago we moved everything from a
completely different 8.04 system (an IBM x330 server) because of the same
problem, eg system dies mysteriously with no log data, pings normally, nmap
shows all services running, but none of those services respond fully. I had
assumed we had some obscure hardware related problem, and moved all the
clients over to the current system. But something else is going on since the
problem has followed me to the current system, which would rule out faulty
hardware, I would think.
The best I can get from the logs is that the last Apache request was served at
16:40 (looked normal). Syslogd lefts its ---MARK--- thing in syslog for the last time at
16:56, which is the last entry that I can find in any log, until a reboot at
17:33. Absolutely nothing unusual in syslog, kern.log, or any other log,
during any of this timeframe. Nothing real unusual in any Apache log either
just prior to this.
I have reported a strange php/suhosin related error to the Ubuntu php team,
that is memory related
(https:/
related to this somehow. Possibly something happened there, and it was not
able to be logged. Hard to say.
As another note, I have several systems running 8.04 now with very like
configurations and these issues have not been a problem (except the previous
incarnation of this particular system).
Remote diagnostics after the problem started at approx 17:10:
$ ping www.example.net
PING www.example.net (212.253.111.163) 56(84) bytes of data.
64 bytes from www.example.net (212.253.111.163): icmp_seq=1 ttl=63 time=4.67 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=2 ttl=63 time=4.61 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=3 ttl=63 time=4.39 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=4 ttl=63 time=3.99 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=5 ttl=63 time=3.78 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=6 ttl=63 time=4.77 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=7 ttl=63 time=4.57 ms
64 bytes from www.example.net (212.253.111.163): icmp_seq=8 ttl=63 time=4.42 ms
^C
--- www.example.net ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7007ms
Starting Nmap 4.76 ( http://
Interesting ports on www.example.net (212.253.111.163):
Not shown: 994 closed ports
PORT STATE SERVICE
21/tcp open ftp
22/tcp open ssh
25/tcp open smtp
80/tcp open http
443/tcp open https
1720/tcp filtered H.323/Q.931
Everything *looks* very normal at this point. But none of those services fully
respond and can't open a usable connection. There is not even any indication
of attempted logins despite multiple attempts at new ssh connections. A
pre-existing ssh connection that had been opened for weeks, was likewise
totally unresponsive. The patient looks alive, but is quite dead.
wget -S www.example.net
--2010-02-03 17:16:08-- http://
Resolving www.example.net... 212.253.111.163
Connecting to www.example.
HTTP request sent, awaiting response... ^C
Hangs at that point. Same with ssh. All other systems in the same rack
and connected to the same switch, are 100% normal at this time too.
This is a remote system to my location so diagnostics had to be run remotely.
Attaching files I forgot about.
As another note, possibly relevant, and possibly not, on the previous system, I had observed what seemed to be very weird clock behavior during these episodes where the system "was not responding". For instance, Apache log files where it looked like the clock jumped backwards. And logging onto the console and finding repeated 'date' commands showed the clock was standing still. Strange. I have not noticed this on the current system (but only one episode and not much diagnostics).