Comment 12 for bug 1154876

Revision history for this message
Marc Hasson (mhassonsuspect) wrote :

My testing on the 3.9 kernel has been underway since the note above, its surpassed 11 days of running the loads from the scripts attached, and even higher. The previous 3.2 and 3.5 kernel testing never exceeded 4.5 days before hanging solidly, and usually were less.

So, the 3.9 kernel appears to be considerably more robust at the very least since I could not cause it to solidly hang as I could in my 2.6 and 3.2/3.5 kernel testbeds. So it would be good to see 3.9 backported to Precise for supported usage on our deployed 12.04 systems. And I will write another bug for the 2.6 systems that are suffering the most so that perhaps something can be done there as well. BUT.....

... I could not tag this bug either as "kernel-bug-exists-upstream" nor "kernel-fixed-upstream" because while the "solid hang/failure" symptom *is* fixed in the upstream kernel we *still* experienced the same hangs but of only 5-10 minutes each event through at least the later half of the 3.9 kernel testing. I had no way to measure these hangs other than my own observations at my testing consoles, I had the impression they occurred a couple of times a day. I first noticed them a few days into the test, and can not say for sure whether they were there from the beginning or not. 5-10 minutes of outage from our servers would look the same to most network operations folks as a permanently solid hang, one can't have customers twiddling their thumbs for that long when engaged in transactions of some kind. I believe these "transient" hangs were also seen in my 3.2/3.5 testing, but I didn't time them since I was most concerned about the solid hang/failure. When any of the kernels, including this 3.9 test,l hangs like this I can see that all CPUs are 100% busy and I presume its the same symptom I've reported earlier about the constant rescheduling all processes for another page that I reported as part of my kdb session attachment. But I did not break in with kdb to confirm that in this round of testing, I didn't want to risk disrupting the longer-term survival testing I was going for primarily. I can confirm that pings were still responded to during these hangs and that the serial console remained unresponsive for the 5-10 minutes of hang.