Comment 10 for bug 669296

Revision history for this message
Gary Poster (gary) wrote : Re: lpnet11 "critical timeout" to nagios, non responsive

The best conclusions so far are these, excerpted from Robert's notes above.

"""
Thread 2 is deleting itself and is waiting for the thread HEAD_LOCK
to do so. It holds the GIL. .... Thread 2 is the reason other threads are not shutting down. Now, why
is thread 2 waiting on the head_mutex?
"""
"""
t_bootstrap calls DeleteCurrent only when finished with the thread: so
thread 2 also is busy finishing with itself.

So the question is, what could possibly be holding the tstate head_lock.

One possibility is that _PyThreadState_Current is set wrongly.

Nothing is actively restoring 0x81ba540
"""

Robert also reported that there seemed to be correlation with logrotate. Perhaps he is referring to something similar to https://bugs.launchpad.net/launchpad/+bug/680333 (which was reported later than Robert's comment)? As that bug says, we do have past problems, per https://bugs.launchpad.net/launchpad/+bug/481512 . However, if the only evidence to relate the logrotate and the hang is timing, that is indirect and circumstantial. More importantly, if we can regard the incident report log as at least a rough indicator of the time of occurrence, the timing is inconsistent. Bug shows the log rotation at 1:00 UTC, but hangs have been recorded in the incident log since Dec 1 at these times (24 hour clock, UTC): 11:38, 01:45, 08:30, 08:19, 01:39, 09:06, 11:04, 11:46, 02:25, 12:45, 05:20, 05:20, 23:50, 23:30. If we estimate that SMS messages happen within a couple of hours before that time is recorded, then that is oddly clustered around the first 12 hours UTC but not indicative of an event at 01:00, as that bug suggests. Of course, perhaps we have more than one cause of hangs.

We requested that the LOSAs use Michael's newer version of pygdb, and to analyze the hanging process both before and after they try to stop the process. They have had done so at least one time, and we have a number of other instances with less complete data, so I'd like to reopen the discussion with the new data.

This is the most complete data for one instance. https://pastebin.canonical.com/40836/ (from https://bugs.launchpad.net/launchpad/+bug/689463)

We have several other instances recently--one or two every few days, per https://wiki.canonical.com/InformationInfrastructure/OSA/LPIncidentLog . Some of them have debugging output. They can also provide information, but the data is less complete:

https://pastebin.canonical.com/41445/ (33 threads?!)
https://pastebin.canonical.com/41452/ (debugging didn't analyze properly)
https://pastebin.canonical.com/40529/
https://pastebin.canonical.com/40527/
https://pastebin.canonical.com/40245/
https://pastebin.canonical.com/40206/

I'm going to send this out to Michael, Barry, and Robert for more input, also inviting any other launchpad-dev participants to contribute. Then I'll stare some more at these pastebins myself.