Launchpad itself

Bug #669296
Comment #12

Comment 12 for bug 669296

Revision history for this message

Gary Poster (gary) wrote on 2011-01-05: Re: lpnet11 "critical timeout" to nagios, non responsive

#12

Gotcha, thank you Michael.

FWIW, for my own understanding, I summarized all the pastebins https://pastebin.canonical.com/41547/. They all share many similarities. Something I noted while assembling the data was that Zope starts worker threads as needed, up to a maximum (pool) currently configured to 4. Therefore, it is very likely that the "odd" threads that are hung while starting (which exist in every pastebin) are usually supposed to be Zope worker threads. Only in https://pastebin.canonical.com/40245/ does it appear that the badly-started thread might be a zope sendmail thread (by process of elimination from the usual pattern).

I saw a couple of notable exceptions from the pattern. First, thread 7 of https://pastebin.canonical.com/41445/ and thread 4 of https://pastebin.canonical.com/40529/ both were interesting because they were doing a storm teardown after a session database access *and* they were an odd "just starting" thread. Second, https://pastebin.canonical.com/41445/ had 33 threads, rather than the usual 5 or 6. I suspect that's just configuration--the extra threads all looked like Zope server threads. I have a feeling that both of these are red herrings, but they are worth noting.

All that said, I guess I have a few things to pursue ATM.
1) ask Robert to see if there are any other indications of the logrotate signal being correlated with these problems, other than time (which does not appear to correlate in the collection of incidents).
2) consider how we can catch the interpreter exiting. do atexit calls run before or after __del__s are called for cleanup? I'd guess before. If so, can we think of anything to look at to tell us why we are exiting? Maybe sys.last_* will be set with something helpful? Maybe we can install signal handlers?
3) I haven't looked at logs yet. I need to do that, for the pertinent times and machines.
4) ...look at Robert about his "_PyThreadState_Current is set wrongly" and wonder what to do to explore it

On another note, while, like Robert, I do also wonder why the zope sendmail queue is hanging around, I don't think it has anything to do with this issue. I was one of the people who told him that it was not being used, and I did so both from talking with people and looking at the configuration. It did not cause a hang in the past, though, as Stuart B described it to me. It caused more and more of a buildup of bad mail that could not be sent that the Zope code kept retrying. The symptoms were very different--and not occurring now to my knowledge. I don't believe it has anything to do with this bug.

Gotcha, thank you Michael.

FWIW, for my own understanding, I summarized all the pastebins  https://pastebin.canonical.com/41547/.  They all share many similarities.  Something I noted while assembling the data was that Zope starts worker threads as needed, up to a maximum (pool) currently configured to 4.  Therefore, it is very likely that the "odd" threads that are hung while starting (which exist in every pastebin) are usually supposed to be Zope worker threads.  Only in https://pastebin.canonical.com/40245/ does it appear that the badly-started thread might be a zope sendmail thread (by process of elimination from the usual pattern).

I saw a couple of notable exceptions from the pattern.  First, thread 7 of https://pastebin.canonical.com/41445/ and thread 4 of https://pastebin.canonical.com/40529/ both were interesting because they were doing a storm teardown after a session database access *and* they were an odd "just starting" thread.  Second, https://pastebin.canonical.com/41445/ had 33 threads, rather than the usual 5 or 6.  I suspect that's just configuration--the extra threads all looked like Zope server threads.  I have a feeling that  both of these are red herrings, but they are worth noting.

All that said, I guess I have a few things to pursue ATM.
1) ask Robert to see if there are any other indications of the logrotate signal being correlated with these problems, other than time (which does not appear to correlate in the collection of incidents).
2) consider how we can catch the interpreter exiting.  do atexit calls run before or after __del__s are called for cleanup?  I'd guess before.  If so, can we think of anything to look at to tell us why we are exiting?  Maybe sys.last_* will be set with something helpful?  Maybe we can install signal handlers?
3) I haven't looked at logs yet.  I need to do that, for the pertinent times and machines.
4) ...look at Robert about his "_PyThreadState_Current is set wrongly" and wonder what to do to explore it

On another note, while, like Robert, I do also wonder why the zope sendmail queue is hanging around, I don't think it has anything to do with this issue.  I was one of the people who told him that it was not being used, and I did so both from talking with people and looking at the configuration.  It did not cause a hang in the past, though, as Stuart B described it to me.  It caused more and more of a buildup of bad mail that could not be sent that the Zope code kept retrying.  The symptoms were very different--and not occurring now to my knowledge.  I don't believe it has anything to do with this bug.