App servers die and hang

Bug #669296 reported by Steve McInerney
54
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Unassigned

Bug Description

https://pastebin.canonical.com/39208/

for a pygdb backtrace.

Had to manually kill, the default stop action, didn't.

Requested info on next hang:
 - grab a core before manually killing (perhaps you did this time too)
 - gdb 'thread apply all bt' + pygdb's rev52 run on the core.

Revision history for this message
Gary Poster (gary) wrote :

AFAICT, and probably stating the obvious, every thread except 2 was waiting for the GIL. Thread 2 was starting a thread and waiting on a lock to delete the current GIL state pointer. Looks like a Python or OS bug to me.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 669296] Re: lpnet11 "critical timeout" to nagios, non responsive

\o/ so we need to debug what it is.

j.c.sackett (jcsackett)
affects: launchpad → launchpad-foundations
Revision history for this message
Gary Poster (gary) wrote : Re: lpnet11 "critical timeout" to nagios, non responsive

Triaging as low unless/until incidence rises.

Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Gary Poster (gary) wrote :

See also bug 669776.

Thread 2 in this pastebin and thread 3 in the other bug's pastebin both seem to be the "odd men out". The other one is not as obviously all about the GIL, but both of the "unusual" threads are calling PyThreadState_DeleteCurrent. Here's a pithy description of what that call is about: http://code.activestate.com/lists/python-list/269073/ . Here's the good bits:

"""
PyThreadState_DeleteCurrent() takes no argument, and deletes the current
thread state. You must hold the GIL while calling it, and the thread
calling it loses its tstate, and loses the GIL, and must never call anything
in the Python C API ever again (unless it acquires another tstate first, but
that's an intended use case). As that description hints <wink>, it was
introduced to plug a Python shutdown race.

A detailed explanation of "the bug" can be found in the comments here:

    http://www.python.org/sf/225673
"""

That's from 2002, but still seems like a good pointer.

Lucid is using Python 2.6.5 locally, and 2.6.6 has been released, but nothing in the NEWS seems to be pertinent to thread issues. http://python.org/download/releases/2.6.6/NEWS.txt

I'll ask Michael Hudson, Barry Warsaw, and Robert to take a look in case they can provide some insight. Other things to look at are changes in before this incident: we've been running Python 2.6 for a few weeks now, so it seems like it might be triggered by other things.

If there's one more somewhat like this soon, it will push everything to the back of the queue. I'd rather not drop everything until that happens, though.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Again, it looks like this is in interpreter teardown, but I guess the traceback was taken after the stop script had been run?

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 669296] Re: lpnet11 "critical timeout" to nagios, non responsive

So, lets see:
Thread 6 is a srv/launchpad.net/production/launchpad-rev-11793/eggs/zope.sendmail-3.7.1-py2.6.egg/zope/sendmail/queue.py
(155): run
- I thought we had disabled the zope mail queue because it crashed appservers?
- waiting on lock=0x2e67e50 (aka the GIL)

Thread 5 has crashed - the entire thread has gone boom, its written
the error to a file is is trying to obtain the GIL to return to python
code. Note that its in 'excepthook' here.

Thread 4 is in a __del__ handler and we've lost the bottom of the
thread - it may even be approximately empty. its waiting on the GIL,
and it was handling
"
/srv/launchpad.net/production/launchpad-rev-11793/eggs/storm-0.18-py2.6-linux-x86_64.egg/storm/database.py
(245): close
/srv/launchpad.net/production/launchpad-rev-11793/eggs/storm-0.18-py2.6-linux-x86_64.egg/storm/database.py
(188): __del__
"

call_function is the basic trampoline, AIUI, so it could be any of the
close methods lines - but WAG: the postgresql line.

Thread 3 has also blown up, and been writing to the *exact same file*
as Thread 5. '0x2b119cb631e0'. Its also blown up the entire stack to
the excepthook default handler.

Thread 2 is deleting itself and is waiting for the thread HEAD_LOCK
to do so. It holds the GIL.

Thread 1 is in a __del__ handler, had been inserting into the request
timeline the fact that the database connection is being closed, and is
waiting for the GIL.

Thread 2 is the reason other threads are not shutting down. Now, why
is thread 2 waiting on the head_mutex?

This looks to me like the slow shutdown bug spm filed a few weeks
back, FWIW. The big blowups on threads are when the interpreter is
shutting down, and the hang is due to a deadlock around the
head_mutex.

Revision history for this message
Robert Collins (lifeless) wrote :

> Thread 2 is deleting itself and is waiting for the thread HEAD_LOCK
> to do so. It holds the GIL.

t_bootstrap calls DeleteCurrent only when finished with the thread: so
thread 2 also is busy finishing with itself.

So the question is, what could possibly be holding the tstate head_lock.

One possibility is that _PyThreadState_Current is set wrongly.

Nothing is actively restoring 0x81ba540

description: updated
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Re: lpnet11 "critical timeout" to nagios, non responsive

FWIW, r52 of lp:pygdb should give a more useful traceback if this happens again.

The part that isn't _completely_ clear to me is if the appserver hung before it was asked to shutdown. If it is a hang during shutdown, that's much less surprising to me for better or worse (interpreter teardown is messy).

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 669296] Re: lpnet11 "critical timeout" to nagios, non responsive

apparently we have correlation with rotate, which is (IIRC) SIGUSR2 handlers.

Revision history for this message
Gary Poster (gary) wrote : Re: lpnet11 "critical timeout" to nagios, non responsive

The best conclusions so far are these, excerpted from Robert's notes above.

"""
Thread 2 is deleting itself and is waiting for the thread HEAD_LOCK
to do so. It holds the GIL. .... Thread 2 is the reason other threads are not shutting down. Now, why
is thread 2 waiting on the head_mutex?
"""
"""
t_bootstrap calls DeleteCurrent only when finished with the thread: so
thread 2 also is busy finishing with itself.

So the question is, what could possibly be holding the tstate head_lock.

One possibility is that _PyThreadState_Current is set wrongly.

Nothing is actively restoring 0x81ba540
"""

Robert also reported that there seemed to be correlation with logrotate. Perhaps he is referring to something similar to https://bugs.launchpad.net/launchpad/+bug/680333 (which was reported later than Robert's comment)? As that bug says, we do have past problems, per https://bugs.launchpad.net/launchpad/+bug/481512 . However, if the only evidence to relate the logrotate and the hang is timing, that is indirect and circumstantial. More importantly, if we can regard the incident report log as at least a rough indicator of the time of occurrence, the timing is inconsistent. Bug shows the log rotation at 1:00 UTC, but hangs have been recorded in the incident log since Dec 1 at these times (24 hour clock, UTC): 11:38, 01:45, 08:30, 08:19, 01:39, 09:06, 11:04, 11:46, 02:25, 12:45, 05:20, 05:20, 23:50, 23:30. If we estimate that SMS messages happen within a couple of hours before that time is recorded, then that is oddly clustered around the first 12 hours UTC but not indicative of an event at 01:00, as that bug suggests. Of course, perhaps we have more than one cause of hangs.

We requested that the LOSAs use Michael's newer version of pygdb, and to analyze the hanging process both before and after they try to stop the process. They have had done so at least one time, and we have a number of other instances with less complete data, so I'd like to reopen the discussion with the new data.

This is the most complete data for one instance. https://pastebin.canonical.com/40836/ (from https://bugs.launchpad.net/launchpad/+bug/689463)

We have several other instances recently--one or two every few days, per https://wiki.canonical.com/InformationInfrastructure/OSA/LPIncidentLog . Some of them have debugging output. They can also provide information, but the data is less complete:

https://pastebin.canonical.com/41445/ (33 threads?!)
https://pastebin.canonical.com/41452/ (debugging didn't analyze properly)
https://pastebin.canonical.com/40529/
https://pastebin.canonical.com/40527/
https://pastebin.canonical.com/40245/
https://pastebin.canonical.com/40206/

I'm going to send this out to Michael, Barry, and Robert for more input, also inviting any other launchpad-dev participants to contribute. Then I'll stare some more at these pastebins myself.

Changed in launchpad:
importance: Low → Medium
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

The processes all seem to be interpreter teardown before the LOSAs try and stop them again. I don't know how easy it is to figure out _why_ the interpreter is trying to exit, but it might be an interesting thing to try and find out...

Revision history for this message
Gary Poster (gary) wrote :

Gotcha, thank you Michael.

FWIW, for my own understanding, I summarized all the pastebins https://pastebin.canonical.com/41547/. They all share many similarities. Something I noted while assembling the data was that Zope starts worker threads as needed, up to a maximum (pool) currently configured to 4. Therefore, it is very likely that the "odd" threads that are hung while starting (which exist in every pastebin) are usually supposed to be Zope worker threads. Only in https://pastebin.canonical.com/40245/ does it appear that the badly-started thread might be a zope sendmail thread (by process of elimination from the usual pattern).

I saw a couple of notable exceptions from the pattern. First, thread 7 of https://pastebin.canonical.com/41445/ and thread 4 of https://pastebin.canonical.com/40529/ both were interesting because they were doing a storm teardown after a session database access *and* they were an odd "just starting" thread. Second, https://pastebin.canonical.com/41445/ had 33 threads, rather than the usual 5 or 6. I suspect that's just configuration--the extra threads all looked like Zope server threads. I have a feeling that both of these are red herrings, but they are worth noting.

All that said, I guess I have a few things to pursue ATM.
1) ask Robert to see if there are any other indications of the logrotate signal being correlated with these problems, other than time (which does not appear to correlate in the collection of incidents).
2) consider how we can catch the interpreter exiting. do atexit calls run before or after __del__s are called for cleanup? I'd guess before. If so, can we think of anything to look at to tell us why we are exiting? Maybe sys.last_* will be set with something helpful? Maybe we can install signal handlers?
3) I haven't looked at logs yet. I need to do that, for the pertinent times and machines.
4) ...look at Robert about his "_PyThreadState_Current is set wrongly" and wonder what to do to explore it

On another note, while, like Robert, I do also wonder why the zope sendmail queue is hanging around, I don't think it has anything to do with this issue. I was one of the people who told him that it was not being used, and I did so both from talking with people and looking at the configuration. It did not cause a hang in the past, though, as Stuart B described it to me. It caused more and more of a buildup of bad mail that could not be sent that the Zope code kept retrying. The symptoms were very different--and not occurring now to my knowledge. I don't believe it has anything to do with this bug.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 669296] Re: lpnet11 "critical timeout" to nagios, non responsive

the 33 threads case can be explained by a misconfigured appserver I
found while working on the single-threaded-appserver rt; lpnet 10 was
runing with 32 workers.

summary: - lpnet11 "critical timeout" to nagios, non responsive
+ App servers die and hang
Revision history for this message
Gary Poster (gary) wrote :

From the incident log, it seems that some app servers are "successfully" dying, and others, like the ones reported here, are trying to die and hanging. We still do not know why they might be trying to hang.

A first look at the logs does not show a traceback near the time of death, but I need to look at more logs.

The Zope queued mail thing is in fact collecting bad mail. Bug 698213. I intend to address that before this one. I don't see how it would be related, but it's the only clearly broken thing in the logs so far, so it might be worth addressing first since we actually have a relatively clear action to perform, unlike this bug.

Changed in launchpad:
importance: Medium → Critical
Revision history for this message
Robert Collins (lifeless) wrote :

We has lpnet6 hang today, and lost - just awol (so could be a different issue) - soybean_lpnet_1 soybean_lpnet_2 soybean_lpnet_9 soybean_lpnet_10 soybean_edge_1 wampee_lpnet_15.

This lead to a in-dc queue 1000 requests long.

We probably need an alert that requests through the front end work too, but possibly we already have that.

Revision history for this message
Steve McInerney (spm) wrote :

We've just had a few repeats of this lately.
All on wampee, which is curious.
We had lpnets: 88, 91, 92, 95 all die. Needing starting. And lpnet 93 in a non responsive state needing to be killed and then started.
These all died today, and all shortly after the scheduled logrotation.

We had a similar pattern yesterday:
lpnets 83, 84, 87 and 92 all hung needing to be restarted or killed.
Again at a similar time. And again in a single batch of failures.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 669296] Re: App servers die and hang

Can we get a core from the next hang, and a timestamp for it for log
correlation.

Revision history for this message
Robert Collins (lifeless) wrote :

Also, can we get the exact command/thing doing the rotation? if it is that, we'll want to reproduce on a developers machine and knowing how we're doing it will help with that.

Revision history for this message
Robert Collins (lifeless) wrote :

spm says 'reload is a /bin/kill -SIGUSR2'
Apparently the common case is a hard death now, no hang; So we either have a deliberate shutdown (which the logs should tell us) or a SIG-death of some sort, but the appservers are not configured to core AIUI.

So next steps - lets get the logs for a 'it died' server and see if it was deliberate or sudden death; if sudden death then we need to turn core gathering on.

Revision history for this message
William Grant (wgrant) wrote :

The main issue is fixed. The remaining ~monthly hang is bug #861742.

Changed in launchpad:
status: Triaged → Fix Released
Revision history for this message
Haw Loeung (hloeung) wrote :

This is still happening.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.