app servers stop responding and need restarting

Bug #861742 reported by Deryck Hodge on 2011-09-28
This bug report is a duplicate of:  Bug #481512: race condition when rotating logs. Edit Remove
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Unassigned

Bug Description

There is an incident log entry on 2011-09-27 23:20:14 UTC about an app server that stopped responding and needed restarting. We don't have much, if anything, to go on at this point. This bug is meant to track the issue and gather any data we can when it happens again.

James Troup (elmo) wrote :

backtraces from two hung app servers are here:

 https://pastebin.canonical.com/57609/
 https://pastebin.canonical.com/57610/

They didn't respond to SIGTERM but the backtraces were no different
afterwards.

Robert Collins (lifeless) wrote :
Download full text (3.2 KiB)

The traces both are stuck in logging - looks like another variation of the log rotation hang.

Thread 2
#0 0x00002b4ed1308b50 in ?? () from None
#1 0x0000000003ff2410 in ?? () from None
#2 0x00000000004d4498 in PyThread_acquire_lock (lock=0x3ff2410, waitflag=128) from ../Python/thread_pthread.h
#3 0x00000000004d7932 in lock_PyThread_acquire_lock (self=0x420d108, args=<value optimised out>) from ../Modules/threadmodule.c
#4 0x00000000004a7c5e in call_function () from ../Python/ceval.c
/usr/lib/python2.6/threading.py (117): acquire
/usr/lib/python2.6/logging/__init__.py (621): acquire
/usr/lib/python2.6/logging/__init__.py (669): handle
/usr/lib/python2.6/logging/__init__.py (1206): callHandlers
/usr/lib/python2.6/logging/__init__.py (1174): handle
/usr/lib/python2.6/logging/__init__.py (1152): _log
/usr/lib/python2.6/logging/__init__.py (1047): info
/srv/launchpad.net/production/launchpad-rev-14560/eggs/zc.zservertracelog-1.1.5-py2.6.egg/zc/zservertracelog/tracelog.py (33): _log
/srv/launchpad.net/production/launchpad-rev-14560/eggs/zc.zservertracelog-1.1.5-py2.6.egg/zc/zservertracelog/tracelog.py (95): executeRequest

is acquiring a lock and

Thread 4
#0 0x00002b4ed1308b50 in ?? () from None
#1 0x0000000003ff2410 in ?? () from None
#2 0x00000000004d4498 in PyThread_acquire_lock (lock=0x3ff2410, waitflag=128) from ../Python/thread_pthread.h
#3 0x00000000004d7932 in lock_PyThread_acquire_lock (self=0x420d108, args=<value optimised out>) from ../Modules/threadmodule.c
#4 0x00000000004a7c5e in call_function () from ../Python/ceval.c
/usr/lib/python2.6/threading.py (117): acquire
/usr/lib/python2.6/logging/__init__.py (621): acquire
/srv/launchpad.net/production/launchpad-rev-14560/eggs/ZConfig-2.9.1dev_20110728-py2.6.egg/ZConfig/components/logger/loghandler.py (87): reopen
/srv/launchpad.net/production/launchpad-rev-14560/eggs/ZConfig-2.9.1dev_20110728-py2.6.egg/ZConfig/components/logger/loghandler.py (42): reopenFiles
/srv/launchpad.net/production/launchpad-rev-14560/lib/canonical/launchpad/webapp/sigusr2.py (20): sigusr2_handler
/usr/lib/python2.6/threading.py (117): acquire
/usr/lib/python2.6/logging/__init__.py (621): acquire
/usr/lib/python2.6/logging/__init__.py (669): handle
/usr/lib/python2.6/logging/__init__.py (1206): callHandlers
/usr/lib/python2.6/logging/__init__.py (1174): handle
/usr/lib/python2.6/logging/__init__.py (1152): _log
/usr/lib/python2.6/logging/__init__.py (1047): info
/srv/launchpad.net/production/launchpad-rev-14560/eggs/zc.zservertracelog-1.1.5-py2.6.egg/zc/zservertracelog/tracelog.py (33): _log
/srv/launchpad.net/production/launchpad-rev-14560/eggs/zc.zservertracelog-1.1.5-py2.6.egg/zc/zservertracelog/tracelog.py (71): handle_request
/srv/launchpad.net/production/launchpad-rev-14560/eggs/zope.server-3.6.1-py2.6.egg/zope/server/serverchannelbase.py (119): received
/srv/launchpad.net/production/launchpad-rev-14560/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py (85): handle_read

is aquiring the *same* lock, which it looks like it was trying to acquire before the siguse2 handler was called.

So I suspect the signal handler has found a bug with lock reeentrancy.

can we get a more fine grained trace (do...

Read more...

Robert Collins (lifeless) wrote :
Download full text (57.9 KiB)

James got me a more detailed access:

Thread 4 (Thread 26660):
#0 0x00002ae656230b50 in ?? ()
#1 0x0000000002f039d0 in ?? ()
#2 0x00000000004d4498 in PyThread_acquire_lock (lock=0x2f039d0, waitflag=128) at ../Python/thread_pthread.h:349
#3 0x00000000004d7932 in lock_PyThread_acquire_lock (self=0x30c00f0, args=<value optimized out>) at ../Modules/threadmodule.c:46
#4 0x00000000004a7c5e in call_function (f=
    Frame 0x1cb45450, for file /usr/lib/python2.6/threading.py, line 123, in acquire (self=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x30c00f0>, _RLock__count=0) at remote 0x30fa250>, blocking=1, me=47168797054336), throwflag=<value optimized out>) at ../Python/ceval.c:3750
#5 PyEval_EvalFrameEx (f=
    Frame 0x1cb45450, for file /usr/lib/python2.6/threading.py, line 123, in acquire (self=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x30c00f0>, _RLock__count=0) at remote 0x30fa250>, blocking=1, me=47168797054336), throwflag=<value optimized out>) at ../Python/ceval.c:2412
#6 0x00000000004a9671 in PyEval_EvalCodeEx (co=0xc0edc8, globals=<value optimized out>, locals=<value optimized out>, args=0x2, argcount=<value optimized out>, kws=<value optimized out>, kwcount=0, defs=0xc23068, defcount=1, closure=0x0)
    at ../Python/ceval.c:3000
#7 0x00000000004a7809 in fast_function (f=
    Frame 0x142916f0, for file /usr/lib/python2.6/logging/__init__.py, line 622, in acquire (self=<FileHandler(stream=<file at remote 0x1702f150>, level=0, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x30c00f0>, _RLock__count=0) at remote 0x30fa250>, _wr=<weakref at remote 0x30b8c00>, baseFilename='/srv/launchpad.net/production-logs/launchpad-trace125.log', mode='a', filters=[], formatter=<Formatter(datefmt='%Y-%m-%dT%H:%M:%S', _fmt='%(message)s') at remote 0x30f8878>) at remote 0x30f8830>), throwflag=<value optimized out>) at ../Python/ceval.c:3846
#8 call_function (f=
    Frame 0x142916f0, for file /usr/lib/python2.6/logging/__init__.py, line 622, in acquire (self=<FileHandler(stream=<file at remote 0x1702f150>, level=0, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x30c00f0>, _RLock__count=0) at remote 0x30fa250>, _wr=<weakref at remote 0x30b8c00>, baseFilename='/srv/launchpad.net/production-logs/launchpad-trace125.log', mode='a', filters=[], formatter=<Formatter(datefmt='%Y-%m-%dT%H:%M:%S', _fmt='%(message)s') at remote 0x30f8878>) at remote 0x30f8830>), throwflag=<value optimized out>) at ../Python/ceval.c:3771
#9 PyEval_EvalFrameEx (f=
    Frame 0x142916f0, for file /usr/lib/python2.6/logging/__init__.py, line 622, in acquire (self=<FileHandler(stream=<file at remote 0x1702f150>, level=0, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x30c00f0>, _RLock__count=0) at remote 0x30fa250>, _wr=<weakref at remote 0x30b8c00>, baseFilename='/srv/launchpad.net/production-logs/launchpad-trace125.log', mode='a', filters=[], formatter=<Formatter(datefmt='%Y-%m-%dT%H:%M:%S', _fmt='%(message)s') at remote 0x30f8878>) at remote 0x...

Robert Collins (lifeless) wrote :

frame 2 is hung on
status = fix_status(sem_wait(thelock));

The two key python lines
/srv/launchpad.net/production/launchpad-rev-14560/lib/canonical/launchpad/webapp/sigusr2.py (20): sigusr2_handler
/usr/lib/python2.6/threading.py (117): acquire

sigusr2_handler is simple - its just calling into reopenFiles

threading.py line 117 is a call to thread.get_ident() called simply as 'lock.acquire' from logging/__init__.py

however the core thinks we're on line 123:
#22 0x00000000004a1b03 in PyEval_CallObjectWithKeywords (func=<function at remote 0x75ea848>, arg=
    (12, Frame 0x1431c760, for file /usr/lib/python2.6/threading.py, line 123, in acquire (self=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x30c00f0>, _RLock__count=0) at remote 0x30fa250>, blocking=1, me=47168797054336)), kw=0x0) at ../Python/ceval.c:3619
#23 0x00000000004d861f in PyErr_CheckSignals () at ../Modules/signalmodule.c:868
#24 0x00000000004a2037 in Py_MakePendingCalls () at ../Python/ceval.c:452
#25 0x00000000004a337c in PyEval_EvalFrameEx (f=
    Frame 0x1431c760, for file /usr/lib/python2.6/threading.py, line 123, in acquire (self=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x30c00f0>, _RLock__count=0) at remote 0x30fa250>, blocking=1, me=47168797054336), throwflag=<value optimized out>) at ../Python/ceval.c:871

which is much less innocuous - that is a call to the underlying python lock.
Note this *not-threadsafe* code:
        rc = self.__block.acquire(blocking)
        if rc:
            self.__owner = me
            self.__count = 1
            if __debug__:
                self._note("%s.acquire(%s): initial success", self, blocking)
        else:
            if __debug__:
                self._note("%s.acquire(%s): failure", self, blocking)

If the low level acquire succeeds, but the low level primitive is not naturally reentrant, then until the 'if rc' code path executes, other calls to this threads acquire method will deadlock.

I need to check the low level primitive code path still, but this is a good contender for the bug.

Robert Collins (lifeless) wrote :

(oh, and the underlying lock isn't reentrant) -
__block is assigned by a call to _allocate_lock which is thread.allocate_lock

So RLock has been built on a non-reentrant lock primitive, which would be fine if we didn't reenter here.

Robert Collins (lifeless) wrote :

05:30 < lifeless> I will file upstream when I wake again
05:30 < lifeless> tell me, our log rotate stuff, does it matter if it takes a little longer (a couple of thread context switches..)
05:30 < lifeless> the rotate is 'mv foo; signal appserver process' right ?
05:31 < elmo> checking
05:32 < elmo> (but I'd imagine so)
05:32 < elmo> lifeless: yeah, it is - so it doesn't matter at all
05:32 < lifeless> great
05:32 < lifeless> I'll move the reopening out of the signal handler
05:33 < lifeless> set a non-threadsafe flag when signaled
05:33 < lifeless> and approximately the very next log message will read it and close+open
05:33 < lifeless> (and set the flag back to false)
05:34 < lifeless> this will avoid using threading code from the signal handler, introducing a small potential race if you rotate twice the second time within the rotation code for the first
05:35 < lifeless> said race will result in the second rotate being ignored (because it happened while the first rotate was doing its thing). Shrug.

Robert Collins (lifeless) wrote :

small test program to test hangs with signals.

Robert Collins (lifeless) wrote :

First hang:
python2.6 ./hang.py &
[1] 10538
robertc@lifeless-64:/tmp$ 10538
..k.il.l ...-.SIG.U.SR.2.. 1..0.5.3.8..
robertc@lifeless-64:/tmp$ Traceback (most recent call last):
  File "./hang.py", line 15, in <module>
    lock_release()
  File "./hang.py", line 8, in lock_release
    l.release()
  File "/usr/lib/python2.6/threading.py", line 139, in release
    self.__count = count = self.__count - 1
TypeError: lock_release() takes no arguments (2 given)

[1]+ Exit 1 python2.6 ./hang.py

Robert Collins (lifeless) wrote :

Bah, thats just me failing to use the signal correctly. Updating and redoing.

Robert Collins (lifeless) wrote :

Fixed test.

Robert Collins (lifeless) wrote :

Race during release:

Traceback (most recent call last):
  File "./hang.py", line 15, in <module>
    lock_release()
  File "./hang.py", line 8, in lock_release
    l.release()
  File "/usr/lib/python2.6/threading.py", line 138, in release
    raise RuntimeError("cannot release un-acquired lock")
RuntimeError: cannot release un-acquired lock

Robert Collins (lifeless) wrote :

And similarly it will hang sometimes (observed by a sudden lack of '.'s). Strace shows
 sudo strace -p 10569
Process 10569 attached - interrupt to quit
futex(0xbf9730, FUTEX_WAIT_PRIVATE, 0, NULL

which is consistent with (shock, horror) the hangs LP sees.

Robert Collins (lifeless) wrote :

Now, in ticket 3001 (http://bugs.python.org/issue3001) upstream has reimplemented RLock in C, so this may be fixed (or worse); need a 3.x syntax version of hang.py

Robert Collins (lifeless) wrote :

I can't manually trigger a failure on 3.2, so this is limited to the 2.x releases of python (which we'll be on for a bit more time).

Robert Collins (lifeless) wrote :

(Probably limited to).

Robert Collins (lifeless) wrote :

Ah, they kept a python version of RLock around, it will be vulnerable :).

Robert Collins (lifeless) wrote :

And yes it is - modified version to disable the python implementation for demonstration purposes.

Robert Collins (lifeless) wrote :

bah, disable the *C* implementation. /etired.

Robert Collins (lifeless) wrote :

Filed upstream as http://bugs.python.org/issue13697.

Next step for us is to fix ZConfig to not use locks.

Robert Collins (lifeless) wrote :

(from the signal handler that is - locking is fine, just not within signal handling :))

Robert Collins (lifeless) wrote :

escalating - this is consuming regular ops time

Changed in launchpad:
importance: High → Critical
tags: added: canonical-losa-lp
Robert Collins (lifeless) wrote :

As this is fallout from a prior change which will need reverting, I'm going to duplicate on that earlier bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.