Using logging module from signal handler can wedge process

Bug #1029727 reported by Adam Gandelman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Glance
Fix Released
Low
Mark Washenberger
Grizzly
Fix Released
Low
Mark Washenberger
OpenStack Compute (nova)
Fix Released
Low
Johannes Erdfelt

Bug Description

Running into an issue with nova-api which may be threading related. Having lots of fun trying to reproduce and debug, but not having much luck yet. Filing here to track my progress and hopefully solicit input from others.

First some background... We're triggering multi-node test deployments for upstream commits using Juju + MAAS. The process of deploying services with Juju is very much event driven. A service is deployed, and relations are added to other services. Each new relation added fires a number of hook scripts on the machine that do various bits of configuration. For most of the nova components, this involves making a few modifications to nova.conf or paste config and restarting the running services (using upstart).

In the case of nova-api, the service is deployed to an empty machine, the packages install and the services are stopped. Then a number of relations are added to the other services in the environment (keystone, rabbit, mysql, glance, etc). The nova-api configuration flow usually looks something like:

deploy -> install nova-api, stop service
mysql relation added -> set sql_connection, restart nova-api
rabbit realtion added -> update nova.conf's rabbit settings , restart nova-api
keystone relation added -> update paste.ini with credentials, restart nova-api
etc, etc, etc

This all happens in rapid succession. The hooks themselves are not executed concurrently, but since nova doesn't fork and upstart launches it in foreground, service restarts are basically asynchronous. Depending on how quickly the hooks fire off, the service may be stopped and started (restarted) while the previous process is still going through its initialization. So, its likely that every time the service is restarted, its in a different stage of init, with different configuration.

It seems that nova-api being restarted in this manor is not reliable. It's not 100% reproducible, but calls to stop nova-api will sometimes leave an orphaned child, still bound to the port it was listening (but unresponsive to connections, sitting in epoll_wait()) AFAICS, the parent process is mistakenly getting killed off before its children. At this point, nova-api is stopped as far as upstart is concerned. The following hook will try to start nova-api and the service will fail because the port is already in use. The only way to recover is to manually kill the orphaned child and restart the service.

I *was* seeing the following traceback in the api log whenever it hit this state, but these have disappeared from the logs over the last day or two though the problem still persists.

2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
2012-07-25 13:07:20 INFO nova.service [-] Child 6519 exited with status 0
2012-07-25 13:07:20 INFO nova.service [-] Child 6518 exited with status 0
2012-07-25 13:07:20 INFO nova.service [-] Child 6517 exited with status 0
2012-07-25 13:07:20 ERROR nova.service [-] Unhandled exception
2012-07-25 13:07:20 TRACE nova.service Traceback (most recent call last):
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 289, in _start_child
2012-07-25 13:07:20 TRACE nova.service self._child_process(wrap.server)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 266, in _child_process
2012-07-25 13:07:20 TRACE nova.service launcher.run_server(server)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 127, in run_server
2012-07-25 13:07:20 TRACE nova.service server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 619, in wait
2012-07-25 13:07:20 TRACE nova.service self.server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/wsgi.py", line 113, in wait
2012-07-25 13:07:20 TRACE nova.service self._server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
2012-07-25 13:07:20 TRACE nova.service return self._exit_event.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2012-07-25 13:07:20 TRACE nova.service return hubs.get_hub().switch()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
2012-07-25 13:07:20 TRACE nova.service return self.greenlet.switch()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 226, in run
2012-07-25 13:07:20 TRACE nova.service self.wait(sleep_time)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 84, in wait
2012-07-25 13:07:20 TRACE nova.service presult = self.do_poll(seconds)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/nova/service.py", line 245, in _sigterm
2012-07-25 13:07:20 TRACE nova.service LOG.info(_('Received SIGTERM, stopping'))
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1420, in info
2012-07-25 13:07:20 TRACE nova.service self.logger.info(msg, *args, **kwargs)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1140, in info
2012-07-25 13:07:20 TRACE nova.service self._log(INFO, msg, args, **kwargs)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1258, in _log
2012-07-25 13:07:20 TRACE nova.service self.handle(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1268, in handle
2012-07-25 13:07:20 TRACE nova.service self.callHandlers(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 1308, in callHandlers
2012-07-25 13:07:20 TRACE nova.service hdlr.handle(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 746, in handle
2012-07-25 13:07:20 TRACE nova.service self.acquire()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/logging/__init__.py", line 697, in acquire
2012-07-25 13:07:20 TRACE nova.service self.lock.acquire()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/threading.py", line 127, in acquire
2012-07-25 13:07:20 TRACE nova.service rc = self.__block.acquire(blocking)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/semaphore.py", line 71, in acquire
2012-07-25 13:07:20 TRACE nova.service hubs.get_hub().switch()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 162, in switch
2012-07-25 13:07:20 TRACE nova.service assert cur is not self.greenlet, 'Cannot switch to MAINLOOP from MAINLOOP'
2012-07-25 13:07:20 TRACE nova.service AssertionError: Cannot switch to MAINLOOP from MAINLOOP
2012-07-25 13:07:20 TRACE nova.service
2012-07-25 13:07:22 INFO nova.service [-] Parent process has died unexpectedly, exiting

I can't seem to reproduce manually, but I am in the process of adding debugging capabilities to the systems during provisioning, so I should be able to get gdb attached to one of the stuck process tomorrow. Any clues in the meantime would be appreciated.

I partly want to migrate to using individual API servers instead of the monolithic nova-api, but if nova-api remains it should be able to handle restarts robustly.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

I've also observed the same thing happened to nova-compute processes. In that case, upstart 'looses track' of the process and calls to restart it end up launching a second nova-compute process, though the first remains responsive, servicing requests from the scheduler with bad config (the defaults).

I have a feeling https://review.openstack.org/#/c/8733/ has something to do with this, and also that the disappearance of the original traceback is the result of moving to openstack-common logging functionality.

Thread @ https://lists.secondlife.com/pipermail/eventletdev/2011-February/000952.html seems consistent with what I'm seeing. I've applied a small testing patch to our repository that disables the logging calls from signal handler to see if that helps.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Removing the logging calls from signal handlers helps reduce the number of times this deadlock triggers, but I'm still seeing with less frequency in different places of the python code path, but still in epoll_wait(). Now, I'm finding the wedged child processes locked at:

(gdb) pystack
/usr/lib/python2.7/dist-packages/eventlet/hubs/epolls.py (59): do_poll
/usr/lib/python2.7/dist-packages/eventlet/hubs/poll.py (84): wait
/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py (226): run

Revision history for this message
Johannes Erdfelt (johannes.erdfelt) wrote :

There's only two layers of processing in multi-process. There's the master process and then there's the child processes. nova-api will wait for children to exit before it exits.

If there's a process still running afterwards, then either it wasn't created by the master process (or the code that tracks the children) or it never waited for children to finish before exiting.

Your traceback makes it look like it never waited, which would make sense given the behavior you saw.

It does look like using logging in a signal handler is a bad idea. I'll put a patch up for review to avoid that.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/10675

Changed in nova:
assignee: nobody → Johannes Erdfelt (johannes.erdfelt)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/10675
Committed: http://github.com/openstack/nova/commit/5de983ae390993450ac182eff2de3f358593f847
Submitter: Jenkins
Branch: master

commit 5de983ae390993450ac182eff2de3f358593f847
Author: Johannes Erdfelt <email address hidden>
Date: Wed Aug 1 18:19:22 2012 +0000

    Avoid using logging in signal handler

    Fixes bug 1029727

    In some cases, logging can grab locks and thusly attempt to reschedule,
    which will fail in signal handlers.

    This patch simplifies the signal handlers for multi-process support to
    only reset the signal handlers and then raise an exception. This will
    move all logging and other cleanup out of the signal handler and into
    safer places.

    Change-Id: I0878adf6ef5c54e577ab2ea39e2ff9845e0e9191

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → folsom-3
status: Fix Committed → Fix Released
Revision history for this message
Johannes Erdfelt (johannes.erdfelt) wrote :

Adding glance since I noticed it has the same problem from perusing the code

summary: - nova-api gets process wedged at service shutdown, cannot be restarted
+ Using logging module from signal handler can wedge process
Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-3 → 2012.2
Brian Waldon (bcwaldon)
Changed in glance:
status: New → Triaged
importance: Undecided → Low
tags: added: folsom-backport-potential
Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to glance (master)

Fix proposed to branch: master
Review: https://review.openstack.org/22760

Changed in glance:
assignee: nobody → Mark Washenberger (markwash)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to glance (master)

Reviewed: https://review.openstack.org/22760
Committed: http://github.com/openstack/glance/commit/42ae9dae931e2afebf916d82b39cc3481b300718
Submitter: Jenkins
Branch: master

commit 42ae9dae931e2afebf916d82b39cc3481b300718
Author: Mark J. Washenberger <email address hidden>
Date: Fri Feb 22 14:59:02 2013 -0800

    Avoid using logging in signal handlers

    Fixes bug 1029727

    It is possible for logging to cause problems during signal handlers.
    Fortunately, the logging we have in signal handlers is minimal and does
    not add enough value to try to keep through workarounds.

    Change-Id: I80708874c9740c906e5a49e0d9c1b45174470a85

Changed in glance:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in glance:
milestone: none → grizzly-rc1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.