Using logging module from signal handler can wedge process
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Glance |
Fix Released
|
Low
|
Mark Washenberger | ||
Grizzly |
Fix Released
|
Low
|
Mark Washenberger | ||
OpenStack Compute (nova) |
Fix Released
|
Low
|
Johannes Erdfelt |
Bug Description
Running into an issue with nova-api which may be threading related. Having lots of fun trying to reproduce and debug, but not having much luck yet. Filing here to track my progress and hopefully solicit input from others.
First some background... We're triggering multi-node test deployments for upstream commits using Juju + MAAS. The process of deploying services with Juju is very much event driven. A service is deployed, and relations are added to other services. Each new relation added fires a number of hook scripts on the machine that do various bits of configuration. For most of the nova components, this involves making a few modifications to nova.conf or paste config and restarting the running services (using upstart).
In the case of nova-api, the service is deployed to an empty machine, the packages install and the services are stopped. Then a number of relations are added to the other services in the environment (keystone, rabbit, mysql, glance, etc). The nova-api configuration flow usually looks something like:
deploy -> install nova-api, stop service
mysql relation added -> set sql_connection, restart nova-api
rabbit realtion added -> update nova.conf's rabbit settings , restart nova-api
keystone relation added -> update paste.ini with credentials, restart nova-api
etc, etc, etc
This all happens in rapid succession. The hooks themselves are not executed concurrently, but since nova doesn't fork and upstart launches it in foreground, service restarts are basically asynchronous. Depending on how quickly the hooks fire off, the service may be stopped and started (restarted) while the previous process is still going through its initialization. So, its likely that every time the service is restarted, its in a different stage of init, with different configuration.
It seems that nova-api being restarted in this manor is not reliable. It's not 100% reproducible, but calls to stop nova-api will sometimes leave an orphaned child, still bound to the port it was listening (but unresponsive to connections, sitting in epoll_wait()) AFAICS, the parent process is mistakenly getting killed off before its children. At this point, nova-api is stopped as far as upstart is concerned. The following hook will try to start nova-api and the service will fail because the port is already in use. The only way to recover is to manually kill the orphaned child and restart the service.
I *was* seeing the following traceback in the api log whenever it hit this state, but these have disappeared from the logs over the last day or two though the problem still persists.
2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
2012-07-25 13:07:20 INFO nova.wsgi [-] WSGI server has stopped.
2012-07-25 13:07:20 INFO nova.service [-] Child 6519 exited with status 0
2012-07-25 13:07:20 INFO nova.service [-] Child 6518 exited with status 0
2012-07-25 13:07:20 INFO nova.service [-] Child 6517 exited with status 0
2012-07-25 13:07:20 ERROR nova.service [-] Unhandled exception
2012-07-25 13:07:20 TRACE nova.service Traceback (most recent call last):
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self._child_
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service launcher.
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self.server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self._server.wait()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service return self._exit_
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service return hubs.get_
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service return self.greenlet.
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self.wait(
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service presult = self.do_
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service LOG.info(
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self.logger.
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self._log(INFO, msg, args, **kwargs)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self.handle(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self.callHandle
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service hdlr.handle(record)
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self.acquire()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service self.lock.acquire()
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service rc = self.__
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service hubs.get_
2012-07-25 13:07:20 TRACE nova.service File "/usr/lib/
2012-07-25 13:07:20 TRACE nova.service assert cur is not self.greenlet, 'Cannot switch to MAINLOOP from MAINLOOP'
2012-07-25 13:07:20 TRACE nova.service AssertionError: Cannot switch to MAINLOOP from MAINLOOP
2012-07-25 13:07:20 TRACE nova.service
2012-07-25 13:07:22 INFO nova.service [-] Parent process has died unexpectedly, exiting
I can't seem to reproduce manually, but I am in the process of adding debugging capabilities to the systems during provisioning, so I should be able to get gdb attached to one of the stuck process tomorrow. Any clues in the meantime would be appreciated.
I partly want to migrate to using individual API servers instead of the monolithic nova-api, but if nova-api remains it should be able to handle restarts robustly.
Changed in nova: | |
milestone: | none → folsom-3 |
status: | Fix Committed → Fix Released |
Changed in nova: | |
milestone: | folsom-3 → 2012.2 |
Changed in glance: | |
status: | New → Triaged |
importance: | Undecided → Low |
tags: | added: folsom-backport-potential |
Changed in nova: | |
importance: | Undecided → Low |
Changed in glance: | |
milestone: | none → grizzly-rc1 |
status: | Fix Committed → Fix Released |
I've also observed the same thing happened to nova-compute processes. In that case, upstart 'looses track' of the process and calls to restart it end up launching a second nova-compute process, though the first remains responsive, servicing requests from the scheduler with bad config (the defaults).
I have a feeling https:/ /review. openstack. org/#/c/ 8733/ has something to do with this, and also that the disappearance of the original traceback is the result of moving to openstack-common logging functionality.
Thread @ https:/ /lists. secondlife. com/pipermail/ eventletdev/ 2011-February/ 000952. html seems consistent with what I'm seeing. I've applied a small testing patch to our repository that disables the logging calls from signal handler to see if that helps.