Comment 3 for bug 1589490

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I checked Georgy's environment where this was reproduced and we found that:

1) Georgy changes the nova-compute configuration and restarts the process right *before* the test

2) the latter actually happens in the middle of the test due to its asynchronous nature

3) from logs we see that nova-compute was stopped abruptly with KILL

root@node-3:~# dmesg | grep -i nova
[ 1876.863558] init: nova-compute main process (25612) killed by KILL signal
[ 2705.071550] init: nova-compute main process (6303) killed by KILL signal

4) tracing with perf_events allows to see that this happens on "service nova-compute restart" ( http://paste.openstack.org/show/508564/): first upstart sends a TERM signal and waits for process to terminate gracefully and after $timeout seconds (defaults to 5 - http://upstart.ubuntu.com/cookbook/#kill-timeout) it sends a KILL signal

5) the problem with the latter is that it's not aligned with the graceful shutdown timeout value in oslo.service - https://github.com/openstack/oslo.service/blob/master/oslo_service/_options.py#L49 - which is 60 seconds.

In my opinion we should tweak the upstart scripts of all OpenStack services which use oslo.service (basically every one of them) and set the "kill timeout" value in upstart to (graceful_shutdown_timeout + 5) to make sure we first give services a chance to terminate gracefully and finish all in-flight requests.