I checked Georgy's environment where this was reproduced and we found that:
1) Georgy changes the nova-compute configuration and restarts the process right *before* the test
2) the latter actually happens in the middle of the test due to its asynchronous nature
3) from logs we see that nova-compute was stopped abruptly with KILL
root@node-3:~# dmesg | grep -i nova
[ 1876.863558] init: nova-compute main process (25612) killed by KILL signal
[ 2705.071550] init: nova-compute main process (6303) killed by KILL signal
In my opinion we should tweak the upstart scripts of all OpenStack services which use oslo.service (basically every one of them) and set the "kill timeout" value in upstart to (graceful_shutdown_timeout + 5) to make sure we first give services a chance to terminate gracefully and finish all in-flight requests.
I checked Georgy's environment where this was reproduced and we found that:
1) Georgy changes the nova-compute configuration and restarts the process right *before* the test
2) the latter actually happens in the middle of the test due to its asynchronous nature
3) from logs we see that nova-compute was stopped abruptly with KILL
root@node-3:~# dmesg | grep -i nova
[ 1876.863558] init: nova-compute main process (25612) killed by KILL signal
[ 2705.071550] init: nova-compute main process (6303) killed by KILL signal
4) tracing with perf_events allows to see that this happens on "service nova-compute restart" ( http:// paste.openstack .org/show/ 508564/): first upstart sends a TERM signal and waits for process to terminate gracefully and after $timeout seconds (defaults to 5 - http:// upstart. ubuntu. com/cookbook/ #kill-timeout) it sends a KILL signal
5) the problem with the latter is that it's not aligned with the graceful shutdown timeout value in oslo.service - https:/ /github. com/openstack/ oslo.service/ blob/master/ oslo_service/ _options. py#L49 - which is 60 seconds.
In my opinion we should tweak the upstart scripts of all OpenStack services which use oslo.service (basically every one of them) and set the "kill timeout" value in upstart to (graceful_ shutdown_ timeout + 5) to make sure we first give services a chance to terminate gracefully and finish all in-flight requests.