OpenStack Compute (nova)

Bug #1357578
Comment #14

Comment 14 for bug 1357578

Revision history for this message

Robert Collins (lifeless) wrote on 2014-09-18:

#14

I've deleted the testr task: testr is (correctly) waiting for the test workers to finish (http://paste.openstack.org/show/112990/) - we can open a feature request to have a testr supervisor timeout facility, but its orthogonal to this nova issue.

Sean has grabbed stuff with gdb:
20727 - a worker we think - http://paste.openstack.org/show/112998/ - this appears to be waiting for a new request
20725 - parent of the worker, stuck in the poll hub - http://paste.openstack.org/show/113014/

Looking at the test code there are lots of issues that will make this test fragile:
- it assumes 0 delay between /a/ worker being spawned and all of them being spawned

- I was going to say that we don't set the multiprocessing workers 'daemon' flag - https://docs.python.org/2/library/multiprocessing.html and this will cause a parent process that exits to hang [same as with threads] unless the process object is killed out of band. OTOH if we want to guarantee all in-progress requests finish cleanly, this is appropriate - but we have no timeout protection on each request - but I observe that actually we don't even use multiprocessing - we've grown our own thing :(.

- it uses tearDown which is not guaranteed to run if setUp has issues.

The hanging process is from test_restart_sighup.

This test doesn't seem to test what SIGHUP does - SIGHUP doesn't restart any children, and the test doesn't strace the processes to see if SIGHUP is actually propogated to children, so the call to _wait should be a 100% no-op, *except* that we want to know the processes aren't being killed - thats why we cross check the pids. But the use of _wait means we'll exit on the first check - not allowing any time for the service process to do the wrong thing. That is, it can falsely pass.

The service code or handling signals is racy: it sets sigcaught to the received signal but multiple signals may be delivered before any are processed. I don't have reason to believe that that is whats happening here.

This is a worry:
except fixtures.TimeoutException:

we catch it, but we don't re-raise it. So we could 'pass' after being timed out, if it happens in tearDown. That seems wrong to me.

Looking at the test code there are lots of issues that will make this test fragile:
 - it assumes 0 delay between /a/ worker being spawned and all of them being spawned

- it uses tearDown which is not guaranteed to run if setUp has issues.

The hanging process is from test_restart_sighup.

This is a worry:
            except fixtures.TimeoutException:

we catch it, but we don't re-raise it. So we could 'pass' after being timed out, if it happens in tearDown. That seems wrong to me.