Comment 14 for bug 1357578

Revision history for this message
Robert Collins (lifeless) wrote :

I've deleted the testr task: testr is (correctly) waiting for the test workers to finish (http://paste.openstack.org/show/112990/) - we can open a feature request to have a testr supervisor timeout facility, but its orthogonal to this nova issue.

Sean has grabbed stuff with gdb:
20727 - a worker we think - http://paste.openstack.org/show/112998/ - this appears to be waiting for a new request
20725 - parent of the worker, stuck in the poll hub - http://paste.openstack.org/show/113014/

Looking at the test code there are lots of issues that will make this test fragile:
 - it assumes 0 delay between /a/ worker being spawned and all of them being spawned

 - I was going to say that we don't set the multiprocessing workers 'daemon' flag - https://docs.python.org/2/library/multiprocessing.html and this will cause a parent process that exits to hang [same as with threads] unless the process object is killed out of band. OTOH if we want to guarantee all in-progress requests finish cleanly, this is appropriate - but we have no timeout protection on each request - but I observe that actually we don't even use multiprocessing - we've grown our own thing :(.

 - it uses tearDown which is not guaranteed to run if setUp has issues.

The hanging process is from test_restart_sighup.

This test doesn't seem to test what SIGHUP does - SIGHUP doesn't restart any children, and the test doesn't strace the processes to see if SIGHUP is actually propogated to children, so the call to _wait should be a 100% no-op, *except* that we want to know the processes aren't being killed - thats why we cross check the pids. But the use of _wait means we'll exit on the first check - not allowing any time for the service process to do the wrong thing. That is, it can falsely pass.

The service code or handling signals is racy: it sets sigcaught to the received signal but multiple signals may be delivered before any are processed. I don't have reason to believe that that is whats happening here.

This is a worry:
            except fixtures.TimeoutException:

we catch it, but we don't re-raise it. So we could 'pass' after being timed out, if it happens in tearDown. That seems wrong to me.