The glance.tests.functional.test_reload.TestReload.test_reload() test has been causing spurious deadlocks in functional test jobs, resulting in TIMED_OUT job statuses due to the global timeout expiring. This can be reproduced locally with lots of exposure, but Zuul runs things enough to hit it fairly often.
I have tracked this down to the test_reload() test, which if I reproduce this locally, I find it is in an infinite waitpid() on the API master process that the FunctionalTest base class has started for it. The test tracks child PIDs of the master as it initiates several SIGHUP operations. Upon exit, the FunctionalTest.cleanup() routine runs and ends up waitpid()ing on the master process forever. A process list shows all the other stestr workers in Z state waiting for the final worker to complete. The final worker being stuck on waitpid() has the master process and both worker processes still running. Upon killing the master, stestr frees up, reports status from the test and exits normally.
Stack trace of the hung test process after signaling the master it is waiting for manually is:
Traceback (most recent call last):
File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/run.py", line 93, in <module>
main()
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/run.py", line 89, in main
testRunner=partial(runner, stdout=sys.stdout))
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/program.py", line 185, in __init__
self.runTests()
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/program.py", line 226, in runTests
self.result = testRunner.run(self.test)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/run.py", line 52, in run
test(result)
File "/usr/lib64/python3.7/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python3.7/unittest/suite.py", line 122, in run
test(result)
File "/usr/lib64/python3.7/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python3.7/unittest/suite.py", line 122, in run
test(result)
File "/usr/lib64/python3.7/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/usr/lib64/python3.7/unittest/suite.py", line 122, in run
test(result)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/unittest2/case.py", line 673, in __call__
return self.run(*args, **kwds)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/testcase.py", line 675, in run
return run_test.run(result)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 80, in run
return self._run_one(actual_result)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 94, in _run_one
return self._run_prepared_result(ExtendedToOriginalDecorator(result))
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 119, in _run_prepared_result
raise e
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 191, in _run_user
return fn(*args, **kwargs)
File "/home/dan/glance/glance/tests/functional/__init__.py", line 881, in cleanup
s.stop()
File "/home/dan/glance/glance/tests/functional/__init__.py", line 293, in stop
rc = test_utils.wait_for_fork(self.process_pid, raise_error=False)
File "/home/dan/glance/glance/tests/utils.py", line 294, in wait_for_fork
(pid, rc) = os.waitpid(pid, 0)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/green/os.py", line 96, in waitpid
greenthread.sleep(0.01)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/greenthread.py", line 36, in sleep
hub.switch()
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/hub.py", line 298, in switch
return self.greenlet.switch()
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/hub.py", line 350, in run
self.wait(sleep_time)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/poll.py", line 80, in wait
presult = self.do_poll(seconds)
File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/epolls.py", line 31, in do_poll
return self.poll.poll(seconds)
With some more debugging, I think what is happening when the hang happens is that test_reload() fails and we block during cleanup waiting for the thing to exit that never will.
Using the proposed fix: https:/ /review. opendev. org/#/c/ 745763/ that avoids an infinite deadlock, and some modifications to remove the SIGKILL, I can make the test cleanup fail if the waitpid() never returns and thus see what the test was going to do. It happens about 5% of the time. When I do that, I see failures like this from test_reload():
File "/home/ dan/glance/ .tox/functional /lib/python3. 7/site- packages/ unittest2/ case.py" , line 690, in fail eption( msg)
raise self.failureExc
AssertionError: Worker change timeout
and
File "/home/ dan/glance/ .tox/functional /lib/python3. 7/site- packages/ unittest2/ case.py" , line 690, in fail eption( msg)
raise self.failureExc
AssertionError: http bind_host timeout
And after that happens, I end up with one more more leaked cmd.api processes, in addition to the one the test fails to see exit. At least once I have seen those processes be children of the master process we're waiting for, but that were never logged by the master:
RuntimeError: Gave up waiting for 2660 to exit! wsgi.server [-] (2728) wsgi exited, is_accepting=False wsgi.server [-] (2729) wsgi exited, is_accepting=False
2020-08-12 12:01:39.628 31144 INFO eventlet.
2020-08-12 12:01:39.629 31144 INFO glance.common.wsgi [-] Child 2728 exiting normally
2020-08-12 12:01:39.637 31143 INFO eventlet.
2020-08-12 12:01:39.637 31143 INFO glance.common.wsgi [-] Child 2729 exiting normally
% ps -eo pid,ppid,cmd | grep cmd.api glance/ .tox/functional /bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw 1/etc/api. conf glance/ .tox/functional /bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw 1/etc/api. conf glance/ .tox/functional /bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw 1/etc/api. conf
2660 1 /home/dan/
2728 2660 /home/dan/
2729 2660 /home/dan/
If I look in the test log, I don't see either of those PIDs logged, but they are children of the process we spawned and are waiting for:
% .tox/functional -py37/bin/ stestr failing | grep 2728 -py37/bin/ stestr failing | grep 3729
% .tox/functional
%
However, if I look at the log of the still-running API master, I can see that after the test ended, the missing PIDs are accounted for:
% tail -n4 /proc/2660/fd/5 wsgi.server [-] (2751) wsgi starting up on http:// 127.0.0. 1:58509
2020-08-12 12:19:25.853 2660 INFO glance.common.wsgi [-] Starting 2 workers
2020-08-12 12:19:25.857 2660 INFO glance.common.wsgi [-] Started child 2750
2020-08-12 12:19:25.863 2660 INFO glance.common.wsgi [-] Started child 2751
2020-08-12 12:19:25.887 2751 INFO eventlet.
If I try to curl against that master, it dies immediately:
% curl http:// 127.0.0. 1:58509
curl: (56) Recv failure: Connection reset by peer
% ps -eo pid,ppid,cmd | grep cmd.api
%
So it kinda seems like something is wedged with that master worker, which ignores the SIGTERM that the base infra sends right at the end. Potentially, it also missed one or more SIGHUPs from the test itself. What is left is stuck in such a way that it won't continue to work, which means (to me) t...