Glance

test_reload() functional test causes hang and jobs TIMED_OUT

Bug #1891190 reported by Dan Smith on 2020-08-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Glance	Fix Released	Undecided	Dan Smith

Bug Description

The glance.tests.functional.test_reload.TestReload.test_reload() test has been causing spurious deadlocks in functional test jobs, resulting in TIMED_OUT job statuses due to the global timeout expiring. This can be reproduced locally with lots of exposure, but Zuul runs things enough to hit it fairly often.

I have tracked this down to the test_reload() test, which if I reproduce this locally, I find it is in an infinite waitpid() on the API master process that the FunctionalTest base class has started for it. The test tracks child PIDs of the master as it initiates several SIGHUP operations. Upon exit, the FunctionalTest.cleanup() routine runs and ends up waitpid()ing on the master process forever. A process list shows all the other stestr workers in Z state waiting for the final worker to complete. The final worker being stuck on waitpid() has the master process and both worker processes still running. Upon killing the master, stestr frees up, reports status from the test and exits normally.

Stack trace of the hung test process after signaling the master it is waiting for manually is:

Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/run.py", line 93, in <module>
    main()
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/run.py", line 89, in main
    testRunner=partial(runner, stdout=sys.stdout))
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/program.py", line 185, in __init__
    self.runTests()
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/program.py", line 226, in runTests
    self.result = testRunner.run(self.test)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/stestr/subunit_runner/run.py", line 52, in run
    test(result)
  File "/usr/lib64/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/lib64/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/usr/lib64/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/usr/lib64/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/unittest2/case.py", line 673, in __call__
    return self.run(*args, **kwds)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/testcase.py", line 675, in run
    return run_test.run(result)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 80, in run
    return self._run_one(actual_result)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 94, in _run_one
    return self._run_prepared_result(ExtendedToOriginalDecorator(result))
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 119, in _run_prepared_result
    raise e
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/testtools/runtest.py", line 191, in _run_user
    return fn(*args, **kwargs)
  File "/home/dan/glance/glance/tests/functional/__init__.py", line 881, in cleanup
    s.stop()
  File "/home/dan/glance/glance/tests/functional/__init__.py", line 293, in stop
    rc = test_utils.wait_for_fork(self.process_pid, raise_error=False)
  File "/home/dan/glance/glance/tests/utils.py", line 294, in wait_for_fork
    (pid, rc) = os.waitpid(pid, 0)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/green/os.py", line 96, in waitpid
    greenthread.sleep(0.01)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/greenthread.py", line 36, in sleep
    hub.switch()
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/hub.py", line 298, in switch
    return self.greenlet.switch()
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/hub.py", line 350, in run
    self.wait(sleep_time)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/poll.py", line 80, in wait
    presult = self.do_poll(seconds)
  File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/eventlet/hubs/epolls.py", line 31, in do_poll
    return self.poll.poll(seconds)

Revision history for this message

Dan Smith (danms) wrote on 2020-08-12:

Download full text (3.1 KiB)

With some more debugging, I think what is happening when the hang happens is that test_reload() fails and we block during cleanup waiting for the thing to exit that never will.

Using the proposed fix: https://review.opendev.org/#/c/745763/ that avoids an infinite deadlock, and some modifications to remove the SIGKILL, I can make the test cleanup fail if the waitpid() never returns and thus see what the test was going to do. It happens about 5% of the time. When I do that, I see failures like this from test_reload():

File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/unittest2/case.py", line 690, in fail
raise self.failureException(msg)
AssertionError: Worker change timeout

and

File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/unittest2/case.py", line 690, in fail
raise self.failureException(msg)
AssertionError: http bind_host timeout

And after that happens, I end up with one more more leaked cmd.api processes, in addition to the one the test fails to see exit. At least once I have seen those processes be children of the master process we're waiting for, but that were never logged by the master:

RuntimeError: Gave up waiting for 2660 to exit!
2020-08-12 12:01:39.628 31144 INFO eventlet.wsgi.server [-] (2728) wsgi exited, is_accepting=False
2020-08-12 12:01:39.629 31144 INFO glance.common.wsgi [-] Child 2728 exiting normally
2020-08-12 12:01:39.637 31143 INFO eventlet.wsgi.server [-] (2729) wsgi exited, is_accepting=False
2020-08-12 12:01:39.637 31143 INFO glance.common.wsgi [-] Child 2729 exiting normally

% ps -eo pid,ppid,cmd | grep cmd.api
2660 1 /home/dan/glance/.tox/functional/bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw1/etc/api.conf
2728 2660 /home/dan/glance/.tox/functional/bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw1/etc/api.conf
2729 2660 /home/dan/glance/.tox/functional/bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw1/etc/api.conf

If I look in the test log, I don't see either of those PIDs logged, but they are children of the process we spawned and are waiting for:

% .tox/functional-py37/bin/stestr failing | grep 2728
% .tox/functional-py37/bin/stestr failing | grep 3729
%

However, if I look at the log of the still-running API master, I can see that after the test ended, the missing PIDs are accounted for:

% tail -n4 /proc/2660/fd/5
2020-08-12 12:19:25.853 2660 INFO glance.common.wsgi [-] Starting 2 workers
2020-08-12 12:19:25.857 2660 INFO glance.common.wsgi [-] Started child 2750
2020-08-12 12:19:25.863 2660 INFO glance.common.wsgi [-] Started child 2751
2020-08-12 12:19:25.887 2751 INFO eventlet.wsgi.server [-] (2751) wsgi starting up on http://127.0.0.1:58509

If I try to curl against that master, it dies immediately:

% curl http://127.0.0.1:58509
curl: (56) Recv failure: Connection reset by peer
% ps -eo pid,ppid,cmd | grep cmd.api
%

So it kinda seems like something is wedged with that master worker, which ignores the SIGTERM that the base infra sends right at the end. Potentially, it also missed one or more SIGHUPs from the test itself. What is left is stuck in such a way that it won't continue to work, which means (to me) t...

With some more debugging, I think what is happening when the hang happens is that test_reload() fails and we block during cleanup waiting for the thing to exit that never will.

File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/unittest2/case.py", line 690, in fail
    raise self.failureException(msg)
AssertionError: Worker change timeout

and

File "/home/dan/glance/.tox/functional/lib/python3.7/site-packages/unittest2/case.py", line 690, in fail
    raise self.failureException(msg)
AssertionError: http bind_host timeout

% ps -eo pid,ppid,cmd | grep cmd.api
2660     1 /home/dan/glance/.tox/functional/bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw1/etc/api.conf
2728 2660 /home/dan/glance/.tox/functional/bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw1/etc/api.conf
2729 2660 /home/dan/glance/.tox/functional/bin/python3 -m glance.cmd.api --config-file /tmp/tmpapio5hw1/etc/api.conf

If I look in the test log, I don't see either of those PIDs logged, but they are children of the process we spawned and are waiting for:

% .tox/functional-py37/bin/stestr failing | grep 2728
% .tox/functional-py37/bin/stestr failing | grep 3729
%

However, if I look at the log of the still-running API master, I can see that after the test ended, the missing PIDs are accounted for:

If I try to curl against that master, it dies immediately:

% curl http://127.0.0.1:58509
curl: (56) Recv failure: Connection reset by peer
% ps -eo pid,ppid,cmd | grep cmd.api
%

Revision history for this message

Dan Smith (danms) wrote on 2020-08-12:

At least once I have seen the master process remain with no children, where the last line in the log file (in the test report and the post-failure exam) looks like this:

2020-08-12 12:29:30.493 4049 INFO eventlet.wsgi.server [-] 127.0.0.1 - - [12/Aug/2020 12:29:30] "GET / HTTP/1.1" 300 1104 0.004106

In that case, curl just hangs after connect, but never receives a response. A netstat shows the socket is open for listening and has a queue, which the process is clearly holding, but won't answer:

% sudo netstat -nlpt | grep python
tcp 2 0 127.0.0.1:45641 0.0.0.0:* LISTEN 4026/python3

An strace shows it's blocked on select not waiting for any IO, which I assume is the eventlet main loop with nothing to do, but also no sockets to monitor:

% sudo strace -p 4026
strace: Process 4026 attached
select(0, NULL, NULL, NULL, {tv_sec=42, tv_usec=688978}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=60, tv_usec=0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=60, tv_usec=0}
... and so on

Revision history for this message

Dan Smith (danms) wrote on 2020-08-12:

By the way, in the scenario from comment #2, SIGTERM won't even interrupt it:

select(0, NULL, NULL, NULL, {tv_sec=60, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=5485, si_uid=0} ---
select(0, NULL, NULL, NULL, {tv_sec=32, tv_usec=459938}

Which explains why it ignored our initial SIGTERM in the test base.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-12: Related fix merged to glance (master)

Reviewed: https://review.opendev.org/745763
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=8e8072c16a63ee652b6cc383f6ad81808b319c2f
Submitter: Zuul
Branch: master

commit 8e8072c16a63ee652b6cc383f6ad81808b319c2f
Author: Dan Smith <email address hidden>
Date: Tue Aug 11 16:45:38 2020 -0700

Make wait_for_fork() more robust against infinite deadlock

    There is something about the process handling in the functional test base
    that opens the scenario where we might signal our API master child,
    waitpid() on it and never return (see bug 1891190). Specifically the
    test_reload() case, where the API master is being sent SIGHUP various times,
    which may cause it to temporarily mask out SIGTERM or something, which
    means it ignores our request. Either way, a deadlock in the tests can
    occur when the cleanup() routine goes to waitpid() on that master,
    forever.

    This change makes our wait_for_fork() helper call waitpid() with WNOHANG
    and progressively ramp up the aggression on asking it to stop. First, it
    waits for the original signal to be honored, then it starts sending SIGTERM,
    followed by SIGKILLs until the overall timeout expires, after which it
    fails the test. The latter should only ever happen if something gets
    wedged in uninterruptible sleep.

Related-Bug: #1891190
Change-Id: Ie0e91e2ee86f33bcdb63eed74f2d9948108acfd4

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-25: Related fix proposed to glance (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/747991

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-28: Related fix merged to glance (master)

Reviewed: https://review.opendev.org/747991
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=76667323d7203a87447269e4f834f7d9339244fd
Submitter: Zuul
Branch: master

commit 76667323d7203a87447269e4f834f7d9339244fd
Author: Dan Smith <email address hidden>
Date: Tue Aug 25 10:29:58 2020 -0700

Disable wait_for_fork() kill aggression if expect_exit=True

    In some cases, we spawn a process that we expect to run to completion
    and exit. In those cases, disable the force-kill logic in wait_for_fork()
    to avoid killing those processes too early.

Related-Bug: #1891190
Change-Id: Ia78d101ae3702d9f761ca8f04b4b10af4a6d84fb

Abhishek Kekane (abhishek-kekane) on 2020-08-28

Changed in glance:
status:	New → Fix Released
assignee:	nobody → Dan Smith (danms)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.