Unit test: nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITest.test_terminate_sigterm timing out in gate

Bug #1357578 reported by Joe Gordon on 2014-08-15
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Critical
Sean Dague
Icehouse
Critical
Ihar Hrachyshka

Bug Description

http://logs.openstack.org/62/114062/3/gate/gate-nova-python27/2536ea4/console.html

 FAIL: nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITest.test_terminate_sigterm

014-08-15 13:46:09.155 | INFO [nova.tests.integrated.api.client] Doing GET on /v2/openstack//flavors/detail
2014-08-15 13:46:09.155 | INFO [nova.tests.integrated.test_multiprocess_api] sent launcher_process pid: 10564 signal: 15
2014-08-15 13:46:09.155 | INFO [nova.tests.integrated.test_multiprocess_api] waiting on process 10566 to exit
2014-08-15 13:46:09.155 | INFO [nova.wsgi] Stopping WSGI server.
2014-08-15 13:46:09.155 | }}}
2014-08-15 13:46:09.156 |
2014-08-15 13:46:09.156 | Traceback (most recent call last):
2014-08-15 13:46:09.156 | File "nova/tests/integrated/test_multiprocess_api.py", line 206, in test_terminate_sigterm
2014-08-15 13:46:09.156 | self._terminate_with_signal(signal.SIGTERM)
2014-08-15 13:46:09.156 | File "nova/tests/integrated/test_multiprocess_api.py", line 194, in _terminate_with_signal
2014-08-15 13:46:09.156 | self.wait_on_process_until_end(pid)
2014-08-15 13:46:09.156 | File "nova/tests/integrated/test_multiprocess_api.py", line 146, in wait_on_process_until_end
2014-08-15 13:46:09.157 | time.sleep(0.1)
2014-08-15 13:46:09.157 | File "/home/jenkins/workspace/gate-nova-python27/.tox/py27/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 31, in sleep
2014-08-15 13:46:09.157 | hub.switch()
2014-08-15 13:46:09.157 | File "/home/jenkins/workspace/gate-nova-python27/.tox/py27/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 287, in switch
2014-08-15 13:46:09.157 | return self.greenlet.switch()
2014-08-15 13:46:09.157 | File "/home/jenkins/workspace/gate-nova-python27/.tox/py27/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 339, in run
2014-08-15 13:46:09.158 | self.wait(sleep_time)
2014-08-15 13:46:09.158 | File "/home/jenkins/workspace/gate-nova-python27/.tox/py27/local/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 82, in wait
2014-08-15 13:46:09.158 | sleep(seconds)
2014-08-15 13:46:09.158 | File "/home/jenkins/workspace/gate-nova-python27/.tox/py27/local/lib/python2.7/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
2014-08-15 13:46:09.158 | raise TimeoutException()
2014-08-15 13:46:09.158 | TimeoutException

Tracy Jones (tjones-i) on 2014-08-27
tags: added: testing
Joe Gordon (jogo) wrote :
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

Bug 1364494 appears to be related.

Matt Riedemann (mriedem) wrote :

Maybe some clues here, i.e. there might be an eventlet.Timeout that's leaking:

http://lists.openstack.org/pipermail/openstack-dev/2014-September/045272.html

Jay Pipes (jaypipes) wrote :

Matt, I think the eventlet.Timeout thing is a red herring. fixtures.Timeout uses the signal module from the standard Python library to interrupt the process after timeout_secs. It does not use eventlet.Timeout. That said, there are two places I found that are incorrectly using eventlet.timeout.Timeout() without a context manager or calling close():

jaypipes@cranky:~/repos/openstack/nova/nova$ ack-grep "Timeout\("
tests/virt/libvirt/test_driver.py
8925: raise eventlet.timeout.Timeout()

tests/virt/hyperv/test_vmops.py
196: mock_with_timeout.side_effect = etimeout.Timeout()

Change abandoned by Jay Pipes (<email address hidden>) on branch: master
Review: https://review.openstack.org/120901
Reason: going to split this into log-only and other stuff.

Joe Gordon (jogo) wrote :

marked as critical because seeing this in the gate, 7 hits in 24 hours

Changed in nova:
importance: High → Critical

Reviewed: https://review.openstack.org/120909
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=608630b0b27dd13847b1e149541c2e1799500421
Submitter: Jenkins
Branch: master

commit 608630b0b27dd13847b1e149541c2e1799500421
Author: Jay Pipes <email address hidden>
Date: Thu Sep 11 16:59:28 2014 -0400

    Adds LOG statements in multiprocess API test

    Renames _terminate_workers() to _wait_for_workers_to_end() to accurately
    match what is done in that method. Adds a bunch of LOG.info() and
    LOG.warn() calls in the various methods in the test to get better
    debugging information when bug # 1357578 rears its ugly head.

    Change-Id: I6698b7c71ec651c812138c8dd93b1e1b33ee1178
    Related-bug: #1357578

Jay,

would adding a loop with the following help?
os.waitpid(-1, os.WNOHANG)

https://gist.github.com/matthewstory/4547282
https://github.com/kanaka/noVNC/issues/47

thanks,
dims

Brent Eagles (beagles) wrote :

I'm afraid I don't quite understand the train of thought implied by the discussion about what the actual *issue* is. On the face of it, it simply looks like the test took too long and the stack trace is "interesting" because we just happened to be in the eventlet trampoline code. This doesn't seem too surprising, so is it suspected that there is a worker hung somewhere or some other obvious thing that shouldn't be happening?

Brent Eagles (beagles) wrote :

Okay, after reviewing the commit messages in the patch(es) and the
comments in the code, it seems pretty obvious. It does appear that the
gist of the bug is simply "this times out but we don't know why at the
moment". Unless the timeout is set wicked short, considering the test a
valid timeout scenario seems pretty improbable. Good test to find
something so weird.

So are these a few of the choices?

- Workers don't shutdown properly when they get the signal.

- Somehow the signal isn't getting to a worker in the first place. At
  the moment, this seems possible only if something were to happen to
  the ProcessLauncher so that it would "forget" about the child/worker
  PID, i.e. leaking a child process. One scenario would be some kind of
  oddness where a worker process caused os.waitpid() to return some kind
  of status for itself in _wait_child(). The logging Jay Pipes added
  will hilight any weirdness like that.

- By some very strange twist, a previous test case didn't cleanup
  properly and there is a timeout hanging around to cause trouble? Is
  that even possible?

Joe Gordon (jogo) on 2014-09-18
Changed in nova:
milestone: none → juno-rc1
Sean Dague (sdague) wrote :

I managed to get a reproduce by creating a slow vm: ubuntu 14.04 in vbox, 1 G ram, 2 vcpu, set to 50% of cpu performance.

tox -epy27 -- --until-fail multiprocess

On the 3rd time through I got the following:

running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \
${PYTHON:-python} -m subunit.run discover -t ./ ./nova/tests --load-list /tmp/tmp7YX5uh
running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \
${PYTHON:-python} -m subunit.run discover -t ./ ./nova/tests --load-list /tmp/tmpU5Qsw_
{0} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITest.test_killed_worker_recover [5.688803s] ... ok

Captured stderr:
~~~~~~~~~~~~~~~~
    /home/sdague/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:35: SAWarning: The IN-predicate on "instances.uuid" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance.
      return o[0](self, self.expr, op, *(other + o[1:]), **kwargs)

{0} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITestV3.test_killed_worker_recover [2.634592s] ... ok
{0} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITestV3.test_restart_sighup [1.565492s] ... ok
{0} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITestV3.test_terminate_sigterm [2.400319s] ... ok
{1} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITest.test_restart_sighup [160.043131s] ... FAILED
{1} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITest.test_terminate_sigkill [2.317150s] ... ok
{1} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITest.test_terminate_sigterm [2.274788s] ... ok
{1} nova.tests.integrated.test_multiprocess_api.MultiprocessWSGITestV3.test_terminate_sigkill [2.089225s] ... ok

and.... hang

So, testr is correctly killing the restart test when it times out. It is also correctly moving on to additional tests. However it is then in a hung state and can't finish once the tests are done.

Why the test timed out, I don't know. However the fact that testr is going crazy is an issue all by itself.

no longer affects: testrepository
Robert Collins (lifeless) wrote :

I've deleted the testr task: testr is (correctly) waiting for the test workers to finish (http://paste.openstack.org/show/112990/) - we can open a feature request to have a testr supervisor timeout facility, but its orthogonal to this nova issue.

Sean has grabbed stuff with gdb:
20727 - a worker we think - http://paste.openstack.org/show/112998/ - this appears to be waiting for a new request
20725 - parent of the worker, stuck in the poll hub - http://paste.openstack.org/show/113014/

Looking at the test code there are lots of issues that will make this test fragile:
 - it assumes 0 delay between /a/ worker being spawned and all of them being spawned

 - I was going to say that we don't set the multiprocessing workers 'daemon' flag - https://docs.python.org/2/library/multiprocessing.html and this will cause a parent process that exits to hang [same as with threads] unless the process object is killed out of band. OTOH if we want to guarantee all in-progress requests finish cleanly, this is appropriate - but we have no timeout protection on each request - but I observe that actually we don't even use multiprocessing - we've grown our own thing :(.

 - it uses tearDown which is not guaranteed to run if setUp has issues.

The hanging process is from test_restart_sighup.

This test doesn't seem to test what SIGHUP does - SIGHUP doesn't restart any children, and the test doesn't strace the processes to see if SIGHUP is actually propogated to children, so the call to _wait should be a 100% no-op, *except* that we want to know the processes aren't being killed - thats why we cross check the pids. But the use of _wait means we'll exit on the first check - not allowing any time for the service process to do the wrong thing. That is, it can falsely pass.

The service code or handling signals is racy: it sets sigcaught to the received signal but multiple signals may be delivered before any are processed. I don't have reason to believe that that is whats happening here.

This is a worry:
            except fixtures.TimeoutException:

we catch it, but we don't re-raise it. So we could 'pass' after being timed out, if it happens in tearDown. That seems wrong to me.

Robert Collins (lifeless) wrote :

Looking further, _start_child in service.py is leaky - interrupt that at the wrong time and we won't capture the child pid, which will lead to us not killing it, and that will result in a stuck process tree.

Robert Collins (lifeless) wrote :

Oh, for clarity; testr is hung here because the pipe it opened to the test process is still open. That means that this leaked worker process has the same stdout as the subunit.run process. Pushing up a simple fix which will stop this timing out in the same way (but obviously won't fix the underlying bug).

Fix proposed to branch: master
Review: https://review.openstack.org/122543

Changed in nova:
assignee: nobody → Robert Collins (lifeless)
status: Confirmed → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/122545

Reviewed: https://review.openstack.org/122545
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4acf31fb3ac97fe84efb5833da58fa90c145d947
Submitter: Jenkins
Branch: master

commit 4acf31fb3ac97fe84efb5833da58fa90c145d947
Author: Robert Collins <email address hidden>
Date: Fri Sep 19 10:14:48 2014 +1200

    tests: kill worker pids as well on timeouts

    Workers that don't exit can also trigger a timeout, and we want our
    test environment to be super robust - kill them aggressively too.

    Also raise the timeout error so that we don't mask this when it
    happens.

    Change-Id: Iee606720f98e9a5e8a11f162eb885c0a00a2345e
    Partial-Bug: #1357578

Reviewed: https://review.openstack.org/122543
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d4a17f2ad72bf1ffa979f57a59bb2bd2727afe3
Submitter: Jenkins
Branch: master

commit 7d4a17f2ad72bf1ffa979f57a59bb2bd2727afe3
Author: Robert Collins <email address hidden>
Date: Fri Sep 19 10:07:52 2014 +1200

    Close standard fds in test child process

    The std fd's are shared with parent processes, which makes testr think
    that we still have test workers, even though this isn't a test
    process, its a live server.

    Change-Id: I516a6a9bd9a98abffeb182189038763ea94618f6
    Partial-Bug: #1357578

Michael Still (mikal) wrote :

I don't see any outstanding reviews for this bug. Is this correct or have I missed something?

Jay Pipes (jaypipes) wrote :

There are no outstanding reviews, no, but there are a still a dozen or more failures in the gate since the last patch from lifeless was merged. So, unfortunately, I don't believe that those two patches have resolved this issue entirely.

My patches were intended to help the diagnosis of the failure, not
prevent it: the issue is real - we're leaking processes that we fail
to shut down; my patches should have helped us get more logging and
diagnostics out when it happens.

Changed in nova:
assignee: Robert Collins (lifeless) → nobody
status: In Progress → Triaged

Related fix proposed to branch: master
Review: https://review.openstack.org/123427

Sean Dague (sdague) wrote :

So... my current theory is the following:

    def _get_workers(self):
        # NOTE(hartsocks): use of ps checks the process table for child pid
        # entries these processes may be ended but not reaped so ps may
        # show processes that are still being cleaned out of the table.
        f = os.popen('ps ax -o pid,ppid,command')
        # Skip ps header
        f.readline()

        processes = [tuple(int(p) for p in l.strip().split()[:2])
                     for l in f.readlines()]
        workers = [p for p, pp in processes if pp == self.pid]
        LOG.info('in _get_workers(), workers: %r for PPID %d',
                 workers, self.pid)
        return workers

Collects the subprocesses via ps.

However, when we loop on them, we assume that they were all sent the correct SIGTERM previously. And we os.kill(pid, 0), which doesn't actually send a kill signal.

So, if some earlier test failed on us, and it spawned a worker that we didn't expect, we could very easily stall polling the status of that process. And we'll never even try the rest of them based on the way the for loop works.

Proposed fix is to send a SIGTERM on every loop iteration.

Matt Riedemann (mriedem) wrote :

This article from Doug Hellmann might be useful:

http://pymotw.com/2/signal/

Related fix proposed to branch: master
Review: https://review.openstack.org/123467

Fix proposed to branch: master
Review: https://review.openstack.org/123521

Changed in nova:
assignee: nobody → Jay Pipes (jaypipes)
status: Triaged → In Progress
Nikola Đipanov (ndipanov) wrote :

Here's one more theory - Lifeless noticed correctly that "_start_child in service.py is leaky" but that really should not matter because child processes should die if the parent dies an unexpected death (from openstack/common/service.py):

This bit happens in the forked worker:

 def _child_process(self, service):
        self._child_process_handle_signal()

        # Reopen the eventlet hub to make sure we don't share an epoll
        # fd with parent and/or siblings, which would be bad
        eventlet.hubs.use_hub()

        # Close write to ensure only parent has it open
        os.close(self.writepipe)
        # Create greenthread to watch for parent to close pipe
        eventlet.spawn_n(self._pipe_watcher)

and the _pipe_watcher method:

   def _pipe_watcher(self):
        # This will block until the write end is closed when the parent
        # dies unexpectedly
        self.readpipe.read()

        LOG.info(_LI('Parent process has died unexpectedly, exiting'))

        sys.exit(1)

interesting bit is that it will only raise a SystemExit (not sure why it doesn't do the os._exit there) in a fork of a testr process that might actually be handling it in some way that can cause unexpected behavior?

Reviewed: https://review.openstack.org/123467
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ff259324abf66f436b5098a6eaf8d781ac57140d
Submitter: Jenkins
Branch: master

commit ff259324abf66f436b5098a6eaf8d781ac57140d
Author: Sean Dague <email address hidden>
Date: Tue Sep 23 11:02:55 2014 -0400

    Add further debug logging for multiprocess test

    In order to track down where we are stalling add some additional
    informational logging into the cleanup path.

    Change-Id: Ia12b1049b3a63851bde69cb5f7c3fe31b0d12c41
    Related-Bug: #1357578

Nikola Đipanov (ndipanov) wrote :

Further to comment #30 - actually after looking at it more closely it seems that we do check for SystemExit in _spawn which should account for the _pipe_watching green thread killing a worker, so it's probably not that.

Nikola Đipanov (ndipanov) wrote :

So one more thing I noticed while looking at this mess. The 'launcher' worker we spawn is forked off of a prcess that ran the whole setUp for the integrated_helpers._IntegratedTestBase that starts a bunch of services in the same process (meaning a bunch of greentrheads). I am not sure how they play with fork(), but usually best practice is to re-initiate the eventlet hub after fork.

I can imagine that we may be timing out because of a deadlock especially in the SIGINT case since the 'launcher' process needs to get to calling os.kill(SIGINT) on it's children to kill them.

How about we attempt to re-initialize eventlet in the 'launcher' process?

Fix proposed to branch: master
Review: https://review.openstack.org/123667

Changed in nova:
assignee: Jay Pipes (jaypipes) → Nikola Đipanov (ndipanov)

Fix proposed to branch: master
Review: https://review.openstack.org/123742

Changed in nova:
assignee: Nikola Đipanov (ndipanov) → Sean Dague (sdague)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/123427

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/123426

Reviewed: https://review.openstack.org/123742
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f3c99ba1efb1c039ec47dfdf29781f2e05f32b81
Submitter: Jenkins
Branch: master

commit f3c99ba1efb1c039ec47dfdf29781f2e05f32b81
Author: Sean Dague <email address hidden>
Date: Wed Sep 24 10:18:48 2014 -0400

    remove test_multiprocess_api

    This test suite should be removed for the following reasons.

    1) it's not a unit test

    2) it's failing randomly, a lot, and depends on timing of the host for
    process start and signal handling which are provably not predicable on
    slower machines

    3) it's triggering testr bugs in the ways that it fails which means we
    don't actually get any subunit in the fail scenario, which makes it
    actually impossible to move forward on fixes.

    Change-Id: Ieb7c02115c12cd9f6c9fa67691a643aed7632784
    Closes-Bug: #1357578

Changed in nova:
status: In Progress → Fix Committed

Change abandoned by Jay Pipes (<email address hidden>) on branch: master
Review: https://review.openstack.org/123521
Reason:
This test has been removed now.

Change abandoned by Nikola Dipanov (<email address hidden>) on branch: master
Review: https://review.openstack.org/123667
Reason: We decided to rip the code out instead of fixing it! :)

Reviewed: https://review.openstack.org/123976
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4e6371b82a68c265c010fb2a2edb7235319d36b7
Submitter: Jenkins
Branch: stable/icehouse

commit 4e6371b82a68c265c010fb2a2edb7235319d36b7
Author: Sean Dague <email address hidden>
Date: Wed Sep 24 10:18:48 2014 -0400

    remove test_multiprocess_api

    This test suite should be removed for the following reasons.

    1) it's not a unit test

    2) it's failing randomly, a lot, and depends on timing of the host for
    process start and signal handling which are provably not predicable on
    slower machines

    3) it's triggering testr bugs in the ways that it fails which means we
    don't actually get any subunit in the fail scenario, which makes it
    actually impossible to move forward on fixes.

    Conflicts:
     nova/tests/integrated/test_multiprocess_api.py

    Change-Id: Ieb7c02115c12cd9f6c9fa67691a643aed7632784
    Closes-Bug: #1357578
    (cherry picked from commit f3c99ba1efb1c039ec47dfdf29781f2e05f32b81)

tags: added: in-stable-icehouse
Thierry Carrez (ttx) on 2014-10-01
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2014-10-16
Changed in nova:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers