OpenStack Compute (nova)

Bug #1946339
Comment #10

Comment 10 for bug 1946339

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-11-03: Fix merged to nova (master)

#10

Reviewed: https://review.opendev.org/c/openstack/nova/+/814036
Committed: https://opendev.org/openstack/nova/commit/61fc81a6761d34afdfc4a6d1c4c953802fd8a179
Submitter: "Zuul (22348)"
Branch: master

commit 61fc81a6761d34afdfc4a6d1c4c953802fd8a179
Author: Balazs Gibizer <email address hidden>
Date: Thu Oct 14 18:09:18 2021 +0200

Prevent leaked eventlets to send notifications

    In out functional tests we run nova services as eventlets. Also those
    services can spawn there own eventlets for RPC or other parallel
    processing. The test case executor only sees and tracks the main
    eventlet where the code of the test case is running. When that is
    finishes the test executor considers the test case to be finished
    regardless of the other spawned eventlets. This could lead to leaked
    eventlets that are running in parallel with later test cases.

    One way that it can cause trouble is via the global variables in
    nova.rpc module. Those globals are re-initialized for each test case so
    they are not directly leaking information between test cases. However if
    a late eventlet calls nova.rpc.get_versioned_notifier() it will get a
    totally usable FakeVersionedNotifier object regardless of which test
    case this notifier is belongs to or which test case the eventlet belongs
    to. This way the late eventlet can send a notification to the currently
    running test case and therefore can make it fail.

The current case we saw is the following:

    1) The test case
      nova.tests.functional.test_servers.ServersTestV219.test_description_errors
      creates a server but don't wait for it to reach terminal state (ACTIVE
      / ERROR). This test case finishes quickly but leaks running eventlets
      in the background waiting for some RPC call to return.
    2) As the test case finished the cleanup code deletes the test case
       specific setup, including the DB.
    3) The test executor moves forward and starts running another test case
    4) 60 seconds later the leaked eventlet times out waiting for the RPC
       call to return and tries doing things, but fails as the DB is already
       gone. Then it tries to report this as an error notification. It calls
       nova.rpc.get_versioned_notifier() and gets a fresh notifier that is
       connected to the currently running test case. Then emits the error
       notification there.
    5) The currently running test case also waits for an error notification
       to be triggered by the currently running test code. But it gets the
       notification form the late eventlet first. As the content of the
       notification does not match with the expectations the currently
       running test case fails. The late eventlet prints a lot of
       error about the DB being gone making the troubleshooting pretty hard.

    This patch proposes a way to fix this by marking each eventlet at spawn
    time with the id of the test case that was directly or indirectly
    started it.

    Then when the NotificationFixture gets a notification it compares the
    test case id stored in the calling eventlet with the id of the test case
    initialized the NotificationFixture. If the two ids do not match then
    the fixture ignores the notification and raises an exception to the
    caller eventlet to make it terminate.

Change-Id: I012dcf63306bae624dc4f66aae6c6d96a20d4327
Closes-Bug: #1946339

Reviewed:  https://review.opendev.org/c/openstack/nova/+/814036
Committed: https://opendev.org/openstack/nova/commit/61fc81a6761d34afdfc4a6d1c4c953802fd8a179
Submitter: "Zuul (22348)"
Branch:    master

commit 61fc81a6761d34afdfc4a6d1c4c953802fd8a179
Author: Balazs Gibizer <balazs.gibizer@est.tech>
Date:   Thu Oct 14 18:09:18 2021 +0200

Prevent leaked eventlets to send notifications
    
    In out functional tests we run nova services as eventlets. Also those
    services can spawn there own eventlets for RPC or other parallel
    processing. The test case executor only sees and tracks the main
    eventlet where the code of the test case is running. When that is
    finishes the test executor considers the test case to be finished
    regardless of the other spawned eventlets. This could lead to leaked
    eventlets that are running in parallel with later test cases.
    
    One way that it can cause trouble is via the global variables in
    nova.rpc module. Those globals are re-initialized for each test case so
    they are not directly leaking information between test cases. However if
    a late eventlet calls nova.rpc.get_versioned_notifier() it will get a
    totally usable FakeVersionedNotifier object regardless of which test
    case this notifier is belongs to or which test case the eventlet belongs
    to. This way the late eventlet can send a notification to the currently
    running test case and therefore can make it fail.
    
    The current case we saw is the following:
    
    1) The test case
      nova.tests.functional.test_servers.ServersTestV219.test_description_errors
      creates a server but don't wait for it to reach terminal state (ACTIVE
      / ERROR). This test case finishes quickly but leaks running eventlets
      in the background waiting for some RPC call to return.
    2) As the test case finished the cleanup code deletes the test case
       specific setup, including the DB.
    3) The test executor moves forward and starts running another test case
    4) 60 seconds later the leaked eventlet times out waiting for the RPC
       call to return and tries doing things, but fails as the DB is already
       gone. Then it tries to  report this as an error notification. It calls
       nova.rpc.get_versioned_notifier() and gets a fresh notifier that is
       connected to the currently running test case. Then emits the error
       notification there.
    5) The currently running test case also waits for an error notification
       to be triggered by the currently running test code. But it gets the
       notification form the late eventlet first. As the content of the
       notification does not match with the expectations the currently
       running test case fails. The late eventlet prints a lot of
       error about the DB being gone making the troubleshooting pretty hard.
    
    This patch proposes a way to fix this by marking each eventlet at spawn
    time with the id of the test case that was directly or indirectly
    started it.
    
    Then when the NotificationFixture gets a notification it compares the
    test case id stored in the calling eventlet with the id of the test case
    initialized the NotificationFixture. If the two ids do not match then
    the fixture ignores the notification and raises an exception to the
    caller eventlet to make it terminate.
    
    Change-Id: I012dcf63306bae624dc4f66aae6c6d96a20d4327
    Closes-Bug: #1946339