Weird functional test failures hitting neutron API in unrelated resize flows since 8/5

Bug #1839515 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Balazs Gibizer

Bug Description

Noticed here:

https://logs.opendev.org/32/634832/43/check/nova-tox-functional-py36/d4f3be5/testr_results.html.gz

With this test:

nova.tests.functional.notification_sample_tests.test_service.TestServiceUpdateNotificationSampleLatest.test_service_disabled

That's a simple test which disables a service and then asserts there is a service.update notification, but there is another notification happening as well:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/nova/tests/functional/notification_sample_tests/test_service.py", line 122, in test_service_disabled
    'uuid': self.service_uuid})
  File "/home/zuul/src/opendev.org/openstack/nova/nova/tests/functional/notification_sample_tests/test_service.py", line 37, in _verify_notification
    base._verify_notification(sample_file_name, replacements, actual)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/tests/functional/notification_sample_tests/notification_sample_base.py", line 148, in _verify_notification
    self.assertEqual(1, len(fake_notifier.VERSIONED_NOTIFICATIONS))
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/testtools/testcase.py", line 411, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/testtools/testcase.py", line 498, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: 1 != 2

And in the error output, we can see this weird traceback of a resize revert failure b/c the NeutronFixture isn't being used:

2019-08-07 23:22:23,621 ERROR [nova.network.neutronv2.api] The [neutron] section of your nova configuration file must be configured for authentication with the networking service endpoint. See the networking service install guide for details: https://docs.openstack.org/neutron/latest/install/
2019-08-07 23:22:23,634 ERROR [nova.compute.manager] Setting instance vm_state to ERROR
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 8656, in _error_out_instance_on_exception
    yield
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 4830, in _resize_instance
    migration_p)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 2697, in migrate_instance_start
    client = _get_ksa_client(context, admin=True)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 215, in _get_ksa_client
    auth_plugin = _get_auth_plugin(context, admin=admin)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 151, in _get_auth_plugin
    _ADMIN_AUTH = _load_auth_plugin(CONF)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 82, in _load_auth_plugin
    raise neutron_client_exc.Unauthorized(message=err_msg)
neutronclient.common.exceptions.Unauthorized: Unknown auth type: None

According to logstash this started showing up around 8/5:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22ERROR%20%5Bnova.network.neutronv2.api%5D%20The%20%5Bneutron%5D%20section%20of%20your%20nova%20configuration%20file%20must%20be%20configured%20for%20authentication%20with%20the%20networking%20service%20endpoint.%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

Which makes me think this change, which is restarting a compute service and sleeping in a stub:

https://review.opendev.org/#/c/670393/

Might be screwing up concurrently running tests.

Looking at when that test runs and the ones that fails:

2019-08-07 23:21:54.157918 | ubuntu-bionic | {4} nova.tests.functional.compute.test_init_host.ComputeManagerInitHostTestCase.test_migrate_disk_and_power_off_crash_finish_revert_migration [4.063814s] ... ok

2019-08-07 23:25:00.073443 | ubuntu-bionic | {4} nova.tests.functional.notification_sample_tests.test_service.TestServiceUpdateNotificationSampleLatest.test_service_disabled [160.155643s] ... FAILED

We can see they are on the same worker process and run at about the same time.

Furthermore, we can see that TestServiceUpdateNotificationSampleLatest.test_service_disabled eventually times out after 160 seconds and this is in the error output:

2019-08-07 23:24:59,911 ERROR [nova.compute.api] An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host host1. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/api.py", line 5034, in _update_compute_provider_status
    self.rpcapi.set_host_enabled(context, service.host, enabled)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/rpcapi.py", line 996, in set_host_enabled
    return cctxt.call(ctxt, 'set_host_enabled', enabled=enabled)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
    transport_options=self.transport_options)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
    transport_options=transport_options)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_fake.py", line 224, in send
    transport_options)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_fake.py", line 208, in _send
    reply, failure = reply_q.get(timeout=timeout)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
    return waiter.wait()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
    return get_hub().switch()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
    return self.greenlet.switch()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 350, in run
    self.wait(sleep_time)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/hubs/poll.py", line 77, in wait
    time.sleep(seconds)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()
fixtures._fixtures.timeout.TimeoutException

So test_migrate_disk_and_power_off_crash_finish_revert_migration is probably not cleaning up properly.

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/675417

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/675417
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4156571d2cbda764d79d873dc9686c6b12a46380
Submitter: Zuul
Branch: master

commit 4156571d2cbda764d79d873dc9686c6b12a46380
Author: Matt Riedemann <email address hidden>
Date: Thu Aug 8 13:09:18 2019 -0400

    Skip test_migrate_disk_and_power_off_crash_finish_revert_migration

    The stub with the compute service restart + sleep in this test is
    leaking into other tests on the same worker process causing other
    tests to fail, like versioned notification sample tests which are
    asserting an expected number of notifications but getting more than
    expected because the revert resize is waking up after the sleep and
    failing which triggers and error notification.

    Change-Id: I8da4caebe4c574280b1cfdb76a93cc899b807f2e
    Related-Bug: #1839515

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I think it obvious but hacky fix is to increase the 30 seconds sleep to something way bigger.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675553

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
status: Confirmed → In Progress
Changed in nova:
assignee: Balazs Gibizer (balazs-gibizer) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Balazs Gibizer (balazs-gibizer)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/675553
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1
Submitter: Zuul
Branch: master

commit f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1
Author: Balazs Gibizer <email address hidden>
Date: Fri Aug 9 09:53:45 2019 +0200

    Prevent init_host test to interfere with other tests

    The test_migrate_disk_and_power_off_crash_finish_revert_migration test
    needs to simulate a compute host crash at a certain point. It stops the
    execution at a certain point by injecting a sleep then simulating a
    compute restart. However the sleep is just 30 seconds which allows the
    stopped function to return while other functional tests are running in
    the same test worker process making those tests fail in a weird way.

    One simple solution is to add a big enough sleep to the test that will
    never return before the whole functional test execution. This patch
    proposes a million seconds which is more than 277 hours. Similar to how
    the other test in this test package works. This solution is hacky but
    simple. A better solution would be to further enhance the capabilities
    of the functional test env supporting nova-compute service crash / kill
    + restart.

    Change-Id: Ib0d142806804e9113dd61d3a7ec15a98232775c8
    Closes-Bug: #1839515

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/687579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/687862

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/687876

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/687916

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/687579
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=080b4759d3a9500266543989cb92b10cc87ea8d5
Submitter: Zuul
Branch: stable/stein

commit 080b4759d3a9500266543989cb92b10cc87ea8d5
Author: Balazs Gibizer <email address hidden>
Date: Fri Aug 9 09:53:45 2019 +0200

    Prevent init_host test to interfere with other tests

    The test_migrate_disk_and_power_off_crash_finish_revert_migration test
    needs to simulate a compute host crash at a certain point. It stops the
    execution at a certain point by injecting a sleep then simulating a
    compute restart. However the sleep is just 30 seconds which allows the
    stopped function to return while other functional tests are running in
    the same test worker process making those tests fail in a weird way.

    One simple solution is to add a big enough sleep to the test that will
    never return before the whole functional test execution. This patch
    proposes a million seconds which is more than 277 hours. Similar to how
    the other test in this test package works. This solution is hacky but
    simple. A better solution would be to further enhance the capabilities
    of the functional test env supporting nova-compute service crash / kill
    + restart.

    Change-Id: Ib0d142806804e9113dd61d3a7ec15a98232775c8
    Closes-Bug: #1839515
    (cherry picked from commit f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/687862
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=042746e68b33eba1fe0fe90e14fe487e8d336c2c
Submitter: Zuul
Branch: stable/rocky

commit 042746e68b33eba1fe0fe90e14fe487e8d336c2c
Author: Balazs Gibizer <email address hidden>
Date: Fri Aug 9 09:53:45 2019 +0200

    Prevent init_host test to interfere with other tests

    The test_migrate_disk_and_power_off_crash_finish_revert_migration test
    needs to simulate a compute host crash at a certain point. It stops the
    execution at a certain point by injecting a sleep then simulating a
    compute restart. However the sleep is just 30 seconds which allows the
    stopped function to return while other functional tests are running in
    the same test worker process making those tests fail in a weird way.

    One simple solution is to add a big enough sleep to the test that will
    never return before the whole functional test execution. This patch
    proposes a million seconds which is more than 277 hours. Similar to how
    the other test in this test package works. This solution is hacky but
    simple. A better solution would be to further enhance the capabilities
    of the functional test env supporting nova-compute service crash / kill
    + restart.

    Change-Id: Ib0d142806804e9113dd61d3a7ec15a98232775c8
    Closes-Bug: #1839515
    (cherry picked from commit f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1)
    (cherry picked from commit 080b4759d3a9500266543989cb92b10cc87ea8d5)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/687876
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5fdbac1307a8bb26819bf6c87b38a45081c1c330
Submitter: Zuul
Branch: stable/queens

commit 5fdbac1307a8bb26819bf6c87b38a45081c1c330
Author: Balazs Gibizer <email address hidden>
Date: Fri Aug 9 09:53:45 2019 +0200

    Prevent init_host test to interfere with other tests

    The test_migrate_disk_and_power_off_crash_finish_revert_migration test
    needs to simulate a compute host crash at a certain point. It stops the
    execution at a certain point by injecting a sleep then simulating a
    compute restart. However the sleep is just 30 seconds which allows the
    stopped function to return while other functional tests are running in
    the same test worker process making those tests fail in a weird way.

    One simple solution is to add a big enough sleep to the test that will
    never return before the whole functional test execution. This patch
    proposes a million seconds which is more than 277 hours. Similar to how
    the other test in this test package works. This solution is hacky but
    simple. A better solution would be to further enhance the capabilities
    of the functional test env supporting nova-compute service crash / kill
    + restart.

    Change-Id: Ib0d142806804e9113dd61d3a7ec15a98232775c8
    Closes-Bug: #1839515
    (cherry picked from commit f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1)
    (cherry picked from commit 080b4759d3a9500266543989cb92b10cc87ea8d5)
    (cherry picked from commit 042746e68b33eba1fe0fe90e14fe487e8d336c2c)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.opendev.org/687916
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dfda1bcbd3e598ab111e48c31fbd514b175bd5d2
Submitter: Zuul
Branch: stable/pike

commit dfda1bcbd3e598ab111e48c31fbd514b175bd5d2
Author: Balazs Gibizer <email address hidden>
Date: Fri Aug 9 09:53:45 2019 +0200

    Prevent init_host test to interfere with other tests

    The test_migrate_disk_and_power_off_crash_finish_revert_migration test
    needs to simulate a compute host crash at a certain point. It stops the
    execution at a certain point by injecting a sleep then simulating a
    compute restart. However the sleep is just 30 seconds which allows the
    stopped function to return while other functional tests are running in
    the same test worker process making those tests fail in a weird way.

    One simple solution is to add a big enough sleep to the test that will
    never return before the whole functional test execution. This patch
    proposes a million seconds which is more than 277 hours. Similar to how
    the other test in this test package works. This solution is hacky but
    simple. A better solution would be to further enhance the capabilities
    of the functional test env supporting nova-compute service crash / kill
    + restart.

    Change-Id: Ib0d142806804e9113dd61d3a7ec15a98232775c8
    Closes-Bug: #1839515
    (cherry picked from commit f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1)
    (cherry picked from commit 080b4759d3a9500266543989cb92b10cc87ea8d5)
    (cherry picked from commit 042746e68b33eba1fe0fe90e14fe487e8d336c2c)
    (cherry picked from commit 5fdbac1307a8bb26819bf6c87b38a45081c1c330)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.1.0

This issue was fixed in the openstack/nova 19.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.3.0

This issue was fixed in the openstack/nova 18.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova pike-eol

This issue was fixed in the openstack/nova pike-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.