test_fail_set_az fails intermittently with "AssertionError: OpenStackApiException not raised by _set_az_aggregate"

Bug #1844174 reported by Eric Fried on 2019-09-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Matt Riedemann

Bug Description

Since 20190910 we've hit this 10x: 8x in functional and 2x in functional-py36

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22OpenStackApiException%20not%20raised%20by%20_set_az_aggregate%5C%22

It looks to be a NoValidHosts caused by

2019-09-16 15:10:21,389 INFO [nova.filters] Filter AvailabilityZoneFilter returned 0 hosts
2019-09-16 15:10:21,390 INFO [nova.filters] Filtering removed all hosts for the request with instance ID 'e1ae6109-2bc2-4a40-9249-3dee7d5e80b5'. Filter results: ['AvailabilityZoneFilter: (start: 2, end: 0)']

Here's one example: https://14cb8680ad7e2d5893c2-a0a2161f988b6356e48326da15450ffb.ssl.cf1.rackcdn.com/671800/36/check/nova-tox-functional-py36/abc690a/testr_results.html.gz

or pasted here for when ^ expires: http://paste.openstack.org/raw/776821/

Matt Riedemann (mriedem) on 2019-09-16
tags: added: gate-failure testing
Matt Riedemann (mriedem) wrote :

Looks like it started with this change:

https://review.opendev.org/#/c/671075/21/nova/tests/functional/test_aggregates.py

I'm not sure what about that is tickling the change, but changing AggregateRequestFiltersTest to inherit from ProviderUsageBaseTestCase might have something to do with it.

My guess is that since ProviderUsageBaseTestCase does a lot of fixture and service setup, like api/conductor/scheduler, and the existing AggregateRequestFiltersTest class is also doing that stuff, when we set az metadata on aggregates it's only getting sync'ed to one scheduler process and not both and that's why we have intermittent failures.

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Matt Riedemann (mriedem) on 2019-09-16
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)

Fix proposed to branch: master
Review: https://review.opendev.org/682475

Changed in nova:
status: Confirmed → In Progress
Matt Riedemann (mriedem) wrote :

The test was definitely starting two scheduler workers:

2019-09-16 15:10:22.512312 | ubuntu-bionic | b'2019-09-16 15:10:19,455 INFO [nova.service] Starting conductor node (version 19.1.0)'
2019-09-16 15:10:22.512458 | ubuntu-bionic | b'2019-09-16 15:10:19,491 INFO [nova.service] Starting scheduler node (version 19.1.0)'
2019-09-16 15:10:22.512633 | ubuntu-bionic | b'2019-09-16 15:10:19,517 WARNING [placement.db_api] TransactionFactory already started, not reconfiguring.'
2019-09-16 15:10:22.512779 | ubuntu-bionic | b'2019-09-16 15:10:20,113 INFO [nova.service] Starting conductor node (version 19.1.0)'
2019-09-16 15:10:22.512925 | ubuntu-bionic | b'2019-09-16 15:10:20,156 INFO [nova.service] Starting scheduler node (version 19.1.0)'

We should consider changing the ServiceFixture to blow up if you try to create more than one nova-scheduler service since I can't think of any functional tests that will rely on that.

Reviewed: https://review.opendev.org/682475
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c074a7bd314889024c648e7f6f038242209927d9
Submitter: Zuul
Branch: master

commit c074a7bd314889024c648e7f6f038242209927d9
Author: Matt Riedemann <email address hidden>
Date: Mon Sep 16 16:33:58 2019 -0400

    Remove redundancies from AggregateRequestFiltersTest.setUp

    Change I9ab9d7d65378be564b3731b5227ede8cece71bef made
    AggregateRequestFiltersTest extend ProviderUsageBaseTestCase
    but left a lot of redundant setUp for fixtures and services
    in it which might be contributing to test_fail_set_az failing
    intermittently. My theory is that we're starting multiple API
    and scheduler workers and when setting an AZ on an aggregate
    the API will RPC cast to all schedulers to store that metadata
    information in the scheduler process. Since we have more than
    one scheduler process, and I'm not sure how stable the RPC cast
    fanout capability is in the fake messaging driver, we could be
    hitting a scheduler worker during the test that does not have
    the AZ metadata and the AvailabilityZoneFilter filters out the
    expected host.

    This simply removes the redundant setUp. Even if this isn't the
    root of the problem it at least rules it out.

    Change-Id: I45107ff686456d5db91c615830fa443b3902ddfb
    Partial-Bug: #1844174

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/682485
Reason: Drop in favor of https://review.opendev.org/#/c/682486/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers