nova.tests.functional.test_server_group.ServerGroupTestV264.test_boot_servers_with_affinity_no_valid_host intermittently failing with "OpenStackApiNotFoundException: Item not found"

Bug #1800472 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

http://logs.openstack.org/48/613348/1/check/nova-tox-functional/16b5d01/job-output.txt.gz#_2018-10-29_10_14_46_484703

2018-10-29 10:14:46.550516 | ubuntu-xenial | 2018-10-29 10:14:45,457 INFO [nova.api.openstack.requestlog] 127.0.0.1 "POST /v2.1/6f70656e737461636b20342065766572/servers" status: 202 len: 480 microversion: 2.64 time: 0.182153
2018-10-29 10:14:46.550841 | ubuntu-xenial | 2018-10-29 10:14:45,485 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572/servers/e5f8520d-9f2f-4895-bcf8-485e50898bd0" status: 200 len: 1806 microversion: 2.64 time: 0.023981
2018-10-29 10:14:46.551172 | ubuntu-xenial | 2018-10-29 10:14:45,639 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/allocation_candidates?limit=1000&resources=DISK_GB%3A40%2CMEMORY_MB%3A4096%2CVCPU%3A2" status: 200 len: 467 microversion: 1.29
2018-10-29 10:14:46.551333 | ubuntu-xenial | 2018-10-29 10:14:45,657 INFO [nova.filters] Filter ServerGroupAffinityFilter returned 0 hosts
2018-10-29 10:14:46.552003 | ubuntu-xenial | 2018-10-29 10:14:45,657 INFO [nova.filters] Filtering removed all hosts for the request with instance ID 'e5f8520d-9f2f-4895-bcf8-485e50898bd0'. Filter results: ['RetryFilter: (start: 1, end: 1)', 'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1, end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 'ImagePropertiesFilter: (start: 1, end: 1)', 'ServerGroupAntiAffinityFilter: (start: 1, end: 1)', 'ServerGroupAffinityFilter: (start: 1, end: 0)']
2018-10-29 10:14:46.552140 | ubuntu-xenial | 2018-10-29 10:14:45,658 ERROR [nova.conductor.manager] Failed to schedule instances
2018-10-29 10:14:46.552209 | ubuntu-xenial | Traceback (most recent call last):
2018-10-29 10:14:46.552345 | ubuntu-xenial | File "nova/conductor/manager.py", line 1255, in schedule_and_build_instances
2018-10-29 10:14:46.552423 | ubuntu-xenial | instance_uuids, return_alternates=True)
2018-10-29 10:14:46.552545 | ubuntu-xenial | File "nova/conductor/manager.py", line 750, in _schedule_instances
2018-10-29 10:14:46.552626 | ubuntu-xenial | return_alternates=return_alternates)
2018-10-29 10:14:46.552740 | ubuntu-xenial | File "nova/scheduler/utils.py", line 953, in wrapped
2018-10-29 10:14:46.552810 | ubuntu-xenial | return func(*args, **kwargs)
2018-10-29 10:14:46.552942 | ubuntu-xenial | File "nova/scheduler/client/__init__.py", line 53, in select_destinations
2018-10-29 10:14:46.553051 | ubuntu-xenial | instance_uuids, return_objects, return_alternates)
2018-10-29 10:14:46.553167 | ubuntu-xenial | File "nova/scheduler/client/__init__.py", line 37, in __run_method
2018-10-29 10:14:46.553268 | ubuntu-xenial | return getattr(self.instance, __name)(*args, **kwargs)
2018-10-29 10:14:46.553387 | ubuntu-xenial | File "nova/scheduler/client/query.py", line 42, in select_destinations
2018-10-29 10:14:46.553483 | ubuntu-xenial | instance_uuids, return_objects, return_alternates)
2018-10-29 10:14:46.553596 | ubuntu-xenial | File "nova/scheduler/rpcapi.py", line 158, in select_destinations
2018-10-29 10:14:46.553703 | ubuntu-xenial | return cctxt.call(ctxt, 'select_destinations', **msg_args)
2018-10-29 10:14:46.553934 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-10-29 10:14:46.553995 | ubuntu-xenial | retry=self.retry)
2018-10-29 10:14:46.554235 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 128, in _send
2018-10-29 10:14:46.554280 | ubuntu-xenial | retry=retry)
2018-10-29 10:14:46.554693 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_fake.py", line 222, in send
2018-10-29 10:14:46.554906 | ubuntu-xenial | return self._send(target, ctxt, message, wait_for_reply, timeout)
2018-10-29 10:14:46.555228 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_fake.py", line 209, in _send
2018-10-29 10:14:46.555319 | ubuntu-xenial | raise failure
2018-10-29 10:14:46.555485 | ubuntu-xenial | NoValidHost: No valid host was found. There are not enough hosts available.
2018-10-29 10:14:46.555816 | ubuntu-xenial | 2018-10-29 10:14:45,697 WARNING [nova.scheduler.utils] Failed to compute_task_build_instances: No valid host was found. There are not enough hosts available.
2018-10-29 10:14:46.556006 | ubuntu-xenial | 2018-10-29 10:14:45,698 WARNING [nova.scheduler.utils] Setting instance to ERROR state.

Looks like the test creates two servers in an affinity group and then tries to create a third server in the group and expects it to fail because there are not enough resources on the host with the other two servers.

That all seems to work. The problem is when creating the 3rd server, while waiting for its status to change to ERROR the API returns a 404:

2018-10-29 10:14:46.484852 | ubuntu-xenial | Captured traceback:
2018-10-29 10:14:46.484897 | ubuntu-xenial | ~~~~~~~~~~~~~~~~~~~
2018-10-29 10:14:46.484967 | ubuntu-xenial | Traceback (most recent call last):
2018-10-29 10:14:46.485159 | ubuntu-xenial | File "nova/tests/functional/test_server_group.py", line 303, in test_boot_servers_with_affinity_no_valid_host
2018-10-29 10:14:46.485247 | ubuntu-xenial | expected_status='ERROR')
2018-10-29 10:14:46.485394 | ubuntu-xenial | File "nova/tests/functional/test_server_group.py", line 112, in _boot_a_server_to_group
2018-10-29 10:14:46.485490 | ubuntu-xenial | self.admin_api, created_server, expected_status)
2018-10-29 10:14:46.485633 | ubuntu-xenial | File "nova/tests/functional/integrated_helpers.py", line 259, in _wait_for_state_change
2018-10-29 10:14:46.485743 | ubuntu-xenial | admin_api, server, {'status': expected_status}, max_retries)
2018-10-29 10:14:46.485892 | ubuntu-xenial | File "nova/tests/functional/integrated_helpers.py", line 243, in _wait_for_server_parameter
2018-10-29 10:14:46.485980 | ubuntu-xenial | server = admin_api.get_server(server['id'])
2018-10-29 10:14:46.486104 | ubuntu-xenial | File "nova/tests/functional/api/client.py", line 268, in get_server
2018-10-29 10:14:46.486218 | ubuntu-xenial | return self.api_get('/servers/%s' % server_id).body['server']
2018-10-29 10:14:46.486330 | ubuntu-xenial | File "nova/tests/functional/api/client.py", line 225, in api_get
2018-10-29 10:14:46.486439 | ubuntu-xenial | return APIResponse(self.api_request(relative_uri, **kwargs))
2018-10-29 10:14:46.486557 | ubuntu-xenial | File "nova/tests/functional/api/client.py", line 207, in api_request
2018-10-29 10:14:46.486671 | ubuntu-xenial | raise OpenStackApiNotFoundException(response=response)
2018-10-29 10:14:46.486811 | ubuntu-xenial | nova.tests.functional.api.client.OpenStackApiNotFoundException: Item not found

My guess would be something in this recent change is causing a problem looking up the instance from the cell0 DB:

https://review.openstack.org/#/c/611665/

Revision history for this message
Matt Riedemann (mriedem) wrote :
summary: nova.tests.functional.test_server_group.ServerGroupTestV264.test_boot_servers_with_affinity_no_valid_host
- intermittently failing with "Filter ServerGroupAffinityFilter returned 0
- hosts"
+ intermittently failing with "OpenStackApiNotFoundException: Item not
+ found"
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

The earlier changes this hit on before 10/24 are coming from this series:

https://review.openstack.org/#/q/topic:bp/handling-down-cell+(status:open+OR+status:merged)

So I have to think it's something in https://review.openstack.org/#/c/611665/ that is causing the problem.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (12.5 KiB)

I saw something similar today in another test - it hits novalidhost and then we get a 404 while waiting for the instance to go to ERROR state:

http://logs.openstack.org/34/607934/15/gate/nova-tox-functional-py35/25b8065/testr_results.html.gz

ft1.2: nova.tests.functional.wsgi.test_servers.EnforceVolumeBackedForZeroDiskFlavorTestCase.test_create_volume_backed_server_with_zero_disk_allowed_StringException: pythonlogging:'': {{{
2018-11-02 01:11:15,309 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1" status: 204 len: 0 microversion: - time: 0.000513
2018-11-02 01:11:15,345 INFO [nova.api.openstack.requestlog] 127.0.0.1 "POST /v2.1/6f70656e737461636b20342065766572/flavors" status: 200 len: 577 microversion: 2.1 time: 0.031808
2018-11-02 01:11:15,358 INFO [nova.api.openstack.placement.objects.resource_provider] Synced traits from os_traits into API DB: {'HW_NIC_OFFLOAD_TXVLAN', 'HW_CPU_AARCH64_JSCVT', 'HW_CPU_X86_SSSE3', 'HW_NIC_OFFLOAD_RXHASH', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V3_0', 'HW_CPU_AARCH64_FP', 'HW_CPU_AARCH64_SM4', 'HW_CPU_AARCH64_EVTSTRM', 'HW_CPU_HYPERTHREADING', 'COMPUTE_DEVICE_TAGGING', 'HW_GPU_API_DIRECTX_V10', 'HW_CPU_X86_MMX', 'HW_GPU_API_OPENGL_V1_2', 'HW_CPU_X86_SSE', 'HW_CPU_AARCH64_CPUID', 'HW_CPU_X86_SGX', 'HW_GPU_API_OPENGL_V2_1', 'HW_NIC_OFFLOAD_TCS', 'HW_CPU_AARCH64_SHA512', 'HW_GPU_RESOLUTION_W1920H1080', 'HW_NIC_OFFLOAD_GSO', 'HW_GPU_API_OPENGL_V4_3', 'HW_NIC_OFFLOAD_TX', 'HW_NIC_OFFLOAD_RX', 'HW_NIC_SRIOV_TRUSTED', 'HW_GPU_API_VULKAN', 'HW_NIC_OFFLOAD_RXVLAN', 'HW_CPU_X86_ABM', 'HW_NIC_OFFLOAD_TXUDP', 'HW_CPU_X86_CLMUL', 'HW_GPU_API_OPENGL_V4_4', 'HW_NIC_DCB_QCN', 'HW_CPU_X86_FMA3', 'HW_GPU_RESOLUTION_W1366H768', 'HW_CPU_AARCH64_LRCPC', 'HW_GPU_CUDA_SDK_V6_5', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V6_0', 'HW_GPU_MAX_DISPLAY_HEADS_1', 'HW_GPU_CUDA_SDK_V9_2', 'HW_GPU_API_OPENGL_V4_2', 'HW_GPU_API_DIRECT2D', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V2_0', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V7_1', 'HW_GPU_RESOLUTION_W640H480', 'HW_GPU_RESOLUTION_W800H600', 'HW_CPU_X86_SSE3', 'HW_CPU_AARCH64_SHA2', 'HW_CPU_X86_AVX', 'HW_GPU_CUDA_SDK_V7_5', 'HW_GPU_RESOLUTION_W2560H1440', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V5_2', 'HW_NIC_OFFLOAD_RDMA', 'HW_CPU_X86_MPX', 'HW_GPU_API_DIRECT3D_V8_1', 'HW_CPU_X86_AVX2', 'HW_GPU_API_OPENCL_V1_0', 'HW_GPU_API_DIRECT3D_V10_0', 'HW_CPU_X86_SSE4A', 'HW_GPU_RESOLUTION_W1280H800', 'HW_CPU_AARCH64_ASIMDDP', 'HW_CPU_X86_SSE2', 'HW_GPU_API_DIRECT3D_V10_1', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V6_2', 'HW_CPU_X86_TBM', 'HW_NIC_ACCEL_IPSEC', 'HW_GPU_API_DIRECT3D_V6_0', 'HW_GPU_RESOLUTION_W1440H900', 'COMPUTE_NET_ATTACH_INTERFACE_WITH_TAG', 'HW_CPU_X86_AESNI', 'HW_GPU_API_DIRECTX_V11', 'HW_CPU_AARCH64_SHA3', 'HW_GPU_API_DIRECT3D_V11_0', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V5_3', 'HW_GPU_MAX_DISPLAY_HEADS_2', 'HW_GPU_API_DIRECTX_V12', 'HW_NIC_OFFLOAD_FDF', 'HW_NIC_ACCEL_SSL', 'HW_GPU_CUDA_COMPUTE_CAPABILITY_V5_0', 'HW_CPU_X86_VMX', 'HW_GPU_RESOLUTION_W1600H900', 'HW_CPU_X86_XOP', 'HW_NIC_VMDQ', 'HW_GPU_RESOLUTION_W1600H1200', 'HW_GPU_API_DIRECT3D_V12_0', 'HW_GPU_API_OPENGL_V3_2', 'HW_NIC_ACCEL_RSA', 'HW_GPU_API_OPENGL_V3_0', 'HW_NIC_OFFLOAD_UFO', 'HW_CPU_X86_AVX512DQ', 'HW_GPU_RESOLUTION_W7680H4320', 'HW_CPU_AARCH64_ATOMICS', 'H...

Revision history for this message
Matt Riedemann (mriedem) wrote :

I think this is probably the key issue:

2018-11-02 01:11:16,254 WARNING [nova.compute.api] The nova-osapi_compute service version is from before Ocata and may cause problems looking up instances in a cells v2 setup. Check your nova-api service configuration and cell mappings. You may need to remove stale nova-osapi_compute service records from the cell database.

We hit that here:

https://github.com/openstack/nova/blob/4f5cb30a2f88965772bcbff0a19a1dcfcd3836c6/nova/compute/api.py#L2411

And then lookup the instance in the main database which in our tests defaults to cell1, but for a novalidhost error, the instance will be in cell0, and we're not looking there because the nova-api service in the tests doesn't have a service version.

Changed in nova:
status: Confirmed → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

The confusing thing is the API fixture should setup and create a service record:

https://github.com/openstack/nova/blob/4f5cb30a2f88965772bcbff0a19a1dcfcd3836c6/nova/tests/fixtures.py#L931

https://github.com/openstack/nova/blob/4f5cb30a2f88965772bcbff0a19a1dcfcd3836c6/nova/api/openstack/wsgi_app.py#L44

But maybe that's getting created by default in cell1 when it should just be getting created in whatever database connection is configured for the API service (which probably just defaults to cell1).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/615347

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/615347
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=097693aaa3f57011e38188a94687e02bc9b074ab
Submitter: Zuul
Branch: master

commit 097693aaa3f57011e38188a94687e02bc9b074ab
Author: Matt Riedemann <email address hidden>
Date: Fri Nov 2 17:36:41 2018 -0400

    Drop pre-cellsv2 compat in compute API.get()

    This removes the nova-osapi_compute service version
    compatibility check for cells v2 which was added in
    Newton.

    A "nova-status upgrade check" was added for that
    in Rocky:

      Ie2bc4616439352850cf29a9de7d33a06c8f7c2b8

    And it is also causing issues with functional tests
    where instances are buried in cell0 and because we
    don't use the instance mapping, we fail to find the
    instance.

    As a result of this, compute API tests that rely on
    retrieving a real instance out of the database will
    also need to make sure an associated instance mapping
    record exists.

    Change-Id: I11083aa3c78bd8b6201558561457f3d65007a177
    Closes-Bug: #1800472

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.