heal_allocations test randomly failing with "ValueError: Field `compute_node_uuid' cannot be None"

Bug #1781648 reported by Matt Riedemann on 2018-07-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Rocky
Medium
Matt Riedemann

Bug Description

http://logs.openstack.org/44/564444/14/gate/nova-tox-functional/75cad04/job-output.txt.gz#_2018-07-13_16_27_07_833394

Things are failing during scheduling:

2018-07-13 16:27:07.846188 | ubuntu-xenial | 2018-07-13 16:27:02,302 INFO [nova.scheduler.host_manager] Host filter forcing available hosts to cell1
2018-07-13 16:27:07.846244 | ubuntu-xenial | 2018-07-13 16:27:02,303 ERROR [oslo_messaging.rpc.server] Exception during message handling
2018-07-13 16:27:07.846273 | ubuntu-xenial | Traceback (most recent call last):
2018-07-13 16:27:07.846367 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
2018-07-13 16:27:07.846400 | ubuntu-xenial | res = self.dispatcher.dispatch(message)
2018-07-13 16:27:07.846492 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2018-07-13 16:27:07.846532 | ubuntu-xenial | return self._do_dispatch(endpoint, method, ctxt, args)
2018-07-13 16:27:07.846626 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2018-07-13 16:27:07.846663 | ubuntu-xenial | result = func(ctxt, **new_args)
2018-07-13 16:27:07.846762 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 226, in inner
2018-07-13 16:27:07.846791 | ubuntu-xenial | return func(*args, **kwargs)
2018-07-13 16:27:07.846840 | ubuntu-xenial | File "nova/scheduler/manager.py", line 169, in select_destinations
2018-07-13 16:27:07.846877 | ubuntu-xenial | allocation_request_version, return_alternates)
2018-07-13 16:27:07.846925 | ubuntu-xenial | File "nova/scheduler/filter_scheduler.py", line 91, in select_destinations
2018-07-13 16:27:07.846966 | ubuntu-xenial | allocation_request_version, return_alternates)
2018-07-13 16:27:07.847010 | ubuntu-xenial | File "nova/scheduler/filter_scheduler.py", line 177, in _schedule
2018-07-13 16:27:07.847031 | ubuntu-xenial | hosts, num_alts)
2018-07-13 16:27:07.847080 | ubuntu-xenial | File "nova/scheduler/filter_scheduler.py", line 320, in _legacy_find_hosts
2018-07-13 16:27:07.847108 | ubuntu-xenial | spec_obj, hosts, num, num_alts)
2018-07-13 16:27:07.847157 | ubuntu-xenial | File "nova/scheduler/filter_scheduler.py", line 356, in _get_alternate_hosts
2018-07-13 16:27:07.847197 | ubuntu-xenial | allocation_request_version=allocation_request_version)
2018-07-13 16:27:07.847239 | ubuntu-xenial | File "nova/objects/selection.py", line 60, in from_host_state
2018-07-13 16:27:07.847279 | ubuntu-xenial | allocation_request_version=allocation_request_version)
2018-07-13 16:27:07.847369 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 307, in __init__
2018-07-13 16:27:07.847398 | ubuntu-xenial | setattr(self, key, kwargs[key])
2018-07-13 16:27:07.847494 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 72, in setter
2018-07-13 16:27:07.847536 | ubuntu-xenial | field_value = field.coerce(self, name, value)
2018-07-13 16:27:07.847627 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 199, in coerce
2018-07-13 16:27:07.847662 | ubuntu-xenial | return self._null(obj, attr)
2018-07-13 16:27:07.847757 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 177, in _null
2018-07-13 16:27:07.847800 | ubuntu-xenial | raise ValueError(_("Field `%s' cannot be None") % attr)
2018-07-13 16:27:07.847837 | ubuntu-xenial | ValueError: Field `compute_node_uuid' cannot be None
2018-07-13 16:27:07.847889 | ubuntu-xenial | 2018-07-13 16:27:02,305 ERROR [nova.conductor.manager] Failed to schedule instances
2018-07-13 16:27:07.847917 | ubuntu-xenial | Traceback (most recent call last):
2018-07-13 16:27:07.847967 | ubuntu-xenial | File "nova/conductor/manager.py", line 1189, in schedule_and_build_instances
2018-07-13 16:27:07.848000 | ubuntu-xenial | instance_uuids, return_alternates=True)
2018-07-13 16:27:07.848044 | ubuntu-xenial | File "nova/conductor/manager.py", line 714, in _schedule_instances
2018-07-13 16:27:07.848075 | ubuntu-xenial | return_alternates=return_alternates)
2018-07-13 16:27:07.848113 | ubuntu-xenial | File "nova/scheduler/utils.py", line 840, in wrapped
2018-07-13 16:27:07.848139 | ubuntu-xenial | return func(*args, **kwargs)
2018-07-13 16:27:07.848187 | ubuntu-xenial | File "nova/scheduler/client/__init__.py", line 53, in select_destinations
2018-07-13 16:27:07.848225 | ubuntu-xenial | instance_uuids, return_objects, return_alternates)
2018-07-13 16:27:07.848270 | ubuntu-xenial | File "nova/scheduler/client/__init__.py", line 37, in __run_method
2018-07-13 16:27:07.848318 | ubuntu-xenial | return getattr(self.instance, __name)(*args, **kwargs)
2018-07-13 16:27:07.848368 | ubuntu-xenial | File "nova/scheduler/client/query.py", line 42, in select_destinations
2018-07-13 16:27:07.848406 | ubuntu-xenial | instance_uuids, return_objects, return_alternates)
2018-07-13 16:27:07.848449 | ubuntu-xenial | File "nova/scheduler/rpcapi.py", line 158, in select_destinations
2018-07-13 16:27:07.848491 | ubuntu-xenial | return cctxt.call(ctxt, 'select_destinations', **msg_args)
2018-07-13 16:27:07.848579 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-07-13 16:27:07.848600 | ubuntu-xenial | retry=self.retry)
2018-07-13 16:27:07.848688 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-07-13 16:27:07.864805 | ubuntu-xenial | retry=retry)
2018-07-13 16:27:07.864917 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_fake.py", line 222, in send
2018-07-13 16:27:07.864965 | ubuntu-xenial | return self._send(target, ctxt, message, wait_for_reply, timeout)
2018-07-13 16:27:07.865058 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_fake.py", line 209, in _send
2018-07-13 16:27:07.865078 | ubuntu-xenial | raise failure
2018-07-13 16:27:07.865115 | ubuntu-xenial | ValueError: Field `compute_node_uuid' cannot be None

Must have some weird issues with multi-cell setup in the test or maybe something with the way the compute services are started?

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Failed%20to%20compute_task_build_instances%3A%20Field%20%60compute_node_uuid'%20cannot%20be%20None%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

Matt Riedemann (mriedem) on 2018-07-13
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Matt Riedemann (mriedem) wrote :

According to logstash, this started around July 12.

Matt Riedemann (mriedem) wrote :

https://github.com/openstack/nova/commit/6b6d81cf2b4600bafaca05117c079e8a1d240756 merged on July 13 but I'm not sure what it might have caused for a problem, unless it was just something lurking in the tests and adding another test tipped something over.

Balazs Gibizer (balazs-gibizer) wrote :

From the log [1] it seems that host cell1 and cell2 creation overlaps with the scheduling of the instances [2]. In a run where the test passes the host creation clearly happens before the instance boot. Interestingly the test setUp() asserts the number of hypervisors and that assert passes.

[1] http://logs.openstack.org/44/564444/14/gate/nova-tox-functional/75cad04/job-output.txt.gz#_2018-07-13_16_27_07_833394
[2] http://logs.openstack.org/44/564444/14/gate/nova-tox-functional/75cad04/job-output.txt.gz#_2018-07-13_16_27_07_843004

Matt Riedemann (mriedem) wrote :
Download full text (4.8 KiB)

Potentially related failure:

http://logs.openstack.org/34/585034/20/check/nova-tox-functional/2f1e5d1/job-output.txt.gz#_2018-08-28_22_17_16_476578

But NoValidHost in this case. And I think gibi is onto something here - in this failure, the compute service starts up but the node isn't in the DB before the scheduler starts filtering so we get NoValidHost:

2018-08-28 22:17:16.508812 | ubuntu-xenial | 2018-08-28 22:17:10,004 INFO [nova.service] Starting compute node (version 18.0.0)
2018-08-28 22:17:16.509229 | ubuntu-xenial | 2018-08-28 22:17:10,046 WARNING [nova.compute.manager] No compute node record found for host cell2. If this is the first time this service is starting on this host, then you can ignore this warning.
2018-08-28 22:17:16.509585 | ubuntu-xenial | 2018-08-28 22:17:10,047 WARNING [nova.compute.monitors] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2018-08-28 22:17:16.509862 | ubuntu-xenial | 2018-08-28 22:17:10,053 WARNING [nova.compute.resource_tracker] No compute node record for cell2:cell2
2018-08-28 22:17:16.510072 | ubuntu-xenial | 2018-08-28 22:17:10,060 INFO [nova.filters] Filter RetryFilter returned 0 hosts
2018-08-28 22:17:16.510476 | ubuntu-xenial | 2018-08-28 22:17:10,060 INFO [nova.filters] Filtering removed all hosts for the request with instance ID '275c6524-e5d8-4082-b8ff-61ba9f1f58a2'. Filter results: ['RetryFilter: (start: 0, end: 0)']
2018-08-28 22:17:16.510792 | ubuntu-xenial | 2018-08-28 22:17:10,061 INFO [nova.compute.resource_tracker] Compute node record created for cell2:cell2 with uuid: fd8fbe55-302c-4bfc-a9d9-fa636813b377
2018-08-28 22:17:16.510977 | ubuntu-xenial | 2018-08-28 22:17:10,069 ERROR [nova.conductor.manager] Failed to schedule instances
2018-08-28 22:17:16.511074 | ubuntu-xenial | Traceback (most recent call last):
2018-08-28 22:17:16.511248 | ubuntu-xenial | File "nova/conductor/manager.py", line 1206, in schedule_and_build_instances
2018-08-28 22:17:16.511361 | ubuntu-xenial | instance_uuids, return_alternates=True)
2018-08-28 22:17:16.511517 | ubuntu-xenial | File "nova/conductor/manager.py", line 723, in _schedule_instances
2018-08-28 22:17:16.511624 | ubuntu-xenial | return_alternates=return_alternates)
2018-08-28 22:17:16.511755 | ubuntu-xenial | File "nova/scheduler/utils.py", line 907, in wrapped
2018-08-28 22:17:16.511848 | ubuntu-xenial | return func(*args, **kwargs)
2018-08-28 22:17:16.512017 | ubuntu-xenial | File "nova/scheduler/client/__init__.py", line 53, in select_destinations
2018-08-28 22:17:16.512148 | ubuntu-xenial | instance_uuids, return_objects, return_alternates)
2018-08-28 22:17:16.516439 | ubuntu-xenial | File "nova/scheduler/client/__init__.py", line 37, in __run_method
2018-08-28 22:17:16.516634 | ubuntu-xenial | return getattr(self.instance, __name)(*args, **kwargs)
2018-08-28 22:17:16.517085 | ubuntu-xenial | File "nova/scheduler/client/query.py", line 42, in select_destinations
2018-08-28 22:17:16.517225 | ubuntu-xenial | instance_uuids, return_objects, return_alterna...

Read more...

Fix proposed to branch: master
Review: https://review.openstack.org/597571

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/597571
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4fe4fbe7a43ac7e971b0aabc8a9079c7701cae55
Submitter: Zuul
Branch: master

commit 4fe4fbe7a43ac7e971b0aabc8a9079c7701cae55
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 29 11:58:12 2018 -0400

    Restart scheduler in TestNovaManagePlacementHealAllocations

    TestNovaManagePlacementHealAllocations tests rely on the
    CachingScheduler specifically because it does not use
    placement and therefore does not create allocations during
    scheduling, which gives us instances that need to heal
    allocations.

    However, we have a race in the test setup where the scheduler
    is started before the compute services. The CachingScheduler
    runs a periodic task on startup to fetch the hosts from the
    DB to initialize it's cache, and then uses that during scheduling
    until the periodic runs again. So what we can hit is the
    scheduler starts, loads an empty cache, then we start the computes
    and try to create an instance but because of the empty cache we
    fail with a NoValidHost error.

    This restarts and resets the CachingScheduler cache *after* we have
    started the computes and asserted they are available in the API.

    Change-Id: I32f607a436e9851a96877123ae3d1fe51f444f73
    Closes-Bug: #1781648

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/597606
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5fe80b6339b8a514bb4c1aee1f2d47cd83be592d
Submitter: Zuul
Branch: master

commit 5fe80b6339b8a514bb4c1aee1f2d47cd83be592d
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 29 13:16:09 2018 -0400

    (Re)start caching scheduler after starting computes in tests

    This moves the scheduler restart logic from the heal_allocations
    tests into a more generic location (like restart_compute_service)
    so that we can re-use it in other functional tests that rely on
    the caching scheduler. There are a couple of other tests that don't
    need to restart the scheduler if we just move the start of the
    scheduler to after we start the computes.

    Change-Id: I7720fe4a3a0e537b7b356947317766597d4b47cf
    Related-Bug: #1781648

Reviewed: https://review.openstack.org/598152
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=49ce59e1119bbb1d9c837dde1e4ab28888c75d58
Submitter: Zuul
Branch: stable/rocky

commit 49ce59e1119bbb1d9c837dde1e4ab28888c75d58
Author: Matt Riedemann <email address hidden>
Date: Wed Aug 29 11:58:12 2018 -0400

    Restart scheduler in TestNovaManagePlacementHealAllocations

    TestNovaManagePlacementHealAllocations tests rely on the
    CachingScheduler specifically because it does not use
    placement and therefore does not create allocations during
    scheduling, which gives us instances that need to heal
    allocations.

    However, we have a race in the test setup where the scheduler
    is started before the compute services. The CachingScheduler
    runs a periodic task on startup to fetch the hosts from the
    DB to initialize it's cache, and then uses that during scheduling
    until the periodic runs again. So what we can hit is the
    scheduler starts, loads an empty cache, then we start the computes
    and try to create an instance but because of the empty cache we
    fail with a NoValidHost error.

    This restarts and resets the CachingScheduler cache *after* we have
    started the computes and asserted they are available in the API.

    Change-Id: I32f607a436e9851a96877123ae3d1fe51f444f73
    Closes-Bug: #1781648
    (cherry picked from commit 4fe4fbe7a43ac7e971b0aabc8a9079c7701cae55)

This issue was fixed in the openstack/nova 18.0.1 release.

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers