scheduler.utils.set_vm_state_and_notify() race between ERROR state save() and compute_utils.add_instance_fault_from_ex()

Bug #2045921 reported by melanie witt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Maybe "race" isn't the right word, but the ordering of the Instance.save() of ERROR state and the creation of the instance.fault record can result in the instance.fault not having been created yet after the instance is visibly showing ERROR state.

Seen in the gate:

test_qos_min_bw_allocation_basic fails because the expected 'fault' field is missing in the server
Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' field is not found in the server from GET /servers/{server_id}

testr_results.html:

testtools.matchers._impl.MismatchError: 'fault' not in {'id': '19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR', 'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id': '8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '', 'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links': [{'rel': 'bookmark', 'href': 'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]}, 'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap': 0, 'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed': 'True'}}, 'created': '2023-12-07T15:00:24Z', 'updated': '2023-12-07T15:00:30Z', 'addresses': {}, 'accessIPv4': '', 'accessIPv6': '', 'links': [{'rel': 'self', 'href': 'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}, {'rel': 'bookmark', 'href': 'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}], 'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '', 'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None, 'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-volumes:volumes_attached': [], 'locked': False, 'description': None, 'tags': [], 'trusted_image_certificates': None, 'server_groups': []}

screen-placement-api.txt:

found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
this ^ is expected for this part of the test

OpenSearch query:

message:"testtools.matchers._impl.MismatchError: 'fault' not in"

Comments:

This may be a race because ERROR state is set on the instance and save()'ed before the 'fault' record is created
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
and the test waits for ERROR state before checking for the 'fault' field, so maybe sometimes it GETs the instance before the fault was able to be added
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276

Code:

    vm_state = updates['vm_state']
    properties = request_spec.get('instance_properties', {})
    notifier = rpc.get_notifier(service)
    state = vm_state.upper()
    LOG.warning('Setting instance to %s state.', state,
                instance_uuid=instance_uuid)

    instance = objects.Instance(context=context, uuid=instance_uuid,
                                **updates)
    instance.obj_reset_changes(['uuid'])
    instance.save()
    compute_utils.add_instance_fault_from_exc(
        context, instance, ex, sys.exc_info())

I wonder if it would be legit to swap the ordering to do compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR state?

Tags: gate-failure
melanie witt (melwitt)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.