evacuation failed: Port update failed : Unable to correlate PCI slot

Bug #1896463 reported by Wonil Choi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Balazs Gibizer
Queens
Triaged
Low
Unassigned
Rocky
In Progress
Low
Unassigned
Stein
Triaged
Low
Unassigned
Train
Fix Released
Low
Unassigned
Ussuri
Fix Released
Low
Balazs Gibizer
Victoria
Fix Released
Low
Balazs Gibizer

Bug Description

Description
===========
if the _update_available_resource() of resource_tracker is called between _do_rebuild_instance_with_claim() and instance.save() when evacuating VM instances on destination host,

nova/compute/manager.py

2931 def rebuild_instance(self, context, instance, orig_image_ref, image_ref,
2932 +-- 84 lines: injected_files, new_pass, orig_sys_metadata,-------------------------------------------------------------------
3016 claim_ctxt = rebuild_claim(
3017 context, instance, scheduled_node,
3018 limits=limits, image_meta=image_meta,
3019 migration=migration)
3020 self._do_rebuild_instance_with_claim(
3021 +-- 47 lines: claim_ctxt, context, instance, orig_image_ref,-----------------------------------------------------------------
3068 instance.apply_migration_context()
3069 # NOTE (ndipanov): This save will now update the host and node
3070 # attributes making sure that next RT pass is consistent since
3071 # it will be based on the instance and not the migration DB
3072 # entry.
3073 instance.host = self.host
3074 instance.node = scheduled_node
3075 instance.save()
3076 instance.drop_migration_context()

the instance is not handled as managed instance of the destination host because it is not updated on DB yet.

2020-09-19 07:27:36.321 8 WARNING nova.compute.resource_tracker [req-b35d5b9a-0786-4809-bd81-ad306cdda8d5 - - - - -] Instance 22f6ca0e-f964-4467-83a3-f2bf12bb05ae is not being actively managed by this compute host but has allocations referencing this compute host: {u'resources': {u'MEMORY_MB': 12288, u'VCPU': 2, u'DISK_GB': 10}}. Skipping heal of allocation because we do not know what to do.

And so the SRIOV ports (PCI device) was free by clean_usage() eventhough the VM has the VF port already.

 743 def _update_available_resource(self, context, resources):
 744 +-- 45 lines: # initialize the compute node object, creating it--------------------------------------------------------------
 789 self.pci_tracker.clean_usage(instances, migrations, orphans)
 790 dev_pools_obj = self.pci_tracker.stats.to_device_pools_obj()

After that, evacuated this VM to another compute host again, we got the error like below.

Steps to reproduce
==================
1. create a VM on com1 with SRIOV VF ports.
2. stop and disable nova-compute service on com1
3. wait 60 sec (nova-compute reporting interval)
4. evauate the VM to com2
5. wait the VM is active on com2
6. enable and start nova-compute on com1
7. wait 60 sec (nova-compute reporting interval)
8. stop and disable nova-compute service on com2
9. wait 60 sec (nova-compute reporting interval)
10. evauate the VM to com1
11. wait the VM is active on com1
12. enable and start nova-compute on com2
13. wait 60 sec (nova-compute reporting interval)
14. go to step 2.

Expected result
===============
Evacuation should be done without errors.

Actual result
=============
Evacuation failed with "Port update failed"

Environment
===========
openstack-nova-compute-18.0.1-1 with SRIOV ports are used. libvirt is used.

Logs & Configs
==============
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [req-38dd0be2-7223-4a59-8073-dd1b072125c5 c424fbb3d41f444bb7a025266fda36da 6255a6910b9b4d3ba34a93624fe7fb22 - default default] [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Setting instance vm_state to ERROR: PortUpdateFailed: Port update failed for port 76dc33dc-5b3b-4c45-b2cb-fd59025a4dbd: Unable to correlate PCI slot 0000:05:12.2
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Traceback (most recent call last):
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7993, in _error_out_instance_on_exception
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] yield
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3025, in rebuild_instance
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] migration, request_spec)
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3087, in _do_rebuild_instance_with_claim
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] self._do_rebuild_instance(*args, **kwargs)
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3190, in _do_rebuild_instance
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] context, instance, self.host, migration)
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 2953, in setup_instance_network_on_host
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] migration)
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 3058, in _update_port_binding_for_instance
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] pci_slot)
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] PortUpdateFailed: Port update failed for port 76dc33dc-5b3b-4c45-b2cb-fd59025a4dbd: Unable to correlate PCI slot 0000:05:12.2
2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]

Revision history for this message
Wonil Choi (wonil22) wrote :

Before saving instance with new host(destination host) in rebuild_instance(),

2931 def rebuild_instance(self, context, instance, orig_image_ref, image_ref,
2932 +--136 lines: injected_files, new_pass, orig_sys_metadata,-------------------------------------------------------------------
3068 instance.apply_migration_context()
3069 # NOTE (ndipanov): This save will now update the host and node
3070 # attributes making sure that next RT pass is consistent since
3071 # it will be based on the instance and not the migration DB
3072 # entry.
3073 instance.host = self.host
3074 instance.node = scheduled_node
3075 instance.save()
3076 instance.drop_migration_context()
3077
3078 # NOTE (ndipanov): Mark the migration as done only after we
3079 # mark the instance as belonging to this host.
3080 self._set_migration_status(migration, 'done')

resource tracker(RT) may get instances by get_by_host_and_node()

 743 def _update_available_resource(self, context, resources):
 744 +-- 12 lines: # initialize the compute node object, creating it--------------------------------------------------------------
 756 # Grab all instances assigned to this node:
 757 instances = objects.InstanceList.get_by_host_and_node(

and after saving instance host(destination host) and set the migration status to 'done', RT get the migrations.

 767 # Grab all in-progress migrations:
 768 migrations = objects.MigrationList.get_in_progress_by_host_and_node(
 769 context, self.host, nodename)
 770

After these situation, pci devices are free by clean_usage()

 789 self.pci_tracker.clean_usage(instances, migrations, orphans)

tags: added: evacuate pci resource-tracker
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I was not able to reproduce this issue on master in a multinode devstack.

Could you please attach the debug logs from the two nova-compute services from the reproduction?

How frequently could the problem be reproduced? Once in 10 attempts or 100 attempts?

Nova does consider ongoing migrations even if the instance is not assigned to the given host[1]. During evacuation we do have a Migration object[2]. This object is created at [3]. At that point destination only set if the evacuation is forced to a target host.

Later during the rebuild_instance there is a claim. That claim updates the migration object[4][5]. Also the _move_claim[6] first updates the migration object and then claims the pci devices. So I don't see how this can race with the _update_available_resource() run.

I mark this bug Incomplete as I don't have a reproduction and I don't see how the described race happens. If you have any more input or idea then please set it back to New.

[1] https://github.com/openstack/nova/blob/5645f75d6b3adac00f6be8e0eae4565c4eb2ab5d/nova/pci/manager.py#L346-L353
[2] https://github.com/openstack/nova/blob/5645f75d6b3adac00f6be8e0eae4565c4eb2ab5d/nova/compute/manager.py#L3135
[3] https://github.com/openstack/nova/blob/b5330a97ae6f874b75fa9588ae122a1d1e827035/nova/compute/api.py#L5206-L5209
[4] https://github.com/openstack/nova/blob/5645f75d6b3adac00f6be8e0eae4565c4eb2ab5d/nova/compute/resource_tracker.py#L282
[5] https://github.com/openstack/nova/blob/5645f75d6b3adac00f6be8e0eae4565c4eb2ab5d/nova/compute/resource_tracker.py#L393
[6] https://github.com/openstack/nova/blob/5645f75d6b3adac00f6be8e0eae4565c4eb2ab5d/nova/compute/resource_tracker.py#L282-L322

Changed in nova:
status: New → Incomplete
Revision history for this message
Minho Ban (mhban) wrote :

> How frequently could the problem be reproduced? Once in 10 attempts or 100 attempts?

Very rare because the racy section is short.

> Also the _move_claim[6] first updates the migration object and then claims the pci devices. So I don't see how this can race with the _update_available_resource() run.

Problem is not in the claim step, but in the free step. Currently clean_usage() part isn't protected from below:

3075 instance.save()
3076 instance.drop_migration_context()

Revision history for this message
Wonil Choi (wonil22) wrote :
Download full text (4.7 KiB)

reproduce ratio: 1/1000 ~ 1/5000. actually I tried to reproduce this with evacuating 20 VMs on a host.
please review that these sequence.

1. (manager) request evacuate and move_claim() called and created on destination host[1].
2020-09-22 19:16:48.440 8 INFO nova.virt.libvirt.driver [-] [instance: d74302ce-5c79-43f0-a035-98449a6aa62b] Instance spawned successfully.

2. (RT) Enter the _update_available_resource() and get the instances from DB[2], instance's host is not changed to destination yet. so the get_by_host_and_node()[2] did not include the evacuated instance (i.e. d74302ce-5c79-43f0-a035-98449a6aa62b )
2020-09-22 19:16:48.765 8 DEBUG oslo_concurrency.lockutils [req-0689f6b0-8e5b-41e3-a56a-d5439474ead4 - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker._update_available_resource" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
...
2020-09-22 19:16:48.960 8 INFO oslo.messaging.notification.compute.instance.rebuild.end [req-eaa15672-5559-4bb0-9a02-aad134ccd60b c424fbb3d41f444bb7a025266fda36da 6255a6910b9b4d3ba34a93624fe7fb22 - default default] {"event_type": "compute.instance.rebuild.end", "timestamp": "2020-09-22 10:16:48.960283", "payload": {"state_description": "", "availability_zone": "nova", "terminated_at": "", "ephemeral_gb": 0, "instance_type_id": 280, "deleted_at": "", "reservation_id": "r-vwlai4vk", "memory_mb": 10240, "display_name": "VIDA_VM3", "fixed_ips": [{"version": 4, "vif_mac": "fa:16:3e:87:c4:46", "floating_ips": [], "label": "sriov-network", "meta": {}, "address": "100.90.80.18", "type": "fixed"}, {"version": 4, "vif_mac": "fa:16:3e:86:76:c2", "floating_ips": [], "label": "sriov-network", "meta": {}, "address": "100.90.80.69", "type": "fixed"}, {"version": 4, "vif_mac": "fa:16:3e:bb:88:c6", "floating_ips": [], "label": "sriov-network", "meta": {}, "address": "100.90.80.37", "type": "fixed"}], "hostname": "vida-vm3", "state": "active", "progress": "", "launched_at": "2020-09-22T10:16:48.508530", "metadata": {}, "node": "com11", "ramdisk_id": "", "access_ip_v6": null, "disk_gb": 10, "access_ip_v4": null, "kernel_id": "", "image_name": "rhel7.4", "host": "com11", "user_id": "c424fbb3d41f444bb7a025266fda36da", "image_ref_url": "http://114.128.0.211:9292/images/28a91968-9df2-4b02-8212-78a86a56e353", "cell_name": "", "root_gb": 10, "tenant_id": "6255a6910b9b4d3ba34a93624fe7fb22", "created_at": "2020-09-22 06:51:54+00:00", "instance_id": "d74302ce-5c79-43f0-a035-98449a6aa62b", "instance_type": "VIDA_flavor", "vcpus": 2, "image_meta": {"min_disk": "10", "container_format": "bare", "min_ram": "0", "disk_format": "raw", "base_image_ref": "28a91968-9df2-4b02-8212-78a86a56e353"}, "architecture": null, "os_type": null, "instance_flavor_id": "302"}, "priority": "INFO", "publisher_id": "compute.com10", "message_id": "3e96ce39-fd53-4324-be64-b3a822db5215"}

3. (manager) Save the instance with destination host and migration status 'done'[3]
(I added temporal logs on the end of rebuild_instance()[4])
2020-09-22 19:16:49.349 8 DEBUG nova.compute.manager [req-eaa15672-5559-4bb0-9a02-aad134ccd60b c424fbb3d41f444bb7a025266fda36da 6255a6910b9b4d3ba34a93624fe7fb22 - def...

Read more...

Wonil Choi (wonil22)
Changed in nova:
status: Incomplete → New
Revision history for this message
Wonil Choi (wonil22) wrote :

I can reproduce this easily with adding 10 sec delay between InstanceList.get_by_host_and_node() and MigrationList.get_in_progress_by_host_and_node() [1]

[1] https://github.com/openstack/nova/blob/5645f75d6b3adac00f6be8e0eae4565c4eb2ab5d/nova/compute/resource_tracker.py#L766

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Thanks for the additional information. I will try to the reproduction with the 10sec delay.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

OK. I was able to reproduce the bug. Setting this confirmed. I will push a functional test that shows the problem soon.

Changed in nova:
status: New → Confirmed
assignee: nobody → Balazs Gibizer (balazs-gibizer)
importance: Undecided → Low
Revision history for this message
Wonil Choi (wonil22) wrote :

after the evacuation failed, the instance can not be evacuated again. I think this is critical situation for "evacuation" scenario even though the ratio is rare.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/754100

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I've pushed a reproducer see commeent #9

Revision history for this message
sean mooney (sean-k-mooney) wrote :

just adding the previous filed downstream redhat bug
https://bugzilla.redhat.com/show_bug.cgi?id=1852110

this can happen in queens for context so when we root cause the issue and fix it it should like be backported to queens. tjere are other older bugs form newton that look similar related to unshelve so its posible that the same issue is affecting multiple move operations.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i have taged this against the older release and set the same importance. im not sure i we will backport that far so we might update to later baranches to wontfix at some point but we can do that later

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754815

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Wonil: could you try the fix I proposed[1] in your env? I think it should be fairly simple to backport it to rocky version.

[1] https://review.opendev.org/754815

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I did some analysis about the other move operations to see if they are affected by the same issue.

During cold migrate and resize the instance.host is set to the dest host when the instance reaches the VERIFY_RESIZE state. At this point the migration is set to 'finished' but finished is not an end state of the migration. A migration in finished state still considered in progress by the resource tracker. Later when the resize is confirmed (or reverted) then the migration status is put to 'confirmed' ('reverted') and this is an end state. So a simple delay in the resource tracker is not enough to trigger the race. A long enough delay is needed in the RT that allows the user to actually confirm the migration by hand. Or the automatic confirmation task to be configured and actually run. This makes the race pretty unlikely.

In case of live migration the process is a lot more complex. There is multiple steps in the live migration process that grabs the COMPUTE_RESOURCE_SEMAPHORE in the resource tracker (e.g. the PCI claim) so simply starting and then slowing down the update_available_resources task in the resource tracker does not work as a reproduction. I gave up on trying to reproduce the same problem but I'm not convinced that such fault does not exist for live migration. A reproduction should trigger a periodic task after the last COMPUTE_RESOURCE_SEMAPHORE grabbing live migration step but before the live migration finishes.

The unshelve (after shelve offload) is not affected by the bug as it uses the same instance_claim as the boot process. the instance_claim will set the instance.host under the COMPUTE_RESOURCE_SEMAPHORE lock so there is no chance for an overlapping update_available_resource periodic run.

Revision history for this message
Wonil Choi (wonil22) wrote :

@Balazs, Sure I will try the fix and share the result.

Revision history for this message
Wonil Choi (wonil22) wrote :

I cannot reproduce it with the patch ported in my rocky env. Thanks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/756421

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Wonil: thanks for the feedback

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/754100
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3f348602ae4a40c52c7135b2cb48deaa6052c488
Submitter: Zuul
Branch: master

commit 3f348602ae4a40c52c7135b2cb48deaa6052c488
Author: Balazs Gibizer <email address hidden>
Date: Thu Sep 24 15:04:21 2020 +0200

    Reproduce bug 1896463 in func env

    There is a race condition between the rebuild and the
    _update_available_resource periodic task on the compute. This patch adds
    a reproducer functional test. Unfortunately it needs some injected sleep
    to make the race happen in a stable way. This is suboptimal but only
    adds 3 seconds of slowness to the test execution.

    Change-Id: Id0577bceed9808b52da4acc352cf9c204f6c8861
    Related-Bug: #1896463

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/754815
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7675964af81472f3dd57db952d704539e61a3e6e
Submitter: Zuul
Branch: master

commit 7675964af81472f3dd57db952d704539e61a3e6e
Author: Balazs Gibizer <email address hidden>
Date: Mon Sep 28 18:44:02 2020 +0200

    Set instance host and drop migration under lock

    The _update_available_resources periodic makes resource allocation
    adjustments while holding the COMPUTE_RESOURCE_SEMAPHORE based on the
    list of instances assigned to this host of the resource tracker and
    based on the migrations where the source or the target host is the host
    of the resource tracker. So if the instance.host or the migration
    context changes without holding the COMPUTE_RESOURCE_SEMAPHORE while
    the _update_available_resources task is running there there will be data
    inconsistency in the resource tracker.

    This patch makes sure that during evacuation the instance.host and the
    migration context is changed while holding the semaphore.

    Change-Id: Ica180165184b319651d22fe77e076af036228860
    Closes-Bug: #1896463

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/763152

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/763154

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.1.0

This issue was fixed in the openstack/nova 22.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.2.0

This issue was fixed in the openstack/nova 21.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/841288

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/841444

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/756421
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/841288
Committed: https://opendev.org/openstack/nova/commit/e558aa2046b649696b69551657e5d429e8117427
Submitter: "Zuul (22348)"
Branch: stable/train

commit e558aa2046b649696b69551657e5d429e8117427
Author: Balazs Gibizer <email address hidden>
Date: Thu Sep 24 15:04:21 2020 +0200

    Reproduce bug 1896463 in func env

    There is a race condition between the rebuild and the
    _update_available_resource periodic task on the compute. This patch adds
    a reproducer functional test. Unfortunately it needs some injected sleep
    to make the race happen in a stable way. This is suboptimal but only
    adds 3 seconds of slowness to the test execution.

    Conflicts:
      nova/tests/functional/regressions/test_bug_1896463.py due to
      I84c58de90dad6d86271767363aef90ddac0f1730 and
      I8c96b337f32148f8f5899c9b87af331b1fa41424 is not in stable/train.
      Also I had to make the test code python 2 compatible.

    Change-Id: Id0577bceed9808b52da4acc352cf9c204f6c8861
    Related-Bug: #1896463
    (cherry picked from commit 3f348602ae4a40c52c7135b2cb48deaa6052c488)
    (cherry picked from commit d768cdbb88d0b0b3ca38623c4bb26d5eabdf1596)
    (cherry picked from commit 02114a9d7f2e8b62d3a7091ca3fde251dfffa860)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/841444
Committed: https://opendev.org/openstack/nova/commit/d5506014837097261231b205e9b3acbfa503b354
Submitter: "Zuul (22348)"
Branch: stable/train

commit d5506014837097261231b205e9b3acbfa503b354
Author: Balazs Gibizer <email address hidden>
Date: Mon Sep 28 18:44:02 2020 +0200

    Set instance host and drop migration under lock

    The _update_available_resources periodic makes resource allocation
    adjustments while holding the COMPUTE_RESOURCE_SEMAPHORE based on the
    list of instances assigned to this host of the resource tracker and
    based on the migrations where the source or the target host is the host
    of the resource tracker. So if the instance.host or the migration
    context changes without holding the COMPUTE_RESOURCE_SEMAPHORE while
    the _update_available_resources task is running there there will be data
    inconsistency in the resource tracker.

    This patch makes sure that during evacuation the instance.host and the
    migration context is changed while holding the semaphore.

    stable/train specific change: fair locking was introduced from ussuri
    forward. See Ia5e521e0f0c7a78b5ace5de9f343e84d872553f9
    So the train backport does not add fair=True to the synchronized
    decorator of finish_evacuation as that would interfere with the other
    functions taking the same lock without fair=True.

    Change-Id: Ica180165184b319651d22fe77e076af036228860
    Closes-Bug: #1896463
    (cherry picked from commit 7675964af81472f3dd57db952d704539e61a3e6e)
    (cherry picked from commit 3ecc098d28addd8f3e1da16b29940f4337209e62)
    (cherry picked from commit 3b497ad7d41bc84ec50124c4d55f4138c4878751)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova train-eol

This issue was fixed in the openstack/nova train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.