live migration of instance should claim resources on target compute node

Bug #1289064 reported by Chris Friesen on 2014-03-06
68
This bug affects 11 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Artom Lifshitz
Train
Undecided
Unassigned

Bug Description

I'm looking at the current Icehouse code, but this applies to previous versions as well.

When we create a new instance via _build_instance() or _build_and_run_instance(), in both cases we call instance_claim() to test for resources and reserve them.

During a cold migration we call prep_resize() which calls resize_claim() to reserve resources.

However, when we live-migrate or evacuate an instance we don't do this. As far as I can see the current code will just spawn the new instance but the resource usage won't be updated until the audit runs at some unknown time in the future at which point it will add the new instance to self.tracked_instances and update the resource usage.

This means that until the audit runs the scheduler has a stale view of system resources.

Michael Still (mikal) on 2014-03-07
tags: added: compute
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Rohan (kanaderohan) on 2014-03-07
Changed in nova:
assignee: nobody → Rohan (kanaderohan)
Chris Friesen (cbf123) on 2014-03-11
Changed in nova:
assignee: Rohan (kanaderohan) → Chris Friesen (cbf123)

Fix proposed to branch: master
Review: https://review.openstack.org/79806

Changed in nova:
status: Triaged → In Progress
Sean Dague (sdague) wrote :

The upstream patch is stalled. New owner welcomed.

Changed in nova:
assignee: Chris Friesen (cbf123) → nobody
status: In Progress → Confirmed

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/79806
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Fix proposed to branch: master
Review: https://review.openstack.org/142001

Changed in nova:
assignee: nobody → jichenjc (jichenjc)
status: Confirmed → In Progress

Change abandoned by jichenjc (<email address hidden>) on branch: master
Review: https://review.openstack.org/142001
Reason: wrong direction

Fix proposed to branch: master
Review: https://review.openstack.org/142740

Changed in nova:
assignee: jichenjc (jichenjc) → Alex Xu (xuhj)

Reviewed: https://review.openstack.org/142739
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=737fb8e7a7db775e937fe8b8a5f0ca148e1641be
Submitter: Jenkins
Branch: master

commit 737fb8e7a7db775e937fe8b8a5f0ca148e1641be
Author: jichenjc <email address hidden>
Date: Thu Dec 18 18:23:57 2014 +0800

    Enhance EvacuateHostTestCase test cases

    Currently even if the EvacuateHostTestCase test cases pass,
    there are some error log in the logs, it may lead to
    confusion when debug the problem, and more important,
    it will fail if the 'node' variable is used in the
    compute layer code since the 'node' is None and the
    cases will fail.
    Use stub by purpose because don't want to change current
    test structure.

    2014-12-18 18:20:23,694 ERROR [nova.compute.manager] Failed to get compute_info for fake-mini
    Traceback (most recent call last):
      File "/home/jichen/git/nova/nova/compute/manager.py", line 2797, in rebuild_instance
        compute_node = self._get_compute_info(context, self.host)
      File "/home/jichen/git/nova/nova/compute/manager.py", line 4859, in _get_compute_info
        service = objects.Service.get_by_compute_host(context, host)
      File "/home/jichen/git/nova/nova/objects/base.py", line 156, in wrapper
        result = fn(cls, context, *args, **kwargs)
      File "/home/jichen/git/nova/nova/objects/service.py", line 111, in get_by_compute_host
        db_service = db.service_get_by_compute_host(context, host)
      File "/home/jichen/git/nova/nova/db/api.py", line 131, in service_get_by_compute_host
        use_slave=use_slave)
      File "/home/jichen/git/nova/nova/db/sqlalchemy/api.py", line 127, in wrapper
        return f(*args, **kwargs)
      File "/home/jichen/git/nova/nova/db/sqlalchemy/api.py", line 431, in service_get_by_compute_host
        raise exception.ComputeHostNotFound(host=host)
    ComputeHostNotFound: Compute host fake-mini could not be found.

    Change-Id: I5541fc27afc23346ddcd685667737548b2a813c7
    Partial-Bug: #1289064

Changed in nova:
assignee: Alex Xu (xuhj) → jichenjc (jichenjc)
Bart Wensley (bartwensley) wrote :

It looks to me like the fixes being delivered against this bug are for evacuate - not live migration. The bug is specifically for the live migration case.

Note that as part of the work I am doing to fix bug 1417667, I plan to add resource claims for both evacuate and live migration. We could mark 1289064 as a duplicate of 1417667.

Bart, title itself says only about live migration, but in description you can find some informations about evacute operation too. Also I'm already working on a fix for the issue with live migration.

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/142740
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

tags: added: live-migrate

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/142740
Reason: This patch has been stalled for quite a while, so I am going to abandon it to keep the code review queue sane. Please restore the change when it is ready for review.

Change abandoned by John Garbutt (<email address hidden>) on branch: master
Review: https://review.openstack.org/142740
Reason: this seems like a duplicate, so abandoning this for now, if thats not true, feel free to bring it back again.

Chris Friesen (cbf123) wrote :

Not a duplicate...as far as I know this is still a problem for live migration though Nikola has done some work for other scenarios.

Paul Murray (pmurray) on 2015-11-06
tags: added: live-migration
removed: live-migrate

Nikola will probably fix this issue so assigning him there - https://review.openstack.org/#/q/topic:bug/1417667,n,z

Changed in nova:
assignee: jichenjc (jichenjc) → Nikola Đipanov (ndipanov)
Changed in nova:
assignee: Nikola Đipanov (ndipanov) → Sylvain Bauza (sylvain-bauza)
Changed in nova:
assignee: Sylvain Bauza (sylvain-bauza) → sahid (sahid-ferdjaoui)

Change abandoned by Daniel Berrange (<email address hidden>) on branch: master
Review: https://review.openstack.org/286742
Reason: Abadoning since its obsolet & nikola no longer works on nova

Changed in nova:
assignee: sahid (sahid-ferdjaoui) → Sylvain Bauza (sylvain-bauza)
Changed in nova:
assignee: Sylvain Bauza (sylvain-bauza) → sahid (sahid-ferdjaoui)
Changed in nova:
assignee: sahid (sahid-ferdjaoui) → Stephen Finucane (stephenfinucane)
Changed in nova:
assignee: Stephen Finucane (stephenfinucane) → Pawel Koniszewski (pawel-koniszewski)
Changed in nova:
assignee: Pawel Koniszewski (pawel-koniszewski) → sahid (sahid-ferdjaoui)
Changed in nova:
assignee: sahid (sahid-ferdjaoui) → Pawel Koniszewski (pawel-koniszewski)
Changed in nova:
assignee: Pawel Koniszewski (pawel-koniszewski) → Andrey Volkov (avolkov)
Sean Dague (sdague) wrote :

Automatically discovered version icehouse in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.icehouse

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/244489
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/286744
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in nova:
assignee: Andrey Volkov (avolkov) → Stephen Finucane (stephenfinucane)
Changed in nova:
assignee: Stephen Finucane (stephenfinucane) → sahid (sahid-ferdjaoui)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.openstack.org/244489
Reason: Safe to say this is dead in the water and should finally be put out of its misery. artom: your turn.

Changed in nova:
assignee: sahid (sahid-ferdjaoui) → Artom Lifshitz (notartom)

Reviewed: https://review.openstack.org/611088
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ae2e5650d14a2c81dd397727d67b60f9b8dd0dd7
Submitter: Zuul
Branch: master

commit ae2e5650d14a2c81dd397727d67b60f9b8dd0dd7
Author: Stephen Finucane <email address hidden>
Date: Tue Oct 16 17:41:17 2018 +0100

    Fail to live migration if instance has a NUMA topology

    Live migration is currently totally broken if a NUMA topology is
    present. This affects everything that's been regrettably stuffed in with
    NUMA topology including CPU pinning, hugepage support and emulator
    thread support. Side effects can range from simple unexpected
    performance hits (due to instances running on the same cores) to
    complete failures (due to instance cores or huge pages being mapped to
    CPUs/NUMA nodes that don't exist on the destination host).

    Until such a time as we resolve these issues, we should alert users to
    the fact that such issues exist. A workaround option is provided for
    operators that _really_ need the broken behavior, but it's defaulted to
    False to highlight the brokenness of this feature to unsuspecting
    operators.

    Change-Id: I217fba9138132b107e9d62895d699d238392e761
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-bug: #1289064

Reviewed: https://review.openstack.org/625880
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=52b89734426253f64b6d4797ba4d849c3020fb52
Submitter: Zuul
Branch: stable/rocky

commit 52b89734426253f64b6d4797ba4d849c3020fb52
Author: Stephen Finucane <email address hidden>
Date: Tue Oct 16 17:41:17 2018 +0100

    Fail to live migration if instance has a NUMA topology

    Live migration is currently totally broken if a NUMA topology is
    present. This affects everything that's been regrettably stuffed in with
    NUMA topology including CPU pinning, hugepage support and emulator
    thread support. Side effects can range from simple unexpected
    performance hits (due to instances running on the same cores) to
    complete failures (due to instance cores or huge pages being mapped to
    CPUs/NUMA nodes that don't exist on the destination host).

    Until such a time as we resolve these issues, we should alert users to
    the fact that such issues exist. A workaround option is provided for
    operators that _really_ need the broken behavior, but it's defaulted to
    False to highlight the brokenness of this feature to unsuspecting
    operators.

    Conflicts:
     nova/conf/workarounds.py
     nova/tests/unit/api/openstack/compute/admin_only_action_common.py
     nova/tests/unit/api/openstack/compute/test_migrate_server.py

    NOTE(stephenfin): Conflicts due to removal of
    'report_ironic_standard_resource_class_inventory' option and addition of
    change Iaea1cb4ed93bb98f451de4f993106d7891ca3682 on master.

    Change-Id: I217fba9138132b107e9d62895d699d238392e761
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-bug: #1289064
    (cherry picked from commit ae2e5650d14a2c81dd397727d67b60f9b8dd0dd7)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/629597
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9999bce00f5bea5f3e90ab9e16625d4237504bcb
Submitter: Zuul
Branch: stable/queens

commit 9999bce00f5bea5f3e90ab9e16625d4237504bcb
Author: Stephen Finucane <email address hidden>
Date: Tue Oct 16 17:41:17 2018 +0100

    Fail to live migration if instance has a NUMA topology

    Live migration is currently totally broken if a NUMA topology is
    present. This affects everything that's been regrettably stuffed in with
    NUMA topology including CPU pinning, hugepage support and emulator
    thread support. Side effects can range from simple unexpected
    performance hits (due to instances running on the same cores) to
    complete failures (due to instance cores or huge pages being mapped to
    CPUs/NUMA nodes that don't exist on the destination host).

    Until such a time as we resolve these issues, we should alert users to
    the fact that such issues exist. A workaround option is provided for
    operators that _really_ need the broken behavior, but it's defaulted to
    False to highlight the brokenness of this feature to unsuspecting
    operators.

    Conflicts:
     nova/conf/workarounds.py
     nova/tests/unit/api/openstack/compute/admin_only_action_common.py
     nova/tests/unit/api/openstack/compute/test_migrate_server.py
     nova/tests/unit/conductor/tasks/test_live_migrate.py

    NOTE(stephenfin): stable/rocky conflicts due to removal of
    'report_ironic_standard_resource_class_inventory' option and addition of
    change Iaea1cb4ed93bb98f451de4f993106d7891ca3682 on master.

    NOTE(stephenfin): stable/queens conflicts due to presence of
    the 'enable_consoleauth' configuration option and change
    I83b473e9ba557545b5c186f979e068e442de2424 (Mox to mock) in stable/rocky.
    A hyperlink is removed from the config option help text as the version
    of 'oslo.config' used here does not parse help text as rST (bug 1755783).

    Change-Id: I217fba9138132b107e9d62895d699d238392e761
    Signed-off-by: Stephen Finucane <email address hidden>
    Related-bug: #1289064
    (cherry picked from commit ae2e5650d14a2c81dd397727d67b60f9b8dd0dd7)
    (cherry picked from commit 52b89734426253f64b6d4797ba4d849c3020fb52)

tags: added: in-stable-queens

I think this bug might be considered as a "Fix released" since BluePrint [1] has been introduced and merged for Train with topic [2]

[1] https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/numa-aware-live-migration.html
[2] https://review.opendev.org/#/q/topic:bp/numa-aware-live-migration+(status:open+OR+status:merged)

Artom Lifshitz (notartom) wrote :

So, only live migrations of instances with a NUMA topology do MoveClaims. SRIOV live migration is handled without MoveClaims, but updates the resource tracker as well. Everything else is covered by placement, I believe. So I'd say this bug is safe to close. Worst case, if there's a resource that we aren't accounting correctly during live migration, a new bug can be opened.

sean mooney (sean-k-mooney) wrote :

that was not completed until Train
so i think we can now mark this as fix released but only from train on.
for earlier releases stephens change to blocks numa migration by default
so that could be considered a fix in that i prevents this race from happening.

Artom Lifshitz (notartom) wrote :

Based on comments #31 and #30, we consider this as 'Fix released' in Train.

Changed in nova:
status: In Progress → Fix Released
Yi Yang (yangyi01) wrote :

Hi, guys

I tried live migrate with OVS DPDK and Openstack rocky, I got the below error:

Migration pre-check error: Instance has an associated NUMA topology. Instance NUMA topologies, including related attributes such as CPU pinning, huge page and emulator thread pinning information, are not currently recalculated on live migration. See bug #1289064 for more information. (HTTP 400) (Request-ID: req-6cabd79c-bb0c-4008-866c-14d1f6587bee)

Anybody can help confirm if live migrate can't work on OVS DPDK?

I saw this bug has been marked as fix released, why can it not work in Rocky? what's wrong?

Artom Lifshitz (notartom) wrote :

NUMA live migration support was added in Train. So Rocky, which is 2 releases *before* Train, will not have NUMA live migration support. You can still try to live migrate, but there are a lot of caveats. This is why the enable_numa_live_migration workaround config option [1] exists. It looks like in your case, it is not set to True, which is blocking live migration. You can set it to True if you understand the caveats and still want to perform a NUMA live migration.

[1] https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.enable_numa_live_migration

Yi Yang (yangyi01) wrote :

@Artom Lifshitz (notartom) Thank you so much for clarification, yes, I can live migrate my instance after I enable enable_numa_live_migration, but I'm not very clear what those caveats mean, does openstack have more detailed documents about those caveats? I just wonder if live migrated instance can work normally as before, why if no? Any existing ways can fix them on Rocky or Stein?

Artom Lifshitz (notartom) wrote :

I think the best documentation we have of the caveats is in the help text to the [workarounds]/enable_numa_live_migration [1]:

> Live migration of instances with NUMA topologies when using the libvirt driver is only supported
> in deployments that have been fully upgraded to Train. In previous versions, or in mixed
> Stein/Train deployments with a rolling upgrade in progress, live migration of instances with NUMA
> topologies is disabled by default when using the libvirt driver. This includes live migration of
> instances with CPU pinning or hugepages. CPU pinning and huge page information for such instances
> is not currently re-calculated, as noted in bug #1289064. This means that if instances were
> already present on the destination host, the migrated instance could be placed on the same
> dedicated cores as these instances or use hugepages allocated for another instance. Alternately,
> if the host platforms were not homogeneous, the instance could be assigned to non-existent cores
> or be inadvertently split across host NUMA nodes.
>
> Despite these known issues, there may be cases where live migration is necessary. By enabling
> this option, operators that are aware of the issues and are willing to manually work around them
> can enable live migration support for these instances."

[1] https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.enable_numa_live_migration

Yi Yang (yangyi01) wrote :

@Artom Lifshitz (notartom) Thanks a lot, the point I got after I carefully read it is migrated instance may assign dedicated CPU cores other instances are using and may use hugepages other instances are using.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related blueprints