Force live migrate doesn't claim resources on the target host

Bug #1712008 reported by Lajos Katona
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Matt Riedemann
Pike
Fix Committed
Critical
Matt Riedemann

Bug Description

During force live live migrate nova doesn't claims the resources on the target host as expected, see the sequence:
* Boot a VM.
* Force live migrate the VM.
* Check the allocations:
** the claims are still on the source host.
** on the destination there is no claim.

This situation doesn't change after running the periodics.
The test that contains the expected assertions (commented out now):
https://review.openstack.org/495170

nova commit: 08ec8a1ad3f3492b99db48d9e8fa132cb1bb3e8c

tags: added: live-migration placement
Matt Riedemann (mriedem)
tags: added: pike-rc-potential
Matt Riedemann (mriedem)
Changed in nova:
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

The problem is here:

https://github.com/openstack/nova/blob/16.0.0.0rc1/nova/conductor/tasks/live_migrate.py#L51-L56

When a host is forced, conductor bypasses the call to scheduler_client.select_destinations which is the code that eventually creates the allocation on the destination host:

https://github.com/openstack/nova/blob/16.0.0.0rc1/nova/scheduler/client/report.py#L147

And due to this change:

https://review.openstack.org/#/c/491012/

If all of your computes are upgraded, the resource tracker isn't going to "heal" the allocations on the target host during it's update_available_resources periodic task.

Thinking of solutions:

1. Both paths are going to eventually call check_can_live_migrate_destination on the destination compute host so we could create the allocation there, although that gets tricky since it could overwrite any allocations that the scheduler created via select_destinations if a host isn't forced.

2. Just call placement from conductor when a host isn't forced, somewhere in this else block:

https://github.com/openstack/nova/blob/16.0.0.0rc1/nova/conductor/tasks/live_migrate.py#L56

That's probably the cleanest since it wouldn't overwrite any allocations by the scheduler, since the scheduler isn't called, and it would actually make the destination host allocations correct before the RT could heal them, assuming not all compute nodes are upgraded yet.

Revision history for this message
Matt Riedemann (mriedem) wrote :

After discussing this with Dan Smith a bit in IRC, solution #2 in comment 1 isn't great either since it would mean duplicating more of the allocation logic in multiple places, which is something we're already having to deal with between the compute service and the scheduler service right now, so we really want the scheduler to be responsible for this, so we should do something where conductor calls select_destinations with a target host and just make it do the claim, and I guess run the filters?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Semi-related but another problem here is when forcing a host during live migration, we don't check that the source and destination hosts are within the same cell if you're running a multi-cell deployment with cells v2.

Changed in nova:
importance: Undecided → High
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/496031

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/495170
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=52d732d6bdb78ca5c940112e8dff6eef7f6828e8
Submitter: Jenkins
Branch: master

commit 52d732d6bdb78ca5c940112e8dff6eef7f6828e8
Author: Lajos Katona <email address hidden>
Date: Fri Aug 18 13:02:48 2017 +0200

    Add functional force live migrate test

    Forced live migration now seems to be failing, as after the movement of
    the VM, the allocations are only present on the source host, and not on
    the destination host.
    The related assertions are commented out and replaced with wrong ones to
    make the test pass.

    Related-Bug: #1712008
    Change-Id: I6856f57426db6f2f49daea86679b50d5d019fe19

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/496419

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/496725

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/496727

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/496729

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/496031
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5d3a11b9c9a6a5aecd46ad7ecc635215184d930e
Submitter: Jenkins
Branch: master

commit 5d3a11b9c9a6a5aecd46ad7ecc635215184d930e
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 21 18:35:07 2017 -0400

    Allocate resources on forced dest host during live migration

    When forcing a host during live migration, conductor bypasses
    the scheduler so the scheduler won't create an allocation in
    Placement against the destination host.

    With change Ia93168b1560267178059284186fb2b7096c7e81f, once all
    computes are upgraded to Pike, the computes won't auto-heal the
    allocations for their respective nodes either, so we end up with
    no allocation for the destination node during a live migration when
    the host is forced.

    This change makes conductor use the source compute node allocations
    for the instance to claim the same resource amounts on the forced
    destination host in Placement. If the claim fails, a
    MigrationPreCheckError is raised.

    This is a short-term fix for Pike. A longer-term fix to avoid this
    duplication with the scheduler is to have conductor call the
    scheduler even when force=True but pass a flag to the scheduler
    so it skips the filters but still makes the claim on the destination
    node.

    Finally, some comments are left in the live_migrate method in the
    compute API code since this is all tightly-coupled between the
    API and conductor when a host is specified in the request, and it's
    easy to get lost on what the API is doing to the request spec which
    changes how conductor behaves, i.e. if it calls the scheduler or not.

    Change-Id: I40b5af5e85b1266402a7e4bdeb3705e1b0bd6f3b
    Closes-Bug: #1712008

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/496419
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8fc789deb7c774bcc5b5128d638c6c7e30bf0a54
Submitter: Jenkins
Branch: master

commit 8fc789deb7c774bcc5b5128d638c6c7e30bf0a54
Author: Matt Riedemann <email address hidden>
Date: Tue Aug 22 16:54:44 2017 -0400

    Restrict live migration to same cell

    We do not yet support live migrating an instance
    across cells. This change handles two cases for
    live migration:

    1. The destination host is forced so the scheduler
       is bypassed. In this case we directly compare the
       source cell against the destination cell and fail
       if they are not the same with a MigrationPreCheckError.

    2. If no destination host is specified, or it's not forced,
       we update the RequestSpec sent to the scheduler so it
       will restrict the compute nodes it pulls from the same
       cell that the instance lives in. If a host is requested
       in this case but it's in a different cell, it would result
       in a NoValidHost error from the scheduler.

    Change-Id: I66fc72d402ac118270a835cf929fe1ea387d78cd
    Closes-Bug: #1712008

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/496725
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c12446ecabd5987f0a3a9184691a1e42352e3070
Submitter: Jenkins
Branch: stable/pike

commit c12446ecabd5987f0a3a9184691a1e42352e3070
Author: Lajos Katona <email address hidden>
Date: Fri Aug 18 13:02:48 2017 +0200

    Add functional force live migrate test

    Forced live migration now seems to be failing, as after the movement of
    the VM, the allocations are only present on the source host, and not on
    the destination host.
    The related assertions are commented out and replaced with wrong ones to
    make the test pass.

    Related-Bug: #1712008
    Change-Id: I6856f57426db6f2f49daea86679b50d5d019fe19
    (cherry picked from commit 52d732d6bdb78ca5c940112e8dff6eef7f6828e8)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/496727
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7f282839911222223652245b2dd12ca26d42a9d3
Submitter: Jenkins
Branch: stable/pike

commit 7f282839911222223652245b2dd12ca26d42a9d3
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 21 18:35:07 2017 -0400

    Allocate resources on forced dest host during live migration

    When forcing a host during live migration, conductor bypasses
    the scheduler so the scheduler won't create an allocation in
    Placement against the destination host.

    With change Ia93168b1560267178059284186fb2b7096c7e81f, once all
    computes are upgraded to Pike, the computes won't auto-heal the
    allocations for their respective nodes either, so we end up with
    no allocation for the destination node during a live migration when
    the host is forced.

    This change makes conductor use the source compute node allocations
    for the instance to claim the same resource amounts on the forced
    destination host in Placement. If the claim fails, a
    MigrationPreCheckError is raised.

    This is a short-term fix for Pike. A longer-term fix to avoid this
    duplication with the scheduler is to have conductor call the
    scheduler even when force=True but pass a flag to the scheduler
    so it skips the filters but still makes the claim on the destination
    node.

    Finally, some comments are left in the live_migrate method in the
    compute API code since this is all tightly-coupled between the
    API and conductor when a host is specified in the request, and it's
    easy to get lost on what the API is doing to the request spec which
    changes how conductor behaves, i.e. if it calls the scheduler or not.

    Change-Id: I40b5af5e85b1266402a7e4bdeb3705e1b0bd6f3b
    Closes-Bug: #1712008
    (cherry picked from commit 5d3a11b9c9a6a5aecd46ad7ecc635215184d930e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/496729
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b37411c1e3d167156b6e54f837fc291c9e61eeef
Submitter: Jenkins
Branch: stable/pike

commit b37411c1e3d167156b6e54f837fc291c9e61eeef
Author: Matt Riedemann <email address hidden>
Date: Tue Aug 22 16:54:44 2017 -0400

    Restrict live migration to same cell

    We do not yet support live migrating an instance
    across cells. This change handles two cases for
    live migration:

    1. The destination host is forced so the scheduler
       is bypassed. In this case we directly compare the
       source cell against the destination cell and fail
       if they are not the same with a MigrationPreCheckError.

    2. If no destination host is specified, or it's not forced,
       we update the RequestSpec sent to the scheduler so it
       will restrict the compute nodes it pulls from the same
       cell that the instance lives in. If a host is requested
       in this case but it's in a different cell, it would result
       in a NoValidHost error from the scheduler.

    Change-Id: I66fc72d402ac118270a835cf929fe1ea387d78cd
    Closes-Bug: #1712008
    (cherry picked from commit 8fc789deb7c774bcc5b5128d638c6c7e30bf0a54)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0rc2

This issue was fixed in the openstack/nova 16.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/501314

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/501477

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/501314
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=02a82c41f90a64f261d81cdc0bc57471d15a7c8e
Submitter: Jenkins
Branch: master

commit 02a82c41f90a64f261d81cdc0bc57471d15a7c8e
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 6 11:13:04 2017 -0400

    Add release note for force live migration allocations

    When forcing a destination host during live migration, conductor
    bypasses the scheduler, performs some pre-migration checks and then
    casts to the specified destination compute host directly.

    With change I40b5af5e85b1266402a7e4bdeb3705e1b0bd6f3b we are still
    bypassing the scheduler but conductor will attempt to allocate
    resources against the specified destination host, which could fail
    and result in the live migration failing even though the force flag
    was specified in the API.

    This change simply adds a release note for the new behavior which
    was missing from the original fix.

    Change-Id: I1811dfa59865c0a878522007e0070f0fde8344f0
    Related-Bug: #1712008

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/501477
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d220b3dd7f9a52773604eab8160503b05f586b3
Submitter: Jenkins
Branch: stable/pike

commit 7d220b3dd7f9a52773604eab8160503b05f586b3
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 6 11:13:04 2017 -0400

    Add release note for force live migration allocations

    When forcing a destination host during live migration, conductor
    bypasses the scheduler, performs some pre-migration checks and then
    casts to the specified destination compute host directly.

    With change I40b5af5e85b1266402a7e4bdeb3705e1b0bd6f3b we are still
    bypassing the scheduler but conductor will attempt to allocate
    resources against the specified destination host, which could fail
    and result in the live migration failing even though the force flag
    was specified in the API.

    This change simply adds a release note for the new behavior which
    was missing from the original fix.

    Change-Id: I1811dfa59865c0a878522007e0070f0fde8344f0
    Related-Bug: #1712008
    (cherry picked from commit 02a82c41f90a64f261d81cdc0bc57471d15a7c8e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b1

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.