VM deployed with availability-zone (force_hosts) cannot be live migrated to an untargeted host

Bug #1561357 reported by Taylor Peoples
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Sylvain Bauza
Mitaka
Fix Released
High
Sylvain Bauza

Bug Description

Steps:
1) Deploy a VM to a specific host using availability zones (i.e., do a targeted deploy).
2) Attempt to live migrate the VM from (1) letting the scheduler decide what host to live migrate to (i.e., do an untargeted live migration).

Outcome:
The live migration will always fail.

Version: mitaka

This is happening because of the following recent change: https://github.com/openstack/nova/commit/111a852e79f0d9e54228d8e2724dc4183f737397. The recent change pulls the request spec from the originak deploy from the DB and uses it for the live migration. Since the initial deploy of the VM was targeted, the request spec object saved in the DB has the "force_hosts" field set to a specific host. Part of the live migrate flow will set the "ignore_hosts" field of said request spec object to said specific host since it doesn't make sense to live migrate to the source host. This results in unsolvable constraints for the scheduler.

nova/compute/api.py::live_migrate():
    ...
    try:
        request_spec = objects.RequestSpec.get_by_instance_uuid( <----------------------- this fetches the request spec from the DB, which will have force_hosts set
            context, instance.uuid)
    ...
    self.compute_task_api.live_migrate_instance(context, instance,
        host_name, block_migration=block_migration,
        disk_over_commit=disk_over_commit,
        request_spec=request_spec)

After a lot of API plumbing, the flow ends up in nova/conductor/tasks/live_migrate.py::_find_destination():
    ...
    attempted_hosts = [self.source]
    ...
    host = None
    while host is None:
        ...
        request_spec.ignore_hosts = attempted_hosts <---------------------------------- we're setting the source host to "ignore_hosts" field
        try:
            host = self.scheduler_client.select_destinations(self.context, request_spec)[0]['host'] <------------------------ we're passing an unsolvable request_spec to the scheduler now, which will never find a valid host to migrate to

Example on a multi-node (2) devstack environment:

stack@controller:~/devstack$ nova boot tdp-server --image 13a9f724-36ef-46ae-896d-f4f003ac1a10 --flavor m1.tiny --availability-zone nova:host613

stack@controller:~/devstack$ nova list --fields name,status,OS-EXT-SRV-ATTR:host
+--------------------------------------+------------+--------+-----------------------+
| ID | Name | Status | OS-EXT-SRV-ATTR: Host |
+--------------------------------------+------------+--------+-----------------------+
| a9fe19e4-5528-40f2-af08-031eaf4c33a6 | tdp-server | ACTIVE | host613 |
+--------------------------------------+------------+--------+-----------------------+

mysql> select spec from request_specs where instance_uuid="a9fe19e4-5528-40f2-af08-031eaf4c33a6";
{
    ...
    "nova_object.name":"RequestSpec",
    "nova_object.data":{
        "instance_uuid":"a9fe19e4-5528-40f2-af08-031eaf4c33a6",
        ...,
        "availability_zone":"nova",
        "force_nodes":null,
        ...,
        "force_hosts":[
            "host613"
        ],
        "ignore_hosts":null,
        ...,
        "scheduler_hints":{}
    },
 ...
}

stack@controller:~/devstack$ nova live-migration tdp-server
ERROR (BadRequest): No valid host was found. There are not enough hosts available. (HTTP 400) (Request-ID: req-78725630-e87b-426c-a4f6-dc31f9c08223)

/opt/stack/logs/n-sch.log:2016-03-24 02:25:27.515 INFO nova.scheduler.host_manager [req-78725630-e87b-426c-a4f6-dc31f9c08223 admin admin] Host filter ignoring hosts: host613
...
/opt/stack/logs/n-sch.log:2016-03-24 02:25:27.515 INFO nova.scheduler.host_manager [req-78725630-e87b-426c-a4f6-dc31f9c08223 admin admin] No hosts matched due to not matching 'force_hosts' value of 'host613'

This is breaking previous behavior - the force_hosts field was not "sticky" in that it did not prevent the scheduler from moving the VM to another host after initial deploy. It previously only forced the initial deploy to go to a specific host.

Two possible fixes come to mind:

1) Do not save the force_hosts field in the DB. This may have unintended consequences that I have not thought through.
2) Remove the force_hosts field from the request_spec object that is used for the live migration task.

summary: VM deployed with availability-zone (force_hosts) cannot be live migrated
+ to an untargeted host
Matt Riedemann (mriedem)
tags: added: availability-zones live-migration mitaka-rc-potential
Changed in nova:
status: New → Confirmed
assignee: nobody → Sylvain Bauza (sylvain-bauza)
importance: Undecided → High
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

So, now evacuate, live-migrate and unshelve are using the existing RequestSpec.
Given that evacuate and live-migrate permit to give a destination in the API, it means that it's not a problem for those actions because when using a destination, it doesn't call the scheduler.

That said, unshelve doesn't ask for a destination so it means that the scheduler would then provide only the original forced host as a destination for unshelving, which could be a problem.

I'm working on a patch for fixing all of that.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/297387

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/297387
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=446d15568e00a483d909dc5c565d70baf29179ad
Submitter: Jenkins
Branch: master

commit 446d15568e00a483d909dc5c565d70baf29179ad
Author: Sylvain Bauza <email address hidden>
Date: Thu Mar 24 23:07:54 2016 +0100

    Stop providing force_hosts to the scheduler for move ops

    Since now we provide the original RequestSpec for move operations (unshelve,
    live-migrate and evacuate), it can also provide the original force_hosts/nodes
    to the scheduler.
    In that case, it means that if an admin was asking to boot an instance forcing
    to an host, a later move operation could then give again the forced value and
    then wouldn't permit to get a different destination which is an issue.

    TBH, that is not a problem for live-migrate and evacuate that do provide an
    optional host value (which bypasses then the scheduler) but since unshelve
    is not having this optional value, it would mean that we could only unshelve
    an forced instance to the same host.

    Change-Id: I03c22ff757d0ee1da9d69fa48cc4bdd036e6b13f
    Closes-Bug: #1561357

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/297846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/mitaka)

Reviewed: https://review.openstack.org/297846
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c71c4e01b21da5b5b7cb262969e30d6a68b3c58a
Submitter: Jenkins
Branch: stable/mitaka

commit c71c4e01b21da5b5b7cb262969e30d6a68b3c58a
Author: Sylvain Bauza <email address hidden>
Date: Thu Mar 24 23:07:54 2016 +0100

    Stop providing force_hosts to the scheduler for move ops

    Since now we provide the original RequestSpec for move operations (unshelve,
    live-migrate and evacuate), it can also provide the original force_hosts/nodes
    to the scheduler.
    In that case, it means that if an admin was asking to boot an instance forcing
    to an host, a later move operation could then give again the forced value and
    then wouldn't permit to get a different destination which is an issue.

    TBH, that is not a problem for live-migrate and evacuate that do provide an
    optional host value (which bypasses then the scheduler) but since unshelve
    is not having this optional value, it would mean that we could only unshelve
    an forced instance to the same host.

    Change-Id: I03c22ff757d0ee1da9d69fa48cc4bdd036e6b13f
    Closes-Bug: #1561357
    (cherry picked from commit 446d15568e00a483d909dc5c565d70baf29179ad)

tags: added: in-stable-mitaka
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0rc3

This issue was fixed in the openstack/nova 13.0.0.0rc3 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/302578

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)
Download full text (11.5 KiB)

Reviewed: https://review.openstack.org/302578
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a8ebbebd4ee0c3bb1452ea32f92e1588a6b35067
Submitter: Jenkins
Branch: master

commit 7105f888ee1f52d2a462fc0ece3130dc0d3d49f5
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Mar 31 06:28:06 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Ibe5d4d38834fbcb99c0332d3375659a21d94154e

commit 5de98cb2de2eca3d061488c55f96e6f7c9bc56a8
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Mar 30 06:41:25 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Ia46d661560b1141c1c1522c9477c510d28a0d0e7

commit a9d55427b6e8d2472088e3d40a8a5151ce408283
Author: Moshe Levi <email address hidden>
Date: Wed Mar 23 10:59:04 2016 +0200

    Fix detach SR-IOV when using LibvirtConfigGuestHostdevPCI

    This patch fixes an issue which was introduced by this
    change If3edc1965c01a077eb61984a442e0d778d870d75.
    Usually the vif config is of type LibvirtConfigGuestInterface,
    but some vif use LibvirtConfigGuestHostdevPCI config
    (e.g. the ib_hostdev). The difference is that
    LibvirtConfigGuestInterface keeps the pci address in source_dev
    while LibvirtConfigGuestHostdevPCI has domain, bus, slot and
    function, instead of relying on the vif config type we can take the
    pci address for the neutron port.

    Closes-Bug: #1560860

    Change-Id: I62a7ff16f1c9c5da923451520fbeeabb5cc0c5c6
    (cherry picked from commit f15d9a9693b19393fcde84cf4bc6f044d39ffdca)

commit 5b6ee702df7ad901f68bec2ed8d43b66aa6d98c1
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Mar 29 06:37:30 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Iad0e42a18bd3a7dcf216b4df17b9893e13382efe

commit 29042e06f7e570bd13607b62b997a6ae21db80c5
Author: OpenStack Proposal Bot <email address hidden>
Date: Mon Mar 28 06:34:19 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: If159133a2e32c6ef53ba104751a3eb054a95b733

commit 3e9819dab8249ec9993b0b9874e80a78f2ed1754
Author: Matt Riedemann <email address hidden>
Date: Sun Mar 27 19:31:32 2016 -0400

    Update cells blacklist regex for test_server_basic_ops

    Tempest change 9bee3b92f1559cb604c8bd74dcca57805a85a97a
    renamed a test in our blacklist so update the filter to
    handle the old and new name.

    The Tempest team is hesitant to revert the change so we
    should handle it ourselves and eventually move to using
    test uuids for our blacklist, but there might need to
    be work in devstack-gate for that fi...

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/nova 14.0.0.0b1

This issue was fixed in the openstack/nova 14.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.