Failed cold migration with SR-IOV

Bug #1512880 reported by Ludovic Beliveau
42
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Ludovic Beliveau
Newton
Fix Released
Medium
Unassigned

Bug Description

Cold migration of an instance that has an SR-IOV interface fails to migrate because on migrated compute's nova is trying to use the PCI device/address that has been allocated from the incoming compute. Obviously this is failing since the PCI device is not present on the migrated compute.

See the error "libvirtError: Device 0000:83:10.6 not found: could not access /sys/bus/pci/devices/0000:83:10.6/config: No such file or directory" in the log in the attachment.

Nova should allocate a new PCI device based the hardware configuration of the compute where the instance is being migrated and this PCI device should be use to create the instance XML.

Nova version:
commit 2397d636ff6ea3767fe62ee681d609fce4fc98ca
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Oct 27 06:30:34 2015 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I38f537e37972e5ddae13d388021412d85f6be898

Devstack setup:

* One server configured with controller and compute functions
    * Intel 10G port is configured with 8 VFs: $ echo 8 > /sys/bus/pci/devices/0000\:85\:00.0/sriov_numvfs
     * /etc/nova/nova.conf: pci_passthrough_whitelist = {"address":"*:85:10.*","physical_network":"default"}
* One server configured with compute function only
    * Intel 10G port is configured with 8 VFs: $ echo 8 > /sys/bus/pci/devices/0000\:83\:00.0/sriov_numvfs
     * /etc/nova/nova.conf: pci_passthrough_whitelist = {"address":"*:83:10.*","physical_network":"default"}
* Note that it is important for this test that the PCI addresses for the SR-IOV interfaces are different. We want to validate that new PCI devices are claimed/allocated on the incoming compute.

Reproduce steps:

1) Boot an instance with an SR-IOV interface:

$ NETID=`neutron net-list | grep default | awk '{print $2}'`
$ neutron port-create $NETID --binding:vnic-type direct --name p-direct
$ PORTID=`neutron port-list | grep "p-direct" | awk '{print $2}'`
$ nova boot test --image=ubuntu --nic port-id=$PORTID --flavor=m1.small

2) Migrate the instance to the other compute:

$ nova migrate test

Expected result:

The instance is successfully migrated on the other server.

Actual result:

The instance failed to migrate. Instance is stuck in error. See log in attachment for more information.

Revision history for this message
Ludovic Beliveau (ludovic-beliveau) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/242573

Changed in nova:
assignee: nobody → Ludovic Beliveau (ludovic-beliveau)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/305978

Changed in nova:
assignee: Ludovic Beliveau (ludovic-beliveau) → Moshe Levi (moshele)
Moshe Levi (moshele)
Changed in nova:
assignee: Moshe Levi (moshele) → nobody
assignee: nobody → Ludovic Beliveau (ludovic-beliveau)
Changed in nova:
assignee: Ludovic Beliveau (ludovic-beliveau) → Moshe Levi (moshele)
Moshe Levi (moshele)
Changed in nova:
assignee: Moshe Levi (moshele) → Ludovic Beliveau (ludovic-beliveau)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Ludovic Beliveau (<email address hidden>) on branch: master
Review: https://review.openstack.org/305978
Reason: Another patch set implemented a fix for this bug using migration context.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/242573
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dfdae01bd03f522ffab7876b253ec41641934702
Submitter: Jenkins
Branch: master

commit dfdae01bd03f522ffab7876b253ec41641934702
Author: Ludovic Beliveau <email address hidden>
Date: Fri Nov 6 12:22:16 2015 -0500

    Update binding:profile for SR-IOV ports

    The libvirt driver relies on the network information for configuring the
    SR-IOV interfaces (binding:vnic_type). The network information is specified
    when creating a Neutron port (binding:profile - which contain the PCI address
    of the device).

    During a migration, the Neutron port information needs to be updated based on
    the allocated PCI device on the migrated node.

    Co-Authored-By: Moshe Levi <email address hidden>

    Closes-Bug: #1512880

    Change-Id: I7423907ef33648669decc561fc3461415b877ba6

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Ludovic Beliveau (<email address hidden>) on branch: master
Review: https://review.openstack.org/331830
Reason: Functionality included in https://review.openstack.org/#/c/242573

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/347444

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/348522

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/349060

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/349061

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/348522
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3db49d339e6586c92bfe27a8e8edbd7cee6f864f
Submitter: Jenkins
Branch: master

commit 3db49d339e6586c92bfe27a8e8edbd7cee6f864f
Author: Ludovic Beliveau <email address hidden>
Date: Thu Jul 28 15:03:07 2016 -0400

    Follow up on Update binding:profile for SR-IOV ports

    This patch addresses comments from https://review.openstack.org/#/c/242573

    - Only load PCI mappings when there is at least one port of type SR-IOV.
    - Fixed exception message on neutron port update.

    Change-Id: I88586c66a2b2b84a2ff4a5c00857d3da19148268
    Partial-bug: #1512880

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/328983
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5129f48226cb06a77faa89dd04d290d49e76036f
Submitter: Jenkins
Branch: master

commit 5129f48226cb06a77faa89dd04d290d49e76036f
Author: Ludovic Beliveau <email address hidden>
Date: Mon Jun 13 08:30:15 2016 -0400

    Allocate PCI devices on migration

    Enable the resource tracker to allocate PCI devices on migration. The
    resource tracker which correspond to the destination node in the migration
    object is responsible to allocate the PCI devices.

    Change-Id: If58a1483930d6666c59fd98c00c4e16a297ddead
    Partial-bug: #1512880

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Ludovic Beliveau (<email address hidden>) on branch: master
Review: https://review.openstack.org/326174
Reason: Patch was splitted

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/349061
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9a6fd30d2d854a4a1bedbbc22cb5236fb8ca0fad
Submitter: Jenkins
Branch: master

commit 9a6fd30d2d854a4a1bedbbc22cb5236fb8ca0fad
Author: Ludovic Beliveau <email address hidden>
Date: Fri Jul 29 15:57:07 2016 -0400

    PCI: Fix network calls order on finish_revert_resize()

    The driver was spawning the guest before migrate_instance_finish() was being
    called which caused the neutron ports to not be updated with the old PCI
    devices information. Doing so results in spawning the guest to go in error
    since the driver is using the wrong PCI devices.

    Change-Id: I9b715f638fda65dd2be6c6693e55c8502156ed57
    Partial-bug: #1512880

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/347444
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d31a57654f42a8c04ec4ee826530f17e844ed3dd
Submitter: Jenkins
Branch: master

commit d31a57654f42a8c04ec4ee826530f17e844ed3dd
Author: Moshe Levi <email address hidden>
Date: Tue Jul 26 16:45:08 2016 +0300

    Update binding:profile for SR-IOV ports on resize-revert

    commit I7423907ef33648669decc561fc3461415b877ba6 updated
    the Neutron port information of the allocated PCI device on
    migration resize-confirm. This patch adds the logic for
    migration resize-revert

    Co-Authored-By: Ludovic Beliveau <email address hidden>

    Partial-bug: #1512880

    Change-Id: I376e482c2fb494ebd5a6b9e27c8cf2c78ce4874b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.0.0b3

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/349060
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2f073fd8df4da9d828f4d36357dbf6605ad3dcb1
Submitter: Jenkins
Branch: master

commit 2f073fd8df4da9d828f4d36357dbf6605ad3dcb1
Author: Ludovic Beliveau <email address hidden>
Date: Fri Jul 29 11:51:47 2016 -0400

    Fix drop_move_claim() on revert resize

    On revert_resize(), the migration context was dropped hence setting it
    to None. The problem is that the migration context is needed when
    executing finish_revert_resize() so that the neutron port can be updated
    with the right binding profile.

    This patch moves the instance.drop_migration_context() call out from the
    resource tracker so that the migration context can be dropped at the
    appropriate times: after RT.drop_move_claim() is called in the compute
    manager's confirm_resize() call and after the virt driver's
    finish_revert_resize() in the compute manager's finish_revert_resize()
    call.

    Change-Id: I1b7b2573de4380576dd8b801ed59d8955b0ab72a
    Partial-bug: #1512880

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/384316

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/384316
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8a2d7d4d4f774f692af291f29c0616c6113d5b6a
Submitter: Jenkins
Branch: stable/newton

commit 8a2d7d4d4f774f692af291f29c0616c6113d5b6a
Author: Ludovic Beliveau <email address hidden>
Date: Fri Jul 29 11:51:47 2016 -0400

    Fix drop_move_claim() on revert resize

    On revert_resize(), the migration context was dropped hence setting it
    to None. The problem is that the migration context is needed when
    executing finish_revert_resize() so that the neutron port can be updated
    with the right binding profile.

    This patch moves the instance.drop_migration_context() call out from the
    resource tracker so that the migration context can be dropped at the
    appropriate times: after RT.drop_move_claim() is called in the compute
    manager's confirm_resize() call and after the virt driver's
    finish_revert_resize() in the compute manager's finish_revert_resize()
    call.

    NOTE(lyarwood): Conflict caused by the DB warning clean up in I4529b4bdc.

    Conflicts:
            nova/tests/unit/compute/test_compute_mgr.py

    Change-Id: I1b7b2573de4380576dd8b801ed59d8955b0ab72a
    Partial-bug: #1512880
    (cherry picked from commit 2f073fd8df4da9d828f4d36357dbf6605ad3dcb1)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/603909

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/603909
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fc8fe92945f69880a7fc2424fb5ea7ca27e79db1
Submitter: Zuul
Branch: master

commit fc8fe92945f69880a7fc2424fb5ea7ca27e79db1
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 19 16:53:00 2018 -0400

    Mention SR-IOV cold migration limitation in admin docs

    It wasn't possible to perform a cold migration with
    SR-IOV devices attached to a server until Newton.

    This adds a mention of that limitation in the admin
    docs on PCI pass-through. Since the note this is going
    into already has a known limitation, it's re-formatted
    to be a list for readability and emphasis.

    Change-Id: I983107b5bf43cd46adfc8ca02e20dd30cb0b404c
    Related-Bug: #1512880

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.