Failed cold migration with SR-IOV

Bug #1512880 reported by Ludovic Beliveau on 2015-11-03
42
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Ludovic Beliveau
Newton
Medium
Unassigned

Bug Description

Cold migration of an instance that has an SR-IOV interface fails to migrate because on migrated compute's nova is trying to use the PCI device/address that has been allocated from the incoming compute. Obviously this is failing since the PCI device is not present on the migrated compute.

See the error "libvirtError: Device 0000:83:10.6 not found: could not access /sys/bus/pci/devices/0000:83:10.6/config: No such file or directory" in the log in the attachment.

Nova should allocate a new PCI device based the hardware configuration of the compute where the instance is being migrated and this PCI device should be use to create the instance XML.

Nova version:
commit 2397d636ff6ea3767fe62ee681d609fce4fc98ca
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Oct 27 06:30:34 2015 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I38f537e37972e5ddae13d388021412d85f6be898

Devstack setup:

* One server configured with controller and compute functions
    * Intel 10G port is configured with 8 VFs: $ echo 8 > /sys/bus/pci/devices/0000\:85\:00.0/sriov_numvfs
     * /etc/nova/nova.conf: pci_passthrough_whitelist = {"address":"*:85:10.*","physical_network":"default"}
* One server configured with compute function only
    * Intel 10G port is configured with 8 VFs: $ echo 8 > /sys/bus/pci/devices/0000\:83\:00.0/sriov_numvfs
     * /etc/nova/nova.conf: pci_passthrough_whitelist = {"address":"*:83:10.*","physical_network":"default"}
* Note that it is important for this test that the PCI addresses for the SR-IOV interfaces are different. We want to validate that new PCI devices are claimed/allocated on the incoming compute.

Reproduce steps:

1) Boot an instance with an SR-IOV interface:

$ NETID=`neutron net-list | grep default | awk '{print $2}'`
$ neutron port-create $NETID --binding:vnic-type direct --name p-direct
$ PORTID=`neutron port-list | grep "p-direct" | awk '{print $2}'`
$ nova boot test --image=ubuntu --nic port-id=$PORTID --flavor=m1.small

2) Migrate the instance to the other compute:

$ nova migrate test

Expected result:

The instance is successfully migrated on the other server.

Actual result:

The instance failed to migrate. Instance is stuck in error. See log in attachment for more information.

Fix proposed to branch: master
Review: https://review.openstack.org/242573

Changed in nova:
assignee: nobody → Ludovic Beliveau (ludovic-beliveau)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/305978

Changed in nova:
assignee: Ludovic Beliveau (ludovic-beliveau) → Moshe Levi (moshele)
Moshe Levi (moshele) on 2016-06-01
Changed in nova:
assignee: Moshe Levi (moshele) → nobody
assignee: nobody → Ludovic Beliveau (ludovic-beliveau)
Changed in nova:
assignee: Ludovic Beliveau (ludovic-beliveau) → Moshe Levi (moshele)
Moshe Levi (moshele) on 2016-07-14
Changed in nova:
assignee: Moshe Levi (moshele) → Ludovic Beliveau (ludovic-beliveau)
Changed in nova:
importance: Undecided → Medium

Change abandoned by Ludovic Beliveau (<email address hidden>) on branch: master
Review: https://review.openstack.org/305978
Reason: Another patch set implemented a fix for this bug using migration context.

Reviewed: https://review.openstack.org/242573
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dfdae01bd03f522ffab7876b253ec41641934702
Submitter: Jenkins
Branch: master

commit dfdae01bd03f522ffab7876b253ec41641934702
Author: Ludovic Beliveau <email address hidden>
Date: Fri Nov 6 12:22:16 2015 -0500

    Update binding:profile for SR-IOV ports

    The libvirt driver relies on the network information for configuring the
    SR-IOV interfaces (binding:vnic_type). The network information is specified
    when creating a Neutron port (binding:profile - which contain the PCI address
    of the device).

    During a migration, the Neutron port information needs to be updated based on
    the allocated PCI device on the migrated node.

    Co-Authored-By: Moshe Levi <email address hidden>

    Closes-Bug: #1512880

    Change-Id: I7423907ef33648669decc561fc3461415b877ba6

Changed in nova:
status: In Progress → Fix Released

Change abandoned by Ludovic Beliveau (<email address hidden>) on branch: master
Review: https://review.openstack.org/331830
Reason: Functionality included in https://review.openstack.org/#/c/242573

Fix proposed to branch: master
Review: https://review.openstack.org/348522

Fix proposed to branch: master
Review: https://review.openstack.org/349060

Fix proposed to branch: master
Review: https://review.openstack.org/349061

Reviewed: https://review.openstack.org/348522
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3db49d339e6586c92bfe27a8e8edbd7cee6f864f
Submitter: Jenkins
Branch: master

commit 3db49d339e6586c92bfe27a8e8edbd7cee6f864f
Author: Ludovic Beliveau <email address hidden>
Date: Thu Jul 28 15:03:07 2016 -0400

    Follow up on Update binding:profile for SR-IOV ports

    This patch addresses comments from https://review.openstack.org/#/c/242573

    - Only load PCI mappings when there is at least one port of type SR-IOV.
    - Fixed exception message on neutron port update.

    Change-Id: I88586c66a2b2b84a2ff4a5c00857d3da19148268
    Partial-bug: #1512880

Reviewed: https://review.openstack.org/328983
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5129f48226cb06a77faa89dd04d290d49e76036f
Submitter: Jenkins
Branch: master

commit 5129f48226cb06a77faa89dd04d290d49e76036f
Author: Ludovic Beliveau <email address hidden>
Date: Mon Jun 13 08:30:15 2016 -0400

    Allocate PCI devices on migration

    Enable the resource tracker to allocate PCI devices on migration. The
    resource tracker which correspond to the destination node in the migration
    object is responsible to allocate the PCI devices.

    Change-Id: If58a1483930d6666c59fd98c00c4e16a297ddead
    Partial-bug: #1512880

Change abandoned by Ludovic Beliveau (<email address hidden>) on branch: master
Review: https://review.openstack.org/326174
Reason: Patch was splitted

Reviewed: https://review.openstack.org/349061
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9a6fd30d2d854a4a1bedbbc22cb5236fb8ca0fad
Submitter: Jenkins
Branch: master

commit 9a6fd30d2d854a4a1bedbbc22cb5236fb8ca0fad
Author: Ludovic Beliveau <email address hidden>
Date: Fri Jul 29 15:57:07 2016 -0400

    PCI: Fix network calls order on finish_revert_resize()

    The driver was spawning the guest before migrate_instance_finish() was being
    called which caused the neutron ports to not be updated with the old PCI
    devices information. Doing so results in spawning the guest to go in error
    since the driver is using the wrong PCI devices.

    Change-Id: I9b715f638fda65dd2be6c6693e55c8502156ed57
    Partial-bug: #1512880

Reviewed: https://review.openstack.org/347444
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d31a57654f42a8c04ec4ee826530f17e844ed3dd
Submitter: Jenkins
Branch: master

commit d31a57654f42a8c04ec4ee826530f17e844ed3dd
Author: Moshe Levi <email address hidden>
Date: Tue Jul 26 16:45:08 2016 +0300

    Update binding:profile for SR-IOV ports on resize-revert

    commit I7423907ef33648669decc561fc3461415b877ba6 updated
    the Neutron port information of the allocated PCI device on
    migration resize-confirm. This patch adds the logic for
    migration resize-revert

    Co-Authored-By: Ludovic Beliveau <email address hidden>

    Partial-bug: #1512880

    Change-Id: I376e482c2fb494ebd5a6b9e27c8cf2c78ce4874b

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

Reviewed: https://review.openstack.org/349060
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2f073fd8df4da9d828f4d36357dbf6605ad3dcb1
Submitter: Jenkins
Branch: master

commit 2f073fd8df4da9d828f4d36357dbf6605ad3dcb1
Author: Ludovic Beliveau <email address hidden>
Date: Fri Jul 29 11:51:47 2016 -0400

    Fix drop_move_claim() on revert resize

    On revert_resize(), the migration context was dropped hence setting it
    to None. The problem is that the migration context is needed when
    executing finish_revert_resize() so that the neutron port can be updated
    with the right binding profile.

    This patch moves the instance.drop_migration_context() call out from the
    resource tracker so that the migration context can be dropped at the
    appropriate times: after RT.drop_move_claim() is called in the compute
    manager's confirm_resize() call and after the virt driver's
    finish_revert_resize() in the compute manager's finish_revert_resize()
    call.

    Change-Id: I1b7b2573de4380576dd8b801ed59d8955b0ab72a
    Partial-bug: #1512880

Reviewed: https://review.openstack.org/384316
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8a2d7d4d4f774f692af291f29c0616c6113d5b6a
Submitter: Jenkins
Branch: stable/newton

commit 8a2d7d4d4f774f692af291f29c0616c6113d5b6a
Author: Ludovic Beliveau <email address hidden>
Date: Fri Jul 29 11:51:47 2016 -0400

    Fix drop_move_claim() on revert resize

    On revert_resize(), the migration context was dropped hence setting it
    to None. The problem is that the migration context is needed when
    executing finish_revert_resize() so that the neutron port can be updated
    with the right binding profile.

    This patch moves the instance.drop_migration_context() call out from the
    resource tracker so that the migration context can be dropped at the
    appropriate times: after RT.drop_move_claim() is called in the compute
    manager's confirm_resize() call and after the virt driver's
    finish_revert_resize() in the compute manager's finish_revert_resize()
    call.

    NOTE(lyarwood): Conflict caused by the DB warning clean up in I4529b4bdc.

    Conflicts:
            nova/tests/unit/compute/test_compute_mgr.py

    Change-Id: I1b7b2573de4380576dd8b801ed59d8955b0ab72a
    Partial-bug: #1512880
    (cherry picked from commit 2f073fd8df4da9d828f4d36357dbf6605ad3dcb1)

tags: added: in-stable-newton
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments