VMs with vif_type bridge/tap started before Rocky upgrade cannot be live migrated

Bug #1800511 reported by Mohammed Naser on 2018-10-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Mohammed Naser
Rocky
High
Mohammed Naser

Bug Description

In Rocky, the following patch introduced adding MTU to the network for VMs:

https://github.com/openstack/nova/commit/f02b3800051234ecc14f3117d5987b1a8ef75877

However, this didn't affect live migrations much because Nova didn't touch the network bits of the XML during live migration, until this patch:

https://github.com/openstack/nova/commit/2b52cde565d542c03f004b48ee9c1a6a25f5b7cd

With that change, the MTU is added to the configuration, which means that the destination is launched with host_mtu=N, which apparently changes the guest ABI (see: https://bugzilla.redhat.com/show_bug.cgi?id=1449346). This means the live migration will fail with an error looking like this:

2018-10-29 14:59:15.126+0000: 5289: error : qemuProcessReportLogError:1914 : internal error: qemu unexpectedly closed the monitor: 2018-10-29T14:59:14.977084Z qemu-kvm: get_pci_config_device: Bad config data: i=0x10 read: 61 device: 1 cmask: ff wmask: c0 w1cmask:0
2018-10-29T14:59:14.977105Z qemu-kvm: Failed to load PCIDevice:config
2018-10-29T14:59:14.977109Z qemu-kvm: Failed to load virtio-net:virtio
2018-10-29T14:59:14.977112Z qemu-kvm: error while loading state for instance 0x0 of device ‘0000:00:03.0/virtio-net’
2018-10-29T14:59:14.977283Z qemu-kvm: load of migration failed: Invalid argument

I was able to further verify this by seeing that `host_mtu` exists in the command line when looking at the destination host instance logs in /var/log/libvirt/qemu/instance-foo.log

Mohammed Naser (mnaser) on 2018-10-29
Changed in nova:
assignee: nobody → Mohammed Naser (mnaser)
Matt Riedemann (mriedem) on 2018-10-29
tags: added: libvirt live-migration upgrade
Changed in nova:
importance: Undecided → High
status: New → Triaged
Matt Riedemann (mriedem) wrote :

FWIW I don't think https://github.com/openstack/nova/commit/2b52cde565d542c03f004b48ee9c1a6a25f5b7cd really changed how https://github.com/openstack/nova/commit/f02b3800051234ecc14f3117d5987b1a8ef75877 could have broken anything. _update_vif_xml is called from the source host using migrate data from the dest host, but as far as I know that migrate data doesn't have any information about mtu from the dest to determine what to set in the source vif config. Before _update_vif_xml, we would have just sent the source guest xml vif config to the dest and if the dest didn't support mtu it would have failed also.

Matt Riedemann (mriedem) wrote :

(12:54:52 PM) cfriesen: mriedem: I think the issue is that the instance originally didn't have mtu in it
(12:54:59 PM) cfriesen: then we upgraded nova, and now it would have mtu
(12:55:53 PM) cfriesen: nova on the dest doesn't need to have the mtu, but the xml that we pass the live migration needs to match the xml that was used to start the running instance.

Given that, then yes https://github.com/openstack/nova/commit/2b52cde565d542c03f004b48ee9c1a6a25f5b7cd would have caused the regression since it changed the vif xml config to add the mtu setting which otherwise wasn't in the guest xml when it was last started.

Matt Riedemann (mriedem) on 2018-10-29
summary: - VMs started before Rocky upgrade cannot be live migrated
+ VMs with vif_type bridge/tap started before Rocky upgrade cannot be live
+ migrated

Fix proposed to branch: master
Review: https://review.openstack.org/614008

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/614004
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=29ee8011b4e1cf5371b1aa9c6c0f930eb49fe795
Submitter: Zuul
Branch: master

commit 29ee8011b4e1cf5371b1aa9c6c0f930eb49fe795
Author: Mohammed Naser <email address hidden>
Date: Mon Oct 29 19:18:26 2018 +0100

    Add tests for bug #1800511

    In Rocky, we started including the MTU for networks when vif_type
    is bridge or tap and libvirt >= 3.3.0, however this also meant
    that we started specifying the MTU when doing live migrations.

    It seems that the guest ABI changes when setting the MTU which
    means that live migrations will fail. This broke the live migration
    of any instance that was launched prior to upgrading to Rocky, as
    it was not loaded with the ABI having host_mtu specified.

    This is a passing (negative) test-case which will include a follow
    up patch that contains a fix and correcting tests.

    Related-Bug: #1800511
    Change-Id: Ia2fe50d727b1f83e808cb9dda3a55f853f048a3e

Reviewed: https://review.openstack.org/614008
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=643f53f5e9544d6c98833b1ae3dd472602118a1f
Submitter: Zuul
Branch: master

commit 643f53f5e9544d6c98833b1ae3dd472602118a1f
Author: Mohammed Naser <email address hidden>
Date: Mon Oct 29 19:49:41 2018 +0100

    libvirt: Avoid setting MTU during live migration if unset

    If there is a live migration of an instance that was launched
    before change Iecc265fb25e88fa00a66f1fd38e215cad53e7669, it
    would not have an mtu set and therefore it wouldn't have it in
    the XML.

    When live migrating, the mtu is added which changes the
    guest ABI[1], causing the live migration to fail. The failure
    occurs when trying to live migrate an instance that:

    - Launched before change Iecc265fb25e88fa00a66f1fd38e215cad53e7669
    - It has not been rebooted (i.e. XML has not changed since)
    - It's using bridge/tap networking
    - Migration attempted after change Iecc265fb25e88fa00a66f1fd38e215cad53e7669

    This patch prevents this by avoiding setting MTU if the running
    instance does not have one configured in its domain XML.

    [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1449346

    Closes-Bug #1800511
    Change-Id: I6e2e6437a7c826dc425d8b353c38670d6eece0b5

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/614040
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d28e7346ec1b3730d9bedf4a1b84223b89083635
Submitter: Zuul
Branch: stable/rocky

commit d28e7346ec1b3730d9bedf4a1b84223b89083635
Author: Mohammed Naser <email address hidden>
Date: Mon Oct 29 19:18:26 2018 +0100

    Add tests for bug #1800511

    In Rocky, we started including the MTU for networks when vif_type
    is bridge or tap and libvirt >= 3.3.0, however this also meant
    that we started specifying the MTU when doing live migrations.

    It seems that the guest ABI changes when setting the MTU which
    means that live migrations will fail. This broke the live migration
    of any instance that was launched prior to upgrading to Rocky, as
    it was not loaded with the ABI having host_mtu specified.

    This is a passing (negative) test-case which will include a follow
    up patch that contains a fix and correcting tests.

    (cherry picked from commit 29ee8011b4e1cf5371b1aa9c6c0f930eb49fe795)

    Related-Bug: #1800511
    Change-Id: Ia2fe50d727b1f83e808cb9dda3a55f853f048a3e

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/614041
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=66ec5b4334cd0e1fc65f21ab21b4b088dca87f2b
Submitter: Zuul
Branch: stable/rocky

commit 66ec5b4334cd0e1fc65f21ab21b4b088dca87f2b
Author: Mohammed Naser <email address hidden>
Date: Mon Oct 29 19:49:41 2018 +0100

    libvirt: Avoid setting MTU during live migration if unset

    If there is a live migration of an instance that was launched
    before change Iecc265fb25e88fa00a66f1fd38e215cad53e7669, it
    would not have an mtu set and therefore it wouldn't have it in
    the XML.

    When live migrating, the mtu is added which changes the
    guest ABI[1], causing the live migration to fail. The failure
    occurs when trying to live migrate an instance that:

    - Launched before change Iecc265fb25e88fa00a66f1fd38e215cad53e7669
    - It has not been rebooted (i.e. XML has not changed since)
    - It's using bridge/tap networking
    - Migration attempted after change Iecc265fb25e88fa00a66f1fd38e215cad53e7669

    This patch prevents this by avoiding setting MTU if the running
    instance does not have one configured in its domain XML.

    [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1449346

    (cherry picked from commit 643f53f5e9544d6c98833b1ae3dd472602118a1f)

    Closes-Bug #1800511
    Change-Id: I6e2e6437a7c826dc425d8b353c38670d6eece0b5

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.