live migrations slow

Bug #1786346 reported by Matthew Thode on 2018-08-09
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Dan Smith
Pike
High
Dan Smith
Queens
High
Dan Smith
Rocky
High
Matt Riedemann

Bug Description

This is on pike with ff747792b8f5aefe1bebb01bdf49dacc01353348 (on b58c7f033771e3ea228e4b40c796d1bc95a087f5 to be percice). Live migrations seem to be stuck at 1MiB/s for linuxbridge VMs. The code actually looks like it should fail with a timeout (never see the network interface created debug message), not sure why it succeeds.

Matthew Thode (prometheanfire) wrote :
melanie witt (melwitt) on 2018-08-09
tags: added: live-migration rocky-rc-potential
Changed in nova:
importance: Undecided → High
status: New → Confirmed
Dan Smith (danms) wrote :

This appears to be now fundamentally broken on Rocky, and racy on Queens and Pike. Below is the current thinking from live conversation and debugging of this with the reporter.

The event that triggers VIF plugging happens here:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7775

in pre_live_migration(). That is well before the wait_for_instance_event() call is made in the libvirt driver as added here:

https://review.openstack.org/#/c/497457/30/nova/virt/libvirt/driver.py

Which means that in Pike/Queens we race to the point at which the event shows up. If we win, then we receive it and do the slow..fast dance as expected. If we lose the race, then the event comes before we start listening to it, in which case we time out, but don't actually stop the live migration, which means it continues at 1MB/s until it completes.

On Rocky, we now always listen for that event a level up in the compute manager:

https://review.openstack.org/#/c/558001/10/nova/compute/manager.py

which is properly wrapped around pre_live_migration(), and will always wait for and eat the plugging event before calling into the driver. Thus, the driver will never receive the event it is waiting for and we will always timeout and always run migrations at 1MB/s until completion.

Matt Riedemann (mriedem) wrote :

I've got a test patch here:

https://review.openstack.org/#/c/553608/

It drops the vif_plugging_timeout to 30 seconds in a multi-node linuxbridge job (so we run live migration) and sets live_migration_wait_for_vif_plug=True so the compute manager will wait for and consume the network-vif-plugged event before the driver.live_migration method runs and adds its own waiter. Hopefully with the 1 second bw toggle on the guest that will exceed the 30 second vif_plugging_timeout threshold and we'll see the live migration fail.

Matt Riedemann (mriedem) wrote :

> On Rocky, we now always listen for that event a level up in the compute manager:

By default we don't actually wait. That wait option was made False by default for backward compatibility with the todo to make it True by default in Stein.

Dan Smith (danms) wrote :

> By default we don't actually wait. That wait option was made False by default for backward
> compatibility with the todo to make it True by default in Stein.

Ack, yep, But, the version inside libvirt will be racy at best (Pike/Queens behavior above) if the compute feature is disabled, and enabling the compute feature will conflict with the libvirt version (Rocky behavior above) and make migrations take forever.

Dan Smith (danms) wrote :

Just some further clarification:

- The few hacked up gate jobs we used to test this feature at merge time likely didn't notice the race because the migrations finished before the potential timeout and/or are on systems so loaded that the neutron event came late enough for us to win the race repeatedly.

- The fact that we raise but don't actually cancel the ongoing live migration task means that we log error, report error on the migration object, but the actual migration thread is left running to complete at extremely low speeds for a very long time.

Matt Riedemann (mriedem) on 2018-08-09
Changed in nova:
assignee: nobody → Dan Smith (danms)
Changed in nova:
status: Confirmed → In Progress
melanie witt (melwitt) wrote :

Just adding a note here that given the poor user experience with live migrations running at 1MiB/s because of the racing, we've decided we need to remove the slow => fast live migration functionality for the release and are targeting the removal for RC2.

Related fix proposed to branch: master
Review: https://review.openstack.org/590814

Reviewed: https://review.openstack.org/590538
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=411c45842f6b4bef21710b4d7fa2f473976bc566
Submitter: Zuul
Branch: master

commit 411c45842f6b4bef21710b4d7fa2f473976bc566
Author: Dan Smith <email address hidden>
Date: Thu Aug 9 14:37:28 2018 -0700

    Revert "libvirt: slow live-migration to ensure network is ready"

    This reverts commit 8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4.

    As detailed in the bug, this is overly racy, waiting for the event
    potentially a long ways (and a couple RPC calls) away from where
    the event will be triggered. The compute manager now has a generic
    mechanism to do this which conflicts and replaces this functionality,
    if enabled (default is off in Rocky, to be defaulted to on in Stein).

    Conflicts:
     nova/tests/unit/virt/libvirt/test_driver.py
     nova/virt/libvirt/driver.py

    Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
    Closes-Bug: #1786346

Changed in nova:
status: In Progress → Fix Released

The bug is reported on Pike, why that has been reverted from master without to have clearly identified the root cause? Crying everywhere it's "racy" without any explanation does not really make sense, sorry.

How the patch on compute manager is going to address the issue? It's not.

The specific work which has been done on libvirt driver for linux bridge to slow the migration progress is basically because the TAP device is created by libvirt itself when calling migrateToURI which the neutron agent is going to discover and then correctly setup L2 network. That was basically the whole purpose :/

The issue has been noticed on Pike, it would have been interesting to get more details, basically the neutron logs, since we all know that this patch came with some fix in Neutron, perhaps one of them has been badly backported. I mean at least a bit more of investigations would have been appreciated.

Reviewed: https://review.openstack.org/591275
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=afd1c1e6e152b6337d6e71af1c908c73c847b6f3
Submitter: Zuul
Branch: stable/rocky

commit afd1c1e6e152b6337d6e71af1c908c73c847b6f3
Author: Dan Smith <email address hidden>
Date: Thu Aug 9 14:37:28 2018 -0700

    Revert "libvirt: slow live-migration to ensure network is ready"

    This reverts commit 8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4.

    As detailed in the bug, this is overly racy, waiting for the event
    potentially a long ways (and a couple RPC calls) away from where
    the event will be triggered. The compute manager now has a generic
    mechanism to do this which conflicts and replaces this functionality,
    if enabled (default is off in Rocky, to be defaulted to on in Stein).

    Conflicts:
     nova/tests/unit/virt/libvirt/test_driver.py
     nova/virt/libvirt/driver.py

    Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
    Closes-Bug: #1786346
    (cherry picked from commit 411c45842f6b4bef21710b4d7fa2f473976bc566)

Reviewed: https://review.openstack.org/591761
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dee99b1ed03de4b6ded94f3cf6d2ea7214bca93b
Submitter: Zuul
Branch: stable/queens

commit dee99b1ed03de4b6ded94f3cf6d2ea7214bca93b
Author: Dan Smith <email address hidden>
Date: Tue Aug 14 08:27:20 2018 -0700

    Revert "libvirt: slow live-migration to ensure network is ready"

    This reverts commit 841b0fbea6373ea2ae123f851fb90555faff12e2.

    Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
    Closes-Bug: #1786346

Reviewed: https://review.openstack.org/591762
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b1b5a5eb96ec8f222e5bad9055ec7baeed9767cd
Submitter: Zuul
Branch: stable/pike

commit b1b5a5eb96ec8f222e5bad9055ec7baeed9767cd
Author: Dan Smith <email address hidden>
Date: Tue Aug 14 08:31:41 2018 -0700

    Revert "libvirt: slow live-migration to ensure network is ready"

    This reverts commit ff747792b8f5aefe1bebb01bdf49dacc01353348.

    Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
    Closes-Bug: #1786346

This issue was fixed in the openstack/nova 18.0.0.0rc2 release candidate.

Satish Patel (satish-txt) wrote :

For the record,

I am running Queens 17.0.7-6-g9187bb1 and i am having same problem my live migration cap at 8Mbps. Waiting for stable release :)

Satish Patel (satish-txt) wrote :

Today i have upgraded my nova to fix this patch but look like still i am seeing same issue, my live migration running slow, what else i should look for?

Satish Patel (satish-txt) wrote :
Download full text (6.1 KiB)

This is interesting I have two host compute-06 & compute-07

06 --> 07 (very slow migration)
07 --> 06 ( super fast migration)

[root@ostack-compute-07 ~]# nova --version
10.1.0

Here is the logs of (06-to-07 migration) you can see its very slow even after your patch

2018-08-23 18:11:38.026 37085 WARNING nova.compute.manager [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Received unexpected event network-vif-plugged-1399100a-77c1-43eb-8d95-fc73a91b6bed for instance with vm_state active and task_state migrating.
2018-08-23 18:11:57.759 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 30 secs, memory 44% remaining; (bytes processed=33072816, remaining=484225024, total=1091379200)
2018-08-23 18:11:57.821 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:11:57.853 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:12:28.721 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 60 secs, memory 41% remaining; (bytes processed=66046896, remaining=451284992, total=1091379200)
2018-08-23 18:12:37.212 37085 INFO nova.compute.manager [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] During sync_power_state the instance has a pending task (migrating). Skip.
2018-08-23 18:12:58.700 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:12:58.734 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:12:59.712 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for...

Read more...

Satish Patel (satish-txt) wrote :

Folks,

Forgive me, by mistake i grab wrong commit and that was the reason i didn't see any changer after applying patch.

It works after applying correct version :) Thanks

Reviewed: https://review.openstack.org/591735
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bfe89fec4669df6f4ac48dbb56fde3db0a24cbac
Submitter: Zuul
Branch: stable/queens

commit bfe89fec4669df6f4ac48dbb56fde3db0a24cbac
Author: Matt Riedemann <email address hidden>
Date: Fri Mar 30 17:16:08 2018 -0400

    Wait for network-vif-plugged before starting live migration

    This adds a new config option which is read on the source host
    during pre_live_migration which can be used to determine
    if it should wait for a "network-vif-plugged" event due to VIFs
    being plugged on the destination host. This helps us to
    avoid the guest transfer at all if vif plugging failed on the dest
    host, which we just wouldn't find out until post live migration
    and then we have to rollback.

    The option is disabled by default for backward compatibility and
    also because certain networking backends, like OpenDaylight, are
    known to not send network-vif-plugged events unless the port host
    binding information changes, which for live migration doesn't happen
    until after the guest is transferred to the destination host.

    Related to blueprint neutron-new-port-binding-api
    Related-Bug: #1786346

    NOTE(danms): Stable-only changes to this patch from master include
    removing the RPC-related communication from the destination
    to the source node. As such, the new option is read on the source
    node so the conf option help and release note are updated. This is
    OK before Rocky since we don't claim support to live migrate
    between different networking backends (vif types), so operators
    would need to set the option universally, or at least have host
    aggregates in place if they are using different network types.

    Conflicts:
            nova/conf/compute.py
            nova/tests/unit/objects/test_migrate_data.py

    Change-Id: I0f3ab6604d8b79bdb75cf67571e359cfecc039d8
    (cherry picked from commit 5aadff75c3ac4f2019838600df6580481a96db0f)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/590814
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=79a74d8185ed99be0e5041724f32af82ad8fe8d9
Submitter: Zuul
Branch: master

commit 79a74d8185ed99be0e5041724f32af82ad8fe8d9
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 10 12:21:19 2018 -0400

    Revert "libvirt: add method to configure migration speed"

    This reverts commit 23446a9552b5be3b040278646149a0f481d0a005.

    With change Ibf2b5eeafd962e93ae4ab6290015d58c33024132 there
    is nothing using the migrate_configure_max_speed method any
    longer and can be removed. An additional mock, added after
    the change being reverted, is also removed.

    Change-Id: I90d6e14bf9383bf71d65d2180474ba228db2feab
    Related-Bug: #1786346

Has this bug been resolved? And in what version of nova?

Dan Smith (danms) wrote :

This should be fixed in the GA of Rocky AFAIK, and is being backported to the other branches (releases pending in the next day or so).

This issue was fixed in the openstack/nova 16.1.5 release.

This issue was fixed in the openstack/nova 17.0.6 release.

Alex Walender (awalende) wrote :

How long does it usually take when those updates are available in the standard bionic apt repository?

The currently available version is 2:17.0.5-0ubuntu2 for Ubuntu 18.04.

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers