Bug #1786346 “live migrations slow” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matthew Thode (prometheanfire) wrote on 2018-08-09:

#1

nova-live-migrate-slow.txt Edit (2.0 MiB, text/plain)

Revision history for this message

Matthew Thode (prometheanfire) wrote on 2018-08-09:

#2

nova-live-migrate-slow-non-debug.txt Edit (72.9 KiB, text/plain)

melanie witt (melwitt) on 2018-08-09

tags:	added: live-migration rocky-rc-potential
Changed in nova:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Dan Smith (danms) wrote on 2018-08-09:

#3

This appears to be now fundamentally broken on Rocky, and racy on Queens and Pike. Below is the current thinking from live conversation and debugging of this with the reporter.

The event that triggers VIF plugging happens here:

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7775

in pre_live_migration(). That is well before the wait_for_instance_event() call is made in the libvirt driver as added here:

https://review.openstack.org/#/c/497457/30/nova/virt/libvirt/driver.py

Which means that in Pike/Queens we race to the point at which the event shows up. If we win, then we receive it and do the slow..fast dance as expected. If we lose the race, then the event comes before we start listening to it, in which case we time out, but don't actually stop the live migration, which means it continues at 1MB/s until it completes.

On Rocky, we now always listen for that event a level up in the compute manager:

https://review.openstack.org/#/c/558001/10/nova/compute/manager.py

which is properly wrapped around pre_live_migration(), and will always wait for and eat the plugging event before calling into the driver. Thus, the driver will never receive the event it is waiting for and we will always timeout and always run migrations at 1MB/s until completion.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-09:

#4

I've got a test patch here:

https://review.openstack.org/#/c/553608/

It drops the vif_plugging_timeout to 30 seconds in a multi-node linuxbridge job (so we run live migration) and sets live_migration_wait_for_vif_plug=True so the compute manager will wait for and consume the network-vif-plugged event before the driver.live_migration method runs and adds its own waiter. Hopefully with the 1 second bw toggle on the guest that will exceed the 30 second vif_plugging_timeout threshold and we'll see the live migration fail.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-09:

#5

> On Rocky, we now always listen for that event a level up in the compute manager:

By default we don't actually wait. That wait option was made False by default for backward compatibility with the todo to make it True by default in Stein.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-09:

#6

This is the new option in rocky:

https://docs.openstack.org/nova/latest/configuration/config.html#compute.live_migration_wait_for_vif_plug

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-09: Related fix proposed to nova (master)

#7

Related fix proposed to branch: master
Review: https://review.openstack.org/590538

Revision history for this message

Dan Smith (danms) wrote on 2018-08-09:

#8

> By default we don't actually wait. That wait option was made False by default for backward
> compatibility with the todo to make it True by default in Stein.

Ack, yep, But, the version inside libvirt will be racy at best (Pike/Queens behavior above) if the compute feature is disabled, and enabling the compute feature will conflict with the libvirt version (Rocky behavior above) and make migrations take forever.

Revision history for this message

Dan Smith (danms) wrote on 2018-08-09:

#9

Just some further clarification:

- The few hacked up gate jobs we used to test this feature at merge time likely didn't notice the race because the migrations finished before the potential timeout and/or are on systems so loaded that the neutron event came late enough for us to win the race repeatedly.

- The fact that we raise but don't actually cancel the ongoing live migration task means that we log error, report error on the migration object, but the actual migration thread is left running to complete at extremely low speeds for a very long time.

Matt Riedemann (mriedem) on 2018-08-09

Changed in nova:
assignee:	nobody → Dan Smith (danms)

OpenStack Infra (hudson-openstack) on 2018-08-09

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

melanie witt (melwitt) wrote on 2018-08-09:

#10

Just adding a note here that given the poor user experience with live migrations running at 1MiB/s because of the racing, we've decided we need to remove the slow => fast live migration functionality for the release and are targeting the removal for RC2.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-10:

#11

Related fix proposed to branch: master
Review: https://review.openstack.org/590814

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-13: Fix merged to nova (master)

#12

Reviewed: https://review.openstack.org/590538
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=411c45842f6b4bef21710b4d7fa2f473976bc566
Submitter: Zuul
Branch: master

commit 411c45842f6b4bef21710b4d7fa2f473976bc566
Author: Dan Smith <email address hidden>
Date: Thu Aug 9 14:37:28 2018 -0700

Revert "libvirt: slow live-migration to ensure network is ready"

This reverts commit 8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4.

    As detailed in the bug, this is overly racy, waiting for the event
    potentially a long ways (and a couple RPC calls) away from where
    the event will be triggered. The compute manager now has a generic
    mechanism to do this which conflicts and replaces this functionality,
    if enabled (default is off in Rocky, to be defaulted to on in Stein).

    Conflicts:
     nova/tests/unit/virt/libvirt/test_driver.py
     nova/virt/libvirt/driver.py

Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
Closes-Bug: #1786346

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-13: Fix proposed to nova (stable/rocky)

#13

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/591275

Revision history for this message

Sahid Orentino (sahid-ferdjaoui) wrote on 2018-08-13:

#14

The bug is reported on Pike, why that has been reverted from master without to have clearly identified the root cause? Crying everywhere it's "racy" without any explanation does not really make sense, sorry.

How the patch on compute manager is going to address the issue? It's not.

The specific work which has been done on libvirt driver for linux bridge to slow the migration progress is basically because the TAP device is created by libvirt itself when calling migrateToURI which the neutron agent is going to discover and then correctly setup L2 network. That was basically the whole purpose :/

The issue has been noticed on Pike, it would have been interesting to get more details, basically the neutron logs, since we all know that this patch came with some fix in Neutron, perhaps one of them has been badly backported. I mean at least a bit more of investigations would have been appreciated.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-14: Fix proposed to nova (stable/queens)

#15

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/591761

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-14: Fix proposed to nova (stable/pike)

#16

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/591762

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-14: Fix merged to nova (stable/rocky)

#17

Reviewed: https://review.openstack.org/591275
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=afd1c1e6e152b6337d6e71af1c908c73c847b6f3
Submitter: Zuul
Branch: stable/rocky

commit afd1c1e6e152b6337d6e71af1c908c73c847b6f3
Author: Dan Smith <email address hidden>
Date: Thu Aug 9 14:37:28 2018 -0700

Revert "libvirt: slow live-migration to ensure network is ready"

This reverts commit 8e6d5d404cf49e5b68b43c62e7f6d7db2771a1f4.

    As detailed in the bug, this is overly racy, waiting for the event
    potentially a long ways (and a couple RPC calls) away from where
    the event will be triggered. The compute manager now has a generic
    mechanism to do this which conflicts and replaces this functionality,
    if enabled (default is off in Rocky, to be defaulted to on in Stein).

    Conflicts:
     nova/tests/unit/virt/libvirt/test_driver.py
     nova/virt/libvirt/driver.py

    Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
    Closes-Bug: #1786346
    (cherry picked from commit 411c45842f6b4bef21710b4d7fa2f473976bc566)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-15: Fix merged to nova (stable/queens)

#18

Reviewed: https://review.openstack.org/591761
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dee99b1ed03de4b6ded94f3cf6d2ea7214bca93b
Submitter: Zuul
Branch: stable/queens

commit dee99b1ed03de4b6ded94f3cf6d2ea7214bca93b
Author: Dan Smith <email address hidden>
Date: Tue Aug 14 08:27:20 2018 -0700

Revert "libvirt: slow live-migration to ensure network is ready"

This reverts commit 841b0fbea6373ea2ae123f851fb90555faff12e2.

Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
Closes-Bug: #1786346

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-15: Fix merged to nova (stable/pike)

#19

Reviewed: https://review.openstack.org/591762
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b1b5a5eb96ec8f222e5bad9055ec7baeed9767cd
Submitter: Zuul
Branch: stable/pike

commit b1b5a5eb96ec8f222e5bad9055ec7baeed9767cd
Author: Dan Smith <email address hidden>
Date: Tue Aug 14 08:31:41 2018 -0700

Revert "libvirt: slow live-migration to ensure network is ready"

This reverts commit ff747792b8f5aefe1bebb01bdf49dacc01353348.

Change-Id: Ibf2b5eeafd962e93ae4ab6290015d58c33024132
Closes-Bug: #1786346

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-21: Fix included in openstack/nova 18.0.0.0rc2

#20

This issue was fixed in the openstack/nova 18.0.0.0rc2 release candidate.

Revision history for this message

Satish Patel (satish-txt) wrote on 2018-08-22:

#21

For the record,

I am running Queens 17.0.7-6-g9187bb1 and i am having same problem my live migration cap at 8Mbps. Waiting for stable release :)

Revision history for this message

Satish Patel (satish-txt) wrote on 2018-08-23:

#22

Today i have upgraded my nova to fix this patch but look like still i am seeing same issue, my live migration running slow, what else i should look for?

Revision history for this message

Satish Patel (satish-txt) wrote on 2018-08-23:

#23

Download full text (6.1 KiB)

This is interesting I have two host compute-06 & compute-07

06 --> 07 (very slow migration)
07 --> 06 ( super fast migration)

[root@ostack-compute-07 ~]# nova --version
10.1.0

Here is the logs of (06-to-07 migration) you can see its very slow even after your patch

2018-08-23 18:11:38.026 37085 WARNING nova.compute.manager [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Received unexpected event network-vif-plugged-1399100a-77c1-43eb-8d95-fc73a91b6bed for instance with vm_state active and task_state migrating.
2018-08-23 18:11:57.759 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 30 secs, memory 44% remaining; (bytes processed=33072816, remaining=484225024, total=1091379200)
2018-08-23 18:11:57.821 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:11:57.853 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:12:28.721 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 60 secs, memory 41% remaining; (bytes processed=66046896, remaining=451284992, total=1091379200)
2018-08-23 18:12:37.212 37085 INFO nova.compute.manager [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] During sync_power_state the instance has a pending task (migrating). Skip.
2018-08-23 18:12:58.700 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:12:58.734 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:12:59.712 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for...

This is interesting I have two host compute-06 & compute-07

06 --> 07 (very slow migration)
07 --> 06 ( super fast migration)

[root@ostack-compute-07 ~]# nova --version
10.1.0

Here is the logs of (06-to-07 migration) you can see its very slow even after your patch

2018-08-23 18:11:38.026 37085 WARNING nova.compute.manager [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Received unexpected event network-vif-plugged-1399100a-77c1-43eb-8d95-fc73a91b6bed for instance with vm_state active and task_state migrating.
2018-08-23 18:11:57.759 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 30 secs, memory 44% remaining; (bytes processed=33072816, remaining=484225024, total=1091379200)
2018-08-23 18:11:57.821 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:11:57.853 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:12:28.721 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 60 secs, memory 41% remaining; (bytes processed=66046896, remaining=451284992, total=1091379200)
2018-08-23 18:12:37.212 37085 INFO nova.compute.manager [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] During sync_power_state the instance has a pending task (migrating). Skip.
2018-08-23 18:12:58.700 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:12:58.734 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:12:59.712 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 90 secs, memory 38% remaining; (bytes processed=99128228, remaining=417992704, total=1091379200)
2018-08-23 18:13:30.680 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 120 secs, memory 35% remaining; (bytes processed=132209002, remaining=384954368, total=1091379200)
2018-08-23 18:13:56.995 37085 INFO nova.virt.libvirt.migration [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Increasing downtime to 95 ms after 150 sec elapsed time
2018-08-23 18:13:58.565 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:13:58.596 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:14:01.711 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 150 secs, memory 32% remaining; (bytes processed=165290010, remaining=351809536, total=1091379200)
2018-08-23 18:14:32.700 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 180 secs, memory 29% remaining; (bytes processed=198370955, remaining=318693376, total=1091379200)
2018-08-23 18:14:58.762 37085 WARNING nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Instance not resizing, skipping migration.
2018-08-23 18:14:58.802 37085 INFO nova.compute.resource_tracker [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] Final resource view: name=ostack-compute-07.foo.net phys_ram=32733MB used_ram=5120MB phys_disk=9314GB used_disk=20GB total_vcpus=40 used_vcpus=2 pci_stats=[]
2018-08-23 18:15:03.626 37085 INFO nova.virt.libvirt.driver [req-a9fe2b70-af22-4d6e-bee0-ddc64be0cf09 c3c091aa18b340fdaad9f95afeb4493c 328fc75d4f944a64ad1b8699c02350ca - default default] [instance: 6ce9af65-4e5d-45a2-91a5-0d42f5622022] Migration running for 210 secs, memory 25% remaining; (bytes processed=231349769, remaining=276144128, total=1091379200)

Revision history for this message

Satish Patel (satish-txt) wrote on 2018-08-24:

#24

Folks,

Forgive me, by mistake i grab wrong commit and that was the reason i didn't see any changer after applying patch.

It works after applying correct version :) Thanks

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-14: Related fix merged to nova (stable/queens)

#25

Reviewed: https://review.openstack.org/591735
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bfe89fec4669df6f4ac48dbb56fde3db0a24cbac
Submitter: Zuul
Branch: stable/queens

commit bfe89fec4669df6f4ac48dbb56fde3db0a24cbac
Author: Matt Riedemann <email address hidden>
Date: Fri Mar 30 17:16:08 2018 -0400

Wait for network-vif-plugged before starting live migration

    This adds a new config option which is read on the source host
    during pre_live_migration which can be used to determine
    if it should wait for a "network-vif-plugged" event due to VIFs
    being plugged on the destination host. This helps us to
    avoid the guest transfer at all if vif plugging failed on the dest
    host, which we just wouldn't find out until post live migration
    and then we have to rollback.

    The option is disabled by default for backward compatibility and
    also because certain networking backends, like OpenDaylight, are
    known to not send network-vif-plugged events unless the port host
    binding information changes, which for live migration doesn't happen
    until after the guest is transferred to the destination host.

Related to blueprint neutron-new-port-binding-api
Related-Bug: #1786346

    NOTE(danms): Stable-only changes to this patch from master include
    removing the RPC-related communication from the destination
    to the source node. As such, the new option is read on the source
    node so the conf option help and release note are updated. This is
    OK before Rocky since we don't claim support to live migrate
    between different networking backends (vif types), so operators
    would need to set the option universally, or at least have host
    aggregates in place if they are using different network types.

    Conflicts:
            nova/conf/compute.py
            nova/tests/unit/objects/test_migrate_data.py

Change-Id: I0f3ab6604d8b79bdb75cf67571e359cfecc039d8
(cherry picked from commit 5aadff75c3ac4f2019838600df6580481a96db0f)

Reviewed:  https://review.openstack.org/591735
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bfe89fec4669df6f4ac48dbb56fde3db0a24cbac
Submitter: Zuul
Branch:    stable/queens

commit bfe89fec4669df6f4ac48dbb56fde3db0a24cbac
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Fri Mar 30 17:16:08 2018 -0400

Wait for network-vif-plugged before starting live migration
    
    This adds a new config option which is read on the source host
    during pre_live_migration which can be used to determine
    if it should wait for a "network-vif-plugged" event due to VIFs
    being plugged on the destination host. This helps us to
    avoid the guest transfer at all if vif plugging failed on the dest
    host, which we just wouldn't find out until post live migration
    and then we have to rollback.
    
    The option is disabled by default for backward compatibility and
    also because certain networking backends, like OpenDaylight, are
    known to not send network-vif-plugged events unless the port host
    binding information changes, which for live migration doesn't happen
    until after the guest is transferred to the destination host.
    
    Related to blueprint neutron-new-port-binding-api
    Related-Bug: #1786346
    
    NOTE(danms): Stable-only changes to this patch from master include
    removing the RPC-related communication from the destination
    to the source node. As such, the new option is read on the source
    node so the conf option help and release note are updated. This is
    OK before Rocky since we don't claim support to live migrate
    between different networking backends (vif types), so operators
    would need to set the option universally, or at least have host
    aggregates in place if they are using different network types.
    
    Conflicts:
            nova/conf/compute.py
            nova/tests/unit/objects/test_migrate_data.py
    
    Change-Id: I0f3ab6604d8b79bdb75cf67571e359cfecc039d8
    (cherry picked from commit 5aadff75c3ac4f2019838600df6580481a96db0f)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-18: Related fix merged to nova (master)

#26

Reviewed: https://review.openstack.org/590814
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=79a74d8185ed99be0e5041724f32af82ad8fe8d9
Submitter: Zuul
Branch: master

commit 79a74d8185ed99be0e5041724f32af82ad8fe8d9
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 10 12:21:19 2018 -0400

Revert "libvirt: add method to configure migration speed"

This reverts commit 23446a9552b5be3b040278646149a0f481d0a005.

    With change Ibf2b5eeafd962e93ae4ab6290015d58c33024132 there
    is nothing using the migrate_configure_max_speed method any
    longer and can be removed. An additional mock, added after
    the change being reverted, is also removed.

Change-Id: I90d6e14bf9383bf71d65d2180474ba228db2feab
Related-Bug: #1786346

Revision history for this message

Fernando Silva (fernando.silva) wrote on 2018-09-21:

#27

Has this bug been resolved? And in what version of nova?

Revision history for this message

Dan Smith (danms) wrote on 2018-09-21:

#28

This should be fixed in the GA of Rocky AFAIK, and is being backported to the other branches (releases pending in the next day or so).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-24: Fix included in openstack/nova 16.1.5

#29

This issue was fixed in the openstack/nova 16.1.5 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-24: Fix included in openstack/nova 17.0.6

#30

This issue was fixed in the openstack/nova 17.0.6 release.

Revision history for this message

Alex Walender (awalende) wrote on 2018-10-17:

#31

How long does it usually take when those updates are available in the standard bionic apt repository?

The currently available version is 2:17.0.5-0ubuntu2 for Ubuntu 18.04.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix included in openstack/nova 19.0.0.0rc1

#32

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

OpenStack Compute (nova)

live migrations slow

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	High	Dan Smith
Pike	Fix Released	High	Dan Smith
Queens	Fix Released	High	Dan Smith
Rocky	Fix Committed	High	Matt Riedemann