[ovn] No connection to VM during live-migration

Bug #2069718 reported by Stefan Hoffmann
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
In Progress
Medium
Stefan Hoffmann

Bug Description

Problem: In environments with many hypervisors and VMs, a live-migration leads to VMs being not reachable for some seconds (4-20s).

Description:
We run a big environment with many hypervisors and VMs, so northd reconcile cycles take some time.
At live-migration, even nova has live_migration_wait_for_vif_plug=true configured, the vif-plugged event from neutron is send before northd has processed the change to have the VMs port added to the destination hypervisor and multi-chassis-feature is enabled.
Nova starts the live migration at libvirt and it is done, before southbound and ovn-controller of destination have the change.
So the VM is started at destination hypervisor but the port setup is not done yet.

From what I saw, the vif-plugged event is generated by neutron, when the transaction to northbound ovsdb is finished [1].

Is there a way to wait till the change is propagated to southbound ovsdb?

Version:
neutron-server 21.2.1 zed / unmaintained/zed
ml2 plugin: ovn
at neutron: ovsdb-client (Open vSwitch) 3.3.0
Nova zed / unmaintained/zed
nova.conf: live_migration_wait_for_vif_plug=true ([2])
Hypervisor OS: Ubuntu 22.04 with newer kernel (but that shouldn't be relevant here)

Steps to Reproduce:

1. Run neutron with ovn setup and create a VM that you can ping (via FIP or other VM in same private network)
2. Stop northd
3. Start live-migration
4. Wait till live-migration is done - VM is not reachable anymore

More info:

This problem has two steps.
First, nova don't wait for network-vif-plugged event if using ovn backend, as portbinding options missing some attribute
Also Neutron OVN plugin currently sends vif-plugged events as soon northbound ovsdb has the update and on LogicalSwitchPort Events (so at northbound updates)

[1] https://opendev.org/openstack/neutron/src/branch/unmaintained/zed/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L836
[2] https://docs.openstack.org/nova/latest/configuration/config.html#compute.live_migration_wait_for_vif_plug

Tags: ovn
description: updated
tags: added: ovn
Revision history for this message
Brian Haley (brian-haley) wrote :

You raise a good question, I will bring it up at the next Neutron meeting to see what others think.

Changed in neutron:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Stefan Hoffmann (shoffmann) wrote :

I'm currently testing a patch and will provide it here soon. That can be taken for first discussion where and how to fix this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/922746

Changed in neutron:
status: Triaged → In Progress
Changed in neutron:
assignee: nobody → Stefan Hoffmann (mr-hopeman)
description: updated
Revision history for this message
Stefan Hoffmann (shoffmann) wrote :

Like the update at the bug description, this are two problems.

Nova don't wait for network-vif-plugged event if using ovn backend, as portbinding options missing some attribute.

Also Neutron OVN plugin currently sends vif-plugged events as soon northbound ovsdb has the update and on LogicalSwitchPort Events (so at northbound updates)

I prepared a patch for both issues and test them now.

This means a change in how neutron ovn plugin handles events of ovsdb. Instead of sending network-vif-plugged at LSP events (and trigger this by setting the Port down and up), neutron must not send an event at port update (in case of a migration). But at an update of the ChassisPortBinding we need to check, if there are two requested chassis and additional_chassis is set, so ovn-controller has the change applied.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/923962

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/923963

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Stefan Hoffmann <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/922746
Reason: Abandon due to new patches 923962 and 923963

Revision history for this message
Stefan Hoffmann (shoffmann) wrote :

In nova code I found that get_live_migration_plug_time_events()[1] calls (via has_live_migration_plug_time_event()) is_hybrid_plug_enabled()[2], where it checks if VIF_DETAILS_OVS_HYBRID_PLUG is set in VIF.details.
Also while debugging at my cluster with OVN setup, I saw nova-compute passing that function.

So also this flag/option is meant differently, nova uses it to check, if it needs to wait for a plug event.
(setting live_migration_wait_for_vif_plug doesn't help, also it's default true.)

[1] https://opendev.org/openstack/nova/src/branch/master/nova/network/model.py#L563
[2] https://opendev.org/openstack/nova/src/branch/master/nova/network/model.py#L499

Revision history for this message
sean mooney (sean-k-mooney) wrote :

the approch you are taking to fix this si wrong

you should not be setting VIF_DETAILS_OVS_HYBRID_PLUG

if your doing that your basically telling nova that this is ml2/ovs not ovn

the code path that activates in nova is not the correct way to resolve the reported issue.

Revision history for this message
Stefan Hoffmann (shoffmann) wrote :

@sean-k-mooney thanks for your hint, I'm working on a different approach, including nova-compute change, to wait for events at nova.

The VIF_DETAILS_OVS_HYBRID_PLUG fix will be abandoned.

Bug at nova: https://bugs.launchpad.net/nova/+bug/2073254

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Stefan Hoffmann <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/923962
Reason: Fix the issue at nova instead https://bugs.launchpad.net/nova/+bug/2073254

Revision history for this message
sean mooney (sean-k-mooney) wrote :

this is not a nova bug

in wallaby we encoded the behaviour that neutron and the OVN driver had at the time to resolve
Several other live migration downtime issues including the development of the multiple chasise
support.

neutron with ovn does not reliable send plugtime events so nova cannot wait for them at plug time
ovn send them at bind time.

now there have been effort to make the ovn driver work more like the ml2/ovs and linux bridge driver an dactully send plug time event

but as far as i am aware that has not been done and no work has been proposed so nova for that.

the wallaby behavior of not waiting for the plug events with ovn was heavily tested including with the
multiple chassis feature which was backproted wonstrea as https://bugzilla.redhat.com/show_bug.cgi?id=2104522

if the behaviour has changed in neutron/ovn without a nova spec to detail the change in the contract this is a neutron bug.

the code related to this in nova has been stable and unchanged for 4 years that means this was regressed
by either a change in ovn that need to be understood and documented or a change in neutron that should be reverted.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.