[RFE] reduce the duration of network interrupt during live migration in DVR scenario

Bug #1715340 reported by zhaobo
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Wishlist
Unassigned

Bug Description

Nova contains 3 stages when process 1ive migration:
1. pre_live_migration
2. migrating
3. post_live_migration
The current implement, nova will plug a new vif on the target host. The ovs-agent on the target host will process this new vif, and try to up this port on target host. But the port host_id is src host now.The agent send a rpc to server and return nothing..
After nova process the real migration in stage 2. Maybe the flavor of the instance is small and the duration is very short. Then in stage 3, nova call neutron to update the port's host_id of instance. Network interrupt begins. In the whole live migration ,the vm status is always ACTIVE. But users can not login the VM, or the applications running in the VM will be offline for a while. The reason is neutron process the whole traffic is too late. When nova migrate the instance to the target host, and setup the instance by libvirt, the network traffic provided by neutron is not ready, that means we need to verify both l2 and l3 connection are ready for this.

We test in our product env which is the old release Mitaka(I still think there is the same issue in master), the interrupt time last depends on the port counts in the router subnets, also whether the port is associated with floatingip. When the ports counts <20, the interrupt duration <= 8 seconds, the time will increase 5s if the port is associated with floatingip. When port counts > 20, the duration <=30s, also increase 5s by floatingip.

This cannot accept in NFV scenario or in some telecommunications company. Even though the spec[1] want to pre-configure the nework during live migration, let migration and network configure process in asynchronous way, but the key issue is not sloved, we also need a mechanism like provision_block to let l2 and l3 to process in a synchronize way. And need a way to let nova know about the work is done in neutron, nova could do the next thing during live migration.

[1]http://specs.openstack.org/openstack/neutron-specs/specs/pike/portbinding_information_for_nova.html

zhaobo (zhaobo6)
description: updated
Miguel Lavalle (minsel)
Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Miguel Lavalle (minsel) wrote :

I did some digging about this. It turns out that for the DVR case, L3 networking wiring is carried out during the pre_live_migration stage since the Mitaka release. These are the patchsets that implemented this:

1) Nova: https://review.openstack.org/#/c/275073
2) Neutron, server side: https://review.openstack.org/#/c/275420
3) Neutron, agent side: https://review.openstack.org/#/c/260738

So we can leverage and adapt, if needed, this functionality to work with the implementation of https://specs.openstack.org/openstack/neutron-specs/specs/backlog/ocata/portbinding_information_for_nova.html. I will bring this up tomorrow during the L3 weekly IRC meeting, so the L3 team members are aware.

I also want to point out that the Nova side spec for port binding information for Nova is still work in progress: https://review.openstack.org/#/c/375580. So wee need to get that going

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

In the case of DVR we handle it in the pre-live migration. If the VM port is migrating, we do the intial setup in the host for the fipnamespace and as soon as the VM lands in the new host, we configure the rules for the fip.

Still we lag the neutron status update or state update to nova on live migration.

Revision history for this message
Na Zhu (nazhu) wrote :

@Swaminathan, I see the intial setup for dvr, I am confused about the FIP implementation, the FIP exists both on the source host and destination host, for the incoming FIP traffic from internet in dvr mode, how to know which fg port is the right nexthop, because both fg port proxy the port arp.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: New → Won't Fix
Revision history for this message
norman shen (jshen28) wrote (last edit ):

We are also facing this issue on DVR setup with neutron version rocky and victoria. The problem is whenever I update port binding_profile with

openstack port set --binding-profile migrating_to=<some host> <port uuid>

then on the target host, fip namespace is created and proxy arp rules are setup too which I think is wrong. This setup will make virtual machine's FIP unusable occasionally. The code is https://github.com/openstack/neutron/blob/7471b8590c6455dec61cc050a4201ca669d30d2c/neutron/db/l3_dvrscheduler_db.py#L154 which does a routers_updated for some reason. This issue is more apparent if VM uses a large memory which could possibly took several minutes to complete.

Honestly, I do not know why we need to setup everything on the destination before even migration executed, this at least effects instances on the target host with FIP.

norman shen (jshen28)
information type: Public → Public Security
information type: Public Security → Private Security
information type: Private Security → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.