Network downtime during live migration through routers

Bug #1631647 reported by Drew Thorstensen
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Undecided
Unassigned

Bug Description

neutron/master (close to stable/newton)
VXLAN networks with simple network node (not DVR)

There is network down time of several seconds during a live migration. The amount of time depends on when the VM resumes on the target host versus when the migration ‘completes’.

When a live migration occurs, there is a point in its life cycle where it pauses on the source and starts up (or resumes) on the target. At that point, the migration isn’t complete, the system has determined it is now best to be running on the target. This of course varies per hypervisor, but that is the general flow for most hypervisors.

So during the migration the port goes through a few states.
1) Pre migration, its tied solely to the source host.
2) During migration, its tied to the source host. The port profile has a ‘migrating_to’ attribute that identifies the target host
3) Post migration, the port is tied solely to the target host.

The OVS agent handles the migration well. It detects the port, sees the UUID, and treats the port properly. But things like the router don’t seem to handle it properly, at least in my testing.

It seems only once the VM hits step 3 (post migration, where nova updates the port to be on the target host solely) does the routing information get updated in the router.

In fact, its kinda interesting. I’ve been running a constant ping during the live migration through the router and watching it on both sides with tcpdump. When it resumes on the target, but live migration is not completed the following happens:
 - Ping request goes out from target server
 - Goes through out the router
 - Comes back into the router
 - Gets sent to the source server

I’m not sure if this is somehow specific to vxlan. I haven’t had a chance to try Geneve yet.

This could impact projects like Watcher which will be using the live-migration to constantly optimize the system. But that could be undesirable to optimize because it would introduce down time on the workloads being moved around.

If the time between a VM resume and live migration complete is minimal, then the impact can be quite small (couple seconds). If KVM uses post-copy, it should be susceptible to it. http://wiki.qemu.org/Features/PostCopyLiveMigration

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.