neutron

Network downtime during live migration through routers

Bug #1631647 reported by Drew Thorstensen on 2016-10-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Won't Fix	Undecided	Unassigned

Bug Description

neutron/master (close to stable/newton)
VXLAN networks with simple network node (not DVR)

There is network down time of several seconds during a live migration. The amount of time depends on when the VM resumes on the target host versus when the migration ‘completes’.

When a live migration occurs, there is a point in its life cycle where it pauses on the source and starts up (or resumes) on the target. At that point, the migration isn’t complete, the system has determined it is now best to be running on the target. This of course varies per hypervisor, but that is the general flow for most hypervisors.

So during the migration the port goes through a few states.
1) Pre migration, its tied solely to the source host.
2) During migration, its tied to the source host. The port profile has a ‘migrating_to’ attribute that identifies the target host
3) Post migration, the port is tied solely to the target host.

The OVS agent handles the migration well. It detects the port, sees the UUID, and treats the port properly. But things like the router don’t seem to handle it properly, at least in my testing.

It seems only once the VM hits step 3 (post migration, where nova updates the port to be on the target host solely) does the routing information get updated in the router.

In fact, its kinda interesting. I’ve been running a constant ping during the live migration through the router and watching it on both sides with tcpdump. When it resumes on the target, but live migration is not completed the following happens:
- Ping request goes out from target server
- Goes through out the router
- Comes back into the router
- Gets sent to the source server

I’m not sure if this is somehow specific to vxlan. I haven’t had a chance to try Geneve yet.

This could impact projects like Watcher which will be using the live-migration to constantly optimize the system. But that could be undesirable to optimize because it would introduce down time on the workloads being moved around.

If the time between a VM resume and live migration complete is minimal, then the impact can be quite small (couple seconds). If KVM uses post-copy, it should be susceptible to it. http://wiki.qemu.org/Features/PostCopyLiveMigration

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2023-01-17:

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.