Network downtime during live migration through routers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
neutron/master (close to stable/newton)
VXLAN networks with simple network node (not DVR)
There is network down time of several seconds during a live migration. The amount of time depends on when the VM resumes on the target host versus when the migration ‘completes’.
When a live migration occurs, there is a point in its life cycle where it pauses on the source and starts up (or resumes) on the target. At that point, the migration isn’t complete, the system has determined it is now best to be running on the target. This of course varies per hypervisor, but that is the general flow for most hypervisors.
So during the migration the port goes through a few states.
1) Pre migration, its tied solely to the source host.
2) During migration, its tied to the source host. The port profile has a ‘migrating_to’ attribute that identifies the target host
3) Post migration, the port is tied solely to the target host.
The OVS agent handles the migration well. It detects the port, sees the UUID, and treats the port properly. But things like the router don’t seem to handle it properly, at least in my testing.
It seems only once the VM hits step 3 (post migration, where nova updates the port to be on the target host solely) does the routing information get updated in the router.
In fact, its kinda interesting. I’ve been running a constant ping during the live migration through the router and watching it on both sides with tcpdump. When it resumes on the target, but live migration is not completed the following happens:
- Ping request goes out from target server
- Goes through out the router
- Comes back into the router
- Gets sent to the source server
I’m not sure if this is somehow specific to vxlan. I haven’t had a chance to try Geneve yet.
This could impact projects like Watcher which will be using the live-migration to constantly optimize the system. But that could be undesirable to optimize because it would introduce down time on the workloads being moved around.
If the time between a VM resume and live migration complete is minimal, then the impact can be quite small (couple seconds). If KVM uses post-copy, it should be susceptible to it. http://
Bug closed due to lack of activity, please feel free to reopen if needed.