live migrations slow
Bug #1786346 reported by
Matthew Thode
This bug affects 7 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Dan Smith | ||
Pike |
Fix Released
|
High
|
Dan Smith | ||
Queens |
Fix Released
|
High
|
Dan Smith | ||
Rocky |
Fix Committed
|
High
|
Matt Riedemann |
Bug Description
This is on pike with ff747792b8f5aef
tags: | added: live-migration rocky-rc-potential |
Changed in nova: | |
importance: | Undecided → High |
status: | New → Confirmed |
Changed in nova: | |
assignee: | nobody → Dan Smith (danms) |
Changed in nova: | |
status: | Confirmed → In Progress |
To post a comment you must log in.
This appears to be now fundamentally broken on Rocky, and racy on Queens and Pike. Below is the current thinking from live conversation and debugging of this with the reporter.
The event that triggers VIF plugging happens here:
https:/ /github. com/openstack/ nova/blob/ master/ nova/virt/ libvirt/ driver. py#L7775
in pre_live_ migration( ). That is well before the wait_for_ instance_ event() call is made in the libvirt driver as added here:
https:/ /review. openstack. org/#/c/ 497457/ 30/nova/ virt/libvirt/ driver. py
Which means that in Pike/Queens we race to the point at which the event shows up. If we win, then we receive it and do the slow..fast dance as expected. If we lose the race, then the event comes before we start listening to it, in which case we time out, but don't actually stop the live migration, which means it continues at 1MB/s until it completes.
On Rocky, we now always listen for that event a level up in the compute manager:
https:/ /review. openstack. org/#/c/ 558001/ 10/nova/ compute/ manager. py
which is properly wrapped around pre_live_ migration( ), and will always wait for and eat the plugging event before calling into the driver. Thus, the driver will never receive the event it is waiting for and we will always timeout and always run migrations at 1MB/s until completion.