libvirt virt driver does not wait for network-vif-plugged event during hard reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Balazs Gibizer |
Bug Description
The libvirt virt driver has a logic during spawn to create the domain in libvirt, the pause it, then only resume it after the network-vif-plugged events are received from neutron for the ports of the instance being spawned. This is in place to avoid starting the guest OS before the networking backend can finish set up the networking for the ports. Without this a guest might start and request IP via DHCP before the networking setup is finished and therefore might not get IP at all.
In case of hard reboot (and start as that is a hard reboot too) nova cleans up the instance from the hypervisor (except the local disk) including unplugging the vifs of the instance. Then nova recreate everything including re-plugging the vifs. This is intentional as hard reboot is considered to be an operation that is capable of recovering instances in bad / inconsistent states. However during the hard reboot nova does not wait for the nework-vif-plugged events before it let the domain start running. In a mass instance startup scenario (e.g. after a compute host recovery) there is potentially a lot of vif unplug/plug hits the networking backend. Processing these replugs takes time. Nova does not wait for the network-vif-plugged event, so the guest OS can start the DHCP request a way before the networking backend can catch up with the unplug/plug request. This leads to connectivity issues in the guest.
Changed in nova: | |
status: | New → In Progress |
Changed in nova: | |
importance: | Undecided → Medium |
assignee: | nobody → Balazs Gibizer (balazs-gibizer) |
tags: | added: compute libvirt reboot |
First thanks a lot for raising this bug.
We kindly request it would be great to have both vif_plugged and vif_unplugged handshakes between the nova and the networking-backend, thereby it will enable more collaboration enabling easier troubleshooting of which part of VM start activity failed during mass start of VMs on a typical batch of 50 compute-hosts.