Comment 0 for bug 1329546

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

VMware mine sweeper for Neutron (*) recently showed a 100% failure rate on tempest.api.compute.v3.servers.test_server_actions

Logs for two instances of these failures are available at [1] and [2]
The failure manifested as an instance unable to go active after a rebuild.
A bit of instrumentation and log analysis revealed no obvious error on the neutron side - and also that the instance was actually in "running" state even if its take state was "rebuilding/spawning"

N-API logs [3] revealed that the instance spawn was timing out on a missed notification from neutron regarding VIF plug - however the same log showed such notification was received [4]

It turns out that, after rebuild, the instance network cache had still 'active': False for the instance's VIF, even if the status for the corresponding port was 'ACTIVE'. This happened because after the network-vif-plugged event was received, nothing triggered a refresh of the instance network info. For this reason, the VM, after a rebuild, kept waiting for an even which obviously was never sent from neutron.

While this manifested only on mine sweeper - this appears to be a nova bug - manifesting in vmware minesweeper only because of the way the plugin synchronizes with the backend for reporting the operational status of a port.
A simple solution for this problem would be to reload the instance network info cache when network-vif-plugged events are received by nova. (But as the reporter knows nothing about nova this might be a very bad idea as well)

[1] http://208.91.1.172/logs/neutron/98278/2/413209/testr_results.html
[2] http://208.91.1.172/logs/neutron/73234/34/413213/testr_results.html
[3] http://208.91.1.172/logs/neutron/73234/34/413213/logs/screen-n-cpu.txt.gz?level=WARNING#_2014-06-06_01_46_36_219
[4] http://208.91.1.172/logs/neutron/73234/34/413213/logs/screen-n-cpu.txt.gz?level=DEBUG#_2014-06-06_01_41_31_767

(*) runs libvirt/KVM + NSX