I'm not 100% sure what causes this issue. But I think it's like this.
- Instance gets started, rebuilt triggert
- Start destroy
- Agent loop detects that related tap device exists
- Instance destroyed, tap device, too
- Instances start triggered, nova starts listening on network-vif-plugged event
2016-01-11 04:00:46.887 DEBUG nova.compute.manager [req-d99131b4-6ad7-4d60-8d1f-5c44075c719c tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Preparing to wait for external event network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:514
- Instance created and tap device created
- Agent loop detects that device is still there and assumes it active - it did not realize, that it dissappeared and appeard again
---> so the device up message is missing --> the port will never become Active!
- Instance stays in paused state until it gets deleted
2016-01-11 04:03:37.974 DEBUG nova.compute.manager [req-a7841493-f6c3-485d-a9e2-720d5386dcf2 tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Events pending at deletion: network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 _delete_instance /opt/stack/new/nova/nova/compute/manager.py:2391
long time after the instance got deleted, the network-vif-timeout occurs
So probably the test case did trigger the delete...
I can not verify this assumption from the logs to 100%, as I can not figure out the exact point of time when the tap device got created. But at least the agent did never detect it as down. It's like this:
agent-loop scan: [6] 2016-01-11 04:00:41.727 (device in current + added due to resync)
libvirt: [2] 2016-01-11 04:00:45.875+0000: 16554: VM destroyed
nova log: [5] 2016-01-11 04:00:45.928 - vm destroyd
agent-loop scan: [7] 2016-01-11 04:00:47.560 Loop iteration exceeded interval (2 vs. 5.83269119263)! daemon_loop /opt/stack/
agent-loop scan: [7] 2016-01-11 04:00:47.560 (device in current)
libvirt: [1] 2016-01-11 04:00:47.731+0000: 16552: instance stareted (or to be started?)
nova log [4] 2016-01-11 04:00:47.957 instance started
agent-loop scan: [8] 2016-01-11 04:00:51.560 (device in current)
nova log [9] deletion triggered although still waiting for event
nova log [3] vif timeout
--Device should have been detected as deleted by the agent somewhere in between!
So is the solution to reduce the polling interval - not sure if that one would help in this case. I think the root cause is, that the agent looop (daemon_loop) synchronosly executes it's task. That means, that the interval might increase, if actions take longer...
I'm not 100% sure what causes this issue. But I think it's like this.
- Instance gets started, rebuilt triggert manager [req-d99131b4- 6ad7-4d60- 8d1f-5c44075c71 9c tempest- ServerPersonali tyTestJSON- 1787158816 tempest- ServerPersonali tyTestJSON- 675574703] [instance: 5cc0ffad- aa65-4780- 8cdc-a7c5040824 fe] Preparing to wait for external event network- vif-plugged- 7094cb28- f242-4e85- 85a2-7da2a147e4 f0 prepare_ for_instance_ event /opt/stack/ new/nova/ nova/compute/ manager. py:514
- Start destroy
- Agent loop detects that related tap device exists
- Instance destroyed, tap device, too
- Instances start triggered, nova starts listening on network-vif-plugged event
2016-01-11 04:00:46.887 DEBUG nova.compute.
- Instance created and tap device created manager [req-a7841493- f6c3-485d- a9e2-720d5386dc f2 tempest- ServerPersonali tyTestJSON- 1787158816 tempest- ServerPersonali tyTestJSON- 675574703] [instance: 5cc0ffad- aa65-4780- 8cdc-a7c5040824 fe] Events pending at deletion: network- vif-plugged- 7094cb28- f242-4e85- 85a2-7da2a147e4 f0 _delete_instance /opt/stack/ new/nova/ nova/compute/ manager. py:2391
- Agent loop detects that device is still there and assumes it active - it did not realize, that it dissappeared and appeard again
---> so the device up message is missing --> the port will never become Active!
- Instance stays in paused state until it gets deleted
2016-01-11 04:03:37.974 DEBUG nova.compute.
long time after the instance got deleted, the network-vif-timeout occurs
So probably the test case did trigger the delete...
I can not verify this assumption from the logs to 100%, as I can not figure out the exact point of time when the tap device got created. But at least the agent did never detect it as down. It's like this:
agent-loop scan: [6] 2016-01-11 04:00:41.727 (device in current + added due to resync)
libvirt: [2] 2016-01-11 04:00:45.875+0000: 16554: VM destroyed
nova log: [5] 2016-01-11 04:00:45.928 - vm destroyd
agent-loop scan: [7] 2016-01-11 04:00:47.560 Loop iteration exceeded interval (2 vs. 5.83269119263)! daemon_loop /opt/stack/
agent-loop scan: [7] 2016-01-11 04:00:47.560 (device in current)
libvirt: [1] 2016-01-11 04:00:47.731+0000: 16552: instance stareted (or to be started?)
nova log [4] 2016-01-11 04:00:47.957 instance started
agent-loop scan: [8] 2016-01-11 04:00:51.560 (device in current)
nova log [9] deletion triggered although still waiting for event
nova log [3] vif timeout
--Device should have been detected as deleted by the agent somewhere in between!
[1] http:// logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/libvirt/ libvirtd. txt.gz# _2016-01- 11_04_00_ 47_731 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/libvirt/ libvirtd. txt.gz# _2016-01- 11_04_00_ 45_875 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/screen- n-cpu.txt. gz#_2016- 01-11_04_ 05_47_959 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/screen- n-cpu.txt. gz#_2016- 01-11_04_ 00_47_957 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/screen- n-cpu.txt. gz#_2016- 01-11_04_ 00_45_873 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/screen- q-agt.txt. gz#_2016- 01-11_04_ 00_41_727 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/screen- q-agt.txt. gz#_2016- 01-11_04_ 00_47_560 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/screen- q-agt.txt. gz#_2016- 01-11_04_ 00_51_560 logs.openstack. org/99/ 233499/ 16/check/ gate-tempest- dsvm-neutron- linuxbridge/ d2c2c00/ logs/screen- n-cpu.txt. gz#_2016- 01-11_04_ 03_37_973
[2] http://
[3] http://
[4] http://
[5] http://
[6] http://
[7] http://
[8] http://
[9] http://
So is the solution to reduce the polling interval - not sure if that one would help in this case. I think the root cause is, that the agent looop (daemon_loop) synchronosly executes it's task. That means, that the interval might increase, if actions take longer...