Comment 1 for bug 1531862

Revision history for this message
Andreas Scheuring (andreas-scheuring) wrote : Re: ServerPersonalityTestJSON:test_rebuild_server_with_personality fails intermittent

I'm not 100% sure what causes this issue. But I think it's like this.

- Instance gets started, rebuilt triggert
- Start destroy
- Agent loop detects that related tap device exists
- Instance destroyed, tap device, too
- Instances start triggered, nova starts listening on network-vif-plugged event
2016-01-11 04:00:46.887 DEBUG nova.compute.manager [req-d99131b4-6ad7-4d60-8d1f-5c44075c719c tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Preparing to wait for external event network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:514

- Instance created and tap device created
- Agent loop detects that device is still there and assumes it active - it did not realize, that it dissappeared and appeard again
 ---> so the device up message is missing --> the port will never become Active!
- Instance stays in paused state until it gets deleted
2016-01-11 04:03:37.974 DEBUG nova.compute.manager [req-a7841493-f6c3-485d-a9e2-720d5386dcf2 tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Events pending at deletion: network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 _delete_instance /opt/stack/new/nova/nova/compute/manager.py:2391

long time after the instance got deleted, the network-vif-timeout occurs

So probably the test case did trigger the delete...

I can not verify this assumption from the logs to 100%, as I can not figure out the exact point of time when the tap device got created. But at least the agent did never detect it as down. It's like this:

agent-loop scan: [6] 2016-01-11 04:00:41.727 (device in current + added due to resync)
libvirt: [2] 2016-01-11 04:00:45.875+0000: 16554: VM destroyed
nova log: [5] 2016-01-11 04:00:45.928 - vm destroyd

agent-loop scan: [7] 2016-01-11 04:00:47.560 Loop iteration exceeded interval (2 vs. 5.83269119263)! daemon_loop /opt/stack/
agent-loop scan: [7] 2016-01-11 04:00:47.560 (device in current)
libvirt: [1] 2016-01-11 04:00:47.731+0000: 16552: instance stareted (or to be started?)
nova log [4] 2016-01-11 04:00:47.957 instance started
agent-loop scan: [8] 2016-01-11 04:00:51.560 (device in current)
nova log [9] deletion triggered although still waiting for event

nova log [3] vif timeout
--Device should have been detected as deleted by the agent somewhere in between!

[1] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/libvirt/libvirtd.txt.gz#_2016-01-11_04_00_47_731
[2] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/libvirt/libvirtd.txt.gz#_2016-01-11_04_00_45_875
[3] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-n-cpu.txt.gz#_2016-01-11_04_05_47_959
[4] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-n-cpu.txt.gz#_2016-01-11_04_00_47_957
[5] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-n-cpu.txt.gz#_2016-01-11_04_00_45_873
[6] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-q-agt.txt.gz#_2016-01-11_04_00_41_727
[7] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-q-agt.txt.gz#_2016-01-11_04_00_47_560
[8] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-q-agt.txt.gz#_2016-01-11_04_00_51_560
[9] http://logs.openstack.org/99/233499/16/check/gate-tempest-dsvm-neutron-linuxbridge/d2c2c00/logs/screen-n-cpu.txt.gz#_2016-01-11_04_03_37_973

So is the solution to reduce the polling interval - not sure if that one would help in this case. I think the root cause is, that the agent looop (daemon_loop) synchronosly executes it's task. That means, that the interval might increase, if actions take longer...