neutron

Bug #1531862
Comment #1

Comment 1 for bug 1531862

Revision history for this message

Andreas Scheuring (andreas-scheuring) wrote on 2016-01-11: Re: ServerPersonalityTestJSON:test_rebuild_server_with_personality fails intermittent

I'm not 100% sure what causes this issue. But I think it's like this.

- Instance gets started, rebuilt triggert
- Start destroy
- Agent loop detects that related tap device exists
- Instance destroyed, tap device, too
- Instances start triggered, nova starts listening on network-vif-plugged event
2016-01-11 04:00:46.887 DEBUG nova.compute.manager [req-d99131b4-6ad7-4d60-8d1f-5c44075c719c tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Preparing to wait for external event network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 prepare_for_instance_event /opt/stack/new/nova/nova/compute/manager.py:514

- Instance created and tap device created
- Agent loop detects that device is still there and assumes it active - it did not realize, that it dissappeared and appeard again
---> so the device up message is missing --> the port will never become Active!
- Instance stays in paused state until it gets deleted
2016-01-11 04:03:37.974 DEBUG nova.compute.manager [req-a7841493-f6c3-485d-a9e2-720d5386dcf2 tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Events pending at deletion: network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 _delete_instance /opt/stack/new/nova/nova/compute/manager.py:2391

long time after the instance got deleted, the network-vif-timeout occurs

So probably the test case did trigger the delete...

I can not verify this assumption from the logs to 100%, as I can not figure out the exact point of time when the tap device got created. But at least the agent did never detect it as down. It's like this:

agent-loop scan: [6] 2016-01-11 04:00:41.727 (device in current + added due to resync)
libvirt: [2] 2016-01-11 04:00:45.875+0000: 16554: VM destroyed
nova log: [5] 2016-01-11 04:00:45.928 - vm destroyd

agent-loop scan: [7] 2016-01-11 04:00:47.560 Loop iteration exceeded interval (2 vs. 5.83269119263)! daemon_loop /opt/stack/
agent-loop scan: [7] 2016-01-11 04:00:47.560 (device in current)
libvirt: [1] 2016-01-11 04:00:47.731+0000: 16552: instance stareted (or to be started?)
nova log [4] 2016-01-11 04:00:47.957 instance started
agent-loop scan: [8] 2016-01-11 04:00:51.560 (device in current)
nova log [9] deletion triggered although still waiting for event

nova log [3] vif timeout
--Device should have been detected as deleted by the agent somewhere in between!

So is the solution to reduce the polling interval - not sure if that one would help in this case. I think the root cause is, that the agent looop (daemon_loop) synchronosly executes it's task. That means, that the interval might increase, if actions take longer...

I'm not 100% sure what causes this issue. But I think it's like this.

- Instance created and tap device created
- Agent loop detects that device is still there and assumes it active - it did not realize, that it dissappeared and appeard again
 ---> so the device up message is missing --> the port will never become Active!
- Instance stays in paused state until it gets deleted
2016-01-11 04:03:37.974 DEBUG nova.compute.manager [req-a7841493-f6c3-485d-a9e2-720d5386dcf2 tempest-ServerPersonalityTestJSON-1787158816 tempest-ServerPersonalityTestJSON-675574703] [instance: 5cc0ffad-aa65-4780-8cdc-a7c5040824fe] Events pending at deletion: network-vif-plugged-7094cb28-f242-4e85-85a2-7da2a147e4f0 _delete_instance /opt/stack/new/nova/nova/compute/manager.py:2391

long time after the instance got deleted, the network-vif-timeout occurs

So probably the test case did trigger the delete...

agent-loop scan:     [6] 2016-01-11 04:00:41.727  (device in current + added due to resync)
libvirt:                       [2] 2016-01-11 04:00:45.875+0000: 16554: VM destroyed
nova log:                  [5] 2016-01-11 04:00:45.928 - vm destroyd

agent-loop scan:    [7] 2016-01-11 04:00:47.560 Loop iteration exceeded interval (2 vs. 5.83269119263)! daemon_loop /opt/stack/
agent-loop scan:    [7] 2016-01-11 04:00:47.560  (device in current)
libvirt:                     [1]  2016-01-11 04:00:47.731+0000: 16552: instance stareted (or to be started?)
nova log                 [4] 2016-01-11 04:00:47.957 instance started
agent-loop scan:  [8] 2016-01-11 04:00:51.560 (device in current)
nova log                 [9] deletion triggered although still waiting for event

nova log                [3] vif timeout
--Device should have been detected as deleted by the agent somewhere in between!