neutron

Bug #1567668
Comment #15

Comment 15 for bug 1567668

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2016-05-09:

#15

Update: After investigation I finally found the root cause of why our jobs get stuck.

When using ovsdb native interface, we use ovs python library that encapsulates connection from our code to ovsdb. Part of this connection is keepalive feature, that during inactivity phase sends every 5 seconds "inactivity probe" in form of echo request. Connection to ovsdb is a FSM, which after sending the probe comes to idle state waiting for echo reply from Neutron (also part of ovs library). In the logs I see that after 5 seconds no reply was received and ovsdb closes connection.

2016-05-08T04:43:21.031Z|109380|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:21.031Z|109381|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:21.031Z|109382|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:21.032Z|109384|jsonrpc|DBG|tcp:127.0.0.1:42822: received reply, result=[], id="echo"
2016-05-08T04:43:21.032Z|109385|reconnect|DBG|tcp:127.0.0.1:42822: entering ACTIVE
2016-05-08T04:43:26.031Z|112393|reconnect|DBG|tcp:127.0.0.1:42822: idle 5000 ms, sending inactivity probe
2016-05-08T04:43:26.031Z|112394|reconnect|DBG|tcp:127.0.0.1:42822: entering IDLE
2016-05-08T04:43:26.031Z|112395|jsonrpc|DBG|tcp:127.0.0.1:42822: send request, method="echo", params=[], id="echo"
2016-05-08T04:43:31.032Z|113938|reconnect|ERR|tcp:127.0.0.1:42822: no response to inactivity probe after 5 seconds, disconnecting
2016-05-08T04:43:31.032Z|113939|reconnect|DBG|tcp:127.0.0.1:42822: connection dropped
2016-05-08T04:43:31.032Z|113940|reconnect|DBG|tcp:127.0.0.1:42822: entering VOID

This leads to blocking behavior in ovsdb native interface (part of Neutron code), where we expect communication from ovsdb server that actually never happens. We wait indefinitely for ovsdb and thus after 180 seconds test case raises correctly TimeoutException. Unfortunately we keep using ovsdb in a stuck state without any timeout fixture which leads to overall job timeout (wrote this just for clarity on why our jobs hang).

I reckon bug in both, ovs library and Neutron code. ovs library probably contains a regression in keepalive feature. Neutron shouldn't use blocking behavior in case connection to ovsdb is lost.

I'll go over ovs patches to see whether I can find the culprit.

Update: After investigation I finally found the root cause of why our jobs get stuck.

I reckon bug in both, ovs library and Neutron code. ovs library probably contains a regression in keepalive feature. Neutron shouldn't use blocking behavior in case connection to ovsdb is lost.

I'll go over ovs patches to see whether I can find the culprit.