Comment 7 for bug 1606827

Revision history for this message
John Schwarz (jschwarz) wrote :

@Kevin, the scenario we detected doesn't concern only the neutron-server being restarted, but also the rabbitmq server. The state that is achieved is that, somehow and for some reason, even though the rabbitmq server is up the report_state RPC doesn't exit immediately and in fact stays "hung" until the timeout expires. Then, once the timeout expires and report_state re-tries itself, the call succeeds almost immediately.

At some cases we've observed that if the timeout is 600 seconds (the maximum), it can take up quite a long time (we observed just under 10 minutes) to actually re-try the call and succeed.

I'm pasting the traceback we encountered at [1], which demonstrates this issue. Granted, this "restart all the controllers" business went on for a few times until the maximum timeout has reached 600 seconds, but it's a possible scenario nonetheless (power outages, etc). Notice that in the traceback, there is a just-under-10-minutes sleep at 15:28:45, the call is timed out at 15:38:30, the code sleeps for a second and executes again and finally at 15:38:32 the call succeeds immediately.

Thus, the scenario of "once the server re-starts, everything will work" is one scenario that is possible, but the issue talks of a different one :) Either way, lowering the timeout as the suggested merged patch did doesn't hurt either scenarios.

[1]: http://paste.openstack.org/show/544333/