functional: DbAddCommand sometimes times out after 10 seconds

Bug #1741889 reported by Jakub Libosvar
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Unassigned
ovsdbapp
Fix Released
Undecided
Unassigned

Bug Description

Traceback (most recent call last):
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/fixtures/fixture.py", line 197, in setUp
    self._setUp()
  File "neutron/tests/common/net_helpers.py", line 721, in _setUp
    self.bridge = common_base.create_resource(self.prefix, ovs.add_bridge)
  File "neutron/tests/common/base.py", line 47, in create_resource
    return creation_func(name, *args, **kwargs)
  File "neutron/agent/common/ovs_lib.py", line 140, in add_bridge
    br.create()
  File "neutron/agent/common/ovs_lib.py", line 255, in create
    FAILMODE_SECURE))
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction
    self._nested_txn = None
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__
    self.result = self.commit()
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 57, in commit
    timeout=self.timeout)
ovsdbapp.exceptions.TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.AddBridgeCommand object at 0x7fdcc274a9d0>, <ovsdbapp.schema.open_vswitch.commands.DbAddCommand object at 0x7fdcc274aad0>] exceeded timeout 10 seconds

logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%5C%22legacy-neutron-dsvm-functional%5C%22%20AND%20message%3A%5C%22ovsdbapp.exceptions.TimeoutException%3A%20Commands%5C%22%20AND%20tags%3Aconsole%20AND%20build_status%3AFAILURE

In the logs we can see AddBridgeCommand result returning TRY_AGAIN over and over again:
http://logs.openstack.org/70/531070/4/check/legacy-neutron-dsvm-functional/9aa7107/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_agent_mtu_set_on_interface_driver.txt.gz#_2018-01-08_04_20_06_754

Revision history for this message
Jakub Libosvar (libosvar) wrote :
Revision history for this message
Jakub Libosvar (libosvar) wrote :

The content of the file from comment #1 is just following repeating endlessly:

2018-01-08 13:14:19.695 10019 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): AddBridgeCommand(datapath_type=system, may_exist=True, name=test-bref309aba) do_commit /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-01-08 13:14:19.696 10019 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=1): DbAddCommand(record=test-bref309aba, column=protocols, table=Bridge, values=('OpenFlow10',)) do_commit /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-01-08 13:14:19.696 10019 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:93

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I don't know how it can be related but in ovsdb-server log there is a lot of connection errors, like:

2018-01-08T04:20:06.753Z|00126|jsonrpc|WARN|tcp:127.0.0.1:35670: receive error: Connection reset by peer
2018-01-08T04:20:06.753Z|00127|reconnect|WARN|tcp:127.0.0.1:35670: connection dropped (Connection reset by peer)

e.g. on http://logs.openstack.org/70/531070/4/check/legacy-neutron-dsvm-functional/9aa7107/logs/openvswitch/ovsdb-server.txt.gz which is corresponding log from first comment.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

We see similar failures in fullstack around same time: http://logs.openstack.org/98/513398/10/check/neutron-fullstack/b62a726/logs/testr_results.html.gz

And this doesn't seem to affect gate as much as it did when it was reported. Maybe it's gone and was related to unstable infra.

Changed in neutron:
importance: Critical → High
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

https://review.openstack.org/#/c/537241/ should at least fix the runaway TRY_AGAIN issue. The situation:

If there is a database disconnection between the call to idl.run() and do_commit() calls in Connection.run(), then we call txn.commit_block() which calls txn.commit() which calls _session.send() which will return with an error and the transaction status is set to TRY_AGAIN. commit_block() will then immediately return that status and not call idl.run(). We will see the TRY_AGAIN and eventually call commit_block() again and since idl.run() has not been run, it hasn't run through the logic to reconnect to the database. After we time out in do_commit(), Connection.run() will eventually loop back around to calling idl.run() and reconnect to the database.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (master)

Reviewed: https://review.openstack.org/537241
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=1810faecc9ad2345f3e2f9185ac64194c5a0d711
Submitter: Zuul
Branch: master

commit 1810faecc9ad2345f3e2f9185ac64194c5a0d711
Author: Terry Wilson <email address hidden>
Date: Wed Jan 24 03:41:13 2018 +0000

    Ensure idl.run() called on TRY_AGAIN

    If the DB connection drops after Connection.run() calls idl.run()
    but before Transaction.do_commit(), then it is possible for
    idl.commit() to return TRY_AGAIN due to the failure of session.send().
    This prevents commit_block() from running idl.run() which would
    actually trigger the DB reconnect, causing each successive call to
    commit_block() to fail in the same way until we stop trying again
    due to timeout and Connection.run() calls idl.run() again.

    For now, we can just ensure that idl.run() is called when we get a
    TRY_AGAIN. It might be worth it to refactor this a bit to not even
    use commit_block().

    Change-Id: I77402bde3fc4d45c770b9f48882870cfc91d719d
    Closes-Bug: #1741889

Changed in ovsdbapp:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ovsdbapp (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/538941

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ovsdbapp (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/538942

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (stable/pike)

Reviewed: https://review.openstack.org/538942
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=e5ded80fe579d967642ee9173874ba924395d3af
Submitter: Zuul
Branch: stable/pike

commit e5ded80fe579d967642ee9173874ba924395d3af
Author: Terry Wilson <email address hidden>
Date: Wed Jan 24 03:41:13 2018 +0000

    Ensure idl.run() called on TRY_AGAIN

    If the DB connection drops after Connection.run() calls idl.run()
    but before Transaction.do_commit(), then it is possible for
    idl.commit() to return TRY_AGAIN due to the failure of session.send().
    This prevents commit_block() from running idl.run() which would
    actually trigger the DB reconnect, causing each successive call to
    commit_block() to fail in the same way until we stop trying again
    due to timeout and Connection.run() calls idl.run() again.

    For now, we can just ensure that idl.run() is called when we get a
    TRY_AGAIN. It might be worth it to refactor this a bit to not even
    use commit_block().

    Change-Id: I77402bde3fc4d45c770b9f48882870cfc91d719d
    Closes-Bug: #1741889
    (cherry picked from commit 1810faecc9ad2345f3e2f9185ac64194c5a0d711)

tags: added: in-stable-pike
tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (stable/queens)

Reviewed: https://review.openstack.org/538941
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=c3288aec527ef21eb57d755540e34e6ac5382a7c
Submitter: Zuul
Branch: stable/queens

commit c3288aec527ef21eb57d755540e34e6ac5382a7c
Author: Terry Wilson <email address hidden>
Date: Wed Jan 24 03:41:13 2018 +0000

    Ensure idl.run() called on TRY_AGAIN

    If the DB connection drops after Connection.run() calls idl.run()
    but before Transaction.do_commit(), then it is possible for
    idl.commit() to return TRY_AGAIN due to the failure of session.send().
    This prevents commit_block() from running idl.run() which would
    actually trigger the DB reconnect, causing each successive call to
    commit_block() to fail in the same way until we stop trying again
    due to timeout and Connection.run() calls idl.run() again.

    For now, we can just ensure that idl.run() is called when we get a
    TRY_AGAIN. It might be worth it to refactor this a bit to not even
    use commit_block().

    Change-Id: I77402bde3fc4d45c770b9f48882870cfc91d719d
    Closes-Bug: #1741889
    (cherry picked from commit 1810faecc9ad2345f3e2f9185ac64194c5a0d711)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ovsdbapp 0.9.1

This issue was fixed in the openstack/ovsdbapp 0.9.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ovsdbapp 0.4.2

This issue was fixed in the openstack/ovsdbapp 0.4.2 release.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

After ovsdbapp release, we don't see it happening in functional job anymore. There are some timeout hits in logstash in tempest jobs, but happening at the same time as other actions executed by the agent (e.g. update state RPC call) are timing out too, so probably host overloaded.

Changed in neutron:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ovsdbapp 0.11.0

This issue was fixed in the openstack/ovsdbapp 0.11.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ovsdbapp 0.12.0

This issue was fixed in the openstack/ovsdbapp 0.12.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.