neutron

functional: DbAddCommand sometimes times out after 10 seconds

Bug #1741889 reported by Jakub Libosvar on 2018-01-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Unassigned
	ovsdbapp	Fix Released	Undecided	Unassigned

Bug Description

Traceback (most recent call last):
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/fixtures/fixture.py", line 197, in setUp
    self._setUp()
  File "neutron/tests/common/net_helpers.py", line 721, in _setUp
    self.bridge = common_base.create_resource(self.prefix, ovs.add_bridge)
  File "neutron/tests/common/base.py", line 47, in create_resource
    return creation_func(name, *args, **kwargs)
  File "neutron/agent/common/ovs_lib.py", line 140, in add_bridge
    br.create()
  File "neutron/agent/common/ovs_lib.py", line 255, in create
    FAILMODE_SECURE))
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction
    self._nested_txn = None
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__
    self.result = self.commit()
  File "/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 57, in commit
    timeout=self.timeout)
ovsdbapp.exceptions.TimeoutException: Commands [<ovsdbapp.schema.open_vswitch.commands.AddBridgeCommand object at 0x7fdcc274a9d0>, <ovsdbapp.schema.open_vswitch.commands.DbAddCommand object at 0x7fdcc274aad0>] exceeded timeout 10 seconds

logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%5C%22legacy-neutron-dsvm-functional%5C%22%20AND%20message%3A%5C%22ovsdbapp.exceptions.TimeoutException%3A%20Commands%5C%22%20AND%20tags%3Aconsole%20AND%20build_status%3AFAILURE

In the logs we can see AddBridgeCommand result returning TRY_AGAIN over and over again:
http://logs.openstack.org/70/531070/4/check/legacy-neutron-dsvm-functional/9aa7107/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_agent_mtu_set_on_interface_driver.txt.gz#_2018-01-08_04_20_06_754

Tags:

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2018-01-08:

It also causes functional tests to hang: http://logs.openstack.org/84/531684/2/check/neutron-functional/2ed0f9e/job-output.txt.gz#_2018-01-08_14_11_58_988652

This is the last running test: (huge file, needs to be downloaded)
http://logs.openstack.org/84/531684/2/check/neutron-functional/2ed0f9e/logs/dsvm-functional-logs/neutron.tests.functional.agent.linux.test_ovsdb_monitor.TestSimpleInterfaceMonitor.test_get_events_includes_ofport_native_.txt.gz

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2018-01-08:

The content of the file from comment #1 is just following repeating endlessly:

2018-01-08 13:14:19.695 10019 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): AddBridgeCommand(datapath_type=system, may_exist=True, name=test-bref309aba) do_commit /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-01-08 13:14:19.696 10019 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=1): DbAddCommand(record=test-bref309aba, column=protocols, table=Bridge, values=('OpenFlow10',)) do_commit /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-01-08 13:14:19.696 10019 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:93

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2018-01-08:

I don't know how it can be related but in ovsdb-server log there is a lot of connection errors, like:

e.g. on http://logs.openstack.org/70/531070/4/check/legacy-neutron-dsvm-functional/9aa7107/logs/openvswitch/ovsdb-server.txt.gz which is corresponding log from first comment.

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2018-01-09:

We see similar failures in fullstack around same time: http://logs.openstack.org/98/513398/10/check/neutron-fullstack/b62a726/logs/testr_results.html.gz

And this doesn't seem to affect gate as much as it did when it was reported. Maybe it's gone and was related to unstable infra.

Changed in neutron:
importance:	Critical → High

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2018-01-11:

Happened again in: http://logs.openstack.org/43/532343/1/check/neutron-functional/66eeba8/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_metadata_proxy_respawned.txt.gz

Before the failure, we see https://bugs.launchpad.net/neutron/+bug/1687074 hit the test thread, so those may be related.

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2018-01-11:

Happened here too: http://logs.openstack.org/43/532343/1/check/legacy-neutron-dsvm-functional/f64455f/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_good_address_allocation.txt.gz

Revision history for this message

Terry Wilson (otherwiseguy) wrote on 2018-01-24:

https://review.openstack.org/#/c/537241/ should at least fix the runaway TRY_AGAIN issue. The situation:

If there is a database disconnection between the call to idl.run() and do_commit() calls in Connection.run(), then we call txn.commit_block() which calls txn.commit() which calls _session.send() which will return with an error and the transaction status is set to TRY_AGAIN. commit_block() will then immediately return that status and not call idl.run(). We will see the TRY_AGAIN and eventually call commit_block() again and since idl.run() has not been run, it hasn't run through the logic to reconnect to the database. After we time out in do_commit(), Connection.run() will eventually loop back around to calling idl.run() and reconnect to the database.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-26: Fix merged to ovsdbapp (master)

Reviewed: https://review.openstack.org/537241
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=1810faecc9ad2345f3e2f9185ac64194c5a0d711
Submitter: Zuul
Branch: master

commit 1810faecc9ad2345f3e2f9185ac64194c5a0d711
Author: Terry Wilson <email address hidden>
Date: Wed Jan 24 03:41:13 2018 +0000

Ensure idl.run() called on TRY_AGAIN

    If the DB connection drops after Connection.run() calls idl.run()
    but before Transaction.do_commit(), then it is possible for
    idl.commit() to return TRY_AGAIN due to the failure of session.send().
    This prevents commit_block() from running idl.run() which would
    actually trigger the DB reconnect, causing each successive call to
    commit_block() to fail in the same way until we stop trying again
    due to timeout and Connection.run() calls idl.run() again.

    For now, we can just ensure that idl.run() is called when we get a
    TRY_AGAIN. It might be worth it to refactor this a bit to not even
    use commit_block().

Change-Id: I77402bde3fc4d45c770b9f48882870cfc91d719d
Closes-Bug: #1741889

Changed in ovsdbapp:
status:	New → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-29: Fix proposed to ovsdbapp (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/538941

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-29: Fix proposed to ovsdbapp (stable/pike)

#10

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/538942

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-30: Fix merged to ovsdbapp (stable/pike)

#11

Reviewed: https://review.openstack.org/538942
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=e5ded80fe579d967642ee9173874ba924395d3af
Submitter: Zuul
Branch: stable/pike

commit e5ded80fe579d967642ee9173874ba924395d3af
Author: Terry Wilson <email address hidden>
Date: Wed Jan 24 03:41:13 2018 +0000

Ensure idl.run() called on TRY_AGAIN

    For now, we can just ensure that idl.run() is called when we get a
    TRY_AGAIN. It might be worth it to refactor this a bit to not even
    use commit_block().

    Change-Id: I77402bde3fc4d45c770b9f48882870cfc91d719d
    Closes-Bug: #1741889
    (cherry picked from commit 1810faecc9ad2345f3e2f9185ac64194c5a0d711)

tags:	added: in-stable-pike
tags:	added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-30: Fix merged to ovsdbapp (stable/queens)

#12

Reviewed: https://review.openstack.org/538941
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=c3288aec527ef21eb57d755540e34e6ac5382a7c
Submitter: Zuul
Branch: stable/queens

commit c3288aec527ef21eb57d755540e34e6ac5382a7c
Author: Terry Wilson <email address hidden>
Date: Wed Jan 24 03:41:13 2018 +0000

Ensure idl.run() called on TRY_AGAIN

    For now, we can just ensure that idl.run() is called when we get a
    TRY_AGAIN. It might be worth it to refactor this a bit to not even
    use commit_block().

    Change-Id: I77402bde3fc4d45c770b9f48882870cfc91d719d
    Closes-Bug: #1741889
    (cherry picked from commit 1810faecc9ad2345f3e2f9185ac64194c5a0d711)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-05: Fix included in openstack/ovsdbapp 0.9.1

#13

This issue was fixed in the openstack/ovsdbapp 0.9.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-06: Fix included in openstack/ovsdbapp 0.4.2

#14

This issue was fixed in the openstack/ovsdbapp 0.4.2 release.

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2018-03-20:

#15

After ovsdbapp release, we don't see it happening in functional job anymore. There are some timeout hits in logstash in tempest jobs, but happening at the same time as other actions executed by the agent (e.g. update state RPC call) are timing out too, so probably host overloaded.