TimeoutException while executing tests adding bridge using OVSDB native

Bug #1627106 reported by Ihar Hrachyshka
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Terry Wilson
ovsdbapp
Fix Released
Undecided
Unassigned

Bug Description

http://logs.openstack.org/91/366291/12/check/gate-neutron-dsvm-functional-ubuntu-trusty/a23c816/testr_results.html.gz

Traceback (most recent call last):
  File "neutron/tests/base.py", line 125, in func
    return f(self, *args, **kwargs)
  File "neutron/tests/functional/agent/ovsdb/test_impl_idl.py", line 62, in test_post_commit_vswitchd_completed_no_failures
    self._add_br_and_test()
  File "neutron/tests/functional/agent/ovsdb/test_impl_idl.py", line 56, in _add_br_and_test
    self._add_br()
  File "neutron/tests/functional/agent/ovsdb/test_impl_idl.py", line 52, in _add_br
    tr.add(ovsdb.add_br(self.brname))
  File "neutron/agent/ovsdb/api.py", line 76, in __exit__
    self.result = self.commit()
  File "neutron/agent/ovsdb/impl_idl.py", line 72, in commit
    'timeout': self.timeout})
neutron.agent.ovsdb.api.TimeoutException: Commands [AddBridgeCommand(name=test-br6925d8e2, datapath_type=None, may_exist=True)] exceeded timeout 10 seconds

I suspect this one may hit us because we finally made timeout working with Icd745514adc14730b9179fa7a6dd5c115f5e87a5.

Changed in neutron:
importance: Undecided → High
tags: added: functional-tests ovs
tags: added: gate-failure
Changed in neutron:
status: New → Confirmed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Let'see how burning hot this gets.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Weird, from the logs it doesn't even look like it took 10 seconds:

2016-09-23 13:56:30.642 27268 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): AddBridgeCommand(name=test-br6925d8e2, datapath_type=None, may_exist=True) do_commit neutron/agent/ovsdb/impl_idl.py:98
2016-09-23 13:56:30.643 27268 DEBUG neutron.agent.ovsdb.impl_idl [-] OVSDB transaction returned TRY_AGAIN, retrying do_commit neutron/agent/ovsdb/impl_idl.py:109
2016-09-23 13:56:31.708 27268 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): AddBridgeCommand(name=test-br6925d8e2, datapath_type=None, may_exist=True) do_commit neutron/agent/ovsdb/impl_idl.py:98
2016-09-23 13:56:31.754 27268 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): DelBridgeCommand(name=test-br6925d8e2, if_exists=True) do_commit neutron/agent/ovsdb/impl_idl.py:98

I wonder whether it's possible we got stuck at do_post_commit() introduced recently.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/381465

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote : Re: TimeoutException while executing test_post_commit_vswitchd_completed_no_failures

Hit same in http://logs.openstack.org/06/363206/25/check/gate-neutron-dsvm-functional-ubuntu-trusty/b339cc6/testr_results.html.gz

Traceback (most recent call last):
  File "neutron/tests/functional/agent/test_firewall.py", line 99, in setUp
    self.tester, self.firewall = getattr(self, self.initialize)()
  File "neutron/tests/functional/agent/test_firewall.py", line 123, in initialize_ovs
    if not checks.ovs_conntrack_supported():
  File "neutron/cmd/sanity/checks.py", line 381, in ovs_conntrack_supported
    with ovs_lib.OVSBridge(br_name) as br:
  File "neutron/agent/common/ovs_lib.py", line 584, in __enter__
    self.create()
  File "neutron/agent/common/ovs_lib.py", line 211, in create
    FAILMODE_SECURE))
  File "neutron/agent/ovsdb/api.py", line 76, in __exit__
    self.result = self.commit()
  File "neutron/agent/ovsdb/impl_idl.py", line 72, in commit
    'timeout': self.timeout})
neutron.agent.ovsdb.api.TimeoutException: Commands [AddBridgeCommand(name=ovs-test-ea0186, may_exist=True, datapath_type=system)] exceeded timeout 10 seconds

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/381465
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote : Re: TimeoutException while executing test_post_commit_vswitchd_completed_no_failures
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

I'm pinging Terry Wilson, shall we try to increase ovsdb timeout? May be this happens when nodes are overloaded somehow?

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

I see that with current code, commit would also emit a TimeoutException if a Queue.Empty result is returned:

     def commit(self):
         self.ovsdb_connection.queue_txn(self)
         try:
             result = self.results.get(timeout=self.timeout)
         except Queue.Empty:
             raise api.TimeoutException(
                 _("Commands %(commands)s exceeded timeout %(timeout)d "
                   "seconds") % {'commands': self.commands,
                                 'timeout': self.timeout})
         if isinstance(result, idlutils.ExceptionResult):
             if self.log_errors:
                 LOG.error(result.tb)
             if self.check_error:
                 raise result.ex
         return result

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/425623

Changed in neutron:
assignee: nobody → Miguel Angel Ajo (mangelajo)
status: Confirmed → In Progress
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote : Re: TimeoutException while executing test_post_commit_vswitchd_completed_no_failures
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

There goes my naive patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/426032

Changed in neutron:
milestone: none → ocata-rc1
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote : Re: TimeoutException while executing test_post_commit_vswitchd_completed_no_failures

This comment is about "no response to inactivity probe after 5 seconds, disconnecting" messages from ovsdb-server and ovs-vswitchd. I believe this is from the same cause as this bug report, but it can turn out to be otherwise.

1. Query the logstash with message:"no response to inactivity probe". Use tcpdump or whatsoever to get the json.

import simplejson
j=simplejson.load(open('/tmp/logstash.json'))
import collections
for k,v in collections.Counter([x['_source']['log_url'] for x in j['hits']['hits']]).most_common():
    print('%d\t%s' % (v,k))

gives something like:

58 http://logs.openstack.org/24/425924/2/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/e2834c2/logs/syslog.txt
49 http://logs.openstack.org/13/426813/2/check/gate-cinder-dsvm-apache-ubuntu-xenial-nv/04f9e80/logs/syslog.txt
19 http://logs.openstack.org/38/426938/2/check/gate-trove-scenario-dsvm-mariadb-single-ubuntu-xenial-nv/8bc7cbb/logs/syslog.txt
15 http://logs.openstack.org/96/426896/2/check/gate-manila-tempest-dsvm-scenario-ubuntu-xenial-nv/4a29497/logs/syslog.txt
14 http://logs.openstack.org/59/418559/12/check/gate-manila-tempest-dsvm-postgres-zfsonlinux-ubuntu-xenial-nv/c9137c2/logs/syslog.txt
9 http://logs.openstack.org/55/421155/4/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/632669b/logs/syslog.txt

Note the issue is not limited to neutron or tempest. functional and fullstack are also on the list, but not useful for this analysis as they don't leave dstat logs.

2. Get syslog.txt and dstat-csv_log.txt.
3. Compare them. for example:

$ grep 'no response to inacti' 426842/syslog.txt.gz |colrm 80
$ python findjump.py < 426842/dstat-csv_log.txt.gz |colrm 80
$ cat findjump.py
import calendar
import time
import sys

yr = time.gmtime().tm_year
told = None
thres = 3 # there is some room for tuning :)
for l in sys.stdin:
    l = l[:-1]
    ll = l.split(',')
    try:
        tm = time.strptime(ll[0], "%d-%m %H:%M:%S")
        tnew = calendar.timegm(time.struct_time((yr,) + tm[1:]))
        if told and tnew - told > thres:
            print(tnew - told)
            print(l)
        told = tnew
    except ValueError:
        pass

Their timestamps are not identical, but highly correlated imo.

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

Forgot to note that dstat is expected to give data every second. The above python script checks that.
If even dstat cannot run every second, ovs-agent etc won't be able to response in a timely manner.

One example of the above:

http://logs.openstack.org/24/425924/2/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/e2834c2/logs/syslog.txt#_Jan_30_14_58_42
http://logs.openstack.org/24/425924/2/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/e2834c2/logs/screen-dstat.txt.gz (search for timestamp 14:58:42)

dstat output is missing from 14:58:37 to 14:58:40.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

IIUC if the inactivity probe disconnect, the next attempt of communication should reconnect, no? I also tried to disable inactivity probe - https://review.openstack.org/#/c/315042/2 - do you think it would help? What if we would increase the probe timeout?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Increasing the probe timeout may be a viable option if we are just running into system slow-downs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/426032
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6ce47f1d62519a7de3abaef64383c3b927d7fe0d
Submitter: Jenkins
Branch: master

commit 6ce47f1d62519a7de3abaef64383c3b927d7fe0d
Author: Kevin Benton <email address hidden>
Date: Thu Jan 26 18:15:55 2017 -0800

    Stop passing datapath as 'may_exist' argument

    While debugging the related bug, this showed up in the logs:
    AddBridgeCommand(datapath_type=None, may_exist=system, name=test-brd8f3648c)

    This was caused by the add_bridge command not passing in the datapath
    type as named argument so it was taking the positional argument of
    may_exist. So the datapath type was being lost whenever this command
    was being used.

    This fixes it by makeing add_bridge use a named argument for
    datapath_type and it also switches it to using the
    OVSBridge.create() method to unify the code-paths for bridge
    creation.

    It's not clear that this was a cause of the bug though, so
    this fix is only related.

    Change-Id: I5711c29c21741dd847f2c27fb621f723b24cb7fd
    Related-Bug: #1627106

tags: added: auto-allocated-topology
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Miguel Angel Ajo (<email address hidden>) on branch: master
Review: https://review.openstack.org/425623
Reason: Terry Wilson is investigating it more deeply, he found that even rising the timeout to 1000 seconds would make it fail, so IWAMOTO analysis seems more accurate. This would be of no help.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/429095

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/429951

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This is probably the most important failure we see right now for functional tests.

summary: - TimeoutException while executing
- test_post_commit_vswitchd_completed_no_failures
+ TimeoutException while executing tests adding bridge using OVSDB native
Changed in neutron:
importance: High → Critical
Changed in neutron:
milestone: ocata-rc1 → ocata-rc2
tags: added: ocata-rc-potential
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

The commit ad99fb0dd added dstat logs for functional and fullstack jobs.
I picked up a few of them with inactivity probe failures.
In those logs, it seems there are plenty of free mem for those jobs and no dstat timestamp jumps.

So, there are some other reason than system load that cause ovsdb/openflow connection problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/429095
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26766963c6e6b2105718e21cd14dfc6df201154c
Submitter: Jenkins
Branch: master

commit 26766963c6e6b2105718e21cd14dfc6df201154c
Author: Terry Wilson <email address hidden>
Date: Fri Feb 3 16:39:24 2017 -0600

    Clean up ovsdb-native's use of verify()

    When updating an ovsdb set-type column, the existing code does
    the following:

    1. Read the existing column value
    2. Call verify() to cause a write failure if something else
       modifies the column before we commit
    3. Append the value to our local copy of the column
    4. Assign the local value to the object.column to trigger
       __setattr__ to write the value to the database

    If verify() fails, which it *very* often does in a test
    environment or a busy system, ovsdb-server will respond with a
    TRY_AGAIN response which will cause the whole process to start
    over.

    In the 2.6 cycle, Row.addvalue()/delvalue() methods were added
    which use OVDB's native 'mutate' methods to handle adding/deleting
    from a set/map-type column. Using these means calling verify() is
    no longer required and things will proceed *much* more efficiently.

    Bug #1627106 where we get frequent TimeoutExceptions appears to be
    related. Eliminating the frequent TRY_AGAIN responses, in my local
    testing, also eliminates the TimeoutExceptions. This doesn't mean
    that the underlying issue is resolved, but it may be avoided.
    Related-Bug: #1627106

    Change-Id: I26c7731f5dbd3bd2955dbfa18a7c41517da63e6e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/429951
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=339855ff09916ed3573786395eace0e06011677d
Submitter: Jenkins
Branch: stable/ocata

commit 339855ff09916ed3573786395eace0e06011677d
Author: Terry Wilson <email address hidden>
Date: Fri Feb 3 16:39:24 2017 -0600

    Clean up ovsdb-native's use of verify()

    When updating an ovsdb set-type column, the existing code does
    the following:

    1. Read the existing column value
    2. Call verify() to cause a write failure if something else
       modifies the column before we commit
    3. Append the value to our local copy of the column
    4. Assign the local value to the object.column to trigger
       __setattr__ to write the value to the database

    If verify() fails, which it *very* often does in a test
    environment or a busy system, ovsdb-server will respond with a
    TRY_AGAIN response which will cause the whole process to start
    over.

    In the 2.6 cycle, Row.addvalue()/delvalue() methods were added
    which use OVDB's native 'mutate' methods to handle adding/deleting
    from a set/map-type column. Using these means calling verify() is
    no longer required and things will proceed *much* more efficiently.

    Bug #1627106 where we get frequent TimeoutExceptions appears to be
    related. Eliminating the frequent TRY_AGAIN responses, in my local
    testing, also eliminates the TimeoutExceptions. This doesn't mean
    that the underlying issue is resolved, but it may be avoided.
    Related-Bug: #1627106

    (cherry picked from commit 26766963c6e6b2105718e21cd14dfc6df201154c)
    Change-Id: I26c7731f5dbd3bd2955dbfa18a7c41517da63e6e

tags: added: in-stable-ocata
Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

This bug isn't fixed. Please update the bug status.

Revision history for this message
Gary Kotton (garyk) wrote :

Moving to triaged as the issue has not been addressed.

Changed in neutron:
status: Fix Released → Triaged
Akihiro Motoki (amotoki)
tags: added: ocata-backport-potential
removed: ocata-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Miguel Angel Ajo (<email address hidden>) on branch: master
Review: https://review.openstack.org/425623

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/441258
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1698bee770b84a2663ba940a6ded5d4b9733101a
Submitter: Jenkins
Branch: master

commit 1698bee770b84a2663ba940a6ded5d4b9733101a
Author: Terry Wilson <email address hidden>
Date: Fri Mar 3 11:46:16 2017 -0600

    Set OVS inactivity_probe to vsctl_timeout when adding manager

    If the vsctl_timeout > OVS's inactivity probe interval and a
    transaction execution time exceeds the probe interval, OVS will
    disconnect and the transaction will return TRY_AGAIN and most
    likely repeat failing until the vsctl_timeout is reached. This
    change ensures that the "failsafe" creation of the manager also
    sets the inactivity probe to the vsctl_timeout value.

    Currently the patch doesn't override the probe_interval on an
    existing Manager since it is possible that connection is used by
    outside applications and it theoretically should be handled at
    deployment.

    Related-Bug: #1627106
    Change-Id: I76fa0a0cf04a166edf062086fceb2fd90960ad6b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/443965

Changed in neutron:
assignee: Miguel Angel Ajo (mangelajo) → Terry Wilson (otherwiseguy)
milestone: ocata-rc2 → pike-1
status: Triaged → In Progress
tags: removed: auto-allocated-topology
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I suspect it does not affect us anymore, or not as much as before. Reducing importance and will monitor before closing.

Changed in neutron:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/443965
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=72def16669682f34731d85e9fc5bbece8f55037c
Submitter: Jenkins
Branch: stable/ocata

commit 72def16669682f34731d85e9fc5bbece8f55037c
Author: Terry Wilson <email address hidden>
Date: Fri Mar 3 11:46:16 2017 -0600

    Set OVS inactivity_probe to vsctl_timeout when adding manager

    If the vsctl_timeout > OVS's inactivity probe interval and a
    transaction execution time exceeds the probe interval, OVS will
    disconnect and the transaction will return TRY_AGAIN and most
    likely repeat failing until the vsctl_timeout is reached. This
    change ensures that the "failsafe" creation of the manager also
    sets the inactivity probe to the vsctl_timeout value.

    Currently the patch doesn't override the probe_interval on an
    existing Manager since it is possible that connection is used by
    outside applications and it theoretically should be handled at
    deployment.

    Related-Bug: #1627106
    Change-Id: I76fa0a0cf04a166edf062086fceb2fd90960ad6b
    (cherry picked from commit 1698bee770b84a2663ba940a6ded5d4b9733101a)

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

0 hits in 10 days, closing the bug.

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/454611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/454612

tags: removed: neutron-proactive-backport-potential ocata-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/425623
Reason: Restore if we start getting OVS Timeout failures again

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/454611
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4c9e3edc758699674deed83893cf314481dc6ee1
Submitter: Jenkins
Branch: stable/newton

commit 4c9e3edc758699674deed83893cf314481dc6ee1
Author: Terry Wilson <email address hidden>
Date: Fri Feb 3 16:39:24 2017 -0600

    Clean up ovsdb-native's use of verify()

    When updating an ovsdb set-type column, the existing code does
    the following:

    1. Read the existing column value
    2. Call verify() to cause a write failure if something else
       modifies the column before we commit
    3. Append the value to our local copy of the column
    4. Assign the local value to the object.column to trigger
       __setattr__ to write the value to the database

    If verify() fails, which it *very* often does in a test
    environment or a busy system, ovsdb-server will respond with a
    TRY_AGAIN response which will cause the whole process to start
    over.

    In the 2.6 cycle, Row.addvalue()/delvalue() methods were added
    which use OVDB's native 'mutate' methods to handle adding/deleting
    from a set/map-type column. Using these means calling verify() is
    no longer required and things will proceed *much* more efficiently.

    Bug #1627106 where we get frequent TimeoutExceptions appears to be
    related. Eliminating the frequent TRY_AGAIN responses, in my local
    testing, also eliminates the TimeoutExceptions. This doesn't mean
    that the underlying issue is resolved, but it may be avoided.
    Related-Bug: #1627106

    Conflicts:
     neutron/agent/ovsdb/native/commands.py
     neutron/tests/functional/agent/test_ovs_lib.py

    Change-Id: I26c7731f5dbd3bd2955dbfa18a7c41517da63e6e
    (cherry picked from commit 26766963c6e6b2105718e21cd14dfc6df201154c)
    (cherry picked from commit 339855ff09916ed3573786395eace0e06011677d)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/454612
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=634d7b02e25a95f479f2bc42b0a01f1c503bfba4
Submitter: Jenkins
Branch: stable/newton

commit 634d7b02e25a95f479f2bc42b0a01f1c503bfba4
Author: Terry Wilson <email address hidden>
Date: Fri Mar 3 11:46:16 2017 -0600

    Set OVS inactivity_probe to vsctl_timeout when adding manager

    If the vsctl_timeout > OVS's inactivity probe interval and a
    transaction execution time exceeds the probe interval, OVS will
    disconnect and the transaction will return TRY_AGAIN and most
    likely repeat failing until the vsctl_timeout is reached. This
    change ensures that the "failsafe" creation of the manager also
    sets the inactivity probe to the vsctl_timeout value.

    Currently the patch doesn't override the probe_interval on an
    existing Manager since it is possible that connection is used by
    outside applications and it theoretically should be handled at
    deployment.

    Related-Bug: #1627106

    Conflicts:
     neutron/agent/ovsdb/native/connection.py
     neutron/agent/ovsdb/native/idlutils.py

    Change-Id: I76fa0a0cf04a166edf062086fceb2fd90960ad6b
    (cherry picked from commit 1698bee770b84a2663ba940a6ded5d4b9733101a)
    (cherry picked from commit 72def16669682f34731d85e9fc5bbece8f55037c)

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This still happens. At least once in Ocata functional job: http://logs.openstack.org/67/460867/1/check/gate-neutron-dsvm-functional-ubuntu-xenial/382d800/console.html

Also logstash shows 45 hits overall for "message:"exceeded timeout 10 seconds"", almost all of them in fullstack now.

Changed in neutron:
status: Fix Released → Confirmed
milestone: pike-1 → pike-2
tags: added: fullstack
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Added ovsdbapp to the list of affected projects because I believe this error comes from the library.

Changed in neutron:
milestone: pike-2 → none
Changed in ovsdbapp:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ovsdbapp (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/520180

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (master)

Reviewed: https://review.openstack.org/520176
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=158ae06bce0f56e93677f94c59f81e5e76ee1ccc
Submitter: Zuul
Branch: master

commit 158ae06bce0f56e93677f94c59f81e5e76ee1ccc
Author: Terry Wilson <email address hidden>
Date: Wed Nov 15 13:05:18 2017 -0600

    Don't wait on TRY_AGAIN when calling commit_block()

    commit_block() calls idl.run() already, so the seqno has already
    changed, and waiting again could potentially wait until a timeout
    occurs. From the ovs.db.idl docs:

     If the final status is Transaction.TRY_AGAIN, wait for
     Idl.change_seqno to change from the saved 'seqno' (it's possible
     that it's already changed, in which case the client should not
     wait at all), then start over from step 1. Only a call to
     Idl.run() will change the return value of Idl.change_seqno.
     (Transaction.commit_block() calls Idl.run().)

    Closes-Bug: #1627106
    Change-Id: I1e5f4627f0488ba15ba2fb3509504bcf1b787b07

Changed in ovsdbapp:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/520202

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (stable/pike)

Reviewed: https://review.openstack.org/520180
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=742754bce3c9453f8c7186455a92e4f6d6b18ace
Submitter: Zuul
Branch: stable/pike

commit 742754bce3c9453f8c7186455a92e4f6d6b18ace
Author: Terry Wilson <email address hidden>
Date: Wed Nov 15 13:05:18 2017 -0600

    Don't wait on TRY_AGAIN when calling commit_block()

    commit_block() calls idl.run() already, so the seqno has already
    changed, and waiting again could potentially wait until a timeout
    occurs. From the ovs.db.idl docs:

     If the final status is Transaction.TRY_AGAIN, wait for
     Idl.change_seqno to change from the saved 'seqno' (it's possible
     that it's already changed, in which case the client should not
     wait at all), then start over from step 1. Only a call to
     Idl.run() will change the return value of Idl.change_seqno.
     (Transaction.commit_block() calls Idl.run().)

    Closes-Bug: #1627106
    Change-Id: I1e5f4627f0488ba15ba2fb3509504bcf1b787b07
    (cherry picked from commit 158ae06bce0f56e93677f94c59f81e5e76ee1ccc)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/520202
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=319ac92d9b47442368452b910f0985a2006eaa7a
Submitter: Zuul
Branch: stable/ocata

commit 319ac92d9b47442368452b910f0985a2006eaa7a
Author: Terry Wilson <email address hidden>
Date: Wed Nov 15 14:58:18 2017 -0600

    Don't wait on TRY_AGAIN when calling commit_block()

    commit_block() calls idl.run() already, so the seqno has already
    changed, and waiting again could potentially wait until a timeout
    occurs. From the ovs.db.idl docs:

     If the final status is Transaction.TRY_AGAIN, wait for
     Idl.change_seqno to change from the saved 'seqno' (it's possible
     that it's already changed, in which case the client should not
     wait at all), then start over from step 1. Only a call to
     Idl.run() will change the return value of Idl.change_seqno.
     (Transaction.commit_block() calls Idl.run().)

    This is backported from I1e5f4627f0488ba15ba2fb3509504bcf1b787b07
    in the ovsdbapp project.

    Closes-Bug: #1627106
    Change-Id: I1e5f4627f0488ba15ba2fb3509504bcf1b787b07

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ovsdbapp 0.8.0

This issue was fixed in the openstack/ovsdbapp 0.8.0 release.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Fixes backported to all stable branches.

Changed in neutron:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ovsdbapp 0.4.1

This issue was fixed in the openstack/ovsdbapp 0.4.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.5

This issue was fixed in the openstack/neutron 10.0.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.