neutron-functional-with-uwsgi fails with "KeyError: UUID" during create_default_drop_port_group

Bug #1938766 reported by Slawek Kaplonski
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Jakub Libosvar

Bug Description

Key failure is KeyError in create_default_drop_port_group().
"No such file or directory" and an error during self.mech_driver.nb_ovn.ovsdb_connection.stop() are part of the cleanup process.

--

Recently we have been seeing pretty often functional tests issues with errors like:

ft1.13: neutron.tests.functional.plugins.ml2.drivers.ovn.mech_driver.ovsdb.test_maintenance.TestMaintenance.test_check_for_port_security_unknown_addresstesttools.testresult.real._StringException: traceback-1: {{{
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/base.py", line 363, in stop
    self.mech_driver.nb_ovn.ovsdb_connection.stop()
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 153, in nb_ovn
    self._post_fork_event.wait()
  File "/usr/lib/python3.8/threading.py", line 558, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/semaphore.py", line 120, in acquire
    hubs.get_hub().switch()
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
    return self.greenlet.switch()
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 365, in run
    self.wait(sleep_time)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/poll.py", line 80, in wait
    presult = self.do_poll(seconds)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/eventlet/hubs/epolls.py", line 31, in do_poll
    return self.poll.poll(seconds)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()
fixtures._fixtures.timeout.TimeoutException
}}}

traceback-2: {{{
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/base.py", line 351, in _collect_processes_logs
    self._copy_log_file("%s.log" % northd_log, dst_northd)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/base.py", line 356, in _copy_log_file
    shutil.copyfile(
  File "/usr/lib/python3.8/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmps9cyr99c/ovn_northd.log'
}}}

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_maintenance.py", line 41, in setUp
    super(_TestMaintenanceHelper, self).setUp()
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/base.py", line 217, in setUp
    self._start_idls()
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/functional/base.py", line 320, in _start_idls
    self.mech_driver.pre_fork_initialize(
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 254, in pre_fork_initialize
    self._create_neutron_pg_drop()
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 272, in _create_neutron_pg_drop
    create_default_drop_port_group(pre_ovn_nb_api)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1241, in create_default_drop_port_group
    txn.add(nb_idl.pg_add_ports(pg_name, list(ports_with_pg)))
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py", line 274, in transaction
    yield t
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/ovsdbapp/api.py", line 110, in transaction
    del self._nested_txns_map[cur_thread_id]
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/ovsdbapp/api.py", line 61, in __exit__
    self.result = self.commit()
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 65, in commit
    raise result.ex
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 131, in run
    txn.results.put(txn.do_commit())
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 136, in do_commit
    self.post_commit(txn)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 73, in post_commit
    command.post_commit(txn)
  File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-functional/lib/python3.8/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 94, in post_commit
    row = self.api.tables[self.table_name].rows[real_uuid]
  File "/usr/lib/python3.8/collections/__init__.py", line 1010, in __getitem__
    raise KeyError(key)
KeyError: UUID('c9570a47-64da-4360-9c94-12ba459344c2')

Examples of such failures:

https://bd41d5ab9bd61c5e04d6-86672578d4e6ceb498f2d932b0da6815.ssl.cf5.rackcdn.com/798961/7/check/neutron-functional-with-uwsgi/772270c/testr_results.html

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_856/801310/2/check/neutron-functional-with-uwsgi/856f539/testr_results.html

https://7ee9afbce159ed2d3c7a-ae7c228c39808cabcdb014f8881dacdb.ssl.cf5.rackcdn.com/800278/7/check/neutron-functional-with-uwsgi/6332cdf/testr_results.html

https://fc20912585430bbde389-7e8720184c7a4cb82719c44bcc579362.ssl.cf5.rackcdn.com/802687/1/check/neutron-functional-with-uwsgi/d5d2430/testr_results.html

In all cases the stacktrace is pretty the same but failing tests are different.

Revision history for this message
Jakub Libosvar (libosvar) wrote :
Revision history for this message
Akihiro Motoki (amotoki) wrote :

Looking at the log linked above and recent neutron-functional-with-uwsgi failures in neutron-lib changes, I think failures happened in create_default_drop_port_group() which is called in setUp().

"No such file or directory" happend in _collect_processes_logs which is triggered as part of the cleanup process [1]. self.mech_driver.nb_ovn.ovsdb_connection.stop hit Timeout exception but it is also part of the cleanup process [2].

create_default_drop_port_group() fails in all failures with "No such file or directory: '/tmp/tmps9cyr99c/ovn_northd.log'", so create_default_drop_port_group() looks the root failure and we need to fix it.

[1] https://opendev.org/openstack/neutron/src/commit/dea5a9d3e302774932aa7122a7a9085b7a2cfa7b/neutron/tests/functional/base.py#L302
[2] https://opendev.org/openstack/neutron/src/commit/dea5a9d3e302774932aa7122a7a9085b7a2cfa7b/neutron/tests/functional/base.py#L316

summary: - Functional tests related to ovn failing with No such file or directory:
- '/tmp/tmps9cyr99c/ovn_northd.log'
+ neutron-functional-with-uwsgi fails with "KeyError: UUID" during
+ create_default_drop_port_group
Akihiro Motoki (amotoki)
description: updated
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Changed in neutron:
assignee: nobody → Jakub Libosvar (libosvar)
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

I think I've figured out what's causing the issue.

Ultimately, it is a race condition with the order of messages from the ovsdb-server. There is the Idl's monitor request where it gets a stream of updates when the database is changed, and there are also the replies to individual transactions.

When doing an OVSDB insert, you assign a temporary UUID to the object you are creating and then after the server inserts it, its reply will include the permanent UUID that it chose for the object.

ovsdbapp's AddCommand, which PgAddCommand subclasses, in post_commit() calls txn.get_insert_uuid(self.result) which allows us to get the real UUID from the server. This will always be available because it comes from the reply to the transaction which just committed. The problem is that we try look up and the row in memory with this UUID, but if we haven't already gotten the *monitor request notification* for that row, it will not be found and we will get the KeyError.

It doesn't look easy to solve this in python-ovs, because we don't have access to the full Row object when processing the insert operation's reply--it just contains the UUID. I believe I can work around this in ovsdbapp w/o changing the API by storing the temporary Row object when saving the result and then adding the ability to override the uuid presented in the RowView result.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

also, just to leave myself a note for writing the patch in a few hours, it also looks like one reason the issue happens so frequently with the short-living connection for the neutron_pg_drop creation is that we end up sending the transaction *before* we even send the initial monitor request.

jlibosvar has traced some other issues down to a particular python-ovs commit that I believe was the one where they added support for clustered dbs. part of that change (if I am remembering correctly) added connecting to the _Server db. ovsdbapp had a hack to wait until we'd gotten the initial db, idlutils.wait_for_change(). Since there are now several different things going on, wait_for_change() probably doesn't do what we really want, which is "wait until we get the initial dump of the OVN dbs". So we start sending transactions too early sometimes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/806938

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Terry Wilson (otherwiseguy) wrote :

And ultimately, https://patchwork.ozlabs<email address hidden>/ is the fix. It will prevent python-ovs from allowing transactions before it has started monitoring and retrieved the initial db, telling clients to try the transaction again (which ovsdbapp does).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/806938
Committed: https://opendev.org/openstack/neutron/commit/76ee64f4a8095ba207872d5de9613f93f208fdd4
Submitter: "Zuul (22348)"
Branch: master

commit 76ee64f4a8095ba207872d5de9613f93f208fdd4
Author: Terry Wilson <email address hidden>
Date: Wed Sep 1 14:19:19 2021 +0000

    Fix neutron_pg_drop-related startup issues

    There is a python-ovs bug that allows transactions to be executed
    before it has started monitoring the db for changes. This breaks
    ovsdbapp's AddCommand behavior, which looks up a row in memory
    post-commit to return it to the caller, because we process the
    reply from ovsdb-server that has the UUID in it, but when we try to
    look it up in memory, it isn't there since that is handled via the
    monitor notifications.

    Since we don't care about the return value, we can just ignore the
    KeyError. In addition, the request to Idl.cond_change() to monitor
    only the neutron_pg_drop table was failing for a similar reason--
    it is called before the initial monitor request is sent, so
    directly setting table.condition will allow the condition to be
    added to the initial monitor request.

    Closes-Bug: #1938766
    Change-Id: I78af15970a86923fb810e903d38634b6e60a8fe4

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/808958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/808864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/808865

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.0.0.0rc1

This issue was fixed in the openstack/neutron 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/808958
Committed: https://opendev.org/openstack/neutron/commit/651fce3fba083c8c5000221104bacd5747044b94
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 651fce3fba083c8c5000221104bacd5747044b94
Author: Terry Wilson <email address hidden>
Date: Wed Sep 1 14:19:19 2021 +0000

    Fix neutron_pg_drop-related startup issues

    There is a python-ovs bug that allows transactions to be executed
    before it has started monitoring the db for changes. This breaks
    ovsdbapp's AddCommand behavior, which looks up a row in memory
    post-commit to return it to the caller, because we process the
    reply from ovsdb-server that has the UUID in it, but when we try to
    look it up in memory, it isn't there since that is handled via the
    monitor notifications.

    Since we don't care about the return value, we can just ignore the
    KeyError. In addition, the request to Idl.cond_change() to monitor
    only the neutron_pg_drop table was failing for a similar reason--
    it is called before the initial monitor request is sent, so
    directly setting table.condition will allow the condition to be
    added to the initial monitor request.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Closes-Bug: #1938766
    Change-Id: I78af15970a86923fb810e903d38634b6e60a8fe4
    (cherry picked from commit 76ee64f4a8095ba207872d5de9613f93f208fdd4)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/808865
Committed: https://opendev.org/openstack/neutron/commit/1327a543370abcedcbee75d4653987fa27f34c83
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 1327a543370abcedcbee75d4653987fa27f34c83
Author: Terry Wilson <email address hidden>
Date: Wed Sep 1 14:19:19 2021 +0000

    Fix neutron_pg_drop-related startup issues

    There is a python-ovs bug that allows transactions to be executed
    before it has started monitoring the db for changes. This breaks
    ovsdbapp's AddCommand behavior, which looks up a row in memory
    post-commit to return it to the caller, because we process the
    reply from ovsdb-server that has the UUID in it, but when we try to
    look it up in memory, it isn't there since that is handled via the
    monitor notifications.

    Since we don't care about the return value, we can just ignore the
    KeyError. In addition, the request to Idl.cond_change() to monitor
    only the neutron_pg_drop table was failing for a similar reason--
    it is called before the initial monitor request is sent, so
    directly setting table.condition will allow the condition to be
    added to the initial monitor request.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Closes-Bug: #1938766
    Change-Id: I78af15970a86923fb810e903d38634b6e60a8fe4
    (cherry picked from commit 76ee64f4a8095ba207872d5de9613f93f208fdd4)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/808864
Committed: https://opendev.org/openstack/neutron/commit/be6872be5a8b59c0eba222879404a2803c7e441d
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit be6872be5a8b59c0eba222879404a2803c7e441d
Author: Terry Wilson <email address hidden>
Date: Wed Sep 1 14:19:19 2021 +0000

    Fix neutron_pg_drop-related startup issues

    There is a python-ovs bug that allows transactions to be executed
    before it has started monitoring the db for changes. This breaks
    ovsdbapp's AddCommand behavior, which looks up a row in memory
    post-commit to return it to the caller, because we process the
    reply from ovsdb-server that has the UUID in it, but when we try to
    look it up in memory, it isn't there since that is handled via the
    monitor notifications.

    Since we don't care about the return value, we can just ignore the
    KeyError. In addition, the request to Idl.cond_change() to monitor
    only the neutron_pg_drop table was failing for a similar reason--
    it is called before the initial monitor request is sent, so
    directly setting table.condition will allow the condition to be
    added to the initial monitor request.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py

    Closes-Bug: #1938766
    Change-Id: I78af15970a86923fb810e903d38634b6e60a8fe4
    (cherry picked from commit 76ee64f4a8095ba207872d5de9613f93f208fdd4)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.4.2

This issue was fixed in the openstack/neutron 16.4.2 release.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.3.0

This issue was fixed in the openstack/neutron 17.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.2.0

This issue was fixed in the openstack/neutron 18.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn train-eol

This issue was fixed in the openstack/networking-ovn train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.