Data not present in OVN IDL when requested

Bug #1888878 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Terry Wilson
ovsdbapp
Fix Released
Undecided
Unassigned

Bug Description

During a test execution, we were expecting the NB "NB_Global" table to have the cfg values. After the test execution (and the error presented here), "NB_Global" table was consulted via cli command (ovn-nbctl) and it was correctly populated.

Error: http://paste.openstack.org/show/796293/

Between the reconnection and the command (and the exception), there are is a time gap of only 36ms. Is it possible that the in-memory DB was still being populated?

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

Which test was failing? Do we know what is causing the reconnection? While we can wait for a connection to be started initially, we don't really have any knowledge of when a reconnection is going on due to it being hidden by the python-ovs idl stuff. Any time idl.run() is called, it handles potential reconnections in the background in the Connection thread. So this looks like either an API request happened for agent list or the periodic health check happened immediately after a reconnect?

Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :

Hey

I believe it is related to:

https://review.opendev.org/#/c/742868/ - TestMaintenance.test_port failing

I remember it is pretty the same issue as we with Terry discovered recently. The Logical_Switch_Port row seems not to be in memory replica in time we request for it.

Revision history for this message
Eduardo Olivares (eolivare) wrote :

This occurred during a tobiko test. The test restarts OVN DB master container (which causes the reconnection). Then, it waits until "pcs status" shows that all ovn-dbs-bundle resources are up (master role was moved to a different controller), which happened in ~8 seconds. Finally, ~4 seconds later, it sends an agent list request to the API and receives an HTTPInternalServerError.

The test may need to wait a bit more before performing the agent list validation or to retry when it gets the HTTPInternalServerError.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

I'm not sure why we're not getting the automatic updates when posting a patch, but I have a patch here that should start to make it possible to fix this. https://review.opendev.org/#/c/745746/

For ml2/ovn and the sepcific issue here, some things like the nb_global property would need to either a) specifically hold the lock when doing the read from self.tables or have it return the output of a ReadOnlyCommand.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/747465

Changed in neutron:
assignee: nobody → Terry Wilson (otherwiseguy)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (master)

Reviewed: https://review.opendev.org/745746
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=3cf8a427c0a4dfc248e0778c3b1ee65671d1d07b
Submitter: Zuul
Branch: master

commit 3cf8a427c0a4dfc248e0778c3b1ee65671d1d07b
Author: Terry Wilson <email address hidden>
Date: Tue Aug 11 13:12:47 2020 -0500

    Add lock when calling idl.run()

    python-ovs transparently handles DB reconnections during calls to
    Idl.run(). Since ovsdbapp uses a separate thread for the
    Connection (so it can handle responding to OVSDB keep-alive echo
    requests) and for processing RowEvent notifcations, it is possible
    that a thread could be trying to access OVSDB data at the same
    time that the Connection thread is updating it. Especially during
    a reconnect where the in-memory copy of the DB will be completely
    re-written, this causes a problem.

    Theoretically, an application that uses ovsdbapp shouldn't access
    anything in the backend code directly, but since there is now only
    one supported Backend, that restriction has slipped a bit. If
    all accesses were in Command objects, mostly there wouldn't be a
    problem as they would be handled in the Connection thread. But
    with both ReadOnlyCommands and use of lookup() and anything in
    idlutils, the main thread can access the in-memory db directly and
    potentially cause problems.

    The most common problematic method to be called outside of Command
    objects is lookup() and ReadOnlyCommands. This patch adds a lock
    around calls to idl.run() and lookup() and execute() for the
    ReadOnlyCommand case. row_by_value() is another target, but it is
    left untouched because it doesn't have access to the API instance
    and it is *mostly* called from inside Command objects. User code
    that doesn't will be easy to find because user code basically
    shouldn't use idlutils.

    Change-Id: I98c37771883103e1fb0468de9cf85364071993fa
    Closes-Bug: #1888878

Changed in ovsdbapp:
status: In Progress → Fix Released
Changed in neutron:
importance: Undecided → Medium
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/747465
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e1582e39a1ba7f620e9cea7e5c531dc03e881ced
Submitter: Zuul
Branch: master

commit e1582e39a1ba7f620e9cea7e5c531dc03e881ced
Author: Terry Wilson <email address hidden>
Date: Fri Aug 21 13:48:52 2020 -0500

    Don't directly access the in-memory DB for NB_Global row

    Directly accessing the in-memory OVSDB via 'tables' outside of a
    Command object isn't safe. For example, if we lose connection to
    the OVSDB and reconnect, the in-memory copy will be replaced by
    the Connection thread. If we get a call to get_agents() that is
    handled by the main thread and we look up the value from there,
    the value we expect may be missing.

    The dependent patch in ovsdbapp adds some locking to help with
    this some API commands like lookup() that are heavily used and
    with ReadOnlyCommands which also aren't processed like normal
    Command objects inside the connection thread.

    There will be many more changes in the ml2/ovn code similar to
    this one, as we directly access the in-memory copy a lot.

    Depends-On: https://review.opendev.org/#/c/745746/
    Change-Id: I328391c88af1ef2f6b1af3787c63d8591fb317b5
    Closes-Bug: #1888878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ovsdbapp (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/760394

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ovsdbapp (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/760395

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (stable/ussuri)

Reviewed: https://review.opendev.org/760394
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=9168ddcfa8a6cef68f89252dbda5173f58ec7c5c
Submitter: Zuul
Branch: stable/ussuri

commit 9168ddcfa8a6cef68f89252dbda5173f58ec7c5c
Author: Terry Wilson <email address hidden>
Date: Tue Aug 11 13:12:47 2020 -0500

    Add lock when calling idl.run()

    python-ovs transparently handles DB reconnections during calls to
    Idl.run(). Since ovsdbapp uses a separate thread for the
    Connection (so it can handle responding to OVSDB keep-alive echo
    requests) and for processing RowEvent notifcations, it is possible
    that a thread could be trying to access OVSDB data at the same
    time that the Connection thread is updating it. Especially during
    a reconnect where the in-memory copy of the DB will be completely
    re-written, this causes a problem.

    Theoretically, an application that uses ovsdbapp shouldn't access
    anything in the backend code directly, but since there is now only
    one supported Backend, that restriction has slipped a bit. If
    all accesses were in Command objects, mostly there wouldn't be a
    problem as they would be handled in the Connection thread. But
    with both ReadOnlyCommands and use of lookup() and anything in
    idlutils, the main thread can access the in-memory db directly and
    potentially cause problems.

    The most common problematic method to be called outside of Command
    objects is lookup() and ReadOnlyCommands. This patch adds a lock
    around calls to idl.run() and lookup() and execute() for the
    ReadOnlyCommand case. row_by_value() is another target, but it is
    left untouched because it doesn't have access to the API instance
    and it is *mostly* called from inside Command objects. User code
    that doesn't will be easy to find because user code basically
    shouldn't use idlutils.

    Conflicts:
      ovsdbapp/tests/unit/backend/test_ovs_idl.py

    Change-Id: I98c37771883103e1fb0468de9cf85364071993fa
    Closes-Bug: #1888878
    (cherry picked from commit 3cf8a427c0a4dfc248e0778c3b1ee65671d1d07b)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ovsdbapp (stable/train)

Reviewed: https://review.opendev.org/760395
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=876da6f7c809cb17ae7bf507e165fd7765af5b52
Submitter: Zuul
Branch: stable/train

commit 876da6f7c809cb17ae7bf507e165fd7765af5b52
Author: Terry Wilson <email address hidden>
Date: Tue Aug 11 13:12:47 2020 -0500

    Add lock when calling idl.run()

    python-ovs transparently handles DB reconnections during calls to
    Idl.run(). Since ovsdbapp uses a separate thread for the
    Connection (so it can handle responding to OVSDB keep-alive echo
    requests) and for processing RowEvent notifcations, it is possible
    that a thread could be trying to access OVSDB data at the same
    time that the Connection thread is updating it. Especially during
    a reconnect where the in-memory copy of the DB will be completely
    re-written, this causes a problem.

    Theoretically, an application that uses ovsdbapp shouldn't access
    anything in the backend code directly, but since there is now only
    one supported Backend, that restriction has slipped a bit. If
    all accesses were in Command objects, mostly there wouldn't be a
    problem as they would be handled in the Connection thread. But
    with both ReadOnlyCommands and use of lookup() and anything in
    idlutils, the main thread can access the in-memory db directly and
    potentially cause problems.

    The most common problematic method to be called outside of Command
    objects is lookup() and ReadOnlyCommands. This patch adds a lock
    around calls to idl.run() and lookup() and execute() for the
    ReadOnlyCommand case. row_by_value() is another target, but it is
    left untouched because it doesn't have access to the API instance
    and it is *mostly* called from inside Command objects. User code
    that doesn't will be easy to find because user code basically
    shouldn't use idlutils.

    Conflicts:
      ovsdbapp/tests/unit/backend/test_ovs_idl.py

    Change-Id: I98c37771883103e1fb0468de9cf85364071993fa
    Closes-Bug: #1888878
    (cherry picked from commit 3cf8a427c0a4dfc248e0778c3b1ee65671d1d07b)

tags: added: in-stable-train
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovsdbapp (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/762730

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/762737

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.