ovn: filtering hash ring nodes is not time zone aware

Bug #1894117 reported by Jakub Libosvar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Jakub Libosvar

Bug Description

Nodes are registered using local machine timezone but then when checking alive nodes. After first touch, the update_at gets fixed using UTC timezone and then getting nodes work.

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Jakub Libosvar (libosvar) wrote :

Just for the record, this is the error when starting Neutron server:

Sep 08 04:36:33 devstack neutron-server[585144]: ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovsdb_monitor [-] HashRing is empty, error: Hash Ring returned empty when hashing "b'd18b6f9d-82bc-4056-9ec1-7604b1470a68'". This should never happen in a normal situation, please check the status of your cluster: neutron.common.ovn.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'd18b6f9d-82bc-4056-9ec1-7604b1470a68'". This should never happen in a normal situation, please check the status of your cluster

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/750295

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/750295
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a84462f698f91b10e332e3403ecc9a68c3e6a884
Submitter: Zuul
Branch: master

commit a84462f698f91b10e332e3403ecc9a68c3e6a884
Author: Jakub Libosvar <email address hidden>
Date: Tue Sep 8 04:38:42 2020 -0400

    ovn: Always use UTC for Hash ring timestamps

    When a node is created in ovn_hash_ring table, it used to use timestamp
    depending on the system or mysql timezone. This caused troubles for
    machines using western timezones when getting active nodes before the
    first liveness check, becaus the time difference between UTC and local
    time was always in hours and the limit is 30 seconds.

    This patch configures OVN Hash Ring model to always use UTC when
    creating a node so the times are always compared to UTC, regardless of
    the used timezone. Note that the created_at and updated_at are not user
    facing columns.

    Change-Id: I9172254ce9b1880833128bdcaacacb93821933dd
    Closes-bug: #1894117
    Signed-off-by: Jakub Libosvar <email address hidden>

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/752906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovn-octavia-provider (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/755111

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Same as commented in [1], there is a very small time gap between "created_at" and "updated_at". This could lead to the following error: [2][3].

I'll push a patch, related to this bug, to solve this issue.

Regards.

[1]https://github.com/openstack/neutron/blob/4d31284373e89cb2b29539d6718f90a4c4d8284b/neutron/common/ovn/hash_ring_manager.py#L57-L61
[2]https://7b27dceb4aae1028bffd-4ce80c083e58c5170f82b282f505c84d.ssl.cf5.rackcdn.com/754938/7/check/openstack-tox-cover/828ce35/testr_results.html
[3]http://paste.openstack.org/show/798859/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/756860

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ovn-octavia-provider (master)

Reviewed: https://review.opendev.org/755111
Committed: https://git.openstack.org/cgit/openstack/ovn-octavia-provider/commit/?id=b68d2a78a4b34497e376a7be545a20a23d036191
Submitter: Zuul
Branch: master

commit b68d2a78a4b34497e376a7be545a20a23d036191
Author: Flavio Fernandes <email address hidden>
Date: Tue Sep 29 15:12:30 2020 -0400

    Fix and enable test_port_forwarding

    This test began failing because of inconsistencies in the
    in-memory neutron db that occur since the change
    https://review.opendev.org/#/c/750295/ got merged.

    The failure observed was that an entry in the port forwarding
    table was created but never got added as a row. When the get
    operation for the entry created happened, it raised a not found
    exception. To avoid this situation the test will now use
    OS_TEST_DBAPI_ADMIN_CONNECTION=sqlite:///sqlite.db

    This reverts commit 481f0b3b3c6014c23b9619b6c361754d0185dce1.
    Related-bug: #1894117
    Closes-bug: #1896678
    Co-Authored-By: Jakub Libosvar <email address hidden>
    Change-Id: Ic753232ee576fb8b663af5c4a68635bb40a40edc

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ovn-octavia-provider (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/757575

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ovn-octavia-provider (stable/victoria)

Reviewed: https://review.opendev.org/757575
Committed: https://git.openstack.org/cgit/openstack/ovn-octavia-provider/commit/?id=f92961e8c658d540a7bb0f657a65f960d2e4e82a
Submitter: Zuul
Branch: stable/victoria

commit f92961e8c658d540a7bb0f657a65f960d2e4e82a
Author: Flavio Fernandes <email address hidden>
Date: Tue Sep 29 15:12:30 2020 -0400

    Fix and enable test_port_forwarding

    This test began failing because of inconsistencies in the
    in-memory neutron db that occur since the change
    https://review.opendev.org/#/c/750295/ got merged.

    The failure observed was that an entry in the port forwarding
    table was created but never got added as a row. When the get
    operation for the entry created happened, it raised a not found
    exception. To avoid this situation the test will now use
    OS_TEST_DBAPI_ADMIN_CONNECTION=sqlite:///sqlite.db

    This reverts commit 481f0b3b3c6014c23b9619b6c361754d0185dce1.
    Related-bug: #1894117
    Closes-bug: #1896678
    Co-Authored-By: Jakub Libosvar <email address hidden>
    (cherry picked from commit b68d2a78a4b34497e376a7be545a20a23d036191)

    Change-Id: Ic753232ee576fb8b663af5c4a68635bb40a40edc

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/756860
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a6a99bc9a06585e7cf68a8e66085488e9c089062
Submitter: Zuul
Branch: master

commit a6a99bc9a06585e7cf68a8e66085488e9c089062
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Oct 8 14:02:12 2020 +0000

    Match created_at and updated_at timestamp in TestHashRing

    When a "ovn_hash_ring" record is created in the database,
    there is a small difference between "created_at" and
    "updated_at", assigned during the Python OVO creation.

    This patch equates both values during testing if the
    difference between both is less than 10ms.

    Related-Bug: #1894117

    Change-Id: Ib3b405104d746255f869a94898d4ba14ac600909

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.3.0

This issue was fixed in the openstack/neutron 16.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn 7.4.0

This issue was fixed in the openstack/networking-ovn 7.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/788543

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/788543
Committed: https://opendev.org/openstack/neutron/commit/7cf9597570f288d27768dc5ff7be04824d09f8bc
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 7cf9597570f288d27768dc5ff7be04824d09f8bc
Author: Terry Wilson <email address hidden>
Date: Mon Dec 7 20:46:53 2020 +0000

    Rely on worker count for HashRing caching

    The current code looks at a hash ring node's created_at/updated_at
    fields and tries to determine whether the node has been updated
    based on whether updated_at - created_at > 1 second (due to the
    method that initially fills them being different by microseconds).
    Unfortunately, due to the notify() method being called which calls
    the hash ring node's touch_node(), a node can be updated in under
    a second, meaning we will prevent caching for much longer than
    we intend.

    When using sqlite in-memory db, this continually re-creating the
    Hash Ring objects for every event that is processed is exposing an
    issue where rows that should be in the db just *aren't*.

    This patch instead limits the hash ring nodes to api workers and
    prevents caching only until the number of nodes == number of api
    workers on the host. The switch from spawning hash ring nodes
    where !is_maintenance to is_api_worker is primarily because it
    seems to be difficult to get a list of *all* workers from which to
    subtract the maintenance worker so that _wait_startup_before_caching
    can wait for that specific number of workers. In practice, this
    means that RpcWorker and ServiceWorker workers would not process
    HashRing events.

    A note on bug 1903008: While this change will greatly reduce the
    likelihood of this issue taking place, we still have some work to
    do in order to fully understand why it rubs the database backend
    in the wrong way. Thus, we will make this change 'related to'
    instead of closing the bug.

    Conflicts:
            neutron/tests/unit/common/ovn/test_hash_ring_manager.py

    Related-Bug: #1894117
    Related-Bug: #1903008
    Change-Id: Ia198d45f49bddda549a0e70a3374b8339f88887b
    (cherry picked from commit c4007b0833111a25d24f597161d39ee9ccd37189)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/786097
Committed: https://opendev.org/openstack/neutron/commit/1f30f2dfff722ea65c811b4b99243fae51a2d688
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 1f30f2dfff722ea65c811b4b99243fae51a2d688
Author: Terry Wilson <email address hidden>
Date: Mon Dec 7 20:46:53 2020 +0000

    Rely on worker count for HashRing caching

    The current code looks at a hash ring node's created_at/updated_at
    fields and tries to determine whether the node has been updated
    based on whether updated_at - created_at > 1 second (due to the
    method that initially fills them being different by microseconds).
    Unfortunately, due to the notify() method being called which calls
    the hash ring node's touch_node(), a node can be updated in under
    a second, meaning we will prevent caching for much longer than
    we intend.

    When using sqlite in-memory db, this continually re-creating the
    Hash Ring objects for every event that is processed is exposing an
    issue where rows that should be in the db just *aren't*.

    This patch instead limits the hash ring nodes to api workers and
    prevents caching only until the number of nodes == number of api
    workers on the host. The switch from spawning hash ring nodes
    where !is_maintenance to is_api_worker is primarily because it
    seems to be difficult to get a list of *all* workers from which to
    subtract the maintenance worker so that _wait_startup_before_caching
    can wait for that specific number of workers. In practice, this
    means that RpcWorker and ServiceWorker workers would not process
    HashRing events.

    A note on bug 1903008: While this change will greatly reduce the
    likelihood of this issue taking place, we still have some work to
    do in order to fully understand why it rubs the database backend
    in the wrong way. Thus, we will make this change 'related to'
    instead of closing the bug.

    Related-Bug: #1894117
    Related-Bug: #1903008
    Change-Id: Ia198d45f49bddda549a0e70a3374b8339f88887b
    (cherry picked from commit c4007b0833111a25d24f597161d39ee9ccd37189)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.