Hash Ring: Unreliable health check for workers

Bug #1834498 reported by Lucas Alvares Gomes on 2019-06-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-ovn
High
Lucas Alvares Gomes

Bug Description

When under pressure, futurist (the lib we use in the maintenance thread) may not honor the time interval to run it's tasks. For certain tasks such as touching the Hash Ring nodes which can impact a lot in the system (no events being processed if all nodes are shown as dead in the ring) we need to make it more reliable.

This behavior was first seem in this patch https://review.opendev.org/#/c/666587/ (patch set #2).

In the q-svc log of the rally job we can see many errors like:

"ERROR networking_ovn.ovsdb.ovsdb_monitor [-] HashRing is empty, error: An unknown exception occurred.: HashRingIsEmpty: An unknown exception occurred."

The reason is that the node is too busy and the maintenance thread task that takes care of health checking if the workers in the hash ring is not running at the expected time, see:

Jun 26 19:41:21.839331 ubuntu-bionic-rax-dfw-0008389515 neutron-server[7549]: DEBUG futurist.periodics [None req-4678e35e-7aa5-44a4-ad54-7530a69c567e None None] Submitting periodic callback 'networking_ovn.common.maintenance.DBInconsistenciesPeriodics.touch_hash_ring_nodes' {{(pid=7938) _process_scheduled /usr/local/lib/python2.7/dist-packages/futurist/periodics.py:639}}

The next time it runs is happens at:

Jun 26 19:42:53.172458 ubuntu-bionic-rax-dfw-0008389515 neutron-server[7549]: DEBUG futurist.periodics [None req-4678e35e-7aa5-44a4-ad54-7530a69c567e None None] Submitting periodic callback 'networking_ovn.common.maintenance.DBInconsistenciesPeriodics.touch_hash_ring_nodes' {{(pid=7938) _process_scheduled /usr/local/lib/python2.7/dist-packages/futurist/periodics.py:639}}

The current timeout is 60s (https://github.com/openstack/networking-ovn/blob/8bb5e7ec20e107a3440f405c41d1b301a325cc2f/networking_ovn/common/constants.py#L157)

Changed in networking-ovn:
status: New → Incomplete
status: Incomplete → Confirmed
importance: Undecided → High
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
Changed in networking-ovn:
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/667953
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=f93638b04233b2baf7590398ae384ccd0f51985e
Submitter: Zuul
Branch: master

commit f93638b04233b2baf7590398ae384ccd0f51985e
Author: Lucas Alvares Gomes <email address hidden>
Date: Thu Jun 27 15:55:43 2019 +0100

    Hash Ring: Make workers health check more reliable

    When under pressure the maintenance thread may not honor the time interval
    that it should run its tasks. For something more sensitive like the task
    that health checks the nodes from the Hash Ring this can be problematic
    because it may cause it to appear that the workers have died and the
    ring will get rebalanced; we do not want to have a long interval for it
    because in case of a real failure we do want the ring to rebalance fast.

    This patch implements a new approach to enhance the health check of the
    Hash Ring workers by allowing the worker to "touch" the ring to say that
    its alive.

    The patch also includes unittests for the OvnIdlDistributedLock class
    that was missing before.

    Closes-Bug: #1834498
    Change-Id: If06d8580e8e2637b19ff5d76e16f635a5c1d328e
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

Changed in networking-ovn:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers