Hash Ring: Unreliable health check for workers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
networking-ovn |
Fix Released
|
High
|
Lucas Alvares Gomes |
Bug Description
When under pressure, futurist (the lib we use in the maintenance thread) may not honor the time interval to run it's tasks. For certain tasks such as touching the Hash Ring nodes which can impact a lot in the system (no events being processed if all nodes are shown as dead in the ring) we need to make it more reliable.
This behavior was first seem in this patch https:/
In the q-svc log of the rally job we can see many errors like:
"ERROR networking_
The reason is that the node is too busy and the maintenance thread task that takes care of health checking if the workers in the hash ring is not running at the expected time, see:
Jun 26 19:41:21.839331 ubuntu-
The next time it runs is happens at:
Jun 26 19:42:53.172458 ubuntu-
The current timeout is 60s (https:/
Changed in networking-ovn: | |
status: | New → Incomplete |
status: | Incomplete → Confirmed |
importance: | Undecided → High |
assignee: | nobody → Lucas Alvares Gomes (lucasagomes) |
Changed in networking-ovn: | |
status: | Confirmed → In Progress |
tags: | added: networking-ovn-proactive-backport-potential |
tags: | removed: networking-ovn-proactive-backport-potential |
Reviewed: https:/ /review. opendev. org/667953 /git.openstack. org/cgit/ openstack/ networking- ovn/commit/ ?id=f93638b0423 3b2baf7590398ae 384ccd0f51985e
Committed: https:/
Submitter: Zuul
Branch: master
commit f93638b04233b2b af7590398ae384c cd0f51985e
Author: Lucas Alvares Gomes <email address hidden>
Date: Thu Jun 27 15:55:43 2019 +0100
Hash Ring: Make workers health check more reliable
When under pressure the maintenance thread may not honor the time interval
that it should run its tasks. For something more sensitive like the task
that health checks the nodes from the Hash Ring this can be problematic
because it may cause it to appear that the workers have died and the
ring will get rebalanced; we do not want to have a long interval for it
because in case of a real failure we do want the ring to rebalance fast.
This patch implements a new approach to enhance the health check of the
Hash Ring workers by allowing the worker to "touch" the ring to say that
its alive.
The patch also includes unittests for the OvnIdlDistribut edLock class
that was missing before.
Closes-Bug: #1834498 7b19ff5d76e16f6 35a5c1d328e
Change-Id: If06d8580e8e263
Signed-off-by: Lucas Alvares Gomes <email address hidden>