Redis switch over breaks the coordination for Gnocchi

Bug #1841589 reported by Gabor Orosz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tooz
In Progress
Undecided
Gabor Orosz

Bug Description

Reproduction:
Given three OpenStack controller, each of them has a single Redis instance running and configured in a master-slave mode. HAproxy is being used to direct the sessions to the actual master instance. Pacemaker is being used to manage the Redis instances HAproxy its network namespace and the Virtual IP. Gnocchi is running on all the controllers and it is configured to use Redis as a coordination backend through the Virtual IP.
1. Trigger a graceful switch over for the Redis service by banning the current Redis master instance. Pacemaker will invoke a demote on the master instance and a slave node will be promoted to become a new master.
2. As a result of that, the gnocchi-metricd workers get disconnected and some of them start reporting the following kind of errors after they managed to re-establish the connection to Redis:

2019-08-15T15:28:56.841791+02:00 cic-1.domain.tld gnocchi-metricd[8043]: 2019-08-15 15:28:56,841 [8043] ERROR futurist.periodics: Failed to call periodic 'gnocchi.cli.run_watchers' (it runs every 30.00 seconds)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/futurist/periodics.py", line 290, in run
work()
File "/usr/lib/python2.7/dist-packages/futurist/periodics.py", line 64, in __call__
return self.callback(*self.args, **self.kwargs)
File "/usr/lib/python2.7/dist-packages/futurist/periodics.py", line 178, in decorator
return f(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/gnocchi/cli.py", line 215, in run_watchers
self.coord.run_watchers()
File "/usr/lib/python2.7/dist-packages/tooz/drivers/redis.py", line 747, in run_watchers
result = super(RedisDriver, self).run_watchers(timeout=timeout)
File "/usr/lib/python2.7/dist-packages/tooz/coordination.py", line 763, in run_watchers
MemberLeftGroup(group_id, member_id)))
File "/usr/lib/python2.7/dist-packages/tooz/coordination.py", line 120, in run
return list(map(lambda cb: cb(*args, **kwargs), self))
File "/usr/lib/python2.7/dist-packages/tooz/coordination.py", line 120, in <lambda>
return list(map(lambda cb: cb(*args, **kwargs), self))
File "/usr/lib/python2.7/dist-packages/tooz/partitioner.py", line 50, in _on_member_leave
self.ring.remove_node(event.member_id)
File "/usr/lib/python2.7/dist-packages/tooz/hashring.py", line 92, in remove_node
raise UnknownNode(node)
UnknownNode: Unknown node `fc3584da-6583-45fd-9ab2-1442bf996f72'

The very same issue is being reported under the following ticket for Gnocchi:
https://github.com/gnocchixyz/gnocchi/issues/185

However, our troubleshooting and investigation indicates that this is a bug in Tooz library's Hashring and Coordination implementation.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (master)

Fix proposed to branch: master
Review: https://review.opendev.org/678842

Changed in python-tooz:
assignee: nobody → Gabor Orosz (gabor.orosz)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tooz (master)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/678842
Reason: This is failing pretty hard and hasn't been updated since the initial contribution. I'm going to abandon this so hopefully someone else can go fix this up

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.