Unable to extend Redis lock in certain cases

Bug #1557593 reported by gordon chung
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Gnocchi
Fix Released
Low
Unassigned
tooz
Fix Released
Medium
Julien Danjou

Bug Description

seeing quite a few tooz lock errors when running master gnocchi with redis coordination

[gchung@localhost devstack]$ sudo pip show tooz
---
Metadata-Version: 2.0
Name: tooz
Version: 1.34.0

coordination_url = redis://localhost:6379

2016-03-15 11:22:23.075 1214 WARNING tooz.drivers.redis [-] Unable to heartbeat lock '<tooz.drivers.redis.RedisLock object at 0x7f5e208fc9d0>'
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis Traceback (most recent call last):
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 507, in heartbeat
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis lock.heartbeat()
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 103, in heartbeat
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis self._lock.extend(self._lock.timeout)
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis self.gen.throw(type, value, traceback)
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 55, in _translate_failures
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis cause=e)
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 669, in raise_with_cause
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis raise value
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis ToozError: Cannot extend an unlocked lock
2016-03-15 11:22:23.075 1214 ERROR tooz.drivers.redis

Revision history for this message
gordon chung (chungg) wrote :
Julien Danjou (jdanjou)
no longer affects: python-tooz
Changed in gnocchi:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Julien Danjou (jdanjou) wrote :

Going to fix it in Gnocchi by not using Redis in the gate but the default values.

I think it's a bug in tooz or in redis.locks anyhow, it's worth taking a look at some point.

Changed in python-tooz:
status: New → Triaged
importance: Undecided → Medium
summary: - tooz lock errors
+ Unable to extend Redis lock in certain cases
Revision history for this message
Julien Danjou (jdanjou) wrote :

To me, the problem is that the timeout is to 5 in the gate, which is way to short. There's at least one metric in the log that took >3s to be processed, so likely this one took 5.1s and then the lock blew.

Changed in python-tooz:
status: Triaged → Incomplete
Revision history for this message
gordon chung (chungg) wrote :

is that why it's blowing up? just for reference, when i was speaking with nicodemus, they were getting roughly 3-5seconds per processing period... but they also got 10 - 140seconds +

http://eavesdrop.openstack.org/irclogs/%23openstack-telemetry/%23openstack-telemetry.2016-03-15.log.html#t2016-03-15T19:01:42

Revision history for this message
Julien Danjou (jdanjou) wrote :

That's why, and also because we heartbeat every 10s. That's hardcoded. That sucks, that's why there's a FIXME(jd) in the code. Currently tooz does not expose that timeout info so we're blind. This needs to be fixed. Therefore I set the bug as Triaged for tooz, na!

The processing time it takes is unrelated to this issue, I mean, it's not making it shorter or longer.

Changed in python-tooz:
status: Incomplete → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (master)

Fix proposed to branch: master
Review: https://review.openstack.org/293894

Changed in python-tooz:
assignee: nobody → Julien Danjou (jdanjou)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/295075

Changed in python-tooz:
assignee: Julien Danjou (jdanjou) → Joshua Harlow (harlowja)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tooz (master)

Change abandoned by Julien Danjou (<email address hidden>) on branch: master
Review: https://review.openstack.org/293894
Reason: Let's go with Josh approach.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (master)

Reviewed: https://review.openstack.org/295075
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=10b971150088e3e6830571c08d7928e5855c21ff
Submitter: Jenkins
Branch: master

commit 10b971150088e3e6830571c08d7928e5855c21ff
Author: Julien Danjou <email address hidden>
Date: Wed Mar 16 18:09:56 2016 +0100

    coordination: expose a heartbeat loop method

    This allows to delegate the heartbeat() loop entirely to Tooz. Tooz has
    various information, such as the time required between heartbeats, that
    the consumer application does not possess, which makes it hard to build
    something reliable.

    Change-Id: Ib87aae318285cceb91c6df1946e3cddababedf1b
    Closes-Bug: #1557593

Changed in python-tooz:
status: In Progress → Fix Released
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/tooz 1.38.0

This issue was fixed in the openstack/tooz 1.38.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/gnocchi 2.2.0

This issue was fixed in the openstack/gnocchi 2.2.0 release.

Julien Danjou (jdanjou)
Changed in gnocchi:
status: Triaged → Fix Released
milestone: none → 2.2.0
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

I got this again with recent gnocchi and tooz version:

# pip list | grep -e 'gnocchi ' -e tooz
gnocchi (3.0.1.dev110)
tooz (1.44.0)

2016-11-22 11:54:28.063 17906 WARNING tooz.drivers.redis [-] Unable to heartbeat lock '<tooz.drivers.redis.RedisLock object at 0x7fd902a10810>'
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis Traceback (most recent call last):
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis File "/nagiosdata/gnocchi/venv/local/lib/python2.7/site-packages/tooz/drivers/redis.py", line 506, in heartbeat
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis lock.heartbeat()
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis File "/nagiosdata/gnocchi/venv/local/lib/python2.7/site-packages/tooz/drivers/redis.py", line 102, in heartbeat
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis self._lock.extend(self._lock.timeout)
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis self.gen.throw(type, value, traceback)
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis File "/nagiosdata/gnocchi/venv/local/lib/python2.7/site-packages/tooz/drivers/redis.py", line 54, in _translate_failures
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis cause=e)
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis File "/nagiosdata/gnocchi/venv/local/lib/python2.7/site-packages/tooz/coordination.py", line 763, in raise_with_cause
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis File "/nagiosdata/gnocchi/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis File "/nagiosdata/gnocchi/venv/local/lib/python2.7/site-packages/six.py", line 718, in raise_from
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis raise value
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis ToozError: Cannot extend an unlocked lock
2016-11-22 11:54:28.063 17906 ERROR tooz.drivers.redis

Julien Danjou (jdanjou)
Changed in python-tooz:
status: Fix Released → Triaged
Revision history for this message
gordon chung (chungg) wrote :

yeah, i definitely still get it in my test environment... but it eventually resolves itself it seems so maybe we should catch it in Gnocchi?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (master)

Fix proposed to branch: master
Review: https://review.openstack.org/402055

Changed in python-tooz:
assignee: Joshua Harlow (harlowja) → Julien Danjou (jdanjou)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (master)

Reviewed: https://review.openstack.org/402055
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=486524c37fff2c826f934ac40fd07e2003074569
Submitter: Jenkins
Branch: master

commit 486524c37fff2c826f934ac40fd07e2003074569
Author: Julien Danjou <email address hidden>
Date: Thu Nov 24 14:54:23 2016 +0100

    redis: make sure we don't release and heartbeat a lock at the same time

    Otherwise that makes Redis unhappy.

    Change-Id: I4659bee93b2112733fe76d329bf73327878c5641
    Closes-Bug: #1557593

Changed in python-tooz:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tooz 1.47.0

This issue was fixed in the openstack/tooz 1.47.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tooz (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/447074

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (master)

Fix proposed to branch: master
Review: https://review.openstack.org/447199

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tooz (master)

Reviewed: https://review.openstack.org/447074
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=c76ce8c2fceb00ed53050c204d14e21aac541e82
Submitter: Jenkins
Branch: master

commit c76ce8c2fceb00ed53050c204d14e21aac541e82
Author: Julien Danjou <email address hidden>
Date: Fri Mar 17 18:17:09 2017 +0100

    Enhance heartbeat sleep timer

    Change-Id: I5cb9ceea69a62ccb2d3f00f1d419f32e3ce7dc5b
    Related-Bug: #1557593

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (master)

Reviewed: https://review.openstack.org/447199
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=059f2ae5406a325017ad181ace1e5b7c5bff0d22
Submitter: Jenkins
Branch: master

commit 059f2ae5406a325017ad181ace1e5b7c5bff0d22
Author: Julien Danjou <email address hidden>
Date: Sat Mar 18 10:33:57 2017 +0100

    redis: fix concurrent access on acquire()

    Redis-py is not thread safe and in lock management, which means the following
    can also happen:
    1. lock.acquire()
    2a. lock.release() ---\_ at the same time for all 2. operations
    2b. lock.acquire() ---/
    2c. lock.heartbeat() /

    which means the lock can be when a lock is already acquired and in the process
    of being released and re-acquired in paralllel:
    1. acquire() acquire the lock (self._lock.acquire())
    2. release() release the lock (self._lock.release()) and remove it from
       self._lock.acquire() where it's not (discard() does not raise on non-existing
       element)
    3. acquire() put the lock in _acquired_locks
    4. heartbeat() iterates on _acquired_locks and try to extend a lock that indeed
       was acquired but was in the mean time released by Redis.

    Change-Id: Ib9549b44485cb15de312ec9dfa57b148cb45e2c9
    Closes-Bug: #1557593

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tooz 1.54.0

This issue was fixed in the openstack/tooz 1.54.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/478131

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/478196

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (stable/newton)

Reviewed: https://review.openstack.org/478196
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=44e488e46da756227554be3319cb6e7497324426
Submitter: Jenkins
Branch: stable/newton

commit 44e488e46da756227554be3319cb6e7497324426
Author: Julien Danjou <email address hidden>
Date: Thu Nov 24 14:54:23 2016 +0100

    redis: make sure we don't release and heartbeat a lock at the same time

    Otherwise that makes Redis unhappy.

    Change-Id: I4659bee93b2112733fe76d329bf73327878c5641
    Closes-Bug: #1557593
    (cherry picked from commit 486524c37fff2c826f934ac40fd07e2003074569)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/478131
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=1b5b21e7c01998bbb972920ec869e51d1c20f105
Submitter: Jenkins
Branch: stable/newton

commit 1b5b21e7c01998bbb972920ec869e51d1c20f105
Author: Julien Danjou <email address hidden>
Date: Sat Mar 18 10:33:57 2017 +0100

    redis: fix concurrent access on acquire()

    Redis-py is not thread safe and in lock management, which means the following
    can also happen:
    1. lock.acquire()
    2a. lock.release() ---\_ at the same time for all 2. operations
    2b. lock.acquire() ---/
    2c. lock.heartbeat() /

    which means the lock can be when a lock is already acquired and in the process
    of being released and re-acquired in paralllel:
    1. acquire() acquire the lock (self._lock.acquire())
    2. release() release the lock (self._lock.release()) and remove it from
       self._lock.acquire() where it's not (discard() does not raise on non-existing
       element)
    3. acquire() put the lock in _acquired_locks
    4. heartbeat() iterates on _acquired_locks and try to extend a lock that indeed
       was acquired but was in the mean time released by Redis.

    Change-Id: Ib9549b44485cb15de312ec9dfa57b148cb45e2c9
    Closes-Bug: #1557593
    (cherry picked from commit 059f2ae5406a325017ad181ace1e5b7c5bff0d22)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tooz 1.43.1

This issue was fixed in the openstack/tooz 1.43.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/484880

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (stable/ocata)

Reviewed: https://review.openstack.org/484880
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=0dc996b92dd66a5da5e75966df3a59fabc6df420
Submitter: Jenkins
Branch: stable/ocata

commit 0dc996b92dd66a5da5e75966df3a59fabc6df420
Author: Julien Danjou <email address hidden>
Date: Sat Mar 18 10:33:57 2017 +0100

    redis: fix concurrent access on acquire()

    Redis-py is not thread safe and in lock management, which means the following
    can also happen:
    1. lock.acquire()
    2a. lock.release() ---\_ at the same time for all 2. operations
    2b. lock.acquire() ---/
    2c. lock.heartbeat() /

    which means the lock can be when a lock is already acquired and in the process
    of being released and re-acquired in paralllel:
    1. acquire() acquire the lock (self._lock.acquire())
    2. release() release the lock (self._lock.release()) and remove it from
       self._lock.acquire() where it's not (discard() does not raise on non-existing
       element)
    3. acquire() put the lock in _acquired_locks
    4. heartbeat() iterates on _acquired_locks and try to extend a lock that indeed
       was acquired but was in the mean time released by Redis.

    Change-Id: Ib9549b44485cb15de312ec9dfa57b148cb45e2c9
    Closes-Bug: #1557593
    (cherry picked from commit 059f2ae5406a325017ad181ace1e5b7c5bff0d22)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tooz 1.48.2

This issue was fixed in the openstack/tooz 1.48.2 release.

Revision history for this message
Michal Arbet (michalarbet) wrote :
Download full text (3.3 KiB)

Hi,

We are facing something similar with

(cinder-volume)[root@cn36x005 /]# apt policy python3-redis
python3-redis:
  Installed: 3.3.11-3~bpo10+1
  Candidate: 3.3.11-3~bpo10+1
  Version table:
*** 3.3.11-3~bpo10+1 100
        100 /var/lib/dpkg/status
(cinder-volume)[root@cn36x005 /]# apt policy python3-tooz
python3-tooz:
  Installed: 2.7.2-1~bpo10+1
  Candidate: 2.7.2-1~bpo10+1
  Version table:
*** 2.7.2-1~bpo10+1 100
        100 /var/lib/dpkg/status
(cinder-volume)[root@cn36x005 /]#

LOG:

2022-04-04 09:57:11.558 1074 WARNING tooz.drivers.redis [-] Unable to heartbeat lock '<tooz.drivers.redis.RedisLock object at 0x7fdcda6ee208>': tooz.ToozError: Cannot reacquire a lock that's no longer owned
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis Traceback (most recent call last):
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/tooz/drivers/redis.py", line 58, in wrapper
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis return func(*args, **kwargs)
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/tooz/drivers/redis.py", line 132, in heartbeat
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis self._lock.reacquire()
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/redis/lock.py", line 265, in reacquire
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis return self.do_reacquire()
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/redis/lock.py", line 272, in do_reacquire
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis raise LockNotOwnedError("Cannot reacquire a lock that's"
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis redis.exceptions.LockNotOwnedError: Cannot reacquire a lock that's no longer ownedc
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis The above exception was the direct cause of the following exception:
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis Traceback (most recent call last):
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/tooz/drivers/redis.py", line 544, in heartbeat
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis lock.heartbeat()
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/tooz/drivers/redis.py", line 74, in wrapper
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis cause=e)
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/tooz/utils.py", line 224, in raise_with_cause
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
2022-04-04 09:57:11.558 1074 ERROR tooz.drivers.redis File "<string>", line 3, ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.