abandoned lock files

Bug #1432387 reported by John Griffith
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Gorka Eguileor
oslo.concurrency
Confirmed
Wishlist
Unassigned

Bug Description

I still contend that the new trend of using lockfiles for everything is not a great idea; but the real problem is that everybody just throws a lock decorator up and forgets about it. Turns out, we never seem to clean up our lock files. If you check out /opt/stack/data/cinder/ after a tempest run for example there's a boat load of lock files left around. Seems like folks are really good at creating lockfiles, but not so good at removing them.

Also kinda makes me wonder if they're even useful anyway but that's a whole different bug.

Tags: files lock
Changed in cinder:
importance: Undecided → High
Changed in cinder:
assignee: nobody → Michal Dulko (michal-dulko-f)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/164745

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

Turns out that implementing removal of unused locks isn't trivial in oslo_concurrency. I was able to come out with a little hacky solution, comments are very welcomed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by Michal Dulko (<email address hidden>) on branch: master
Review: https://review.openstack.org/164745
Reason: I've did more thinking and I don't see the possibility to safely delete the lock files. There's always a race condition between checking for other processes waiting for it and deleting a lock file.

In that time window a process can open the lock file and wait for acquiring the lock. Then lock file gets deleted and waiting process may hang. This is certainly unacceptable.

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

I've did some research on how to implement that and I think this is impossible now. I'll try to explain this in more details.

First of all you need to realize that oslo_concurrency file locks are using fcntl (https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L261). This means that existence of lock file doesn't indicate that lock is taken. When lock is acquired fcntl.LOCK_EX is set on it and only this is the indicator of the fact that someones own the lock.

In the process of acquiring the lock file is opened in append mode by the process (https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L200). This creates the file if it doesn't exist. Lock file *isn't* removed when lock is released because there can be other processes waiting for the lock and removing the file would make it impossible to set fcntl.LOCK_EX on the file descriptor they own, because it would lead to a removed file. In my tests processes simply hanged.

This means that to remove the file we need to be sure that there are no other processes waiting. We can do that iterating through the processes and looking for their opened files. Unfortunately still a new process can open the file between our check and the removal and it will result in it hanging.

Therefore I think that it is impossible to safely remove the lock file in current implementation of oslo_concurrency. A possible solution would be to make oslo's FileLock open the file again on each try to acquire the lock so if file get's removed a new one is created.

Changed in cinder:
assignee: Michal Dulko (michal-dulko-f) → nobody
Alex Xu (xuhj)
Changed in cinder:
assignee: nobody → Alex Xu (xuhj)
Revision history for this message
Ryan McNair (rdmcnair) wrote :

The lock files are left around because the fastener library used for locking closes files after releasing them, as opposed to deleting them (https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L169). This makes sense for the current implementation, as other processes waiting on the lock will already have the same file open, so deleting the file would cause further complications.

Some options that jump to mind for fixing are:
    1. Modify fasteners/process_lock.trylock to check that the file exists each time, and if it doesn't will re-open it (like Michal suggested). Downside would be performance hit from calling os.path.isfile every time a process tries regrabbing a lock
    2. Have some periodic cleanup that deletes locks which haven't been used in a certain amount of time. We'd need to be guessing at some max amount of time that a locked operation can take. Could be useful for deleting locks related to volumes which have already been deleted.

FWIW this issue is not only in Cinder - it's in other projects as well (look in /opt/stack/data/nova/instances/locks). The only place I've found where we actually clean up these locks currently is in nova/virt/libvirt/imagecache where we call lockutils.remove_external_lock_file

Revision history for this message
Joshua Harlow (harlowja) wrote :

Perhaps use something like:

    import os
    import fasteners

    class LockBucket(object):
        def __init__(self, lock_path, amount,
                     lock_prefix='', lock_postfix='.lock'):
            if amount <= 0:
                raise ValueError("At least one lock must be created")
            self.lock_path = lock_path
            self.locks = []
            for i in range(0, amount):
                i_lock_path = os.path.join(lock_path,
                                           "%s%s%s" % (lock_prefix, i + 1,
                                                       lock_postfix))
                self.locks.append(fasteners.InterProcessLock(i_lock_path))

        def __getitem__(self, item):
            m = hash(item)
            r = m % len(self.locks)
            return self.locks[r]

    lock_dir = "/tmp/cinder-locks"
    try:
        os.mkdir(lock_dir)
    except OSError:
        pass
    locks = LockBucket(lock_dir, 10)
    print locks['b']

Creating random arbitrary locks per instance, or per resource is not going to end well for anyone, since as u discovered deleting locks (actually deleting the file handle associated with that lock) across process is not easily made safe (because other processes trying to wait on the lock still refer to the old file handle); The above though creates a stable bucket of known amount of locks that will forever exist and can be well defined to only be X amount (what X is needs to be decided, but 100 seems reasonable); then there will never be more than 100 locks, but there will also never been more than 100 current requests to those locks, so 100 should probably == the max number of greenthreads that can be active per process (so that a greenthread is never unable to get a lock/starved).

Do note that http://semanchuk.com/philip/sysv_ipc/ and others also have this same problem...

'''
remove()

    Removes (deletes) the semaphore from the system.

    As far as I can tell, the effect of deleting a semaphore that other processes are still using is OS-dependent. Check your system's man pages for semctl(IPC_RMID).
'''

http://semanchuk.com/philip/posix_ipc/ is I think closer to what u want where that library has the following:

'''
unlink()

    Destroys the semaphore, with a caveat. If any processes have the semaphore open when unlink is called, the call to unlink returns immediately but destruction of the semaphore is postponed until all processes have closed the semaphore.

    Note, however, that once a semaphore has been unlinked, calls to open() with the same name should refer to a new semaphore. Sound confusing? It is, and you'd probably be wise structure your code so as to avoid this situation.
'''

I forget which one (sysv_ipc or posix_ipc) supports releasing automatically on unintentional process death (kill -9) but only one of the above does (file based locks do this automatically, because a process killed releases its held files)...

Revision history for this message
Joshua Harlow (harlowja) wrote :

Found it, http://linux.die.net/man/2/semop is the SEM_UNDO, but thats only sysv_ipc, and see remove() comment from above for sysv_ipc locks... Sad face....

Revision history for this message
Joshua Harlow (harlowja) wrote :

Other option that just came to mind, u can acquire a lock on an offset of a larger file, so like the lock buckets above there could be a fixed number of offsets in a single lock file that is used, 100 offsets lets say, and then each lock in the lock bucket would reference a offset, that way there would only be a single lock file that is used for all of cinder (although the number of locks/offsets, 100, 1000 or other would need to be determined).

Revision history for this message
Joshua Harlow (harlowja) wrote :

For the last point https://github.com/harlowja/fasteners/pull/10 makes this possible,

Then cinder can just have one lock file and use something like:

    import os

    import fasteners

    class LockBucket(object):
        def __init__(self, lock_path, amount):
            self.locks = fasteners.InterProcessLock.make_offset_locks(lock_path,
                                                                      amount)
            self.lock_path = lock_path

        def __getitem__(self, item):
            m = hash(item)
            r = m % len(self.locks)
            return self.locks[r]

    lock_path = "/tmp/cinder.lock"
    locks = LockBucket(lock_path, 100)

    # then get locks from the above lock bucket

Revision history for this message
Joshua Harlow (harlowja) wrote :

I'd prefer the comment #9 solution imho, create 10000 locks in that bucket (each lock consumes a byte) and have fun...

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

Byte-range locks? So now locks would actually *use* disk space, but not that much… What about collisions? This could slow down the system as many non-related operations could lock themselves.

Revision history for this message
Ryan McNair (rdmcnair) wrote :

Another possibility for cleaning up the lock files is to run a periodic task which will delete any lock files which correspond to volumes which have been deleted. We can have some delay between a volume being successfully deleted and when we remove it's lock files, giving time for other processes holding the same lock to work through the pipeline.

This is sort of a hack around the fact we can't tell if other processes are using a lock - we know that once a volume is deleted, nobody else will come into the pipe and grab a lock for that non-existent volume.

I'll see if I can get this cleanup working, otherwise will try out Josh's suggested approach using lock buckets.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/239678

Changed in cinder:
assignee: Alex Xu (xuhj) → Ryan McNair (rdmcnair)
Revision history for this message
Joshua Harlow (harlowja) wrote :

Up to you guys:

https://github.com/harlowja/fasteners/pull/10 allows for byte locks,

One could use that for example to make a 1MB lock-file, if 1MB lock-file which can be 1,048,576 locks isn't enough for you then I'm not sure what cinder is doing. Of course choose a smaller/larger size as needed...

Revision history for this message
Duncan Thomas (duncan-thomas) wrote :

The trouble isn't the number of locks, it is the fact we need to tie locks to a volume id without collisions. Honestly, I don't think that leaving some zero byte files around until the next reboot is all that bad, at least for now. Certainly not worth breaking things to fix - it's a cosmetic issue.

Changed in cinder:
importance: High → Low
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by Ryan McNair (<email address hidden>) on branch: master
Review: https://review.openstack.org/239678
Reason: Doesn't seem to be a lot of interest in fixing this with the current locking scheme (seems it's more a minor annoyance currently than a real issue). Other good alternatives were proposed but would/will require significantly more work (e.g. byte-offset locks or removing these locks altogether).

Revision history for this message
Ryan McNair (rdmcnair) wrote :

Doesn't seem to be a lot of interest in fixing this with the current locking scheme (seems it's more a minor annoyance currently than a real issue). Other good alternatives were proposed but would/will require significantly more work (e.g. byte-offset locks or removing these locks altogether).

I'm going to abandon my patch and un-assign for time being

Changed in cinder:
assignee: Ryan McNair (rdmcnair) → nobody
Robin Naundorf (senk)
Changed in cinder:
status: In Progress → Confirmed
Revision history for this message
Szymon Wróblewski (bluex) wrote :

Now that we support tooz and can use any supported locking backend, is it still an important issue?
File locks are used by default and they still left files behind, but mechanism used for locking can be easily changed to something that don't leave temporary files behind.

Eric Harney (eharney)
Changed in cinder:
status: Confirmed → Incomplete
Ben Nemec (bnemec)
Changed in oslo.concurrency:
status: New → Confirmed
importance: Undecided → Wishlist
Changed in cinder:
status: Incomplete → Won't Fix
Revision history for this message
Eduard Barrera (eduard-barrera) wrote :

This problem is making some system administration task really dificult du the large amount of files to process

Revision history for this message
Herve Beraud (herveberaud) wrote :

I proposed some patches through heat templates and puppet-cinder to remove lock files older than 1 week and avoid file system growing.

This is a solution based on a cron job, to fix that on stable branches, in a second time I'll help the fasteners project to fix the root cause by reviewing and testing the proposed patch (lock based on file offset). In next versions I hope we will use a patched fasteners and so we could drop the cron based solution.

Please can you take a look to:
- https://review.opendev.org/688413
- https://review.opendev.org/688414
- https://review.opendev.org/688415

Revision history for this message
Gorka Eguileor (gorka) wrote :

We should provide something for users to live with the current situation.
I'm proposing 2 patches:
- Make cinder automatically clean locks when a resource (volume or snapshot) is deleted
- Command in cinder-manage to remove locks

Changed in cinder:
assignee: nobody → Gorka Eguileor (gorka)
status: Won't Fix → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/734144

Changed in cinder:
status: Confirmed → In Progress
tags: added: files lock
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/689486
Committed: https://opendev.org/openstack/cinder/commit/d2f6ec5569b1c1b68f567392cfc03a6476169b4b
Submitter: "Zuul (22348)"
Branch: master

commit d2f6ec5569b1c1b68f567392cfc03a6476169b4b
Author: Gorka Eguileor <email address hidden>
Date: Fri Oct 18 15:51:13 2019 +0200

    Remove file locks once we delete a resource

    File locks are never removed from the system, so they keep increasing in
    the locks directory, which can become problematic.

    In this patch we start trying to delete these lock files when we delete
    a volume or a snapshot.

    This affects the 2 type of file locks we currently have:
    - Using oslo lockutils synchronized with external=True
    - Using coordination.synchronized when deployed in Active-Passive and no
      DLM

    This will alleviate the ever increasing files in the locks directory.
    Deployment tools should implement a service that runs when the host is
    booting and cleans of the locks directory before the OpenStack services
    are started.

    Partial-Bug: #1432387
    Change-Id: Ic73ee64257aeb024383c6cb79f2e8c04810aaf69

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/cinder/+/734144
Committed: https://opendev.org/openstack/cinder/commit/dbcbfefe517268c3a2159b3bcd2d9a865b9d0a81
Submitter: "Zuul (22348)"
Branch: master

commit dbcbfefe517268c3a2159b3bcd2d9a865b9d0a81
Author: Gorka Eguileor <email address hidden>
Date: Mon Jun 8 18:35:06 2020 +0200

    Add cinder-manage command to remove file locks

    This patch adds a new subgroup of commands called `util` with a single
    command ``remove_locks` that takes care of deleting locks that will no
    longer be used because the resources they are related to are no longer
    there.

    This works for volumes and snapshots and with the cinder services
    running or stopped.

    It gets the file locks filtering by name (must have the "cinder-" prefix
    and an UUID) and then if the service is online it will not delete file
    locks for resources that are still present in the database.

    Closes-Bug: #1432387
    Change-Id: I2535017e112c8bcb9a2e516876f52a945e9c7da8

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/804071

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/804071
Committed: https://opendev.org/openstack/cinder/commit/d4c112de26c415d430d8bac629b23b5c0f1fff66
Submitter: "Zuul (22348)"
Branch: master

commit d4c112de26c415d430d8bac629b23b5c0f1fff66
Author: Gorka Eguileor <email address hidden>
Date: Tue Aug 10 14:07:22 2021 +0200

    Fix cinder-manage clean_locks command

    On change I2535017e112c8bcb9a2e516876f52a945e9c7da8 the cinder-manage
    clean_locks command was introduced, but it has a bug where the OVO
    `exists` method is incorrectly called, since it's missing a context.

    This patch fixes this issue and properly calls the Snapshot and Volume
    `exists` method.

    Related-Bug: #1432387
    Change-Id: Ibed1e92dd63d7bcc60b4a9c4a05b15402c9eefa9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 19.0.0.0rc1

This issue was fixed in the openstack/cinder 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.