In/Out error when aquiring lock

Bug #882261 reported by Johan Hake on 2011-10-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
flufl.lock
Medium
Barry Warsaw

Bug Description

Pretty often I submit a lot of runs on a cluster, all writing lock files to the same NFS mounted director in my own home area. Then out of some 50 submitted runs I often get 4 failing runs. They all fail with:
  File "/share/apps/si2011/mesoscale/lib/python2.6/site-packages/flufl.lock-2.1.1-py2.6.egg/flufl/lock/_lockfile.py", line 258, in lock
    elif self._read() == self._claimfile:
  File "/share/apps/si2011/mesoscale/lib/python2.6/site-packages/flufl.lock-2.1.1-py2.6.egg/flufl/lock/_lockfile.py", line 430, in _read
    return fp.read()
IOError: [Errno 5] Input/output error

Not sure what causes the Input/output error. You are catching an other error: ENOENT, but not EIO, which i suffer from.

I have not generated a small script that reproduces the error as it only turns up when running on a cluster, but I see it almost every time I run my simulations.

Johan

Johan Hake (johan-hake) wrote :

In addition to above error code I also sometimes get an error at the same place with this error:

  IOError: [Errno 116] Stale NFS file handle

I suspect that my file system is playing me some games here. I have tried using the following updated _read method:

    def _read(self):
        """Read the contents of our lock file.

        :return: The contents of the lock file or None if the lock file does
            not exist.
        """
        try:
            with open(self._lockfile) as fp:
                return fp.read()
        except EnvironmentError as error:
            # Avoid problems occuring when reading a file on an NFS disk
            if error.errno in [errno.EIO, errno.ESTALE]:
                self._sleep()
                return self._read()
            if error.errno != errno.ENOENT:
                raise
            return None

and it looks like it fixed my problem. Not sure it is safe to catch the EIO, ESTALE errors, and retry though.

Barry Warsaw (barry) on 2012-01-20
Changed in flufl.lock:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Barry Warsaw (barry)
milestone: none → 2.2
Barry Warsaw (barry) wrote :

Hi Johan. I'm not sure if the retry is safe either. My main concern with it is that you could end up recursing quite a bit if the problems are persistent. I think that's solvable by using a loop instead of recursion though.

One possibility is to leave retry up to the application. Given the current situation, you'd have to wrap a few API calls in try/excepts, including .refresh(), .lock(), .is_locked(), .transfer_to(), and .take_possession(). As you note though, we're already checking for ENOENT, which means the file is missing. I don't know whether it's safe to treat EIO and ESTALE as a missing file, but I suspect the semantics are different enough not to treat things that way.

One option I suppose is to expose the set of errnos cause in the except clause of _read() to the API, allowing you an easy way to override or extend them. It wouldn't solve the retry issue since that would still be up to your application. I don't know if that would be acceptable to you. Here's an example of what I'm thinking of:

    def _read(self):
        """Read the contents of our lock file.

        :return: The contents of the lock file or None if the lock file does
            not exist.
        """
        while True:
            try:
                with open(self._lockfile) as fp:
                    return fp.read()
            except EnvironmentError as error:
                if error.errno != errno.ENOENT:
                    raise
                elif error.errno in RETRY_ERRNOS:
                    pass
                else:
                    return None

Now, RETRY_ERRNOS would probably end up being a attribute of LockFile, either passed in the constructor or more likely set in a separate method (or a property with getters and setters). It would likely be initialized to the empty set, but you could then set it to the list of EIO and ESTALE for your application.

The more I think about it, the more I like this, so I will put something like this in 2.2. I'll have to think about how to mock up a test case for it.

Barry Warsaw (barry) wrote :

 * Provide a new API for dealing with possible additional unexpected errnos
   while trying to read the lock file. These can happen in some NFS
   environments. If you want to retry the read, set the lock file's
   `retry_errnos` property to a sequence of errnos. If one of those errnos
   occurs, the read is unconditionally (and infinitely) retried.
   `retry_errnos` is a property which must be set to a sequence; it has a
   getter and a deleter too. (LP: #882261)

Changed in flufl.lock:
status: Triaged → Fix Committed
Barry Warsaw (barry) on 2012-01-20
Changed in flufl.lock:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers