ESTALE in cluster configuration

Bug #1839438 reported by Hiroyuki Homma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
New
Undecided
Unassigned

Bug Description

I am running a Mailman cluster with two servers sharing archives/data/lists/locks/spam directories.
qfiles/logs directories are placed on each server's local volumes.

Our environment is:
CentOS 7.6
Mailman 2.1.29
GlusterFS 5.2 for shared volume.

When I sent 1000 messages to the same list in 500 seconds (2 messages per second), about 20 messages has been shunted because of 'Stale file handle' error.

Aug 06 15:14:45 2019 (15817) Uncaught runner exception: [Errno 116] Stale file handle
Aug 06 15:14:45 2019 (15817) Traceback (most recent call last):
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 119, in _oneloop
    self._onefile(msg, msgdata)
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 165, in _onefile
    mlist = self._open_list(listname)
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 208, in _open_list
    mlist = MailList.MailList(listname, lock=False)
  File "/usr/lib/mailman/Mailman/MailList.py", line 133, in __init__
    self.Load()
  File "/usr/lib/mailman/Mailman/MailList.py", line 692, in Load
    dict, e = self.__load(file)
  File "/usr/lib/mailman/Mailman/MailList.py", line 663, in __load
    dict = loadfunc(fp)
IOError: [Errno 116] Stale file handle

Aug 06 15:14:45 2019 (15817) SHUNTING: 1565072084.945903+914cbad4e11aaa0523b7492edba5f4836db939d1

This happens when the recipient list's config.pck file is replaced by another server while reading it.
ESTALE could happen normally on shared volumes, and in most case, simply retrying open/read is sufficient to recover the error.
So I think a retry logic should be implemented in MailList._load() method.

Revision history for this message
Hiroyuki Homma (hhomma) wrote :

ESTALE errors can also occures when closing a file, in LockFile.py and MailList.py:

Aug 28 12:13:14 2019 (25794) Uncaught runner exception: [Errno 116] Stale file handle
Aug 28 12:13:14 2019 (25794) Traceback (most recent call last):
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 119, in _oneloop
    self._onefile(msg, msgdata)
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 190, in _onefile
    keepqueued = self._dispose(mlist, msg, msgdata)
  File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 115, in _dispose
    mlist.Lock(timeout=mm_cfg.LIST_LOCK_TIMEOUT)
  File "/usr/lib/mailman/Mailman/MailList.py", line 164, in Lock
    self.__lock.lock(timeout)
  File "/usr/lib/mailman/Mailman/LockFile.py", line 288, in lock
    elif self.__read() == self.__tmpfname:
  File "/usr/lib/mailman/Mailman/LockFile.py", line 432, in __read
    fp.close()
IOError: [Errno 116] Stale file handle

Aug 28 12:13:14 2019 (25794) SHUNTING: 1566961993.343576+720c68684e735dafb08c192202bfe1454009a201

Aug 29 13:05:15 2019 (27163) Uncaught runner exception: [Errno 116] Stale file handle
Aug 29 13:05:15 2019 (27163) Traceback (most recent call last):
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 119, in _oneloop
    self._onefile(msg, msgdata)
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 165, in _onefile
    mlist = self._open_list(listname)
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 208, in _open_list
    mlist = MailList.MailList(listname, lock=False)
  File "/usr/lib/mailman/Mailman/MailList.py", line 133, in __init__
    self.Load()
  File "/usr/lib/mailman/Mailman/MailList.py", line 692, in Load
    dict, e = self.__load(file)
  File "/usr/lib/mailman/Mailman/MailList.py", line 670, in __load
    fp.close()
IOError: [Errno 116] Stale file handle

Aug 29 13:05:15 2019 (27163) SHUNTING: 1567051501.717527+7b52807884fed08cbdb6f523d1eddef1779ddfdb

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.