The qrunner-master lock file causes issues when running clustered

Bug #1082308 reported by Axis Communications
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
Fix Released
Low
Mark Sapiro

Bug Description

Hi

It is possible to run mailman in a failover or load balancing cluster, see:
http://wiki.list.org/pages/viewpage.action?pageId=4030621

When running a cluster, it is crucial to use:
* a shared directory for archive data
* a shared directory for locks
* separate directories for each qrunner

This is possible to implement by setting the directories in mm_cfg.py, for example like this (where <host> is a host name):
VAR_PREFIX = '<shared dir>'
LIST_DATA_DIR = os.path.join(VAR_PREFIX, 'lists')
LOCK_DIR = os.path.join(VAR_PREFIX, 'locks')
DATA_DIR = os.path.join(VAR_PREFIX, 'data')
SPAM_DIR = os.path.join(VAR_PREFIX, 'spam')
LOG_DIR = os.path.join(VAR_PREFIX, 'logs-<host>')
PUBLIC_ARCHIVE_FILE_DIR = os.path.join(VAR_PREFIX, 'archives', 'public')
PRIVATE_ARCHIVE_FILE_DIR = os.path.join(VAR_PREFIX, 'archives', 'private')
# For qfiles and logs, <dir>-<host> is used to avoid conflicts
QUEUE_DIR = os.path.join(VAR_PREFIX, 'qfiles-<host>')
INQUEUE_DIR = os.path.join(QUEUE_DIR, 'in')
OUTQUEUE_DIR = os.path.join(QUEUE_DIR, 'out')
CMDQUEUE_DIR = os.path.join(QUEUE_DIR, 'commands')
BOUNCEQUEUE_DIR = os.path.join(QUEUE_DIR, 'bounces')
NEWSQUEUE_DIR = os.path.join(QUEUE_DIR, 'news')
ARCHQUEUE_DIR = os.path.join(QUEUE_DIR, 'archive')
SHUNTQUEUE_DIR = os.path.join(QUEUE_DIR, 'shunt')
VIRGINQUEUE_DIR = os.path.join(QUEUE_DIR, 'virgin')
BADQUEUE_DIR = os.path.join(QUEUE_DIR, 'bad')
RETRYQUEUE_DIR = os.path.join(QUEUE_DIR, 'retry')
MAILDIR_DIR = os.path.join(QUEUE_DIR, 'maildir')

Unfortunately, the master-qrunner lock causes problem with this setup. mailmanctl -s starts even if there is a master-qrunner file (provided that there is no running mailmanctl on the host), making it possible to get the service up and running on more than one host. Once a day however, mailmanctl controls the lock. If it does not have it, it shuts down. If you are running a cluster, at least one of the nodes will not have the lock, and the service will be shut down on that node.

To solve this, I propose that the the LOCKFILE name in mailmanctl becomes configurable, so instead of having:
LOCKFILE = os.path.join(mm_cfg.LOCK_DIR, 'master-qrunner')
Have:
LOCKFILE = os.path.join(mm_cfg.LOCK_DIR, mm_cfg.QRUNNER_LOCK_FILE)

Then add LOCKFILE = 'master-qrunner' in Defaults.py.

This would make it easy to have individual qrunner master lock files for each node in a cluster.

Tags: cluster

Related branches

Mark Sapiro (msapiro)
Changed in mailman:
assignee: nobody → Mark Sapiro (msapiro)
importance: Undecided → Low
milestone: none → 2.1.16
status: New → Fix Committed
Mark Sapiro (msapiro)
Changed in mailman:
milestone: 2.1.16 → 2.1.16rc1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.