MailList configuration race condition between qrunners

Bug #1077908 reported by Jan Kaluza
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GNU Mailman
Incomplete
Medium
Mark Sapiro

Bug Description

Hi,

If two qrunner processes access list configuration file in the same second, second qrunner proces does not reload the configuration file, because it thinks that it has not changed. The decision is made according to config file's modification time, which is always in seconds (not in miliseconds) on older systems.

This breaks mlist.post_id incrementation if the first process is IncomingRunner and second process ArchRunner and leads to situation where two emails have the same sequence number.

The flow is following (everything happens in the same second):

AR (ArchRunner) loads config file with post_id=1
AR saves config file with post_id=1
IR (IncomingRunner) loads config file with post_id=1
IR increments post_id and saves config file with post_id=2
AR loads config file, but mlist.__timestamp == mtime, so it thinks it has the latest config file
AR uses post_id=1 (instead of post_id=2)
AR saves config file with post_id=1 (the bug happens - post_id=2 is replaced with post_id=1 in config file)

Second problem is that during the GENERAL_PIPELINE, in some cases the lock file is unlocked and next mlist.Save() fails. The patch checks for this case and try to Lock() the mailing list again.

Revision history for this message
Jan Kaluza (jkaluza) wrote :
Revision history for this message
Jan Kaluza (jkaluza) wrote :

Hm, when thinking about it, my patch wasn't the best thing for the race-condition. It will load config file everytime now, I think. But I think you got the idea what's wrong there.

Revision history for this message
Mark Sapiro (msapiro) wrote :

What mailman version are you running? I think this duplicates LP:266464 which was fixed in Mailman 2.1.15. Please confirm.

Changed in mailman:
assignee: nobody → Mark Sapiro (msapiro)
importance: Undecided → Medium
milestone: none → 2.1.15
status: New → Fix Committed
Revision history for this message
Mark Sapiro (msapiro) wrote :

Regarding your second problem of the list being unlocked by a handler during processing of the pipeline (I assume you mean GLOBAL_PIPELINE, not GENERAL_PIPELINE), I have never seen a report of this. Can you say which handler or under what circumstance this happens?

Revision history for this message
Mark Sapiro (msapiro) wrote :

Also see the post at <http://mail.python.org/pipermail/mailman-developers/2011-September/021524.html> and other posts linked therefrom.

Revision history for this message
Jan Kaluza (jkaluza) wrote :
Download full text (3.9 KiB)

Sorry, this is mailman-2.1.9 and mailman-2.1.12. You are right it's fixed in 2.1.15, thanks.

This is my debugging log showing the second problem. I reproduced it only once and my test case was sending lot of emails per second (I basically sent 10000 emails using more threads).

It prints "SAVE post_id" or "LOAD post_id" in the end of mlist.Save() and mlist.Load() methods followed by traceback (to find out what called those methods - it was useful to debug the first problem). You can see the lock problem in the end.

Oct 15 06:47:27 2012 (20378) SAVE 9345
[' File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n main()\n', ' File "/usr/lib/mailman/bin/qrunner", line 238, in main\n qrunner.run()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n filecnt = s
elf._oneloop()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n self._onefile(msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n keepqueued = self._dispose(m
list, msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 133, in _dispose\n mlist.Save()\n', ' File "/usr/lib/mailman/Mailman/MailList.py", line 561, in Save\n syslog(\'error\', \'SAVE %d\\n%s\\n\',
 self.post_id, repr(traceback.format_stack()))\n']

Oct 15 06:47:28 2012 (20378) SAVE 9346
[' File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n main()\n', ' File "/usr/lib/mailman/bin/qrunner", line 238, in main\n qrunner.run()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n filecnt = s
elf._oneloop()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n self._onefile(msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n keepqueued = self._dispose(m
list, msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 133, in _dispose\n mlist.Save()\n', ' File "/usr/lib/mailman/Mailman/MailList.py", line 561, in Save\n syslog(\'error\', \'SAVE %d\\n%s\\n\',
 self.post_id, repr(traceback.format_stack()))\n']

Oct 15 06:47:28 2012 (20380) LOAD 9346
[' File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n main()\n', ' File "/usr/lib/mailman/bin/qrunner", line 238, in main\n qrunner.run()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n filecnt = s
elf._oneloop()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n self._onefile(msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n keepqueued = self._dispose(m
list, msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/OutgoingRunner.py", line 70, in _dispose\n mlist.Load()\n', ' File "/usr/lib/mailman/Mailman/MailList.py", line 666, in Load\n syslog(\'error\', \'LOAD %d\\n%s\\n\',
self.post_id, repr(traceback.format_stack()))\n']

Oct 15 06:47:28 2012 (20378) Uncaught runner exception: <LockFile -1211698068: /var/lock/mailman/testlist.lock [unlocked: 18000sec] pid=20378>: None
Oct 15 06:47:28 2012 (20378) Traceback (most recent call last):
  File "/usr/lib/...

Read more...

Mark Sapiro (msapiro)
Changed in mailman:
status: Fix Committed → Triaged
status: Triaged → Incomplete
Revision history for this message
Mark Sapiro (msapiro) wrote :

Reopening this bug for the second issue only. The first issue is a duplicate of LP:266464. The second issue requires further study.

Changed in mailman:
milestone: 2.1.15 → none
Revision history for this message
Mark Sapiro (msapiro) wrote :

It seems that what happened is the NotLockedError and traceback at the end of comment #6 occurred when IncomingRunner attempted to Save() the list (two log messages earlier in comment #6) and found it not locked, or possibly this was the Save() after the next message which wasn't yet logged.

In either case, this may have only occurred because of LP:266464. Pending further information and evidence that this can occur with the fix for LP:266464 installed, I'm leaving this 'incomplete'.

Revision history for this message
Jan Kaluza (jkaluza) wrote :

When I reproduced this second bug, my fix for the first bug had not been applied yet. I was not able to reproduce the second bug with your patch for LP:266464. I think you can close this bug. Thank you for all the information and help.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.