GNU Mailman

MailList configuration race condition between qrunners

Bug #1077908 reported by Jan Kaluza on 2012-11-12

This bug report is a duplicate of: Bug #266464: Subscriber "disappears" after subscription. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	GNU Mailman	Incomplete	Medium	Mark Sapiro

Bug Description

Hi,

If two qrunner processes access list configuration file in the same second, second qrunner proces does not reload the configuration file, because it thinks that it has not changed. The decision is made according to config file's modification time, which is always in seconds (not in miliseconds) on older systems.

This breaks mlist.post_id incrementation if the first process is IncomingRunner and second process ArchRunner and leads to situation where two emails have the same sequence number.

The flow is following (everything happens in the same second):

AR (ArchRunner) loads config file with post_id=1
AR saves config file with post_id=1
IR (IncomingRunner) loads config file with post_id=1
IR increments post_id and saves config file with post_id=2
AR loads config file, but mlist.__timestamp == mtime, so it thinks it has the latest config file
AR uses post_id=1 (instead of post_id=2)
AR saves config file with post_id=1 (the bug happens - post_id=2 is replaced with post_id=1 in config file)

Second problem is that during the GENERAL_PIPELINE, in some cases the lock file is unlocked and next mlist.Save() fails. The patch checks for this case and try to Lock() the mailing list again.

Revision history for this message

Jan Kaluza (jkaluza) wrote on 2012-11-12:

mailman-lock-race.patch Edit (894 bytes, text/plain)

Revision history for this message

Jan Kaluza (jkaluza) wrote on 2012-11-12:

Hm, when thinking about it, my patch wasn't the best thing for the race-condition. It will load config file everytime now, I think. But I think you got the idea what's wrong there.

Revision history for this message

Mark Sapiro (msapiro) wrote on 2012-11-13:

What mailman version are you running? I think this duplicates LP:266464 which was fixed in Mailman 2.1.15. Please confirm.

Changed in mailman:
assignee:	nobody → Mark Sapiro (msapiro)
importance:	Undecided → Medium
milestone:	none → 2.1.15
status:	New → Fix Committed

Revision history for this message

Mark Sapiro (msapiro) wrote on 2012-11-13:

Regarding your second problem of the list being unlocked by a handler during processing of the pipeline (I assume you mean GLOBAL_PIPELINE, not GENERAL_PIPELINE), I have never seen a report of this. Can you say which handler or under what circumstance this happens?

Revision history for this message

Mark Sapiro (msapiro) wrote on 2012-11-13:

Also see the post at <http://mail.python.org/pipermail/mailman-developers/2011-September/021524.html> and other posts linked therefrom.

Revision history for this message

Jan Kaluza (jkaluza) wrote on 2012-11-13:

Download full text (3.9 KiB)

Sorry, this is mailman-2.1.9 and mailman-2.1.12. You are right it's fixed in 2.1.15, thanks.

This is my debugging log showing the second problem. I reproduced it only once and my test case was sending lot of emails per second (I basically sent 10000 emails using more threads).

It prints "SAVE post_id" or "LOAD post_id" in the end of mlist.Save() and mlist.Load() methods followed by traceback (to find out what called those methods - it was useful to debug the first problem). You can see the lock problem in the end.

Oct 15 06:47:28 2012 (20378) SAVE 9346
[' File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n main()\n', ' File "/usr/lib/mailman/bin/qrunner", line 238, in main\n qrunner.run()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n filecnt = s
elf._oneloop()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n self._onefile(msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n keepqueued = self._dispose(m
list, msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 133, in _dispose\n mlist.Save()\n', ' File "/usr/lib/mailman/Mailman/MailList.py", line 561, in Save\n syslog(\'error\', \'SAVE %d\\n%s\\n\',
self.post_id, repr(traceback.format_stack()))\n']

Oct 15 06:47:28 2012 (20380) LOAD 9346
[' File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n main()\n', ' File "/usr/lib/mailman/bin/qrunner", line 238, in main\n qrunner.run()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n filecnt = s
elf._oneloop()\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n self._onefile(msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n keepqueued = self._dispose(m
list, msg, msgdata)\n', ' File "/usr/lib/mailman/Mailman/Queue/OutgoingRunner.py", line 70, in _dispose\n mlist.Load()\n', ' File "/usr/lib/mailman/Mailman/MailList.py", line 666, in Load\n syslog(\'error\', \'LOAD %d\\n%s\\n\',
self.post_id, repr(traceback.format_stack()))\n']

Oct 15 06:47:28 2012 (20378) Uncaught runner exception: <LockFile -1211698068: /var/lock/mailman/testlist.lock [unlocked: 18000sec] pid=20378>: None
Oct 15 06:47:28 2012 (20378) Traceback (most recent call last):
File "/usr/lib/...

Sorry, this is mailman-2.1.9 and mailman-2.1.12. You are right it's fixed in 2.1.15, thanks.

This is my debugging log showing the second problem. I reproduced it only once and my test case was sending lot of emails per second (I basically sent 10000 emails using more threads).

Oct 15 06:47:27 2012 (20378) SAVE 9345    
['  File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n    main()\n', '  File "/usr/lib/mailman/bin/qrunner", line 238, in main\n    qrunner.run()\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n    filecnt = s
elf._oneloop()\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n    self._onefile(msg, msgdata)\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n    keepqueued = self._dispose(m
list, msg, msgdata)\n', '  File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 133, in _dispose\n    mlist.Save()\n', '  File "/usr/lib/mailman/Mailman/MailList.py", line 561, in Save\n    syslog(\'error\', \'SAVE %d\\n%s\\n\',
 self.post_id, repr(traceback.format_stack()))\n']  
  
Oct 15 06:47:28 2012 (20378) SAVE 9346    
['  File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n    main()\n', '  File "/usr/lib/mailman/bin/qrunner", line 238, in main\n    qrunner.run()\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n    filecnt = s
elf._oneloop()\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n    self._onefile(msg, msgdata)\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n    keepqueued = self._dispose(m
list, msg, msgdata)\n', '  File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 133, in _dispose\n    mlist.Save()\n', '  File "/usr/lib/mailman/Mailman/MailList.py", line 561, in Save\n    syslog(\'error\', \'SAVE %d\\n%s\\n\',
 self.post_id, repr(traceback.format_stack()))\n']  
  
Oct 15 06:47:28 2012 (20380) LOAD 9346    
['  File "/usr/lib/mailman/bin/qrunner", line 278, in ?\n    main()\n', '  File "/usr/lib/mailman/bin/qrunner", line 238, in main\n    qrunner.run()\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 71, in run\n    filecnt = s
elf._oneloop()\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop\n    self._onefile(msg, msgdata)\n', '  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile\n    keepqueued = self._dispose(m
list, msg, msgdata)\n', '  File "/usr/lib/mailman/Mailman/Queue/OutgoingRunner.py", line 70, in _dispose\n    mlist.Load()\n', '  File "/usr/lib/mailman/Mailman/MailList.py", line 666, in Load\n    syslog(\'error\', \'LOAD %d\\n%s\\n\', 
self.post_id, repr(traceback.format_stack()))\n']   
  
Oct 15 06:47:28 2012 (20378) Uncaught runner exception: <LockFile -1211698068: /var/lock/mailman/testlist.lock [unlocked: 18000sec] pid=20378>: None    
Oct 15 06:47:28 2012 (20378) Traceback (most recent call last):    
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 113, in _oneloop
    self._onefile(msg, msgdata) 
  File "/usr/lib/mailman/Mailman/Queue/Runner.py", line 176, in _onefile
    keepqueued = self._dispose(mlist, msg, msgdata) 
  File "/usr/lib/mailman/Mailman/Queue/IncomingRunner.py", line 133, in _dispose  
    mlist.Save() 
  File "/usr/lib/mailman/Mailman/MailList.py", line 551, in Save   
    self.__lock.refresh()  
  File "/usr/lib/mailman/Mailman/LockFile.py", line 229, in refresh
    raise NotLockedError, '%s: %s' % (repr(self), self.__read())   
NotLockedError: <LockFile -1211698068: /var/lock/mailman/testlist.lock [unlocked: 18000sec] pid=20378>: None    
  
Oct 15 06:47:28 2012 (20378) SHUNTING: 1350273953.9130599+e82f3657cfb0be0aae0be15565a31e333d37324b

Mark Sapiro (msapiro) on 2012-11-13

Changed in mailman:
status:	Fix Committed → Triaged
status:	Triaged → Incomplete

Revision history for this message

Mark Sapiro (msapiro) wrote on 2012-11-13:

Reopening this bug for the second issue only. The first issue is a duplicate of LP:266464. The second issue requires further study.

Changed in mailman:
milestone:	2.1.15 → none

Revision history for this message

Mark Sapiro (msapiro) wrote on 2012-11-14:

It seems that what happened is the NotLockedError and traceback at the end of comment #6 occurred when IncomingRunner attempted to Save() the list (two log messages earlier in comment #6) and found it not locked, or possibly this was the Save() after the next message which wasn't yet logged.

In either case, this may have only occurred because of LP:266464. Pending further information and evidence that this can occur with the fix for LP:266464 installed, I'm leaving this 'incomplete'.

Revision history for this message

Jan Kaluza (jkaluza) wrote on 2012-11-15:

When I reproduced this second bug, my fix for the first bug had not been applied yet. I was not able to reproduce the second bug with your patch for LP:266464. I think you can close this bug. Thank you for all the information and help.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #266464 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

mailman-lock-race.patch Edit

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.