busy looping qrunner keeps CPU needlessly awake

Bug #346006 reported by Heinrich Langos
2
Affects Status Importance Assigned to Milestone
GNU Mailman
New
Wishlist
Barry Warsaw

Bug Description

I just ran powertop on my mostly idle server and it seems as if mailman's various qrunner processes
keep on waking up the CPU. Mostly to see that there is nothing to do.

> Wakeups-from-idle per second : 46.5 interval: 15.0s
> no ACPI power usage estimate available
>
> Top causes for wakeups:
> 21.2% ( 7.0) python : schedule_timeout (process_timeout)
> 17.3% ( 5.7) <interrupt> : eth0
> 11.5% ( 3.8) <interrupt> : extra timer interrupt

I don't know how mailman works internaly so I can't even begin to guess it there is too short a timeout somewhere in a select() call waiting for network connections, or some thread scanning a directory.

Anyway, it keeps waking up the CPU and thus wastes a lot of energy. Don't get me wrong. It doesn't actually produce any significant CPU load, but those wakeups produce context switches, TBL flushes and in general wasted CPU cylcles that translate into wasted energy and higher latencies for those processes that may actually need the CPU.

Taking into account the number of installations that mailman has worldwide I consider this behaviour a grave bug.

Revision history for this message
Mark Sapiro (msapiro) wrote :

This is not a bug. The queue runners sleep for QRUNNER_SLEEP_TIME when they have nothing to do The default is 1 second. If you would like them to sleep longer, set QRUNNER_SLEEP_TIME in mm_cfg.py.

Note however that this is not an interrupt driven process. The qrunners have to poll their queues, so don't set the time too long or performance will suffer.

Changed in mailman:
assignee: nobody → msapiro
status: New → Invalid
Revision history for this message
Heinrich Langos (henrik-launchpad) wrote :

Could you explain WHY exactly they have to poll their queues?
How are those queues implemented? (Are they sockets or file descriptors, or are they directories that are scanned for new entries?)

How exactly would performance suffer? I guess (hope) that it would only hurt latency but not throughput. (i.e. when there is more than one entry in the queue they should be handled in one go without sleeping QRUNNER_SLEEP_TIME between handling each entry.

Anyway .. its the 21st century... polling should be a thing of the past. ;-)

Revision history for this message
Heinrich Langos (henrik-launchpad) wrote :

Sorry, but i consider polling a bug. 9 out of 10 times it is lazyness on the implementor side and the rest are design flaws.

Changed in mailman:
status: Invalid → New
Revision history for this message
Mark Sapiro (msapiro) wrote :

The queues are directories that are scanned for new entries. And, yes, the effect of increasing QRUNNER_SLEEP_TIME would be increased latency.

The current polling architecture predates my involvement with Mailman, and I am unable to comment on the design decisions. You are welcome to consider it a bug and to submit patches to fix it. For my part, I will consider this a feature request.

Revision history for this message
Barry Warsaw (barry) wrote :

Mark is right, this is not a bug. You might not like the design decision but that doesn't mean the current operation is buggy.

Ten years ago polling was the only reliable cross platform solution for discovering new queue files to process. I think it's probably still the only cross platform solution, however I would accept patches that implement say an inotify based callback machinery, as long as the system still provides the same level of current high responsiveness and robustness with or without inotify.

Changed in mailman:
assignee: msapiro → barry
importance: Undecided → Wishlist
Revision history for this message
Heinrich Langos (henrik-launchpad) wrote :

I'm glad you see it at least as a feature request. I'll continue to see it as a bug and I'd fix it myself ... but I don't know any python. So, I'm just a mailman user complaining and ranting... :-)

I don't know if there already is a wrapper for different environments but on Linux there is the
"python fam" library ( http://sf.net/projects/python-fam ) and here's three (well, two. the first solution is polling again) ways to do it on windows:
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html

Best way to do it would be to wrap that "waiting for changes" into a function and do the best you can to avoid polling. Of cause you'd need to avoid doing the checks for your OS and the installed support libraries in that wait-for-changes call as they will probably be quite expensive. So they'd have to be done at install/start time.

BTW: Who writes into those directories?
If it is an external application I'd call it dangerous (How do you know it is done writing?)
If it is just a different part of mailman than I'd definitely call it bad design. Assuming the other part of mailman moves the already written and closed file into the directory in a (hopefully atomic) move/rename() call. Otherwise it would be a race condition. In any case it is inter process communication implemented via the file system... </rant> :-)

Revision history for this message
Barry Warsaw (barry) wrote :

Well, I don't care much about supporting Windows, but we have to support all the major *nixes, e.g. Linux, Mac OS X, Solaris, etc.

Mailman mostly writes to those directories, but other processes can do it too, which is absolutely a design requirement. It's up to them to do the proper atomic renames. It's not a Mailman bug if they break the queues.

Files system "IPC" is useful if you care about never losing a message <wink>. I consider the minor overhead of polling to be worth it.

As for learning Python, well, it's easy. You should give it a shot. :)

Is the polling causing you a real problem, or do you just not like it?

Revision history for this message
Heinrich Langos (henrik-launchpad) wrote :

Hi Barry,
> Well, I don't care much about supporting Windows, but we have to support all the
> major *nixes, e.g. Linux, Mac OS X, Solaris, etc.

Unfortunately I don't think there's a POSIX API for watching directories. FAM seems to be quite
portable and I hope somebody will write the glue code between FAM and the MACs "File System Events API". But as of now there is no complete cross platform way of doing it.

> Mailman mostly writes to those directories, but other processes can do it too,
> which is absolutely a design requirement.
So I take it this is a "supported API" which you can't abandon and replace by a completely different IPC mechanism. OK. But it wouldn't hurt making it more efficient, would it?

> It's not a Mailman bug if they break the queues.
It may not be your bug, but it may be your problem .. but OK. That's your decision.

> Files system "IPC" is useful if you care about never losing a message <wink>. I consider
> the minor overhead of polling to be worth it.

Well, if you don't fsync() after each close(), you may be better off with a system that keeps the original message around until the whole transaction (including all cascading actions by child/sibling processes) is finished and commited to disk. Without that fsync() no amount of filesystem IPC will keep your message safer than keeping it is memory. In fact a crash of a system that does lots of temporary work on the filesystem will usually be more severe than a crash of a system that has few well defined recovery points when stuff goes to the disk but otherwise keeps its state in memory.

Anyway, even if you need to do IPC via the filesystem, there is should be no need for polling. At least on linux you could use fam without losing anything. In fact you would gain performance since your qrunners would be woken up as soon as something arrives. Not half a polling interval later.

> Is the polling causing you a real problem, or do you just not like it?
It wakes up my CPU (and every other user's CPU) seven times per second. Do you guys have numbers about the installed mailman systems in the world? Try adding up the wasted CPU cycles and the time and energy it translates into. I'd say "Yes! it causes a real problem.". It keeps perfectly idle machines that could immediately respond to incoming network packets from doing it by switching contexts, flushing TLBs, invalidating caches, and so on....

Polling may also be quite bad when you think about installations that work on a networked storage.
I am in the process of virtualizing a lot of servers and in order to avoid a SPOF I may have to move the storage away from the machine to allow migration and fast recovery in case of hardware failure.
On such a system you want to avoid "hitting the disk" like a plague.

Revision history for this message
Heinrich Langos (henrik-launchpad) wrote :

Seems like this problem has been reported and partially fixed a long time ago:

https://bugzilla.redhat.com/show_bug.cgi?id=252127

http://sourceforge.net/tracker2/index.php?func=detail&aid=1776178&group_id=103&atid=300103

The fix is Linux-only as it uses inotify, but at least it is there and it is there for mailman 2 and 3.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.