Mailman's qrunner doesn't provide any way to be monitored

Bug #183372 reported by Tom Haddon
4
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Unassigned

Bug Description

We currently don't have any way to monitor the health of the qrunner process, besides a check for the process.

Can we add a variable which controls how often something is written to either the xmlrpc or qrunner log file so that if there's no activity within that specified period there will be a "--MARK--" entry with a timestamp. This way an external process (such as nagios) can easily check for the health of the qrunner.

Related branches

Tom Haddon (mthaddon)
Changed in launchpad:
assignee: nobody → barry
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
assignee: barry → nobody
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Tom Haddon (mthaddon) wrote :

This is now made worse by the fact that the only way we can currently monitor mailman is by doing a process listing, and if you have two instances on the same server (such as staging and qastaging) there's no difference in the process listing to be able to tell which one is which.

tags: added: canonical-losa-lp
Changed in launchpad-registry:
importance: Low → High
Revision history for this message
Tom Haddon (mthaddon) wrote :

Marking as "high" per latest discussion in Launchpad/IS meeting.

Revision history for this message
Barry Warsaw (barry) wrote :

Note that there's actually multiple qrunner processes that must run in order to deliver mail. It would not be hard to add something to the XMLRPC qrunner to write the tag every so often (I think there's even a debug switch that is more chatty to the log). Or you could of course write a small qrunner that just woke up periodically and Did Something. The health of one qrunner does not necessarily imply the health of the entire system though.

Curtis Hovey (sinzui)
Changed in launchpad-registry:
milestone: none → series-future
tags: added: mailing-lists
Revision history for this message
Curtis Hovey (sinzui) wrote :

Hi Tom.

The recent outages were cause when mailman was not syncing with Lp. Mailman reports that it cannot talk to xmlrpc and after a time, it just gives up. I think we want Lp's xmlrpc-malman code to record the timestamp of the last request in the Lp db. I think OSAs know how to pull a timestamp from the db and raise a warning if 30 minutes has passed.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

Hi Curtis,

I think simply logging everytime data has been synced up is more flexible than modifying LP to write to the DB every time a read method is called.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Hi Tom and all LOSAs.

Our mailman installation has both an error and xmlrpc log to report actions and exception. I think we want to watch one of these logs using a nagios plugin that I do not understand. The xmlrpc class has a _oneloop() method that manages what happens during a loop and it also catches exceptions and writes them to a log. It could have a rule to log a heartbeat if there were no exceptions during the run.:

    Nov 30 16:55:13 2010 (10236) --MARK--

In the case of the last outage, The --MARK-- would have stopped appear on Dec 01 and something could report mailman xmlrpc was dead. I image we need to decide an acceptable period without a sync. My own expectation is that everything is synced every 10 minutes. 30 minutes becomes a problem for users setting up teams or removing members.

Curtis Hovey (sinzui)
Changed in launchpad:
status: Triaged → In Progress
milestone: none → 11.01
milestone: 11.01 → 11.02
Curtis Hovey (sinzui)
Changed in launchpad:
assignee: nobody → Curtis Hovey (sinzui)
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Robert Collins (lifeless) wrote :

mark entries are showing up on qastaging. We need an RT (separate) to get this nagios monitored.

tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Robert Collins (lifeless) wrote :

rt 43392

Changed in launchpad:
status: Fix Committed → Fix Released
Curtis Hovey (sinzui)
Changed in launchpad:
assignee: Curtis Hovey (sinzui) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.