nagios cannot be sure mailman is really working ok

Bug #435886 reported by Tom Haddon on 2009-09-24
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself

Bug Description

The extent of our monitoring for mailman at the moment is a process check for mailmanctl / a heartbeat in the log files.

This does not cover end to end operations and things like archiver backlogs will not trigger nagios alerts.

One way to solve this is a tiny process that continually sends and probes for email to be sure its working. There may even be a helper Out There for nagios to do this already.

affects: launchpad → launchpad-registry
Curtis Hovey (sinzui) wrote :

What does this involve? What must the registry team do close this bug?

tags: added: mailing-lists
Changed in launchpad-registry:
status: New → Incomplete
Tom Haddon (mthaddon) wrote :

It requires someone to tell how we can definitely check that things are working okay in mailman. Currently we only check for the mailman process because we don't know how to actually confirm mailman is working as expected. I think Barry had suggested we create a sample list and post items to it and confirm they show up on the list in the web UI. If this is really the only way to do that, I guess we'd need to write a script to do all of that, but it isn't really conducive to a nagios check to be actually making changes to the production system (by posting things to a list), as much as anything since it needs to have very frequently. We could have the script only post every hour (for instance) and check if there's a new message on the web UI every hour, but that would effectively mean it'd take us up to an hour to tell if there was a problem with mailman.

Changed in launchpad-registry:
status: Incomplete → New
Barry Warsaw (barry) wrote :

I can't think of any other reliable way to tell that the system is working from end-to-end.

Curtis Hovey (sinzui) wrote :

I think this needs to be done for this release h1 turn around. is only acceptable if there is not faster say. Something will have to drop from our heavy commitments to make this faster. Does API help?, Barry was working on API to work the held message, this will tell you that mailman thinks everything is okay. it does not verify that the message was successfully delivered.

Curtis Hovey (sinzui) wrote :

I think messages could be sent to verify they land in hold, and nagios can verify they are there and remove them.

Changed in launchpad-registry:
importance: Undecided → Low
status: New → Triaged
Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp
Curtis Hovey (sinzui) wrote :

Hi Tom.

Did the logging changes I added to mailman address this issue? Mailman logs have a heartbeat now.

Tom Haddon (mthaddon) wrote :

It's an interesting one this... So it helps us to monitor the process to some extent for regular usage, but it doesn't allow us to use it as part of a deployment. The reason is that we need to be able to stop a service, verify it's down, and then restart it with new code and verify it's up. If we're just looking for an entry in a logfile within a certain interval this has no connection to whether the process is still running or not. For example, it heartbeats every 1 minute, so we have a check that says "has it made a heartbeat entry in the logfile within the last minute" - for this to work in a deployment scenario we'd have to then wait at least a minute before we could tell if the service was down, and wait again another minute to verify it's up again.

Robert Collins (lifeless) wrote :

So for shutdown, looking at the cron + running services as we previously did should be fine. for coming up, it should write a heartbeat straight away as it comes up, and then every 60 seconds after. That seems like it would work to me?

Barry Warsaw (barry) wrote :

Given that a running mailman consists of a master process and many long running subprocesses, how would you *like* such a check to work? I may not be able to help for Mailman 2 but at least I could build something into Mailman 3. Would it be enough to hit some status URL in the REST API? It may not be possible to tell the health of all the subprocesses though.

Robert Collins (lifeless) wrote :

So I think Tom is asking for a functional check that its 'all good'; we don't in e.g. LP actually mutate data.

if mailman cannot tell that its own children are healthy, that seems like a mailman issue we should delegate to mailman.

Doing a full end to end check on a dedicated private list would probably work, and if we add a 'zap list contents' facility it needn't accumulate too much data.

So, I'm going to split this up as follows:
 - its a nagios problem (e.g. not an LP codebase issue) to send a mail to a list, poll for the response back on a known address that nagios gets, and check it shows up in the archive.
 - but we need to write this script; and probably need to have it happen async of the nagios checks - e.g. it writes 'OK' every N minutes to a log file, and the nagios check is then 'is the OK in the log file < N+1 minutes old' where N is the maximum latency we're willing to tolerate for things flowing through mailman.
 - we need to manually create a dedicated private list for this monitoring to happen on, and the address for the response.
 - we need a code change to permit easy nuking of list contents from time to time

Robert Collins (lifeless) wrote :

Bug 817794 about being able to nuke the list contents.

This bug can now focus on writing a script (doesn't need to be part of the LP tree) that will:
 * send mail to a list
 * retrieve mail from a mailbox to check it was forwarded
 * and look up the same mail to make sure it got archived
 * and every day? week? month? trigger an API request to clean out the list.

Once thats written and we have archive cleaning in place we can file an RT to make a dedicated private list + nagios user account for the probes to use, and close this bug.

summary: - Need a way to monitor mailman via nagios
+ nagios cannot be sure mailman is really working ok
description: updated
Curtis Hovey (sinzui) wrote :
documents slow delivery that might have been caught is we had end-to-end monitoring. Only large lists were affected by slow delivery, so effective monitoring may need to use a large list to provide the state.

An alternate solution to previous suggestion in this bug could involve diagnostic pipelets in the queue runners. Maybe the queue runner processing the outgoing emails can signal a problem if the message's incoming timestamp is older than our service level we set for the time to complete the outgoing mail. A separate pipelet could be in the archive queue to manage a different service level. A piplet can add metadata to a message might be able to add diagnostic data to messages when they enter queues, rather than looking at existing message data, or a log.

Francis J. Lacoste (flacoste) wrote :

Raising to Critical as this is making users report operational problems before we are aware of them.

Changed in launchpad:
importance: Low → Critical
Curtis Hovey (sinzui) on 2013-01-08
Changed in launchpad:
importance: Critical → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers