wish: A Unified scriptactivity checker

Bug #644012 reported by Steve McInerney on 2010-09-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Low
Unassigned

Bug Description

Context:
Currently any cronjobs that are run to do backend work, or simple job scheduling, are "monitored" via scriptactivity checks.
eg:

# HOURLY
07 * * * * $LP_PY /srv/launchpad.net/production/launchpad/scripts/script-monitor.py -U statistician 60 loganberry:send-bug-notifications loganberry:send-person-notifications loganberry:branchscanner loganberry:sendbranchmail loganberry:process-mail loganberry:process_apport_blobs loganberry:merge-proposal-jobs loganberry:rosettabranches

We currently have about 21 separate crontab entries for 58 separate tasks being monitored.

Issues:
* adding new crontab tasks is easy and somewhat common
* adding new crontab task scriptactivity checks is easy to overlook, by both losas and the requesting developers
* the scriptactivity check only monitors scripts it knows about; no check, no alert. This is bad.
* the scriptactivity check doesn't distinguish between jobs that "fail" that are "ZOMG AAAAAAAA", and ones that are "Meh"
* using email for alerting on the "Meh" jobs is fine; it's not good for the ZOMG ones
- aside, I can only think of one that really is ZOMG, but that's one too many that we don't have a good immediate alerting mechanism for
* The existing scheduling is a tad inflexible. This is most noticable with the serial nature of the nightly.sh run. All scripts *do* complete (usually), but due to length creep we often run over a 24 hour period; and thus get lots of spurious alerts.
(related, we should probably run the nightly.sh every 2-4 hours, with locking and a "have I run in < 20 hours?" logic check aborts. Such that even if it does run longer than 24 hours, we minimise the loss of all those jobs not running at all)

Proposal:
A unified scriptactivity check that solves the above problems.
* single crontab entry vs 21
* will alert (via email) on new entries in the scriptactivity table that it's not aware of
* a central 'store' of definitions for the monitoring. probably sql table, but simple text file would also work
* an exception mechanism for the ZOMG jobs that could in turn be monitored via the existing near Real Time alert system (nagios). Probably via writing a status text file "OK" or "CRITICAL". We use this method style in several places.
* needs to be fast. Ideally would easily complete in under a minute, so even on a loaded scripts server, could be run every minute
* (obviously) needs self locking so it doesn't contribute to a death spiral in times of severe load

Possibly, Nice to haves:
* simple trend alerting, "Oh Hai, this job was running for 30 minutes for the past month, but in the past week has been 45-60 mins. this could be bad"

Steve McInerney (spm) on 2010-09-21
tags: added: canonical-losa-lp
Gary Poster (gary) wrote :

Triaging as low because it sounds like we have one known ZOMG job, and Foundations has lots of other high importance bugs, so I'm defaulting to low. That said, if LOSAs or Robert or others disagree, please follow up. To be clear, I'd like to fix this, I'm just not sure I'm going to schedule this anytime soon unless it's made clear that it is of higher importance than other important things.

Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low

I agree with the scheduling for now; getting rollouts sorted is my
current bugbear ;)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers