upgrade robustness for cronscripts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
High
|
Stuart Bishop |
Bug Description
We want to do more full-rollouts without adding downtime except when there are actual db patches to deploy.
We have ~ 170 cronscript instances spread over many machines and these are a likely source of fragility / percieved downtime.
Some possible problems:
- they run from cron, so either they try to run while we're upgrading, or they may still be running while we upgrade
- they run from the 'current' symlink, not the 'active revno dir', so they could in principle do late buggy imports (but this is rare, we can ignore for now)
- some of them have terrible knock on effects if interrupted, and take 15-20 minutes every hour.
Specifically: the publisher script takes ages and if interrupted makes a mess, so we will want to deploy *around* it.
I have in mind a single simple wrapper that we can surround all cronscripts with that will:
- check a single well known place to determine whether to load the script or not
- (optionally) not run scripts that take more than <estimate> leading up to a rollout (e.g. 1 minute scripts might still run)
However thats only a preconceived idea. The actual constraints are:
- be better than the current rollout process of mass crontab editing
- put the policy somewhere more central (e.g. db, config file, whatever)
- for specific highly sensitive cronscripts we may want some 'is it safe yet' check, but that shouldn't be conflated with how we coordinate whether things run or not, unless it makes sense.
The current process causes extended downtime windows by not running anything leading up to the rollout; part of the issue is that individual scripts can't tolerate other services (e.g. the xmlrpc server) going down - we may need a bunch of fine grained bugs to make them better, but the robustness thing here should at least let us start closing the gap.
I'm marking this as high because it will be hard for us to change our merge-qa-deploy workflow until we reduce the downtime of production non-db-patch rollouts, and I know everyone is keen to change that workflow :).
Related branches
- Robert Collins (community): Approve
- Canonical Launchpad Engineering: Pending requested
-
Diff: 1068 lines (+537/-90)16 files modifiedlib/canonical/launchpad/scripts/logger.py (+147/-61)
lib/canonical/launchpad/scripts/tests/loglevels.py (+9/-10)
lib/canonical/launchpad/scripts/tests/test_logger.txt (+28/-5)
lib/canonical/launchpad/scripts/tests/test_scriptmonitor.py (+1/-1)
lib/canonical/launchpad/webapp/errorlog.py (+5/-4)
lib/lp/codehosting/codeimport/tests/test_dispatcher.py (+11/-0)
lib/lp/services/log/loglevels.py (+91/-0)
lib/lp/services/log/mappingfilter.py (+27/-0)
lib/lp/services/scripts/base.py (+60/-3)
lib/lp/services/scripts/doc/launchpad-scripts.txt (+39/-0)
lib/lp/services/scripts/tests/cronscript-crash.py (+44/-0)
lib/lp/services/scripts/tests/test_doc.py (+7/-2)
lib/lp/translations/doc/remove-translations-by.txt (+2/-2)
lib/lp/translations/scripts/tests/test_reupload_translations.py (+4/-2)
lib/lp_sitecustomize.py (+56/-0)
scripts/branch-rewrite.py (+6/-0)
Changed in launchpad-foundations: | |
assignee: | nobody → Stuart Bishop (stub) |
tags: | added: cron |
tags: |
added: qa-ok removed: qa-needstesting |
Changed in launchpad-foundations: | |
status: | Fix Committed → Fix Released |
Addressing 605822 at the same time, or at least preparing for it, would be nice.