upgrade robustness for cronscripts

Bug #607391 reported by Robert Collins on 2010-07-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Stuart Bishop

Bug Description

We want to do more full-rollouts without adding downtime except when there are actual db patches to deploy.

We have ~ 170 cronscript instances spread over many machines and these are a likely source of fragility / percieved downtime.

Some possible problems:
 - they run from cron, so either they try to run while we're upgrading, or they may still be running while we upgrade
 - they run from the 'current' symlink, not the 'active revno dir', so they could in principle do late buggy imports (but this is rare, we can ignore for now)
 - some of them have terrible knock on effects if interrupted, and take 15-20 minutes every hour.

Specifically: the publisher script takes ages and if interrupted makes a mess, so we will want to deploy *around* it.

I have in mind a single simple wrapper that we can surround all cronscripts with that will:
 - check a single well known place to determine whether to load the script or not
 - (optionally) not run scripts that take more than <estimate> leading up to a rollout (e.g. 1 minute scripts might still run)

However thats only a preconceived idea. The actual constraints are:
 - be better than the current rollout process of mass crontab editing
 - put the policy somewhere more central (e.g. db, config file, whatever)
 - for specific highly sensitive cronscripts we may want some 'is it safe yet' check, but that shouldn't be conflated with how we coordinate whether things run or not, unless it makes sense.

The current process causes extended downtime windows by not running anything leading up to the rollout; part of the issue is that individual scripts can't tolerate other services (e.g. the xmlrpc server) going down - we may need a bunch of fine grained bugs to make them better, but the robustness thing here should at least let us start closing the gap.

I'm marking this as high because it will be hard for us to change our merge-qa-deploy workflow until we reduce the downtime of production non-db-patch rollouts, and I know everyone is keen to change that workflow :).

Related branches

Revision history for this message
Gary Poster (gary) wrote :

Addressing 605822 at the same time, or at least preparing for it, would be nice.

Changed in launchpad-foundations:
status: New → Triaged
Gary Poster (gary) on 2010-07-19
Changed in launchpad-foundations:
assignee: nobody → Stuart Bishop (stub)
Stuart Bishop (stub) on 2010-08-06
tags: added: cron
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Changed in launchpad-foundations:
milestone: none → 10.10
tags: added: qa-needstesting
Changed in launchpad-foundations:
status: Triaged → Fix Committed
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
Stuart Bishop (stub) on 2010-09-29
tags: added: qa-ok
removed: qa-needstesting
Curtis Hovey (sinzui) on 2010-10-14
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers