Need more visibility into the progress of schema updates across master and slave DBs

Bug #531833 reported by Tom Haddon
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Won't Fix
High
Stuart Bishop

Bug Description

Each LP rollout at the moment is something of a lottery in terms of timing. Since figuring out whether any of the upgrade.py/fti.py/security.py steps are being blocked in some way is non-trivial, it's hard for us to know when things are going wrong. If we've estimated that the upgrade should take 30 mins, we only really start to worry if things have gone wrong in some way after about 35 minutes. If we're still only 20% of the way through at that stage, our outage estimations are going to be completely wrong.

Ideally we'd have some easy way of determining the progress of updates against the master and each slave DB, and whether they're being blocked in any way.

It would also be useful for the more general case of knowing what's happening on slave DBs to be able to see what the queue of items they're waiting to process is and whether they're blocked in any way.

Tom Haddon (mthaddon)
Changed in launchpad:
importance: Undecided → High
Brad Crittenden (bac)
affects: launchpad → launchpad-foundations
Changed in launchpad-foundations:
status: New → Triaged
Changed in launchpad-foundations:
status: Triaged → New
status: New → Triaged
Gary Poster (gary)
Changed in launchpad-foundations:
milestone: none → 10.03
Tom Haddon (mthaddon)
summary: - Need more visibility into the progress of upgrade.py/fti.py/security.py
+ Need more visibility into the progress of schema updates across master
+ and slave DBs
description: updated
Revision history for this message
Stuart Bishop (stub) wrote :

We can't get this level of detail out of the Slony tools themselves.

We might be able to get meaningful information out of the slony log files.

We could get the slony tools customized to provide less noise and more meaningful feedback.

Have there been blockages detected that were not caused by connections that should have been disconnected? pg_stat_activity can provide information on open connections, and we could write a report to aggregate all the servers if we need to.

Or is this more about getting 'I'm not blocked, I'm busy doing stuff' information? This information should all be in the slony log files. Currently it is buried in a lot of noise. We should consider switching to the Slony-I 2.x series which has apparently cleaned up the logging a lot.

Gary Poster (gary)
Changed in launchpad-foundations:
milestone: 10.03 → none
Revision history for this message
Stuart Bishop (stub) wrote :

I'll look at getting this information from the logs during the rollout. I think we can just run grep on the slon logs with a bit of filtering to get everything we need.

Changed in launchpad-foundations:
assignee: nobody → Stuart Bishop (stub)
Revision history for this message
Stuart Bishop (stub) wrote :

The slon log for one of the slaves, when correctly filtered, provides the relevant information.

When making changes, slonik first applies all the db patches to the master. If they succeed, the patches are then applied in sequence to each of the slaves. So patch1 on slave1, patch1 on slave2, patch2 on slave1, patch2 on slave2 etc.

The following seems good to follow what is happening:

tail -f /var/log/slon/slon-launchpad_prod_1.log | grep -v '] DEBUG2'

I'm leaving the DEBUG1 messages as they are not that noisy, happening regularly enough to let you know things haven't crashed but not so often it obscures the more useful information such as the DDL statements being applied.

Revision history for this message
Stuart Bishop (stub) wrote :

I'll flag this as won't fix, as I believe this is good enough for our needs.

Changed in launchpad-foundations:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.