Replication lag checks can block
Bug #504696 reported by
Stuart Bishop
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
Critical
|
Stuart Bishop |
Bug Description
For various perfectly normal reasons, querying the _sl.sl_status view can get slow. This is what is used by replication_lag(). On production, we have seen cases where we have dozens of appservers trying to query this information in its slow state causing timeouts and the load balancers to think they have died.
Preferred approach would be a less intrusive way of querying database lag, which would be great since it would also be faster.
The alternative is to set the statement timeout to something small like 0.5 or 0.25 seconds before doing the lag checks. If the timeout occurs, we can assume we are lagged and should raise a Retry exception to run the request in master only mode.
Related branches
lp:~stub/launchpad/replication
Merged
into
lp:launchpad
- Abel Deuring (community): Approve (code)
- Aaron Bentley (community): Approve
-
Diff: 196 lines (+106/-19)7 files modifieddaemons/cache-database-replication-lag.py (+53/-0)
database/replication/helpers.py (+1/-0)
database/schema/comments.sql (+4/-0)
database/schema/patch-2207-28-1.sql (+9/-0)
database/schema/security.cfg (+6/-0)
database/schema/trusted.sql (+22/-0)
lib/canonical/launchpad/webapp/dbpolicy.py (+11/-19)
Changed in launchpad-foundations: | |
assignee: | nobody → Stuart Bishop (stub) |
Changed in launchpad-foundations: | |
status: | New → In Progress |
milestone: | none → 10.01 |
Changed in launchpad-foundations: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
We've now been hit twice by this today, 10:14 UTC, 11:18 UTC.
The following query can help diagnose the problem - if you see a log of replication_lag queries that's a problem:
select current_query from pg_Stat_activity where usename like 'lpnet%';