Ubuntu Distributed Development

udd importer should make tea while launchpad is down

Reported by Martin Pool on 2011-06-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Unassigned
Ubuntu Distributed Development
High
Vincent Ladeuil

Bug Description

Launchpad goes offline or readonly roughly every month for about an hour.

At the moment the package importer handles this quite poorly by failing every single package that it tries to import during that time. It can take some time for them to get retried; istr in some cases manual intervention is needed.

It might be better if, before starting an import, it checked if Launchpad is up (and writable) at all. If not, it should probably just not trying anything else for a few minutes.

I don't think Launchpad exports a specific machine-readable interface to say "I'm down" or "I'm readonly". We could look for "readonly" in the html of the front page or try some other requests. Possibly that code should go into lplib if it's reusable.

Related branches

On Fri, 10 Jun 2011 00:09:47 -0000, Martin Pool <email address hidden> wrote:
> Public bug reported:
>
> Launchpad goes offline or readonly roughly every month for about an
> hour.
>
> At the moment the package importer handles this quite poorly by failing
> every single package that it tries to import during that time. It can
> take some time for them to get retried; istr in some cases manual
> intervention is needed.
>
> It might be better if, before starting an import, it checked if
> Launchpad is up (and writable) at all. If not, it should probably just
> not trying anything else for a few minutes.

There's a small issue here in that there are two processes involved.

The mass_import process is what should ideally sleep for a while. This
doesn't communicate with LP at all currently.

The import_package.py could check whether LP is up and exit with a "LP
is down" exit code, which would then trigger a sleep.

Another way to do it would just be to keep a tally of the number of
failures in the last N minutes and sleep if it's 100% failures for a
while.

Perhaps even both should be implemented?

Thanks,

James

Vincent Ladeuil (vila) wrote :

Another point to keep in mind is that some imports are running when lp goes down and it's probably not worth the effort to try to cope with that.

It's probably simpler to have mass_import detect that lp is down (canari polling thread), suspend starting new imports when it's the case and requeue whatever failures happen when lp is down.

The circuit breaker pattern may interest you guys.

Martin Pool (mbp) wrote :

That does look like a good pattern for backing off and gently retrying. http://davybrion.com/blog/2008/05/the-circuit-breaker/

Changed in udd:
assignee: nobody → Jonathan Riddell (jr)
Stuart Bishop (stub) wrote :

Adding Launchpad for now in case we should do something in Launchpad or launchpadlib to ease the pain. We now expect weekly 5 minute outages rather than monthly hour long outages.

tags: added: fastdowntime-later
Changed in launchpad:
status: New → Triaged
importance: Undecided → High
Vincent Ladeuil (vila) on 2011-09-14
Changed in udd:
assignee: Jonathan Riddell (jr) → Vincent Ladeuil (vila)
status: Confirmed → In Progress
Stuart Bishop (stub) wrote :

This might be much nicer when Launchpad is generating 503 error codes instead of 500 error codes when the database is unavailable. launchpadlib could handle 503s by retrying (and some sort of backoff algorithm of course). Launchpad generating 503s is Bug #844631 .

Vincent Ladeuil (vila) wrote :

@Stuart: I am subscribed to bug #844631 (shortly after you filed it even ;).

The proposed fix for this bug can evolve to catch up with launchpadlib and launchpad themselves behaving differently, in the mean time, it will help avoid the failure storms we're seeing now (leading to hundreds of import failures currently).

All fixes in lp and launchpadlib will be warmly welcome but the importer will need to be modified to benefit from them anyway.

Bells and suspenders ftw :)

James Westby (james-w) wrote :

On Thu, 22 Sep 2011 08:45:01 -0000, Stuart Bishop <email address hidden> wrote:
> This might be much nicer when Launchpad is generating 503 error codes
> instead of 500 error codes when the database is unavailable.
> launchpadlib could handle 503s by retrying (and some sort of backoff
> algorithm of course). Launchpad generating 503s is Bug #844631 .

Hi,

The package importer already handles 503s with some backoff, so it may
be that with fastdowntime tea making isn't really needed (though still a
good thing to have for when there may be some notsofastdowntime.)

I believe that there is code in recent versions of lazr.restfulclient to
do backoff+retry, but I'm not sure whether it amounts to 5 minutes of
retries, so fastdowntime may not be completely transparent to scripts.

Thanks,

James

Martin Pool (mbp) wrote :

We should deploy a new restful client, which will stop some 50x errors being
incorrectly mapped and thereby breaking retries.
On Sep 22, 2011 10:36 PM, "James Westby" <email address hidden> wrote:
> On Thu, 22 Sep 2011 08:45:01 -0000, Stuart Bishop <
<email address hidden>> wrote:
>> This might be much nicer when Launchpad is generating 503 error codes
>> instead of 500 error codes when the database is unavailable.
>> launchpadlib could handle 503s by retrying (and some sort of backoff
>> algorithm of course). Launchpad generating 503s is Bug #844631 .
>
> Hi,
>
> The package importer already handles 503s with some backoff, so it may
> be that with fastdowntime tea making isn't really needed (though still a
> good thing to have for when there may be some notsofastdowntime.)
>
> I believe that there is code in recent versions of lazr.restfulclient to
> do backoff+retry, but I'm not sure whether it amounts to 5 minutes of
> retries, so fastdowntime may not be completely transparent to scripts.
>
> Thanks,
>
> James
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/795321
>
> Title:
> udd importer should make tea while launchpad is down
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/launchpad/+bug/795321/+subscriptions
>

Vincent Ladeuil (vila) wrote :

Yup, there is a lot more than can be done on every call site (or shared entry point) related to launchpad, the 'make tea' is a catch all for all issues that are either not fixed yet or plainly unknown.

Robert Collins (lifeless) wrote :

Marked invalid for LP: there are other bugs about LP returning 503 etc, and no actions in LP for this particular bug. if this changes, please do reopen it.

Changed in launchpad:
status: Triaged → Invalid
Vincent Ladeuil (vila) wrote :

Today's rollout was a complete success. Relevant log excerpts:

2011-09-30 08:32:02,308 - __main__ - INFO - Launchpad is down, re-trying jcifs

2011-09-30 08:32:02,427 - __main__ - INFO - Testing if Launchpad is back

2011-09-30 08:32:02,440 - __main__ - INFO - Launchpad is down, re-trying jaxml
2011-09-30 08:32:02,443 - __main__ - INFO - Launchpad is down, re-trying jazip
2011-09-30 08:32:02,537 - __main__ - INFO - Trying jaxml
2011-09-30 08:32:02,546 - __main__ - INFO - Starting thread for jaxml
2011-09-30 08:32:02,548 - __main__ - INFO - Testing if Launchpad is back

<bunch of failures/retries>

2011-09-30 08:33:42,151 - __main__ - INFO - Launchpad is down, re-trying jaxml
2011-09-30 08:33:42,269 - __main__ - INFO - Trying jaxml
2011-09-30 08:33:42,279 - __main__ - INFO - Starting thread for jaxml
2011-09-30 08:33:42,281 - __main__ - INFO - Testing if Launchpad is back
2011-09-30 08:34:09,269 - __main__ - INFO - Success jclicmoodle: Nothing new
2011-09-30 08:34:09,336 - __main__ - INFO - thread for jclicmoodle finished
2011-09-30 08:34:09,337 - __main__ - INFO - threads for [u'jcifs', u'jbofihe', u'jcal', u'jazip', u'jclic', u'jcharts', u'jaxml'] still active
2011-09-30 08:34:09,337 - __main__ - INFO - Launchpad *is* back

For the record, [u'jcifs', u'jbofihe', u'jcal', u'jazip', u'jclic', u'jcharts', u'jaxml'] were used to check if lp was back. jclicmoodle was the first to finish succesfully in 27".

So the preceived outage was 2'7" (including these 27") so approximately 1'40.

There was no failure spike on the importer and the lp downtime was recovered from roughly 30 seconds after it ended instead of the previous *hours*.

Changed in udd:
status: In Progress → Fix Released
James Westby (james-w) wrote :

On Fri, 30 Sep 2011 09:24:25 -0000, Vincent Ladeuil <email address hidden> wrote:
> There was no failure spike on the importer and the lp downtime was
> recovered from roughly 30 seconds after it ended instead of the previous
> *hours*.

Great work Vincent, this will save a lot of confusion in future.

Thanks,

James

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers