should support rate limiting connections to particular servers

Bug #718478 reported by Michael Hudson-Doyle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android Mirror
Fix Released
High
Paul Sokolovsky

Bug Description

It will likely be a good idea to not connect to external servers too frequently lest we overload them.

Related branches

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

I guess we could do following: maintain and check timestamp for each server, and by default not repo-sync it if already updated within (say) 15mins. And specify list of override servers which should be synced every time (git.linaro.org).

One problem I have with Twisted is that it's easy to write a server which does direct processing, but it becomes increasingly complex to add features like logging, error-reporting, auto-retrying, rate limiting, etc. - with code become more and more involved and non-obvious. While, it would be trivial to add such features to standard synchronous (+threading/forking) code ;-). Please see my try on adding logging for lp:724096 though.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Re: [Bug 718478] Re: should support rate limiting connections to particular servers

On Mon, 18 Apr 2011 12:49:02 -0000, Paul Sokolovsky <email address hidden> wrote:
> I guess we could do following: maintain and check timestamp for each
> server, and by default not repo-sync it if already updated within (say)
> 15mins. And specify list of override servers which should be synced
> every time (git.linaro.org).

Yeah, something like that would be good.

> One problem I have with Twisted is that it's easy to write a server
> which does direct processing, but it becomes increasingly complex to add
> features like logging, error-reporting, auto-retrying, rate limiting,
> etc. - with code become more and more involved and non-obvious. While,
> it would be trivial to add such features to standard synchronous
> (+threading/forking) code ;-).

I'm not at all sure I buy this, but it's more your codebase than mine
now so feel free to stop using Twisted if it will make your life easier!

Changed in linaro-android-mirror:
assignee: nobody → Paul Sokolovsky (pfalcon)
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

After making a change to propagate mirror failures (lp:724096), we, while made sure we build up-to-date and correct code, also started to be susceptible to flakiness of upstream 3rd party servers. And AOSP server flakiness increases every week :-(. Right now I cannot proceed with toolchain build work because I can't do a build for few hours. And Patrik Ryd reported similar issues end of last week, when he couldn't complete a source tree checkout of omapzoom repo with mirror service default of using 8 HTTP threads for repo sync.

So, this appears to be pressing issue which needs to be resolved out-of-band. The plan:

To settings.py add config var UPSTREAM_HOSTS which is a dictionary from host name to a dictionary of mirroring parameters. E.g.:

UPSTREAM_HOSTS = {
"android.git.kernel.org": {"stale": 60*60, "jobs": 6},
"git.omapzoom.org": {"stale": 30*60, "jobs": 1},
"git.linaro.org": {"stale": 1*60, "jobs": "8"}
"*": {"stale": 10*60, "jobs": "4"}
}

"stale" specifies period after which host's repo should be updated (if less than that passed since last update, no need to sync again), "jobs" - -j value.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote : Fw: Mirror service per-host configs and timeouts to alleviate upstream downtimes

Begin forwarded message:

Date: Mon, 6 Jun 2011 11:31:42 +0200
From: Alexander Sack <email address hidden>
To: James Westby <email address hidden>
Cc: Paul Sokolovsky <email address hidden>
Subject: Re: Mirror service per-host configs and timeouts to alleviate
upstream downtimes

On Fri, Jun 3, 2011 at 11:51 PM, James Westby <email address hidden>
wrote:
> On Fri, 3 Jun 2011 23:40:35 +0300, Paul Sokolovsky
> <email address hidden> wrote:
>> What exact issues do you see here? These changes would allow us to
>> always sync with trees we directly develop on (git.linaro.org, maybe
>> some vendor trees later), while sync more lazily with 3rd party
>> trees, of which we don't even use master branch, but some release
>> branch/tag which updates infrequently on their own.
>>
>> Just in case, Alexander on IRC expressed desire for the following
>> schedule: always sync with git.linaro.org, for other hosts, can sync
>> like twice a day.
>
> Well, that's the exact problem that I see. Google does another code
> drop and we have a choice of waiting 12 hours until we can build it,
> or doing some manual intervention?

ATM, we manually track google tags (point releases) for our builds and
rebase our changes to new tags when they show up. We might want to
track the release branch rather than tags at some point, but for our
current practice this does not really matter and 12h is probably fine
if this helps.

--

 - Alexander

--
Best Regards,
Paul

Revision history for this message
Paul Sokolovsky (pfalcon) wrote : Re: Mirror service per-host configs and timeouts to alleviate upstream downtimes

Hello James,

On Fri, 03 Jun 2011 17:51:08 -0400
James Westby <email address hidden> wrote:

> I wasn't suggesting that we block this change, just start the
> discussion around a more complete solution.

Ok, sounds good, I deployed those changes in the meantime - seem to work
nice, and well, android.git.kernel.org is back up. I'm also cc:ing
lp:718478 so the discussion is captured.

> On Fri, 3 Jun 2011 23:40:35 +0300, Paul Sokolovsky
> <email address hidden> wrote:
> > What exact issues do you see here? These changes would allow us to
> > always sync with trees we directly develop on (git.linaro.org, maybe
> > some vendor trees later), while sync more lazily with 3rd party
> > trees, of which we don't even use master branch, but some release
> > branch/tag which updates infrequently on their own.
> >
> > Just in case, Alexander on IRC expressed desire for the following
> > schedule: always sync with git.linaro.org, for other hosts, can sync
> > like twice a day.
>
> Well, that's the exact problem that I see. Google does another code
> drop and we have a choice of waiting 12 hours until we can build it,
> or doing some manual intervention?

Well, as Alexander pointed out, 12hrs is probably not that much, but I
agree that good solution should minimize delay automagically, I have
ideas on that (below).

But let's first consider situation we used to have. It's the fact that
upstream git servers can be overloaded/down, and even for longer than
12hrs. Potentially, during any such outages Google can made a code drop
(pretty realistic scenario actually - Google did code drop and servers
got DDOSed). So, would we want, in case of upstream server
non-availability, to not build anything at all, on the basis that
there's possibility that in place far, far away a new code has landed
that we don't have?

I guess, that's worse alternative than be able to still build what we
have, especially when what we already have is exactly what we need.
After all, we added mirroring service to minimize extra-cloud traffic,
but it brings us extra, like allows to also improve our HA points.

Now let's consider what risks are there. Building stale code will be
problem for release builds, but release builds should really use only
builds from tags. So, we either have that tag and can build it, or
don't have, and can't (this relies on good upstream tagging policy,
like not moving tags).

For daily builds for branches, we'd just normally have 12hrs average
delay, the same as for builds themselves. But here's idea how to
improve that: following previous patch, add also "soft_stale" and
"hard_stale" settings. Upstream synced less than soft_stale time ago
won't be synced at all. After that, sync will be attempted, but it's ok
for it to fail w/o affecting a build. After hard_stale time passed,
failed sync will fail the build. So, for android.git we could set
soft_delay=2hrs and hard_delay=24hrs and be pretty good.

Finally, for real-time developers' builds, we indeed could provide
at first a script, later frontend UI to request unconditional sync.

How does that sound?

--
Best Regards,
Paul

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 718478] Fw: Mirror service per-host configs and timeouts to alleviate upstream downtimes

On Mon, 06 Jun 2011 16:30:25 -0000, Paul Sokolovsky <email address hidden> wrote:
> ATM, we manually track google tags (point releases) for our builds and
> rebase our changes to new tags when they show up. We might want to
> track the release branch rather than tags at some point, but for our
> current practice this does not really matter and 12h is probably fine
> if this helps.

Android release was just an example. How about if the example was "a
bugfix we want was added to the pandroid tree?"

Thanks,

James

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 718478] Re: Mirror service per-host configs and timeouts to alleviate upstream downtimes

On Mon, 06 Jun 2011 17:01:59 -0000, Paul Sokolovsky <email address hidden> wrote:
> But let's first consider situation we used to have. It's the fact that
> upstream git servers can be overloaded/down, and even for longer than
> 12hrs. Potentially, during any such outages Google can made a code drop
> (pretty realistic scenario actually - Google did code drop and servers
> got DDOSed). So, would we want, in case of upstream server
> non-availability, to not build anything at all, on the basis that
> there's possibility that in place far, far away a new code has landed
> that we don't have?

No, I don't know where you get the idea that I am suggesting that. We
need to design a robust system that gives the possibility to have quick
turnaround when needed. We used to have a non-robust system with
quick-turnaround. We now have a robust system with slow turnaround.

> I guess, that's worse alternative than be able to still build what we
> have, especially when what we already have is exactly what we need.
> After all, we added mirroring service to minimize extra-cloud traffic,
> but it brings us extra, like allows to also improve our HA points.
>
> Now let's consider what risks are there. Building stale code will be
> problem for release builds, but release builds should really use only
> builds from tags. So, we either have that tag and can build it, or
> don't have, and can't (this relies on good upstream tagging policy,
> like not moving tags).

Right, if we can't get the code we need to build then we shouldn't
build, that much seems obvious to me.

> For daily builds for branches, we'd just normally have 12hrs average
> delay, the same as for builds themselves. But here's idea how to
> improve that: following previous patch, add also "soft_stale" and
> "hard_stale" settings. Upstream synced less than soft_stale time ago
> won't be synced at all. After that, sync will be attempted, but it's ok
> for it to fail w/o affecting a build. After hard_stale time passed,
> failed sync will fail the build. So, for android.git we could set
> soft_delay=2hrs and hard_delay=24hrs and be pretty good.

That sounds like a useful part of a solution to me, provided that the
sync is atomic and so a failed sync on the soft_delay period doesn't
corrupt the repos.

> Finally, for real-time developers' builds, we indeed could provide
> at first a script, later frontend UI to request unconditional sync.

This would be a "request sync" step to force a sync?

Why is it a separate step? The developer would then have to request a
sync and then wait until it was complete before submitting their build
to ensure that the build used the code that they wanted.

Could it just be a part of the build config which is translated to extra
info passed to the mirror service to force an override?

Thanks,

James

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
Download full text (4.0 KiB)

On Mon, 06 Jun 2011 18:55:30 -0000
James Westby <email address hidden> wrote:

> On Mon, 06 Jun 2011 17:01:59 -0000, Paul Sokolovsky
> <email address hidden> wrote:
> > But let's first consider situation we used to have. It's the fact
> > that upstream git servers can be overloaded/down, and even for
> > longer than 12hrs. Potentially, during any such outages Google can
> > made a code drop (pretty realistic scenario actually - Google did
> > code drop and servers got DDOSed). So, would we want, in case of
> > upstream server non-availability, to not build anything at all, on
> > the basis that there's possibility that in place far, far away a
> > new code has landed that we don't have?
>
> No, I don't know where you get the idea that I am suggesting that.

Well, I don't say that, I just wanted to draw extremes, to find good
place inbetween where system can sustainably function.

> We
> need to design a robust system that gives the possibility to have
> quick turnaround when needed. We used to have a non-robust system with
> quick-turnaround. We now have a robust system with slow turnaround.
>

[]

> > For daily builds for branches, we'd just normally have 12hrs average
> > delay, the same as for builds themselves. But here's idea how to
> > improve that: following previous patch, add also "soft_stale" and
> > "hard_stale" settings. Upstream synced less than soft_stale time ago
> > won't be synced at all. After that, sync will be attempted, but
> > it's ok for it to fail w/o affecting a build. After hard_stale time
> > passed, failed sync will fail the build. So, for android.git we
> > could set soft_delay=2hrs and hard_delay=24hrs and be pretty good.
>
> That sounds like a useful part of a solution to me, provided that the
> sync is atomic and so a failed sync on the soft_delay period doesn't
> corrupt the repos.

Can you elaborate on this? If by corrupt you mean inconsistent state,
then it may be not that bright - git pull is of course atomic, but repo
just pulls git subtrees one by one, so it can be the case that one
subtree is updated, while another not.

>
> > Finally, for real-time developers' builds, we indeed could provide
> > at first a script, later frontend UI to request unconditional sync.
>
> This would be a "request sync" step to force a sync?
>
> Why is it a separate step? The developer would then have to request a
> sync and then wait until it was complete before submitting their build
> to ensure that the build used the code that they wanted.
>
> Could it just be a part of the build config which is translated to
> extra info passed to the mirror service to force an override?

Well, it's matter of concept separation. So far, what's in the build
config, affects just that build. Mirror control, on the other hand, is
on another level, and may affect other builds.

Usecase:

1. Mirror has synced, and upstream host soon went down.
2. During soft_stale period, all builds would still succeed.
3. But one developer decided to force mirror sync (essentially
by removing last sync timestamps).
4. As upstream host is down, the sync didn't succeed.
5. Developer's build thus failed.
6. But any other build since this build ...

Read more...

Revision history for this message
James Westby (james-w) wrote :
Download full text (3.3 KiB)

On Tue, 07 Jun 2011 16:17:27 -0000, Paul Sokolovsky <email address hidden> wrote:
> On Mon, 06 Jun 2011 18:55:30 -0000
> James Westby <email address hidden> wrote:
> > That sounds like a useful part of a solution to me, provided that the
> > sync is atomic and so a failed sync on the soft_delay period doesn't
> > corrupt the repos.
>
> Can you elaborate on this? If by corrupt you mean inconsistent state,
> then it may be not that bright - git pull is of course atomic, but repo
> just pulls git subtrees one by one, so it can be the case that one
> subtree is updated, while another not.

If we are doing soft_delay syncs that fail and corrupt the repo such
that no-one can do a build then there's no practical difference with
this approach to the one-timeout approach. Either way if the upstream is
down then no-one can do a build anymore.

I think this means it just needs to be atomic at the git repo level,
though there will be an odd occaision where skew would cause a build
failure that wouldn't have been there had the partial sync not happened.

That's not a big problem though, and so given that git is atomic this
should be fine.

> Well, it's matter of concept separation. So far, what's in the build
> config, affects just that build. Mirror control, on the other hand, is
> on another level, and may affect other builds.
>
> Usecase:
>
> 1. Mirror has synced, and upstream host soon went down.
> 2. During soft_stale period, all builds would still succeed.
> 3. But one developer decided to force mirror sync (essentially
> by removing last sync timestamps).
> 4. As upstream host is down, the sync didn't succeed.
> 5. Developer's build thus failed.
> 6. But any other build since this build will also fail. (Whereas
> otherwise there would soft_stale fail-free period).

Why would other builds have to fail?

> Taking into account that build config changes are persistent (vs
> one-off), I envisioned "force sync" feature as a separate, special
> frontend action available only to build admins (like ability to create
> official builds). But if you think that controlling it on the build
> config level is useful and scenario above isn't problematic, it can be
> done there.

My argument wasn't so much about where we put it, but the user
experience. If it is it's own action, then I have to babysit it as it
syncs until I can start my build.

If however I say "build now and ensure that the mirror is up to date"
then I can go away and either get the build I want, or get a failure if
the mirror is not up to date.

> And we still can start with just providing couple of scripts on
> android-build.linaro.org, so people with access can SSH in and do:
>
> force-mirror-sync - next mirror service request will force mirrored
> repos update (impl: remove last sync timestamps for all hosts)
>
> pretend-mirror-sync - mark all hosts as if they were just synced, the
> remedy for prolonged upstream host unavailability (impl: touch last
> sync timestamps)

Sure, these may be useful things to have, but I don't think they address
the use case I am discussing.

   As an Android build service user I want to trigger a build
   immediately after seeing a fix I need commit...

Read more...

Changed in linaro-android-mirror:
importance: Undecided → High
status: New → In Progress
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

On Tue, 07 Jun 2011 22:26:11 -0000
James Westby <email address hidden> wrote:

> On Tue, 07 Jun 2011 16:17:27 -0000, Paul Sokolovsky
> <email address hidden> wrote:
> > On Mon, 06 Jun 2011 18:55:30 -0000
> > James Westby <email address hidden> wrote:
> > > That sounds like a useful part of a solution to me, provided that
> > > the sync is atomic and so a failed sync on the soft_delay period
> > > doesn't corrupt the repos.
> >
> > Can you elaborate on this? If by corrupt you mean inconsistent
> > state, then it may be not that bright - git pull is of course
> > atomic, but repo just pulls git subtrees one by one, so it can be
> > the case that one subtree is updated, while another not.
>
> If we are doing soft_delay syncs that fail and corrupt the repo such
> that no-one can do a build then there's no practical difference with
> this approach to the one-timeout approach. Either way if the upstream
> is down then no-one can do a build anymore.

Well, not that wreck-ful, failing a sync never updates a "last
successful sync at" timestamp, so, algo essentially works as: don't
sync until soft_stale, then keep banging upstream on each build until
it syncs successfully, just don't fail the build with sync error, then,
after hard_stale passed, fail the build. So, in the scenario you
describe, one can keep trying at least.

And you're of course right that this is not 100% robust solution, but a
heuristic how to reduce number of errors we experience due to upstream
flips by order of magnitude.

>
> I think this means it just needs to be atomic at the git repo level,
> though there will be an odd occaision where skew would cause a build
> failure that wouldn't have been there had the partial sync not
> happened.
>
> That's not a big problem though, and so given that git is atomic this
> should be fine.

I also cc:d you on mail with links showing that repo moves into
direction of reusing git submodules, so it may become atomic itself
over time (that assumes that git submodules support proper atomic
updates themselves).

[]
> Sure, these may be useful things to have, but I don't think they
> address the use case I am discussing.
>
> As an Android build service user I want to trigger a build
> immediately after seeing a fix I need committed to an upstream repo
> and be sure that my build will see that fix so that I can react
> quickly to upstream changes

Ok, so we settled this on IRC, I filed
https://bugs.launchpad.net/linaro-android-frontend/+bug/795145 for this.
Note that asac recently asked for similar (at least, UI-wise) thing, so
it's indeed good idea to consider what's useful and then pack
implementation together:
https://bugs.launchpad.net/linaro-android-frontend/+bug/786466

Thanks,
Paul

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Changes made in course of this bug improved situation largely - we didn't see any problems for more then 2 months. Closing.

Changed in linaro-android-mirror:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.