Linaro Android Build Tools

linaro-cloud-buildd: current setup does not scale slave nodes

Bug #744648 reported by Alexander Sack on 2011-03-28

10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Linaro Android Build Tools	Won't Fix	Medium	Unassigned

Bug Description

Waiting for an experimental build to start, I looked at the jenkins interface and saw five builds being queued up.

My understanding was that in design we fire off an instance for each build and then shut it down (so not pooling, recycling) etc. And that we would set a cap of slaves.

I remember I saw IRC discussion on this and that there is a work in progress to better identifying the slave instances in the cloud to better enforce a proper cap.

Please use this bug to document the current status, and then track progress on this issue.

A solution to get more build throughput and unleash the powers of cloud computing would be highly desirable.

Revision history for this message

Alexander Sack (asac) wrote on 2011-03-29:

#1

02:58 < james_w> asac, I don't really understand bug 744648. You are asking for us to raise the cap so that we get more than two instances at once?

i wanted this bug to document the current state and then see if there is anything to improve.

If "we fire off an instance for each build and then shut it down (so not pooling, recycling)" then I don't see why we shouldn't increase the cap because the price is per-build and not per-parallel build.

Revision history for this message

Alexander Sack (asac) wrote on 2011-03-29:

#2

From what I on https://android-build.linaro.org/jenkins/? we have idle executors that get reused.

* Please ensure that we startup a new executor for each build ... and that it shuts down at the end of each build.
* we don't want any machine reuse/pooling etc.
* also we don't want any long running idle executors.

Revision history for this message

Alexander Sack (asac) wrote on 2011-03-29:

#3

reuse of running instances isn't OK. it has security implications as as we open it up to more folks ... and does not reset machine to clean state to install different build requirements etc. (see https://android-build.linaro.org/jenkins/job/asac_jserv-toolchain-test/2/console which was an intermediate fail i think and it keeps popping up on build now).

Changed in linaro-android:
importance:	Undecided → High

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2011-04-04:

#4

Fixing the "reusing executors" thing requires changes to the jenkins ec2 plugin, I think to use a different "SlaveRetentionStrategy". But I'm not really sure, it's lots of Java of course, so there are layers and layers to peel through.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2011-04-05:

#5

Another alternative, which would be perhaps more robust, would be to have the android-build job trigger a build that kills the instance the triggering build ran on. This probably still requires a plugin to be written though.

Paul Sokolovsky (pfalcon) on 2011-04-19

affects:

linaro-android → linaro-android-build-tools

Revision history for this message

Paul Sokolovsky (pfalcon) wrote on 2011-04-20:

#6

> Waiting for an experimental build to start, I looked at the jenkins interface and saw five builds being queued up.

Well, main scope of this ticket appears to be no-reuse policy for slaves, but I'd like to add something regarding quoted. I also saw couple of times when build was scheduled in the frontend, but sit for quite a long time a jenkins, with a clock icon linked to https://wiki.jenkins-ci.org/display/JENKINS/Executor+Starvation . After some time (10mins+) it finally started to start up a slave. If another job was queued during this time, it got the same clock icon with message that it waits for available executor on the same instance which runs the job already. At all this time there were few (4-5) instances running, which is lower than ec2 plugin cap (10 total instances).

So, we more or less regularly see 2 extremes - Jenkins starting up 2 instances in row and dropping one on the floor (lp:760745) or vice versa, sitting in the corner too shy start up even first one, and then another for another build. Granted, that behaviour is a bit erratic. Worse, there doesn't appear to be logging of ec2 plugin decision making.

My plan would be to finish my first iteration on our own codebase (logging, etc.) and then continue Michael's work and attack Jenkins and its plugins on wide front, be it even source level. We of course should discuss approach in more detail on tomorrow's call.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2011-04-20: Re: [Bug 744648] Re: linaro-cloud-buildd: current setup does not scale slave nodes

#7

On Wed, 20 Apr 2011 18:40:51 -0000, Paul Sokolovsky <email address hidden> wrote:
> > Waiting for an experimental build to start, I looked at the jenkins
> interface and saw five builds being queued up.
>
> Well, main scope of this ticket appears to be no-reuse policy for
> slaves, but I'd like to add something regarding quoted. I also saw
> couple of times when build was scheduled in the frontend, but sit for
> quite a long time a jenkins, with a clock icon linked to https://wiki
> .jenkins-ci.org/display/JENKINS/Executor+Starvation . After some time
> (10mins+) it finally started to start up a slave. If another job was
> queued during this time, it got the same clock icon with message that it
> waits for available executor on the same instance which runs the job
> already. At all this time there were few (4-5) instances running, which
> is lower than ec2 plugin cap (10 total instances).
>
> So, we more or less regularly see 2 extremes - Jenkins starting up 2
> instances in row and dropping one on the floor (lp:760745) or vice
> versa, sitting in the corner too shy start up even first one, and then
> another for another build. Granted, that behaviour is a bit erratic.
> Worse, there doesn't appear to be logging of ec2 plugin decision making.
>
> My plan would be to finish my first iteration on our own codebase
> (logging, etc.) and then continue Michael's work and attack Jenkins and
> its plugins on wide front, be it even source level. We of course should
> discuss approach in more detail on tomorrow's call.

I would say we should give /some/ consideration to not using jenkins.
Jenkins provides various things that it would be annoying to replace --
instance management, live log updates, ..., ... but we in some ways we
seem to be fighting it too. I suspect we'll end up keeping it, but we
should at least _think_ about this :)

Cheers,
mwh

Revision history for this message

Paul Sokolovsky (pfalcon) wrote on 2011-09-23:

#8

We seem to doing well now, lowering priority.

Changed in linaro-android-build-tools:
importance:	High → Medium

Revision history for this message

Alan Bennett (akbennett) wrote on 2013-08-23:

#9

Due to the age of this issue, we are acknowledging that this issue will likely not be fixed. If this issue is still important, please add details and re open the issue.

Changed in linaro-android-build-tools:
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.