linaro-cloud-buildd: current setup does not scale slave nodes

Bug #744648 reported by Alexander Sack
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android Build Tools
Won't Fix
Medium
Unassigned

Bug Description

Waiting for an experimental build to start, I looked at the jenkins interface and saw five builds being queued up.

My understanding was that in design we fire off an instance for each build and then shut it down (so not pooling, recycling) etc. And that we would set a cap of slaves.

I remember I saw IRC discussion on this and that there is a work in progress to better identifying the slave instances in the cloud to better enforce a proper cap.

Please use this bug to document the current status, and then track progress on this issue.

A solution to get more build throughput and unleash the powers of cloud computing would be highly desirable.

Revision history for this message
Alexander Sack (asac) wrote :

02:58 < james_w> asac, I don't really understand bug 744648. You are asking for us to raise the cap so that we get more than two instances at once?

i wanted this bug to document the current state and then see if there is anything to improve.

If "we fire off an instance for each build and then shut it down (so not pooling, recycling)" then I don't see why we shouldn't increase the cap because the price is per-build and not per-parallel build.

Revision history for this message
Alexander Sack (asac) wrote :

From what I on https://android-build.linaro.org/jenkins/? we have idle executors that get reused.

 * Please ensure that we startup a new executor for each build ... and that it shuts down at the end of each build.
 * we don't want any machine reuse/pooling etc.
 * also we don't want any long running idle executors.

Revision history for this message
Alexander Sack (asac) wrote :

reuse of running instances isn't OK. it has security implications as as we open it up to more folks ... and does not reset machine to clean state to install different build requirements etc. (see https://android-build.linaro.org/jenkins/job/asac_jserv-toolchain-test/2/console which was an intermediate fail i think and it keeps popping up on build now).

Changed in linaro-android:
importance: Undecided → High
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Fixing the "reusing executors" thing requires changes to the jenkins ec2 plugin, I think to use a different "SlaveRetentionStrategy". But I'm not really sure, it's lots of Java of course, so there are layers and layers to peel through.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Another alternative, which would be perhaps more robust, would be to have the android-build job trigger a build that kills the instance the triggering build ran on. This probably still requires a plugin to be written though.

affects: linaro-android → linaro-android-build-tools
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

> Waiting for an experimental build to start, I looked at the jenkins interface and saw five builds being queued up.

Well, main scope of this ticket appears to be no-reuse policy for slaves, but I'd like to add something regarding quoted. I also saw couple of times when build was scheduled in the frontend, but sit for quite a long time a jenkins, with a clock icon linked to https://wiki.jenkins-ci.org/display/JENKINS/Executor+Starvation . After some time (10mins+) it finally started to start up a slave. If another job was queued during this time, it got the same clock icon with message that it waits for available executor on the same instance which runs the job already. At all this time there were few (4-5) instances running, which is lower than ec2 plugin cap (10 total instances).

So, we more or less regularly see 2 extremes - Jenkins starting up 2 instances in row and dropping one on the floor (lp:760745) or vice versa, sitting in the corner too shy start up even first one, and then another for another build. Granted, that behaviour is a bit erratic. Worse, there doesn't appear to be logging of ec2 plugin decision making.

My plan would be to finish my first iteration on our own codebase (logging, etc.) and then continue Michael's work and attack Jenkins and its plugins on wide front, be it even source level. We of course should discuss approach in more detail on tomorrow's call.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Re: [Bug 744648] Re: linaro-cloud-buildd: current setup does not scale slave nodes

On Wed, 20 Apr 2011 18:40:51 -0000, Paul Sokolovsky <email address hidden> wrote:
> > Waiting for an experimental build to start, I looked at the jenkins
> interface and saw five builds being queued up.
>
> Well, main scope of this ticket appears to be no-reuse policy for
> slaves, but I'd like to add something regarding quoted. I also saw
> couple of times when build was scheduled in the frontend, but sit for
> quite a long time a jenkins, with a clock icon linked to https://wiki
> .jenkins-ci.org/display/JENKINS/Executor+Starvation . After some time
> (10mins+) it finally started to start up a slave. If another job was
> queued during this time, it got the same clock icon with message that it
> waits for available executor on the same instance which runs the job
> already. At all this time there were few (4-5) instances running, which
> is lower than ec2 plugin cap (10 total instances).
>
> So, we more or less regularly see 2 extremes - Jenkins starting up 2
> instances in row and dropping one on the floor (lp:760745) or vice
> versa, sitting in the corner too shy start up even first one, and then
> another for another build. Granted, that behaviour is a bit erratic.
> Worse, there doesn't appear to be logging of ec2 plugin decision making.
>
> My plan would be to finish my first iteration on our own codebase
> (logging, etc.) and then continue Michael's work and attack Jenkins and
> its plugins on wide front, be it even source level. We of course should
> discuss approach in more detail on tomorrow's call.

I would say we should give /some/ consideration to not using jenkins.
Jenkins provides various things that it would be annoying to replace --
instance management, live log updates, ..., ... but we in some ways we
seem to be fighting it too. I suspect we'll end up keeping it, but we
should at least _think_ about this :)

Cheers,
mwh

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

We seem to doing well now, lowering priority.

Changed in linaro-android-build-tools:
importance: High → Medium
Revision history for this message
Alan Bennett (akbennett) wrote :

Due to the age of this issue, we are acknowledging that this issue will likely not be fixed. If this issue is still important, please add details and re open the issue.

Changed in linaro-android-build-tools:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.