launchpad lost track of a build

Bug #676262 reported by LaMont Jones
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Won't Fix
Critical
Unassigned
launchpad-buildd
Won't Fix
Critical
Unassigned

Bug Description

Today about 20:30 london time, I was asked about "ross not dispatching builds". When I investigated, I found that it was, in fact, building gcc-snapshot. The last time in the build-manager log for ross was 07:30 ish this morning, which would be consistent with a reasonable start time for that build.

Once I killed the build on ross, it happily received and started buliding qt4-x11.

Clearly, some better understanding of the roots of this amnesia is in order.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Lamont, I've no idea what bug you're trying to report here! Can you please state:

1. Behaviour you thought was bad
2. Expected behaviour

Thanks.

Changed in soyuz:
status: New → Incomplete
Revision history for this message
James Troup (elmo) wrote :

1. The build manager failed to accurately keep track of what its client buildds were building
2. Once a build is started (by the build manager) on a client, the build manager doesn't forget about it or otherwise lose track of it until a) the build finishes, or b) the build is explicitly aborted (by the manager or the client), or c) the client goes away

tags: added: buildd-manager
Changed in soyuz:
status: Incomplete → Triaged
importance: Undecided → High
Revision history for this message
LaMont Jones (lamont) wrote :

This seems to be the closest bug I can find to dump the current state of things in...

First, the reality:

1. There are some really big builds that cause the 512MB arm builders to go into swapstorm convultions.
2. During this process, launchpad-buildd gets swapped out of RAM, and may take an extremely long time to get back into RAM
3. This results in timeouts of the xmlrpc ping that buildd-manager does of the buildd.

The thinking was that socket_timeout was the only variable here, but setting that to 2 days has done nothing to help here, and I don't believe the builds are actually getting the full 2 days before they timeout.

Here's the sequence I believe is happening:
1. User uploads a large build (gcc-4.5, openjdk-6b18, etc)
2. arm builders start timing out _WHILE OTHERWISE_WORKING_CORRECTLY_ because of it, generally in either the test suite, or during dpkg-deb.
3. buildd-master fails the build, attempts to abort the build (which is currently impletemented, it seems as a "wait for the build to finish")
4. buildd-master queues the build on the next arm builder
5. The build actually finishes on the first builder (it was never actually dead, just really really slow). But now it can't upload, because buildd-manager has moved on without it.
6. This repeats
7. The user notices that his build has been givenup on, and retries it (manual intervention - I'm assuming that the user in question has retry ability), since there really isn't anything reasonable to do in the source to address the fact that the builder is just plain swamped.
8. Eventually a GSA or I manually intervene and recover the builder.
9. Launchpad automatically loops us back to the top of the cycle.

Automated processes that create work-queues for humans needlessly (as is the case here), are flawed by design and need to be fixed.

Changed in launchpad:
importance: High → Medium
importance: Medium → Critical
Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 676262] Re: launchpad lost track of a build

Seems like either dropping swap to a reasonable level, or pinning
buildd-slave into ram would solve this.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I favour the pinning initially so we can eliminate one theory at least.

Changed in launchpad-buildd:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
LaMont Jones (lamont) wrote :

> Seems like either dropping swap to a reasonable level, or pinning
> buildd-slave into ram would solve this.

I'm guessing you mean "dropping swap usage", and yes, that would be really
nice. OTOH, I don't see distro doing that, nor can we make it a requirement.
At best, it's a workaround (that's not realizable in the general case - some
 of the swaphogs are simply linking oo.o/other-phat binaries, etc.)

> I favour the pinning initially so we can eliminate one theory at least.

See the capabilities(7) man page, CAP_IPC_LOCK, and the mlockall(2) manpage.
Obviously, we'd need to grant the buildd user CAP_IPC_LOCK at the same time
as we deploy the code change.

We probably don't want to do that on all architectures, and should have a way
to disable it trivially.

William Grant (wgrant)
tags: added: soyuz-build
Revision history for this message
William Grant (wgrant) wrote :

ARM builder hardware is now less completely terrible, so communication failures are usually legitimate.

Changed in launchpad-buildd:
status: Triaged → Won't Fix
Changed in launchpad:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.