Dispatched build queue item has no builder

Bug #527617 reported by Michael Nelson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Medium
Michael Nelson

Bug Description

Occasionally we get into a situation where a build has been dispatched (the linked job has a date_started and a running status) but the IBuildQueue.builder attribute is None.

We haven't yet identified why this situation is occurring. IBuildQueue.markAsBuilding() sets the builder, IBuildQueue.reset() and updateBuild_ABORTED() both set it back to None and reset the date_started at the same time. IBuildQueue.setDateStarted() seems to be only ever used to reset back to None. There are a few instances in lp.buildmaster.buildergroup.py which set builder to None, but it is always done at the same time as setting the date_started to None.

This issue is related to bug 499421 (there is a lot of collected info there), but we've since landed a fix to ensure that the buildfarm doesn't fall over in this situation.

Tags: lp-soyuz qa-ok

Related branches

Revision history for this message
Tom Haddon (mthaddon) wrote :
Revision history for this message
Michael Nelson (michael.nelson) wrote :

And here are the details of the 3 builds, after the above query reset their build states: https://pastebin.canonical.com/28835/

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Just taking a look at the above paste in which 3 builds ended up in this state all within 0.52seconds of each other:

* each build has been dispatched previously (but the buildlog doesn't yet help us see what happened to them, Julian's fixed this in devel so we'll have better logging info soon)
* There doesn't appear to be anything strange about the state of the builds themselves, and the samples span different processors, distroarchseries, pockets, archives etc.,
* Importantly, when the issue occurs, the builds *are* dispatched to a builder, it just doesn't end up on the build queue record. So with the above, matching the exact dispatch time with the logs you can see that:

At 22:20:14:
BuildID:1549354 is dispatched to aluminum, and aluminum is rescued at 22:20:58 'rescued from '1549354-3163554': 'No job assigned to builder'
BuildID:1470875 is dispatched to shipova, and at 22:20:16 shipova hits a resume failure: 'A process has ended with a probable error condition: process ended with exit code 255.'
BuildID:1544799 is dispatched to thorium, and at 22:20:51 the bm reports for thorium: 'communication failed (User timeout caused connection failure.)' and it is marked as done.

Yes, that's right, all three builds without a builder resulted in different log messages.

Now the current IBuilder implementation does the following during IBuilder.startBuild():
{{{
# Do it.
build_queue_item.markAsBuilding(self)
try:
    self.current_build_behavior.dispatchBuildToSlave(build_queue_item.id, logger)
}}}

and IBuildQueue.markAsBuilding() is defined as:

{{{
    def markAsBuilding(self, builder):
        """See `IBuildQueue`."""
        self.builder = builder
        if self.job.status != JobStatus.RUNNING:
            self.job.start()
        self.specific_job.jobStarted()
}}}

In all the samples we have, we can verify that job.start() is called (job.date_started reflects the correct time of the event), as is the specific job's jobStarted() method (as in this case, it sets IBuild.buildstate to BUILDING, which is used in the query linked by mthaddon above).

The logical first thought would be that builder is being passed in as None, but this method is only called as above from the builder itself, passing in self.

And then the key appears: all three jobs were aborted for various reasons, and what does updateBuild_ABORTED() do? It set's the queue_item.builder to None and calls PackageBuildFarmJob.jobAborted() which sets the buildstate to... BUILDING.

Looks like the effect of an untested refactoring :/ Writing the test and fix now.

Changed in soyuz:
assignee: nobody → Michael Nelson (michael.nelson)
milestone: none → 10.03
status: Triaged → In Progress
Revision history for this message
Michael Nelson (michael.nelson) wrote :

Here's the relevant snippet of the log, included so others (such as William) can take a look too.
http://pastebin.ubuntu.com/391713/

(all public PPA builds)

Revision history for this message
Ursula Junque (ursinha) wrote : Bug fixed by a commit
Changed in soyuz:
status: In Progress → Fix Committed
tags: added: qa-needstesting
Revision history for this message
Michael Nelson (michael.nelson) wrote :

I QA'd this on dogfood by:

1. Adding queueItem.builder.requestAbort() to updateBuild_BUILDING(), this ensures that as soon as a job is successfully dispatched, the BM will request an abort.
2. Copied a source and verified that it was dispatched, then aborted, after which case the build oopses as expected (as it is BUILDING but has no builder).
3. Patched so that updateBuild_ABORTED() sets the buildstate to NEEDSBUILD
4. Copied a source and verified that it was dispatched, then aborted, and reset to NEEDSBUILD as expected (no oops when viewing it).

tags: added: qa-done
removed: qa-needstesting
tags: added: qa-ok
removed: qa-done
Revision history for this message
LaMont Jones (lamont) wrote :

Grepping for artigas in the build logs:

2010-03-12 19:21:55+0000 [-] Disabling builder: http://artigas.buildd:8221/ -- (113, 'No route to host')
2010-03-12 19:29:53+0000 [-] Builder 'artigas' rescued from '1557394-3179236': 'No job assigned to builder'

Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.