Comment 3 for bug 527617

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Just taking a look at the above paste in which 3 builds ended up in this state all within 0.52seconds of each other:

* each build has been dispatched previously (but the buildlog doesn't yet help us see what happened to them, Julian's fixed this in devel so we'll have better logging info soon)
* There doesn't appear to be anything strange about the state of the builds themselves, and the samples span different processors, distroarchseries, pockets, archives etc.,
* Importantly, when the issue occurs, the builds *are* dispatched to a builder, it just doesn't end up on the build queue record. So with the above, matching the exact dispatch time with the logs you can see that:

At 22:20:14:
BuildID:1549354 is dispatched to aluminum, and aluminum is rescued at 22:20:58 'rescued from '1549354-3163554': 'No job assigned to builder'
BuildID:1470875 is dispatched to shipova, and at 22:20:16 shipova hits a resume failure: 'A process has ended with a probable error condition: process ended with exit code 255.'
BuildID:1544799 is dispatched to thorium, and at 22:20:51 the bm reports for thorium: 'communication failed (User timeout caused connection failure.)' and it is marked as done.

Yes, that's right, all three builds without a builder resulted in different log messages.

Now the current IBuilder implementation does the following during IBuilder.startBuild():
{{{
# Do it.
build_queue_item.markAsBuilding(self)
try:
    self.current_build_behavior.dispatchBuildToSlave(build_queue_item.id, logger)
}}}

and IBuildQueue.markAsBuilding() is defined as:

{{{
    def markAsBuilding(self, builder):
        """See `IBuildQueue`."""
        self.builder = builder
        if self.job.status != JobStatus.RUNNING:
            self.job.start()
        self.specific_job.jobStarted()
}}}

In all the samples we have, we can verify that job.start() is called (job.date_started reflects the correct time of the event), as is the specific job's jobStarted() method (as in this case, it sets IBuild.buildstate to BUILDING, which is used in the query linked by mthaddon above).

The logical first thought would be that builder is being passed in as None, but this method is only called as above from the builder itself, passing in self.

And then the key appears: all three jobs were aborted for various reasons, and what does updateBuild_ABORTED() do? It set's the queue_item.builder to None and calls PackageBuildFarmJob.jobAborted() which sets the buildstate to... BUILDING.

Looks like the effect of an untested refactoring :/ Writing the test and fix now.