Builder stuck on reset build

Reported by Michael Nelson on 2009-12-21
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Medium
Michael Nelson

Bug Description

After a previous issue with this build, it was reset manually with some SQL so that job.date_started = null.

Now it's been dispatched again, and is causing the builder to be stuck.

https://edge.launchpad.net/ubuntu/+archive/test-rebuild-20090909/+build/1226841

Michael Nelson (michael.nelson) wrote :
Michael Nelson (michael.nelson) wrote :

<bigjools> noodles: arg I just noticed an issue on the build farm, can you have a look no particular rush, things are still working. https://edge.launchpad.net/ubuntu/+archive/test-rebuild-20090909/+build/1226841
<noodles> Sure
<bigjools> that build was one I reset previously
<bigjools> the builder it says it's on is reporting as idle
<bigjools> but since we have a build referring to it nothing is being dispatched to that builder
<noodles> um, something's dispatched now?
<noodles> gar, gone again. It had something listed as dispatched (open clipart)
<noodles> OK, so it keeps trying to dispatch and fails... k, looking into it.
<bigjools> yeah that's the bug :(
<noodles> bigjools: but when you say that you reset it, do you mean via sql?
<bigjools> noodles: whatever works :)
<noodles> bigjools: no, that was a question - you said that *you* reset it previously - I'm wondering whether that was an SQL query - if so, it'd be great to have the paste, as it's a very different thing if the data is inconsistent as opposed to a bug?
<noodles> Although, we should still handle the inconsistent data gracefully of course.
<bigjools> oh I see, yes it was
<bigjools> noodles: the data was originally inconsistent, so I just nulled the job start, BQ.builder and made buildstate 0 (NEEDSBUILD)
<bigjools> (sorry was OTP)
<noodles> Thanks
<bigjools> it's happened again
<bigjools> so I suspect a bug
<bigjools> which could be lingering data corruption
<bigjools> or it could be code
<noodles> And it was both times for this exact build 1226841?
<bigjools> yep
<bigjools> https://pastebin.canonical.com/25907/
<bigjools> found my paste
<bigjools> seems like I superseded it

Michael Nelson (michael.nelson) wrote :

I'm just trying to piece together the original situation, after struggling to see how the build could have been re-dispatched after being set as superseded.

The original issue according to the internal irc log that: the slave scanner is stuck with "Scanning failed with: 'NoneType' object has no attribute 'slaveStatus'".

The cause was this corrupt build, which at the time thought it had already started building (job.date_started is not null) but it didn't have an associated builder (bq.builder is null) as shown by a count of 1 from the query https://pastebin.canonical.com/25913/

The fix taken on that date was not actually paste 25907 above, but https://pastebin.canonical.com/25914/, which simply set the job.date_started back to null for this build. So it's expected that it would be dispatched again.

OK, now to figure out why it's stalling on hassium...

description: updated
Michael Nelson (michael.nelson) wrote :

So the issue is:

1. We manually reset job.date_started to null due to the original issue outlined above, but we did *not* reset job.status.
2. When the job was started at 2009-12-17 21:33:34 (https://pastebin.canonical.com/26090/), the build.markAsBuilding() was called, setting the builder, but it did *not* call job.start(), which in turn sets job.date_started as BuildQueue's method is as follows:

{{{
    def markAsBuilding(self, builder):
        """See `IBuildQueue`."""
        self.builder = builder
        if self.job.status != JobStatus.RUNNING:
            self.job.start()
        self.specific_job.jobStarted()
}}}

3. Because job.date_started is null, this job is never returned as part of BuildQueueSet.getActiveBuildJobs(), which is used by the build master to scan and update the current jobs.

Which means that the solution should be to simply set the job.date_started to "2009-12-17 21:33:34".

Changed in soyuz:
status: Confirmed → In Progress
Michael Nelson (michael.nelson) wrote :

I've written a fix that simply sets the date_started as above, but can't find anyone to approve it. Just so it doesn't get lost:

https://pastebin.canonical.com/26093/ (or a variation thereof that sets a more recent date started rather than the true value so as to not exaggerate any subsequent estimates that will be based on this build duration.)

On Wednesday 23 December 2009 14:55:36 Michael Nelson wrote:
> 3. Because job.date_started is null, this job is never returned as part
> of BuildQueueSet.getActiveBuildJobs(), which is used by the build master
> to scan and update the current jobs.
>

Urgh!

It should probably look at BuildQueue instead and look for rows with a non-
null builder. This isolates the decision making to looking at a single table
rather than spreading this wacky set of exclusive fields everywhere.

Michael Nelson (michael.nelson) wrote :

Looks like we've got a few more candidates:

https://answers.edge.launchpad.net/soyuz/+question/96057

I've updated the SQL to automatically set the date_started to NOW - 1hr.

https://pastebin.canonical.com/26126/

Changed in soyuz:
milestone: none → 10.01
Michael Nelson (michael.nelson) wrote :

The above query to check for the problem builds was run on production identifying the following builds:

https://edge.launchpad.net/~n-muench/+archive/gnu-programs/+build/1424385
https://edge.launchpad.net/~n-muench/+archive/gnu-programs/+build/1424386
https://edge.launchpad.net/~n-muench/+archive/gnu-programs/+build/1424387
https://edge.launchpad.net/~pgquiles/+archive/ppa/+build/1401950
https://edge.launchpad.net/~pgquiles/+archive/ppa/+build/1401929

The last two items were expected, but the first three are more recent (built on 20100102), which indicates that we're actually creating the issue each time we run the query fixing bug 499421. Indeed, that is the case. To fix bug 499421 when it occurs (only once over the holidays on the 1st) we're manually setting the date_started to null, but not resetting the job status.

I'll (1) update the query to be used in bug 499421, and (2) redo the query to fix the above builds so that the date_started is set more accurately.

Michael Nelson (michael.nelson) wrote :

OK, the above builds have had their related job.date_started set accurately (which means they do have huge durations as they were stuck on the builders). And the SQL fix used by losa's to fix bug 499421 has been updated so that we don't re-create this situation again.

Changed in soyuz:
status: In Progress → Fix Released
Changed in soyuz:
status: Fix Released → In Progress

Some more candidates without start-dates have turned up on the builders:

https://edge.launchpad.net/~n-muench/+archive/gnu-programs/+build/1428814
https://edge.launchpad.net/~n-muench/+archive/gnu-programs/+build/1428815
https://edge.launchpad.net/~n-muench/+archive/gnu-programs/+build/1428816

I had thought this was because the query that was used to fix the previous builds:

https://pastebin.canonical.com/26156/

only fixed the data for builds that were currently stuck on builders (due to the bq.builder is not null clause), so I prepared the following to check for other examples of the same data corruption that are *not* currently building:

https://pastebin.canonical.com/26187/

(btw: is there any reason why these queries can't go on the public pastebin? I'm just doing them there as we seem to always do so).

But then the date created for the above builds indicates that the build records were only created just before the building started. I checked with mthaddon and there haven't been any more occurrences of bug 499421, so it seems to be unrelated to fixing 499421.

So there are two causes for builds in this state:

1. We manually reset certain builds so that the related job.date_started was null, but without updating job.status. This was the case with the original build in the description (1226841).

And, thanks to wgrant's help, we found that other occurrences of this build state all mention "E: There are problems and -y was used without --force-yes" in the buildlog, which is used to match the buildd.SBuildExitCode.GIVENBACK status, and following through on that leads to:

2. Builds that result in a buildd.SBuildExitCode.GIVENBACK also have their date_started set to null with the builds put back directly into NEEDSBUILD (ie. leaving the job.status as RUNNING). It looks as though BUILDERFAIL will cause the same issue. Other states destroy the BuildQueue entry, so that IBuild.retry() is called, which creates a new bq entry with a new job.

AFAICS, rhis has only been triggering issues recently due to the recent change in BuildQueue.markAsBuilding() which does not call job.start() if the job is already in the running state.

The fix is simple, we'll see about the test :)

William Grant (wgrant) wrote :

The bug is in buildStatus_GIVENBACK and buildStatus_BUILDERFAIL -- job.status is not reset to WAITING. Replacing the manual buildqueue erasure in both methods with queueItem.reset() resolves the issue.

r10114 devel

Changed in soyuz:
status: In Progress → Fix Committed

Here's the QA on dogfood, so it's ready to be CP'd:

Setup
=====
1. from codelines/current, bzr pull -r8840 (db-devel rev. before the fix landed)
2. Emptied the current queue of builds (after issues with ferraz and new concordia builders).
3. Applied http://pastebin.ubuntu.com/353343/ to force GIVENBACK processing.
4. Uploaded and processed:
    https://dogfood.launchpad.net/~michael.nelson/+archive/pocketsphinx/+build/1344510

Reproduce problem
=================
1. Started buildd-manager and confirmed that..
  a) initial build_a correctly set date started etc.:
      http://people.canonical.com/~michaeln/tmp/499095-build_a_1.png
  b) After going through GIVENBACK it was rescheduled but got stuck as expected (no start date, no build log etc.)
      http://people.canonical.com/~michaeln/tmp/499095-build_a_2.png

2. Cleanup:
  a) Manually set the build date_started and job.status
  b) Reverted the forec GIVEBACK change, restarted buildd-manager and let the buildmanager clear the job normally.

Demo fix
========
1) from codelines/current, bzr pull (which pulled 8842 - including the fix).
2) Re-applied http://pastebin.ubuntu.com/353343/ to force GIVENBACK processing.
3) Created, uploaded and processed:
     https://dogfood.launchpad.net/~michael.nelson/+archive/pocketsphinx/+build/1344517
4) Started buildd-manager and confirmed that initial build_b correctly set date_started etc.
    http://people.canonical.com/~michaeln/tmp/499095-build_b_1.png
5) After going through GIVENBACK it was rescheduled correctly (has a new date started, and the log is being updated as the build progresses:
    http://people.canonical.com/~michaeln/tmp/499095-build_b_2.png

    Note: we still need some UI changes here. We're displaying the current builder being used to build (build_queue.builder) as well as the historical builder that was used to do the previous build (build.builder), as well as the preserved log/start/finish times. Heh, there's a note without a bug in the code about this:

        # XXX cprov 2006-05-30: Currently this information is not
        # properly presented in the Web UI. We will discuss it in
        # the next Paris Summit, infinity has some ideas about how
        # to use this content. For now we just ensure it's stored.

so I just created bug 504660.

6) Reverted the force GIVEBACK patch and restarted the buildd_manager and the build completed as per normal without any manual intervention:
http://people.canonical.com/~michaeln/tmp/499095-build_b_3.png

So, I'll be requesting a CP of stable 10114. Until that is landed, we can run https://pastebin.canonical.com/26358/ to clear the stuck builders (and may need to re-run it once after the CP lands in case any more crept in at the last minute.)

Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions