Build job failure counting does not differentiate between dispatching and scanning failures

Bug #680445 reported by Julian Edwards
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Low
William Grant

Bug Description

The failure counting code was initially written with only dispatching in mind, so the code immediately resets the build job when there's a failure. This is not useful if the job is already dispatched - we should allow more leeway since it's probably a transient error.

One thing we could do is to apply the same threshold value as for builders to the job once we know it's been dispatched. That is, the scan needs to fail a few times before we say "ok you're dead, let's push to a new builder."

Changed in soyuz:
status: New → Triaged
importance: Undecided → High
tags: added: buildd-manager
Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
Revision history for this message
Julian Edwards (julian-edwards) wrote :

This is the sort of change I had in mind, but it does still not take into account some of the failure scenarios that require an immediate job or builder failure. Fortunately this diff makes the relevant tests fail so it's just a matter of amending further the assessFailureCounts() method to DTRT. I thought maybe we could pass in a force_reset_job parameter from its callsites.

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py 2010-11-26 15:16:53 +0000
+++ lib/lp/buildmaster/manager.py 2010-12-09 15:39:23 +0000
@@ -62,25 +62,27 @@
     if builder.failure_count == job_failure_count and current_job is not None:
         # If the failure count for the builder is the same as the
         # failure count for the job being built, then we cannot
- # tell whether the job or the builder is at fault. The best
- # we can do is try them both again, and hope that the job
- # runs against a different builder.
- current_job.reset()
+ # tell whether the job or the builder is at fault. Because we
+ # don't want to simply rip the job off the builder in case this
+ # is transient, we do nothing unless the number of failures
+ # exceeds the builder failure threshold.
+ if builder.failure_count >= Builder.FAILURE_THRESHOLD:
+ current_job.reset()
+ builder.failBuilder(fail_notes)
         return

     if builder.failure_count > job_failure_count:
         # The builder has failed more than the jobs it's been
         # running.

- # Re-schedule the build if there is one.
- if current_job is not None:
- current_job.reset()
-
         # We are a little more tolerant with failing builders than
         # failing jobs because sometimes they get unresponsive due to
         # human error, flaky networks etc. We expect the builder to get
         # better, whereas jobs are very unlikely to get better.
         if builder.failure_count >= Builder.FAILURE_THRESHOLD:
+ # Re-schedule the build if there is one.
+ if current_job is not None:
+ current_job.reset()
             # It's also gone over the threshold so let's disable it.
             builder.failBuilder(fail_notes)
     else:

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Grar, the formatting is blown in that last comment, here's a pastebin: http://pastebin.ubuntu.com/550653/

Changed in launchpad:
status: In Progress → Triaged
assignee: Julian Edwards (julian-edwards) → William Grant (wgrant)
William Grant (wgrant)
Changed in launchpad:
assignee: William Grant (wgrant) → nobody
Revision history for this message
Robert Collins (lifeless) wrote :

I'm dropping this to low because we're surviving, and the separate bug to keep the buildd-slave executive in memory will help with the slow responses case - a lot.

summary: - Build job failure counting should differentiate between dispatching and
- scanning failures
+ Build job failure counting does not differentiate between dispatching
+ and scanning failures
Changed in launchpad:
importance: High → Low
William Grant (wgrant)
Changed in launchpad:
assignee: nobody → William Grant (wgrant)
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.