Bug #680445 “Build job failure counting does not differentiate b...” : Bugs : Launchpad itself

Julian Edwards (julian-edwards) on 2010-11-23

Changed in soyuz:
status:	New → Triaged
importance:	Undecided → High
tags:	added: buildd-manager

Julian Edwards (julian-edwards) on 2010-12-09

Changed in soyuz:
status:	Triaged → In Progress
assignee:	nobody → Julian Edwards (julian-edwards)

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2011-01-05:

#1

This is the sort of change I had in mind, but it does still not take into account some of the failure scenarios that require an immediate job or builder failure. Fortunately this diff makes the relevant tests fail so it's just a matter of amending further the assessFailureCounts() method to DTRT. I thought maybe we could pass in a force_reset_job parameter from its callsites.

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py 2010-11-26 15:16:53 +0000
+++ lib/lp/buildmaster/manager.py 2010-12-09 15:39:23 +0000
@@ -62,25 +62,27 @@
     if builder.failure_count == job_failure_count and current_job is not None:
         # If the failure count for the builder is the same as the
         # failure count for the job being built, then we cannot
- # tell whether the job or the builder is at fault. The best
- # we can do is try them both again, and hope that the job
- # runs against a different builder.
- current_job.reset()
+ # tell whether the job or the builder is at fault. Because we
+ # don't want to simply rip the job off the builder in case this
+ # is transient, we do nothing unless the number of failures
+ # exceeds the builder failure threshold.
+ if builder.failure_count >= Builder.FAILURE_THRESHOLD:
+ current_job.reset()
+ builder.failBuilder(fail_notes)
         return

     if builder.failure_count > job_failure_count:
         # The builder has failed more than the jobs it's been
         # running.

- # Re-schedule the build if there is one.
- if current_job is not None:
- current_job.reset()
-
         # We are a little more tolerant with failing builders than
         # failing jobs because sometimes they get unresponsive due to
         # human error, flaky networks etc. We expect the builder to get
         # better, whereas jobs are very unlikely to get better.
         if builder.failure_count >= Builder.FAILURE_THRESHOLD:
+ # Re-schedule the build if there is one.
+ if current_job is not None:
+ current_job.reset()
             # It's also gone over the threshold so let's disable it.
             builder.failBuilder(fail_notes)
     else:

This is the sort of change I had in mind, but it does still not take into account some of the failure scenarios that require an immediate job or builder failure.  Fortunately this diff makes the relevant tests fail so it's just a matter of amending further the assessFailureCounts() method to DTRT.  I thought maybe we could pass in a force_reset_job parameter from its callsites.

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py       2010-11-26 15:16:53 +0000
+++ lib/lp/buildmaster/manager.py       2010-12-09 15:39:23 +0000
@@ -62,25 +62,27 @@
     if builder.failure_count == job_failure_count and current_job is not None:
         # If the failure count for the builder is the same as the
         # failure count for the job being built, then we cannot
-        # tell whether the job or the builder is at fault. The  best
-        # we can do is try them both again, and hope that the job
-        # runs against a different builder.
-        current_job.reset()
+        # tell whether the job or the builder is at fault.  Because we
+        # don't want to simply rip the job off the builder in case this
+        # is transient, we do nothing unless the number of failures
+        # exceeds the builder failure threshold.
+        if builder.failure_count >= Builder.FAILURE_THRESHOLD:
+            current_job.reset()
+            builder.failBuilder(fail_notes)
         return
 
     if builder.failure_count > job_failure_count:
         # The builder has failed more than the jobs it's been
         # running.
 
-        # Re-schedule the build if there is one.
-        if current_job is not None:
-            current_job.reset()
-
         # We are a little more tolerant with failing builders than
         # failing jobs because sometimes they get unresponsive due to
         # human error, flaky networks etc.  We expect the builder to get
         # better, whereas jobs are very unlikely to get better.
         if builder.failure_count >= Builder.FAILURE_THRESHOLD:
+            # Re-schedule the build if there is one.
+            if current_job is not None:
+                current_job.reset()
             # It's also gone over the threshold so let's disable it.
             builder.failBuilder(fail_notes)
     else:

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2011-01-05:

#2

Grar, the formatting is blown in that last comment, here's a pastebin: http://pastebin.ubuntu.com/550653/

Changed in launchpad:
status:	In Progress → Triaged
assignee:	Julian Edwards (julian-edwards) → William Grant (wgrant)

William Grant (wgrant) on 2011-02-18

Changed in launchpad:
assignee:	William Grant (wgrant) → nobody

Revision history for this message

Robert Collins (lifeless) wrote on 2012-01-06:

#3

I'm dropping this to low because we're surviving, and the separate bug to keep the buildd-slave executive in memory will help with the slow responses case - a lot.

summary:	- Build job failure counting should differentiate between dispatching and - scanning failures + Build job failure counting does not differentiate between dispatching + and scanning failures
Changed in launchpad:
importance:	High → Low

William Grant (wgrant) on 2014-06-23

Changed in launchpad:
assignee:	nobody → William Grant (wgrant)
status:	Triaged → Fix Released

Launchpad itself

Build job failure counting does not differentiate between dispatching and scanning failures

Bug Description

Other bug subscribers

Remote bug watches