Build job failure counting does not differentiate between dispatching and scanning failures
Bug #680445 reported by
Julian Edwards
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
Low
|
William Grant |
Bug Description
The failure counting code was initially written with only dispatching in mind, so the code immediately resets the build job when there's a failure. This is not useful if the job is already dispatched - we should allow more leeway since it's probably a transient error.
One thing we could do is to apply the same threshold value as for builders to the job once we know it's been dispatched. That is, the scan needs to fail a few times before we say "ok you're dead, let's push to a new builder."
Changed in soyuz: | |
status: | New → Triaged |
importance: | Undecided → High |
tags: | added: buildd-manager |
Changed in soyuz: | |
status: | Triaged → In Progress |
assignee: | nobody → Julian Edwards (julian-edwards) |
Changed in launchpad: | |
assignee: | William Grant (wgrant) → nobody |
Changed in launchpad: | |
assignee: | nobody → William Grant (wgrant) |
status: | Triaged → Fix Released |
To post a comment you must log in.
This is the sort of change I had in mind, but it does still not take into account some of the failure scenarios that require an immediate job or builder failure. Fortunately this diff makes the relevant tests fail so it's just a matter of amending further the assessFailureCo unts() method to DTRT. I thought maybe we could pass in a force_reset_job parameter from its callsites.
=== modified file 'lib/lp/ buildmaster/ manager. py' buildmaster/ manager. py 2010-11-26 15:16:53 +0000 buildmaster/ manager. py 2010-12-09 15:39:23 +0000 failure_ count == job_failure_count and current_job is not None: failure_ count >= Builder. FAILURE_ THRESHOLD: failBuilder( fail_notes)
--- lib/lp/
+++ lib/lp/
@@ -62,25 +62,27 @@
if builder.
# If the failure count for the builder is the same as the
# failure count for the job being built, then we cannot
- # tell whether the job or the builder is at fault. The best
- # we can do is try them both again, and hope that the job
- # runs against a different builder.
- current_job.reset()
+ # tell whether the job or the builder is at fault. Because we
+ # don't want to simply rip the job off the builder in case this
+ # is transient, we do nothing unless the number of failures
+ # exceeds the builder failure threshold.
+ if builder.
+ current_job.reset()
+ builder.
return
if builder. failure_ count > job_failure_count:
# The builder has failed more than the jobs it's been
# running.
- # Re-schedule the build if there is one. failure_ count >= Builder. FAILURE_ THRESHOLD:
builder. failBuilder( fail_notes)
- if current_job is not None:
- current_job.reset()
-
# We are a little more tolerant with failing builders than
# failing jobs because sometimes they get unresponsive due to
# human error, flaky networks etc. We expect the builder to get
# better, whereas jobs are very unlikely to get better.
if builder.
+ # Re-schedule the build if there is one.
+ if current_job is not None:
+ current_job.reset()
# It's also gone over the threshold so let's disable it.
else: