buildd-manager doesn't give us a good way of determining it's in a failed state

Bug #451351 reported by Tom Haddon on 2009-10-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Unassigned

Bug Description

After a DB restart this morning we were seeing the following pattern in the logs:

2009-10-14 09:00:40+0100 [-] Starting scanning cycle.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Scanning failed with: Already disconnected
2009-10-14 09:00:40+0100 [-] Finishing scanning cycle.
2009-10-14 09:00:40+0100 [-] Scanning cycle finished.

However, the process was responding to nagios checks fine. As a result, we were only able to tell something was wrong based on user feedback.

Tom Haddon (mthaddon) on 2009-10-14
Changed in soyuz:
importance: Undecided → High
Changed in soyuz:
status: New → Triaged
tags: added: soyuz-build
tags: added: tech-debt
tags: added: buildd-manager
Julian Edwards (julian-edwards) wrote :

The internal log reporting for scan failures is currently very obtuse. It could do with adding a stack trace to the error shown. This is very easy by doing something like this:

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py 2009-07-26 14:19:49 +0000
+++ lib/lp/buildmaster/manager.py 2009-12-14 20:46:44 +0000
@@ -238,6 +238,7 @@
         """Deal with scanning failures."""
         self.logger.info(
             'Scanning failed with: %s' % error.getErrorMessage())
+ error.printTraceback()
         self.finishCycle()

Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp
Julian Edwards (julian-edwards) wrote :

The idea check is one where we look at the queue for a particular architecture and then if it has outstanding builds, but the builders for that arch are idle for > N seconds, then we raise a Nagios error.

We should be able to make the relevant data available on the API.

Robert Collins (lifeless) wrote :

Also on the error side - OOPS FTW. :)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers