buildd-manager doesn't give us a good way of determining it's in a failed state

Bug #451351 reported by Tom Haddon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

After a DB restart this morning we were seeing the following pattern in the logs:

2009-10-14 09:00:40+0100 [-] Starting scanning cycle.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Slave Scan Process Initiated.
2009-10-14 09:00:40+0100 [-] Buildd Master has been initialised
2009-10-14 09:00:40+0100 [-] Setting Builders.
2009-10-14 09:00:40+0100 [-] Scanning failed with: Already disconnected
2009-10-14 09:00:40+0100 [-] Finishing scanning cycle.
2009-10-14 09:00:40+0100 [-] Scanning cycle finished.

However, the process was responding to nagios checks fine. As a result, we were only able to tell something was wrong based on user feedback.

Tom Haddon (mthaddon)
Changed in soyuz:
importance: Undecided → High
Changed in soyuz:
status: New → Triaged
tags: added: soyuz-build
tags: added: tech-debt
tags: added: buildd-manager
Revision history for this message
Julian Edwards (julian-edwards) wrote :

The internal log reporting for scan failures is currently very obtuse. It could do with adding a stack trace to the error shown. This is very easy by doing something like this:

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py 2009-07-26 14:19:49 +0000
+++ lib/lp/buildmaster/manager.py 2009-12-14 20:46:44 +0000
@@ -238,6 +238,7 @@
         """Deal with scanning failures."""
         self.logger.info(
             'Scanning failed with: %s' % error.getErrorMessage())
+ error.printTraceback()
         self.finishCycle()

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Revision history for this message
Julian Edwards (julian-edwards) wrote :

The idea check is one where we look at the queue for a particular architecture and then if it has outstanding builds, but the builders for that arch are idle for > N seconds, then we raise a Nagios error.

We should be able to make the relevant data available on the API.

Revision history for this message
Robert Collins (lifeless) wrote :

Also on the error side - OOPS FTW. :)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.