PPA buildds can be reclaimed mid-build, master needs to recover more gracefully when they do

Bug #343683 reported by Nick Moffitt
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Celso Providelo

Bug Description

Most of the PPA buildds are dual-duty systems that may be reclaimed for their original purpose at any time. When this happens, the buildd is deactivated and then immediately whisked off the network, and that can happen mid-build.

This happened recently with the "radon" PPA buildd, and buildmaster did not recover from the situation gracefully. Celso suggests that it simply needs to rescue active jobs assigned to deactivated builders.

Revision history for this message
Celso Providelo (cprov) wrote :

We should also take the opportunity to revise the case described in bug #32154, where builder fail mid-build (instead of being deactivated).

Changed in soyuz:
assignee: nobody → cprov
importance: Undecided → High
milestone: none → pending
status: New → Triaged
Revision history for this message
Adam Conrad (adconrad) wrote :

In a short discussion on IRC, we came to the conclusion that this (and a whole class of bugs relating to this) could be solved with the following two actions:

1) builders shouldn't be marked NOT OK immediately upon a failed attempt to contact them, but rather we should give a 5-minute window for the machine to come back (so a short network hiccup, for instance, doesn't offline 12 buildds and kill their builds), marking them NOT OK at the end of that 5-minute grace period.

2) builders that are marked NOT OK (either manually, or at the end of the above 5-minute window) should have their active jobs reclaimed, so they can be pushed to active buildds.

Revision history for this message
Adam Conrad (adconrad) wrote :

2a) the reclaiming of builds from NOT OK builders must include inactive buildds too, not just active ones, as we mark buildds inactive when we claim them for enablement use, and they can't hold on to their builds indefinitely.

(Maybe the master buildd-master process (ie: not the per-builder "forks", which one would expect to disappear for inactive builders) can, on a regular interval, just quickly compare the list of NOT OK and/or inactive builders against builder IDs in the list of building builds, and clean up that intersection?)

Celso Providelo (cprov)
Changed in soyuz:
milestone: pending → 2.2.4
Celso Providelo (cprov)
Changed in soyuz:
status: Triaged → In Progress
Revision history for this message
Celso Providelo (cprov) wrote :

Jobs assigned to broken/unavailable builders will be rescue and re-dispatched on r8285 (devel).

Revision history for this message
Celso Providelo (cprov) wrote :

The better handling of builder communication failures will be addressed by bug #369109

Changed in soyuz:
status: In Progress → Fix Committed
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.