Lost builder detection is insufficiently aggressive
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
Low
|
William Grant |
Bug Description
rescueBuilderIfLost needs to verify that buildqueue.builder matches and buildqueue.
This was a big problem this morning, causing the build farm to collapse after lots of buildds dropped off the network for a few minutes. buildd-manager marked all the builders as not-OK, and they were manually enabled again once connectivity was restored. LP showed all builders as OK and idle, but was not dispatching to more than a couple of builders.
Inspection afterwards revealed that the slave on artigas (among others) had finished its build and was sitting WAITING, having completed build-buildqueue 1312047-2740484. Since that buildqueue had been unassigned as soon as buildd-manager detected that the builder was not-OK, the builder should have been declared lost and had a rescue attempted.
Unfortunately, rescueBuilderIfLost only verifies the existence and correct linkage of the build and buildqueue. It doesn't confirm that buildqueue.builder is the current builder, or that buildqueue.
Related branches
- Canonical Launchpad Engineering: Pending requested
-
Diff: 311 lines (+191/-7) (has conflicts)4 files modifiedlib/lp/buildmaster/buildergroup.py (+86/-0)
lib/lp/buildmaster/model/buildfarmjobbehavior.py (+9/-1)
lib/lp/soyuz/doc/buildd-slavescanner.txt (+74/-6)
lib/lp/soyuz/tests/soyuzbuilddhelpers.py (+22/-0)
Changed in soyuz: | |
status: | New → In Progress |
assignee: | nobody → William Grant (wgrant) |
Changed in soyuz: | |
status: | In Progress → Triaged |
Changed in soyuz: | |
status: | Incomplete → New |
assignee: | William Grant (wgrant) → nobody |
Changed in soyuz: | |
status: | New → Triaged |
importance: | Undecided → Low |
Changed in soyuz: | |
status: | Triaged → In Progress |
assignee: | nobody → William Grant (wgrant) |
Did you do any work on this William?