buildd handling lives in ivory tower of perfect networks

Bug #54946 reported by James Troup
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Julian Edwards

Bug Description

The buildd design seems to assume perfect networks. If a connection drops, not only is the current build lost forever, but the buildd itself is marked NOT OK and never retried until a human comes along to reset it.

This is really bad for a couple of reasons:

 (1) Even in our data centre, the network is not perfect, cables get knocked out, typos happen in firewall scripts etc. And if it's the right cable or machine, it can affect all buildds at once which affects a lot of builds and is a lot of human work to recover from.

 (2) The launchpad buildd stuff is meant to scale to remote buildds connected to the build-master over the internet. The connection there is going to be even more fragile and it's entirely possible that connection drops will be routine.

Related branches

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I don't think this happens any more, marking as released.

Changed in soyuz:
status: New → Fix Released
Revision history for this message
James Troup (elmo) wrote :

Sorry, but reality begs to differ. We had to reset several buildds in soyuz due to this bug just on Friday.

Changed in soyuz:
status: Fix Released → New
Revision history for this message
Julian Edwards (julian-edwards) wrote :

We don't lose builds any more though, right? And there are several dupes of (2) above.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Gah, when I say (2) I mean the buildd goes to "NOT OK"

Changed in soyuz:
status: New → Triaged
importance: Undecided → Medium
tags: added: buildd-manager
Revision history for this message
Tom Haddon (mthaddon) wrote :

Just had to restart buildd-manager because it hung:

"2010-02-18 12:28:53+0000 [-] <lawrencium:http://lawrencium.ppa:8221/> communication failed (User timeout caused connection failure.)

Seems like the most obvious log entry that may have caused this.

Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Revision history for this message
Tom Haddon (mthaddon) wrote :

This has bit us again over the weekend, meaning that a bunch of the builders need resetting (they've been marked disabled). Increasing priority of this as it's a lot of manual work to recover from it.

Changed in soyuz:
importance: Medium → High
Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
Brad Crittenden (bac)
tags: added: bad-commit-11801
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
Changed in soyuz:
milestone: none → 10.11
tags: added: qa-needstesting
Changed in soyuz:
status: In Progress → Fix Committed
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

tags: added: qa-untestable
removed: qa-ok
Revision history for this message
Robert Collins (lifeless) wrote :

Needs a deployment to the buildd-manager machine to fix-release this.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

tags: added: buildd-scalability
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.