buildd handling lives in ivory tower of perfect networks

Bug #54946 reported by James Troup on 2006-08-02
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Julian Edwards

Bug Description

The buildd design seems to assume perfect networks. If a connection drops, not only is the current build lost forever, but the buildd itself is marked NOT OK and never retried until a human comes along to reset it.

This is really bad for a couple of reasons:

 (1) Even in our data centre, the network is not perfect, cables get knocked out, typos happen in firewall scripts etc. And if it's the right cable or machine, it can affect all buildds at once which affects a lot of builds and is a lot of human work to recover from.

 (2) The launchpad buildd stuff is meant to scale to remote buildds connected to the build-master over the internet. The connection there is going to be even more fragile and it's entirely possible that connection drops will be routine.

Related branches

Julian Edwards (julian-edwards) wrote :

I don't think this happens any more, marking as released.

Changed in soyuz:
status: New → Fix Released
James Troup (elmo) wrote :

Sorry, but reality begs to differ. We had to reset several buildds in soyuz due to this bug just on Friday.

Changed in soyuz:
status: Fix Released → New
Julian Edwards (julian-edwards) wrote :

We don't lose builds any more though, right? And there are several dupes of (2) above.

Julian Edwards (julian-edwards) wrote :

Gah, when I say (2) I mean the buildd goes to "NOT OK"

Changed in soyuz:
status: New → Triaged
importance: Undecided → Medium
tags: added: buildd-manager
Tom Haddon (mthaddon) wrote :

Just had to restart buildd-manager because it hung:

"2010-02-18 12:28:53+0000 [-] <lawrencium:http://lawrencium.ppa:8221/> communication failed (User timeout caused connection failure.)

Seems like the most obvious log entry that may have caused this.

Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp
Tom Haddon (mthaddon) wrote :

This has bit us again over the weekend, meaning that a bunch of the builders need resetting (they've been marked disabled). Increasing priority of this as it's a lot of manual work to recover from it.

Changed in soyuz:
importance: Medium → High
Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
Brad Crittenden (bac) on 2010-10-27
tags: added: bad-commit-11801
Changed in soyuz:
milestone: none → 10.11
tags: added: qa-needstesting
Changed in soyuz:
status: In Progress → Fix Committed
tags: added: qa-ok
removed: qa-needstesting
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 ( by a commit, but not testable.

tags: added: qa-untestable
removed: qa-ok
Robert Collins (lifeless) wrote :

Needs a deployment to the buildd-manager machine to fix-release this.

Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 ( by a commit, but not testable.

tags: added: buildd-scalability
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers