buildd-manager fails to deal with "Fault 8002" errors

Reported by Tom Haddon on 2009-12-14
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Julian Edwards

Bug Description

The buildd-manager was returning "Fault 8002" in the logs. This was preventing it from processing new builds, and we were only made aware of the problem from a user report.

Related branches

lp:~bac/launchpad/rollback-11801
Rejected for merging into lp:launchpad
Brad Crittenden: Approve (code) on 2010-10-27
Julian Edwards (julian-edwards) wrote :

See also bug 451351 and bug 369109

tags: added: buildd-manager soyuz-build
Changed in soyuz:
status: New → Triaged
importance: Undecided → High
Jeroen T. Vermeulen (jtv) wrote :

I also got this while testing the translation templates build jobs on dogfood. Failure of the jobs themselves was sort of expected, what with some setup remaining to be done, but my job just kept being restarted.

Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp
Tom Haddon (mthaddon) wrote :

Happened again on four i386 buildds just now.

Julian Edwards (julian-edwards) wrote :

In the most recent case, the builders were disabled, which is the correct thing to do. So in terms of "handling" the problem, I'm not sure what else it should be doing. I don't think performing an automatic reset is really correct in case someone needs to debug a builder problem.

On Mon, 2010-06-07 at 10:22 +0000, Julian Edwards wrote:
> In the most recent case, the builders were disabled, which is the
> correct thing to do. So in terms of "handling" the problem, I'm not sure
> what else it should be doing. I don't think performing an automatic
> reset is really correct in case someone needs to debug a builder
> problem.
>

Maybe we should be approaching it slightly differently. What causes a
"Fault 8002"? If we don't exactly know, perhaps that's where the focus
of this bug should be...

Julian Edwards (julian-edwards) wrote :

On Monday 07 June 2010 12:44:29 Tom Haddon wrote:
> Maybe we should be approaching it slightly differently. What causes a
> "Fault 8002"? If we don't exactly know, perhaps that's where the focus
> of this bug should be...

It's a genuine error on the slave, and this is how Twisted XMLRPC materialises
it on the client (buildd-manager) side.

It could be a bunch of different problems but it usually indicates a fatal
problem on the slave, such as a coding error or something similiar that leads
to an exception. Disabling the builder is the right thing to do until we work
out what the problem is. In this particular case, we need to investigate more
though, I suspect that Twisted is throwing a wobbly somewhere.

Julian Edwards (julian-edwards) wrote :

The linked branch is the buildd-manager almost-re-write. It handles failures *much* better and will shut a job down if it Goes Bad.

Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
Brad Crittenden (bac) on 2010-10-27
tags: added: bad-commit-11801
Changed in soyuz:
milestone: none → 10.11
tags: added: qa-needstesting
Changed in soyuz:
status: In Progress → Fix Committed
tags: added: qa-ok
removed: qa-needstesting
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

tags: added: qa-untestable
removed: qa-ok
Robert Collins (lifeless) wrote :

Needs a deployment to the buildd-manager machine to fix-release this.

Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

tags: added: buildd-scalability
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers