Resume trigger hangs buildd-manager

Bug #618955 reported by William Grant on 2010-08-16
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Julian Edwards

Bug Description

buildd-manager this morning said http://pastebin.ubuntu.com/479111/, then promptly hung. Some local testing reveals that the normal resume trigger asynchrony and timeout code doesn't work when a resume is triggered from within Builder.handleTimeout, so a single unresettable builder can bring down everything. Killing the rogue ssh reset process woke things up again.

To reproduce:

 1) Tweak vm_resume_command to something like 'sleep 30'.
 2) Enable both sampledata builders.
 3) Start launchpad-buildd (so connecting to the non-virt bob will succeed, but the virt frog will fail)
 4) Start buildd-manager.
 5) Watch buildd-manager and launchpad-buildd logs, noting large breaks in activity while a resume attempt is made on the unreachable frog builder.

I've not been able to confirm whether this also affects normal build-start resumes.

Related branches

tags: added: canonical-losa-lp
tags: added: buildd-manager
Changed in soyuz:
status: New → Triaged
importance: Undecided → High
Julian Edwards (julian-edwards) wrote :

Actually I saw this happening on dogfood but it always came back to life. There was also no break in activity on other builders.

The traceback shows the call stack to be updateBuilderStatus -> checkSlaveAlive -> self.slave.echo

Can you explain what you mean by "timeout code doesn't work when a resume is triggered from within Builder" ?

On Tue, 2010-08-17 at 08:31 +0000, Julian Edwards wrote:
> Actually I saw this happening on dogfood but it always came back to
> life. There was also no break in activity on other builders.

Can you try to confirm the lack of an activity break on other builders?
My testing locally showed that it did indeed block other builders, and
the production issue this morning strongly suggests it.

> The traceback shows the call stack to be updateBuilderStatus ->
> checkSlaveAlive -> self.slave.echo

Yes. The exception handler then prints the exception, and calls
handleTimeout which resumes the slave (or at least sets the flag to
request it).

> Can you explain what you mean by "timeout code doesn't work when a
> resume is triggered from within Builder" ?

There is meant to be a timeout applied to the resume trigger. This
doesn't seem to work when Builder.handleTimeout calls it, but I'm not
sure if it even works in the normal case (when called on build start).

Julian Edwards (julian-edwards) wrote :

Ah I see what's going on now, the traceback was misleading.

So handleTimeout() should not be trying to reset the (virtual) machine here,
it should do nothing. It doesn't make any sense to immediately try and reset
a machine that we know to be timing out, especially synchronously!

If we just leave it, the next dispatch will try and reset it anyway, and that
will also get done *asynchronously*.

Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
Brad Crittenden (bac) on 2010-10-27
tags: added: bad-commit-11801
Changed in soyuz:
milestone: none → 10.11
tags: added: qa-needstesting
Changed in soyuz:
status: In Progress → Fix Committed
tags: added: qa-ok
removed: qa-needstesting
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

tags: added: qa-untestable
removed: qa-ok
Robert Collins (lifeless) wrote :

Needs a deployment to the buildd-manager machine to fix-release this.

Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

Changed in soyuz:
status: Fix Committed → Fix Released
tags: added: buildd-scalability
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers