Launchpad itself

Resume trigger hangs buildd-manager

Bug #618955 reported by William Grant on 2010-08-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Julian Edwards	Launchpad itself 10.11 "Tenacious Turkey"

Bug Description

buildd-manager this morning said http://pastebin.ubuntu.com/479111/, then promptly hung. Some local testing reveals that the normal resume trigger asynchrony and timeout code doesn't work when a resume is triggered from within Builder.handleTimeout, so a single unresettable builder can bring down everything. Killing the rogue ssh reset process woke things up again.

To reproduce:

1) Tweak vm_resume_command to something like 'sleep 30'.
2) Enable both sampledata builders.
3) Start launchpad-buildd (so connecting to the non-virt bob will succeed, but the virt frog will fail)
4) Start buildd-manager.
5) Watch buildd-manager and launchpad-buildd logs, noting large breaks in activity while a resume attempt is made on the unreachable frog builder.

I've not been able to confirm whether this also affects normal build-start resumes.

Tags:

Related branches

lp:~julian-edwards/launchpad/builderslave-resume

Merged into lp:launchpad at revision 11801

Jonathan Lange (community): Approve on 2010-10-19

lp:~bac/launchpad/rollback-11801

Rejected for merging into lp:launchpad

Brad Crittenden (community): Approve (code) on 2010-10-27

Michael Barnett (mbarnett) on 2010-08-16

tags:

added: canonical-losa-lp

Julian Edwards (julian-edwards) on 2010-08-17

tags:	added: buildd-manager
Changed in soyuz:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-08-17:

Actually I saw this happening on dogfood but it always came back to life. There was also no break in activity on other builders.

The traceback shows the call stack to be updateBuilderStatus -> checkSlaveAlive -> self.slave.echo

Can you explain what you mean by "timeout code doesn't work when a resume is triggered from within Builder" ?

Revision history for this message

William Grant (wgrant) wrote on 2010-08-17: Re: [Bug 618955] Re: Resume trigger hangs buildd-manager

On Tue, 2010-08-17 at 08:31 +0000, Julian Edwards wrote:
> Actually I saw this happening on dogfood but it always came back to
> life. There was also no break in activity on other builders.

Can you try to confirm the lack of an activity break on other builders?
My testing locally showed that it did indeed block other builders, and
the production issue this morning strongly suggests it.

> The traceback shows the call stack to be updateBuilderStatus ->
> checkSlaveAlive -> self.slave.echo

Yes. The exception handler then prints the exception, and calls
handleTimeout which resumes the slave (or at least sets the flag to
request it).

> Can you explain what you mean by "timeout code doesn't work when a
> resume is triggered from within Builder" ?

There is meant to be a timeout applied to the resume trigger. This
doesn't seem to work when Builder.handleTimeout calls it, but I'm not
sure if it even works in the normal case (when called on build start).

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-08-17:

Ah I see what's going on now, the traceback was misleading.

So handleTimeout() should not be trying to reset the (virtual) machine here,
it should do nothing. It doesn't make any sense to immediately try and reset
a machine that we know to be timing out, especially synchronously!

If we just leave it, the next dispatch will try and reset it anyway, and that
will also get done *asynchronously*.

Julian Edwards (julian-edwards) on 2010-10-19

Changed in soyuz:
status:	Triaged → In Progress
assignee:	nobody → Julian Edwards (julian-edwards)

Brad Crittenden (bac) on 2010-10-27

tags:

added: bad-commit-11801

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2010-10-27: Bug fixed by a commit

Fixed in stable r11801 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11801>.

Changed in soyuz:
milestone:	none → 10.11
tags:	added: qa-needstesting
Changed in soyuz:
status:	In Progress → Fix Committed

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2010-10-28:

Fixed in stable r11808 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11808>.

Julian Edwards (julian-edwards) on 2010-10-29

tags:

added: qa-ok
removed: qa-needstesting

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2010-10-29:

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

tags:

added: qa-untestable
removed: qa-ok

Revision history for this message

Robert Collins (lifeless) wrote on 2010-11-01:

Needs a deployment to the buildd-manager machine to fix-release this.

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2010-11-04:

Fixed in stable r11815 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11815) by a commit, but not testable.

Julian Edwards (julian-edwards) on 2010-11-16

Changed in soyuz:
status:	Fix Committed → Fix Released
tags:	added: buildd-scalability

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.