Launchpad itself

New buildd-manager disabling everything in sight

Bug #671242 reported by Steve Kowalik on 2010-11-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	Julian Edwards	Launchpad itself 10.11 "Tenacious Turkey"

Bug Description

After the new rollout to cesium, buildds were marked as disabled by the buildd-manager with the following traceback.

2010-11-04 21:53:08+0000 [QueryProtocol,client] Scanning failed with: 'NoneType' object has no attribute 'processor'
2010-11-04 21:53:08+0000 [QueryProtocol,client] Traceback (most recent call last):
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/web/xmlrpc.py", line 318, in handleResponse
2010-11-04 21:53:08+0000 [QueryProtocol,client] self.factory.parseResponse(contents)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/web/xmlrpc.py", line 395, in parseResponse
2010-11-04 21:53:08+0000 [QueryProtocol,client] deferred.callback(response)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 318, in callback
2010-11-04 21:53:08+0000 [QueryProtocol,client] self._startRunCallbacks(result)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 424, in _startRunCallbacks
2010-11-04 21:53:08+0000 [QueryProtocol,client] self._runCallbacks()
2010-11-04 21:53:08+0000 [QueryProtocol,client] --- <exception caught here> ---
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 441, in _runCallbacks
2010-11-04 21:53:08+0000 [QueryProtocol,client] self.result = callback(self.result, *args, **kw)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/builder.py", line 298, in rescue_slave
2010-11-04 21:53:08+0000 [QueryProtocol,client] builder.verifySlaveBuildCookie(slave_build_id)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/builder.py", line 561, in verifySlaveBuildCookie
2010-11-04 21:53:08+0000 [QueryProtocol,client] slave_build_id)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/buildfarmjobbehavior.py", line 69, in verifySlaveBuildCookie
2010-11-04 21:53:08+0000 [QueryProtocol,client] expected_cookie = self.buildfarmjob.generateSlaveBuildCookie()
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/buildfarmjob.py", line 166, in generateSlaveBuildCookie
2010-11-04 21:53:08+0000 [QueryProtocol,client] if buildqueue.processor is None:
2010-11-04 21:53:08+0000 [QueryProtocol,client] exceptions.AttributeError: 'NoneType' object has no attribute 'processor'
2010-11-04 21:53:08+0000 [QueryProtocol,client]
2010-11-04 21:53:17+0000 [QueryProtocol,client] Builder adare failed a probe, count: 6

Tags:

Related branches

lp:~julian-edwards/launchpad/buildd-manager-fixes

Merged into lp:launchpad at revision 11898

Jonathan Lange (community): Approve on 2010-11-10

lp:~jml/launchpad/fix-upcall (Merged)

Steve Kowalik (stevenk) on 2010-11-05

Changed in soyuz:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

William Grant (wgrant) wrote on 2010-11-05:

It looks like it's performing the slave build cookie check after a failure is detected and the current build is reset. But the Builder object is retained, so its cached BuildFarmJobBehavior is used instead of IdleBuildBehavior.

I suspect that rescueIfLost shouldn't be called after a failure.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-08:

The Builder object should get reloaded on each scan, obviously that's not happening here.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-08:

I should probably invalidate self.builder in fact.

Revision history for this message

William Grant (wgrant) wrote on 2010-11-08:

That should already be happening. transaction.commit and transaction.abort are called frequently. I think we may need better logging to properly observe the exact sequence of callbacks.

Julian Edwards (julian-edwards) on 2010-11-08

Changed in soyuz:
status:	Confirmed → Triaged
assignee:	nobody → Julian Edwards (julian-edwards)

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2010-11-08:

We need to fix this or roll-back the refactoring ASAP as we have a roll-out next week.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-08: Re: [Bug 671242] Re: New buildd-manager disabling everything in sight

On Monday 08 November 2010 21:21:39 you wrote:
> We need to fix this or roll-back the refactoring ASAP as we have a roll-
> out next week.

I'll be looking at it all day tomorrow and do one or the other by Wednesday.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-11-09:

The specific rollout concern is that we need a merge of db-stable +
devel thats safe to deploy everywhere - because we can't do partial
deploys of database changes.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-09:

So far I've narrowed this down to when the builder is yanked from the pool by the Enablement guys. The scanning logic in that regard has regressed, should be easy to fix.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-11-09:

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-09:

#10

In fact it's more general than that; when any builder goes unavailable when it's mid-build, we get trouble.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-09:

#11

The consequence of this appears that the job is getting dispatched to a different builder right away and I suspect the buildqueue for the failing builder is not getting reset.

2010-11-04 13:56:35+0000 [-] builder shipova failure count: 1, job 'i386 build of mbp-nvidia-bl-dkms 0.25.2~maverick in ubuntu maverick RELEASE' failure count: 1
2010-11-04 13:56:38+0000 [QueryProtocol,client] startBuild(http://terranova.ppa:8221/, mbp-nvidia-bl-dkms, 0.25.2~maverick, Release)

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-09:

#12

Noting more of my findings before I go to bed:

After the previous scan failure, because the fail counts are equal, current_job.reset() is called in the assessFailureCounts() code.

The next thing that will happen to shipova is that we try to scan it again, which along the way ends up calling builder.current_build_behaviour when it tries to check the cookie. The builder is a new object at this point so current_build_behaviour should return currentjob.required_build_behavior if there's a job or IdleBuildBehavior() if not.

We are expecting IdleBuildBehavior but it gets a real behaviour - I've no idea why since the start of the scan it re-fetches the builder object. I need to try and simulate this on dogfood tomorrow.

As a separate note, I need to implement a separate failure assessment method depending on whether it's a dispatching failure or a polling failure as we need to be more lenient in resetting the job in the latter case.

Revision history for this message

William Grant (wgrant) wrote on 2010-11-09:

#13

That's about as far as I got, with the limited available log snippets. My next line of investigation was going to be whether the rescue callback was still being called even after the failure *inside the same scan*. Is that possible?

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-10:

#14

On Tuesday 09 November 2010 22:16:47 you wrote:
> That's about as far as I got, with the limited available log snippets.
> My next line of investigation was going to be whether the rescue
> callback was still being called even after the failure *inside the same
> scan*. Is that possible?

That's not possible, it can only do it at the top of the scan.

I think the quick way around this is to reset the build behaviour on the build
when we do buildqueue.reset(). But I need to re-create the problem first to
verify that this works.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-10:

#15

Further debugging with a test case gets me here:

162 -> def generateSlaveBuildCookie(self):
163 """See `IBuildFarmJobOld`."""
164 buildqueue = getUtility(IBuildQueueSet).getByJob(self.job)

The buildqueue.reset() does not remove the job connection between the buildfarmjob and the buildqueue, so this statement works.

Something else is resetting the job.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-10:

#16

Latest info:

It looks like the scan failure doesn't happen until the job that was released from the builder has finished building elsewhere - as soon as that happens and the buildqueue/job/buildfarmjob get destroyed then the scan of the originally failing builder goes bang.

This means that the build is not getting reset properly somehow, I'm still looking at it.

Henning Eggers (henninge) on 2010-11-11

Changed in soyuz:
milestone:	none → 10.11

Julian Edwards (julian-edwards) on 2010-11-11

tags:

added: qa-ok

Julian Edwards (julian-edwards) on 2010-11-11

Changed in soyuz:
status:	Triaged → Fix Committed

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2010-11-11: Bug fixed by a commit

#17

Fixed in stable r11898 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/11898>.

tags:

added: qa-needstesting
removed: qa-ok

Julian Edwards (julian-edwards) on 2010-11-11

tags:

added: qa-ok
removed: qa-needstesting

Revision history for this message

Robert Collins (lifeless) wrote on 2010-11-15:

#18

Julian, should we do a deploy to cesium in advance of the downtime rollout, to detect issues early?

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2010-11-15:

#19

Yep, I'd be happy with that. It was working quite well last time, with the exception of the issue I fixed. Please make sure I am around when it rolls though.

Revision history for this message

Robert Collins (lifeless) wrote on 2010-11-15: Re: [Bug 671242] Re: New buildd-manager disabling everything in sight

#20

On Tue, Nov 16, 2010 at 4:22 AM, Julian Edwards
<email address hidden> wrote:
> Yep, I'd be happy with that. It was working quite well last time, with
> the exception of the issue I fixed. Please make sure I am around when
> it rolls though.

I suggest you request it your tuesday am from Tom - just put in on the
LPS page, use qatagger as the approver. Theres no downtime involved so
no need to schedule a specific time.

Curtis Hovey (sinzui) on 2010-11-17

Changed in soyuz:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.