New buildd-manager disabling everything in sight

Bug #671242 reported by Steve Kowalik
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Julian Edwards

Bug Description

After the new rollout to cesium, buildds were marked as disabled by the buildd-manager with the following traceback.

2010-11-04 21:53:08+0000 [QueryProtocol,client] Scanning failed with: 'NoneType' object has no attribute 'processor'
2010-11-04 21:53:08+0000 [QueryProtocol,client] Traceback (most recent call last):
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/web/xmlrpc.py", line 318, in handleResponse
2010-11-04 21:53:08+0000 [QueryProtocol,client] self.factory.parseResponse(contents)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/web/xmlrpc.py", line 395, in parseResponse
2010-11-04 21:53:08+0000 [QueryProtocol,client] deferred.callback(response)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 318, in callback
2010-11-04 21:53:08+0000 [QueryProtocol,client] self._startRunCallbacks(result)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 424, in _startRunCallbacks
2010-11-04 21:53:08+0000 [QueryProtocol,client] self._runCallbacks()
2010-11-04 21:53:08+0000 [QueryProtocol,client] --- <exception caught here> ---
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/eggs/Twisted-10.1.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 441, in _runCallbacks
2010-11-04 21:53:08+0000 [QueryProtocol,client] self.result = callback(self.result, *args, **kw)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/builder.py", line 298, in rescue_slave
2010-11-04 21:53:08+0000 [QueryProtocol,client] builder.verifySlaveBuildCookie(slave_build_id)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/builder.py", line 561, in verifySlaveBuildCookie
2010-11-04 21:53:08+0000 [QueryProtocol,client] slave_build_id)
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/buildfarmjobbehavior.py", line 69, in verifySlaveBuildCookie
2010-11-04 21:53:08+0000 [QueryProtocol,client] expected_cookie = self.buildfarmjob.generateSlaveBuildCookie()
2010-11-04 21:53:08+0000 [QueryProtocol,client] File "/srv/launchpad.net/codelines/soyuz-production-rev-11824/lib/lp/buildmaster/model/buildfarmjob.py", line 166, in generateSlaveBuildCookie
2010-11-04 21:53:08+0000 [QueryProtocol,client] if buildqueue.processor is None:
2010-11-04 21:53:08+0000 [QueryProtocol,client] exceptions.AttributeError: 'NoneType' object has no attribute 'processor'
2010-11-04 21:53:08+0000 [QueryProtocol,client]
2010-11-04 21:53:17+0000 [QueryProtocol,client] Builder adare failed a probe, count: 6

Tags: lp-soyuz qa-ok

Related branches

Steve Kowalik (stevenk)
Changed in soyuz:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
William Grant (wgrant) wrote :

It looks like it's performing the slave build cookie check after a failure is detected and the current build is reset. But the Builder object is retained, so its cached BuildFarmJobBehavior is used instead of IdleBuildBehavior.

I suspect that rescueIfLost shouldn't be called after a failure.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

The Builder object should get reloaded on each scan, obviously that's not happening here.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I should probably invalidate self.builder in fact.

Revision history for this message
William Grant (wgrant) wrote :

That should already be happening. transaction.commit and transaction.abort are called frequently. I think we may need better logging to properly observe the exact sequence of callbacks.

Changed in soyuz:
status: Confirmed → Triaged
assignee: nobody → Julian Edwards (julian-edwards)
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

We need to fix this or roll-back the refactoring ASAP as we have a roll-out next week.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 671242] Re: New buildd-manager disabling everything in sight

On Monday 08 November 2010 21:21:39 you wrote:
> We need to fix this or roll-back the refactoring ASAP as we have a roll-
> out next week.

I'll be looking at it all day tomorrow and do one or the other by Wednesday.

Revision history for this message
Robert Collins (lifeless) wrote :

The specific rollout concern is that we need a merge of db-stable +
devel thats safe to deploy everywhere - because we can't do partial
deploys of database changes.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

So far I've narrowed this down to when the builder is yanked from the pool by the Enablement guys. The scanning logic in that regard has regressed, should be easy to fix.

Revision history for this message
Robert Collins (lifeless) wrote :

o/

Revision history for this message
Julian Edwards (julian-edwards) wrote :

In fact it's more general than that; when any builder goes unavailable when it's mid-build, we get trouble.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

The consequence of this appears that the job is getting dispatched to a different builder right away and I suspect the buildqueue for the failing builder is not getting reset.

2010-11-04 13:56:35+0000 [-] builder shipova failure count: 1, job 'i386 build of mbp-nvidia-bl-dkms 0.25.2~maverick in ubuntu maverick RELEASE' failure count: 1
2010-11-04 13:56:38+0000 [QueryProtocol,client] startBuild(http://terranova.ppa:8221/, mbp-nvidia-bl-dkms, 0.25.2~maverick, Release)

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Noting more of my findings before I go to bed:

After the previous scan failure, because the fail counts are equal, current_job.reset() is called in the assessFailureCounts() code.

The next thing that will happen to shipova is that we try to scan it again, which along the way ends up calling builder.current_build_behaviour when it tries to check the cookie. The builder is a new object at this point so current_build_behaviour should return currentjob.required_build_behavior if there's a job or IdleBuildBehavior() if not.

We are expecting IdleBuildBehavior but it gets a real behaviour - I've no idea why since the start of the scan it re-fetches the builder object. I need to try and simulate this on dogfood tomorrow.

As a separate note, I need to implement a separate failure assessment method depending on whether it's a dispatching failure or a polling failure as we need to be more lenient in resetting the job in the latter case.

Revision history for this message
William Grant (wgrant) wrote :

That's about as far as I got, with the limited available log snippets. My next line of investigation was going to be whether the rescue callback was still being called even after the failure *inside the same scan*. Is that possible?

Revision history for this message
Julian Edwards (julian-edwards) wrote :

On Tuesday 09 November 2010 22:16:47 you wrote:
> That's about as far as I got, with the limited available log snippets.
> My next line of investigation was going to be whether the rescue
> callback was still being called even after the failure *inside the same
> scan*. Is that possible?

That's not possible, it can only do it at the top of the scan.

I think the quick way around this is to reset the build behaviour on the build
when we do buildqueue.reset(). But I need to re-create the problem first to
verify that this works.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Further debugging with a test case gets me here:

162 -> def generateSlaveBuildCookie(self):
163 """See `IBuildFarmJobOld`."""
164 buildqueue = getUtility(IBuildQueueSet).getByJob(self.job)

The buildqueue.reset() does not remove the job connection between the buildfarmjob and the buildqueue, so this statement works.

Something else is resetting the job.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Latest info:

It looks like the scan failure doesn't happen until the job that was released from the builder has finished building elsewhere - as soon as that happens and the buildqueue/job/buildfarmjob get destroyed then the scan of the originally failing builder goes bang.

This means that the build is not getting reset properly somehow, I'm still looking at it.

Changed in soyuz:
milestone: none → 10.11
tags: added: qa-ok
Changed in soyuz:
status: Triaged → Fix Committed
Revision history for this message
Launchpad QA Bot (lpqabot) wrote : Bug fixed by a commit
tags: added: qa-needstesting
removed: qa-ok
tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Robert Collins (lifeless) wrote :

Julian, should we do a deploy to cesium in advance of the downtime rollout, to detect issues early?

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Yep, I'd be happy with that. It was working quite well last time, with the exception of the issue I fixed. Please make sure I am around when it rolls though.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 671242] Re: New buildd-manager disabling everything in sight

On Tue, Nov 16, 2010 at 4:22 AM, Julian Edwards
<email address hidden> wrote:
> Yep, I'd be happy with that.  It was working quite well last time, with
> the exception of the issue I fixed.  Please make sure I am around when
> it rolls though.

I suggest you request it your tuesday am from Tom - just put in on the
LPS page, use qatagger as the approver. Theres no downtime involved so
no need to schedule a specific time.

Curtis Hovey (sinzui)
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.