detect faulty builder and schedule to other builders in the farm as a fallback

Bug #797645 reported by Fathi Boudra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Offspring
New
Undecided
Unassigned
Offspring
Fix Released
Medium
Nicola Heald

Bug Description

All hwpacks failed to build on offspring: https://offspring.linaro.org/

Looking at the 1st attempted build log on malus builder (omap3-natty):
http://snapshots.linaro.org/build/omap3-natty/20110615/0/build-log.txt

I: Saving copy of build setup to build results directory.
Building for armel
Fetching packages
Traceback (most recent call last):
 File "/usr/bin/linaro-hwpack-create", line 66, in <module>
   builder.build()
 File "/usr/lib/pymodules/python2.6/linaro_image_tools/hwpack/builder.py",
line 100, in build
   f.write(hwpack.manifest_text())
 File "/usr/lib/pymodules/python2.6/linaro_image_tools/hwpack/packages.py",
line 173, in __exit__
   shutil.rmtree(tmpdir)
 File "/usr/lib/python2.6/shutil.py", line 204, in rmtree

 File "/usr/lib/python2.6/shutil.py", line 202, in rmtree

OSError: [Errno 5] Input/output error: '/tmp/tmp7dymq0'
E: A fatal error has ocurred. Shutting down.

After that, malus was wedged. I guess that the other builders were busy
to build 11.05-images, causing all the builds schedule to target malus
and the mass build ERROR.

Can we detect when a builder is stuck and schedule next build on another host?

Revision history for this message
James Westby (james-w) wrote :

How would you detect that malus was stuck in this case?

Thanks,

James

Revision history for this message
Cody A.W. Somerville (cody-somerville) wrote :

Fathi Boudra: After omap3-natty 20110615-0 failed on malus, did the subsequent builds that were dispatched to malus return a build result with a name 'ERROR' instead of a proper name like 'yyyymmdd-n'?

Changed in offspring:
status: New → Incomplete
Revision history for this message
Fathi Boudra (fboudra) wrote : Re: [Bug 797645] Re: detect faulty builder and schedule to other builders in the farm as a fallback

On 2 November 2011 19:36, Cody A.W. Somerville
<email address hidden> wrote:
> Fathi Boudra: After omap3-natty 20110615-0 failed on malus, did the
> subsequent builds that were dispatched to malus return a build result
> with a name 'ERROR' instead of a proper name like 'yyyymmdd-n'?

I'm pretty sure 'ERROR' has been returned.

Changed in offspring:
importance: Undecided → Medium
status: Incomplete → Confirmed
Revision history for this message
Kevin McDermott (bigkevmcd) wrote :

We could increment a counter every time a build fails, and reset it to 0 every time we succeed.

In the builder finding logic we could order by this value (or filter if it gets too high), to avoid sending builds to a builder that's repeatedly failing.

David Murphy (schwuk)
tags: added: improvement scheduled
Changed in offspring:
status: Confirmed → Fix Released
assignee: nobody → Mike Heald (mike-powerthroughwords)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.