Offspring Image Build System

detect faulty builder and schedule to other builders in the farm as a fallback

Reported by Fathi Boudra on 2011-06-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Offspring
Undecided
Unassigned
Offspring
Medium
Mike Heald

Bug Description

All hwpacks failed to build on offspring: https://offspring.linaro.org/

Looking at the 1st attempted build log on malus builder (omap3-natty):
http://snapshots.linaro.org/build/omap3-natty/20110615/0/build-log.txt

I: Saving copy of build setup to build results directory.
Building for armel
Fetching packages
Traceback (most recent call last):
 File "/usr/bin/linaro-hwpack-create", line 66, in <module>
   builder.build()
 File "/usr/lib/pymodules/python2.6/linaro_image_tools/hwpack/builder.py",
line 100, in build
   f.write(hwpack.manifest_text())
 File "/usr/lib/pymodules/python2.6/linaro_image_tools/hwpack/packages.py",
line 173, in __exit__
   shutil.rmtree(tmpdir)
 File "/usr/lib/python2.6/shutil.py", line 204, in rmtree

 File "/usr/lib/python2.6/shutil.py", line 202, in rmtree

OSError: [Errno 5] Input/output error: '/tmp/tmp7dymq0'
E: A fatal error has ocurred. Shutting down.

After that, malus was wedged. I guess that the other builders were busy
to build 11.05-images, causing all the builds schedule to target malus
and the mass build ERROR.

Can we detect when a builder is stuck and schedule next build on another host?

James Westby (james-w) wrote :

How would you detect that malus was stuck in this case?

Thanks,

James

Fathi Boudra: After omap3-natty 20110615-0 failed on malus, did the subsequent builds that were dispatched to malus return a build result with a name 'ERROR' instead of a proper name like 'yyyymmdd-n'?

Changed in offspring:
status: New → Incomplete

On 2 November 2011 19:36, Cody A.W. Somerville
<email address hidden> wrote:
> Fathi Boudra: After omap3-natty 20110615-0 failed on malus, did the
> subsequent builds that were dispatched to malus return a build result
> with a name 'ERROR' instead of a proper name like 'yyyymmdd-n'?

I'm pretty sure 'ERROR' has been returned.

Changed in offspring:
importance: Undecided → Medium
status: Incomplete → Confirmed
Kevin McDermott (bigkevmcd) wrote :

We could increment a counter every time a build fails, and reset it to 0 every time we succeed.

In the builder finding logic we could order by this value (or filter if it gets too high), to avoid sending builds to a builder that's repeatedly failing.

David Murphy (schwuk) on 2013-05-20
tags: added: improvement scheduled
Changed in offspring:
status: Confirmed → Fix Released
assignee: nobody → Mike Heald (mike-powerthroughwords)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers