All the builds are being aborted/unstable due to publishing bottleneck

Bug #1164273 reported by vishal
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Linaro Android Infrastructure
Fix Released
Critical
Paul Sokolovsky

Bug Description

All the Android builds are being aborted since the reach the timeout of 255 minutes. I am not sure what has changed. We had recently added native toolchain to the build which was slow. We have disabled it but the timeouts are still happening.

Fathi Boudra (fboudra)
Changed in linaro-android-infrastructure:
milestone: none → 2013.04
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
Bernhard Rosenkraenzer (berolinux) wrote :

We can speed up the native toolchain bits by moving a lot of stuff currently pulled in through bzr and wget into git repositories and adding them to the manifests -- but given even builds without native toolchain are failing, we should probably just increase the timeout.
Android can take forever to build esp. on boxes with less than 32 GB RAM.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote : Re: All the builds are being aborted/unstabled

Looking at https://android-build.linaro.org/jenkins/job/linaro-android-restricted_juice-aosp/buildTimeTrend , big build time increase happened after Jenkins upgrade on Fri: morning Fri, there was build with normal build time, we don't build on Sat, then on Sun, we have that build time jump. #196 on Tue was still able to complete, so publishing itself appear to work, and builds indeed appear aborted due to timeout (some when it already goes to publishing stage, producing weird log messages and "unstable" status).

Changed in linaro-android-infrastructure:
assignee: nobody → Paul Sokolovsky (pfalcon)
summary: - all the builds are being aborted
+ All the builds are being aborted/unstabled
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Build stage times:

$ grep "^TIME" consoleText-linaro-android-restricted_juice-aosp_193.txt
TIME: Seed download and uncompress: 6m27.002s
TIME: Repo sync (using seed as reference): 11m54.480s
TIME: Compilation: 92m12.334s

$ grep "^TIME" consoleText-linaro-android-restricted_juice-aosp_194.txt
TIME: Seed download and uncompress: 24m52.504s
TIME: Repo sync (using seed as reference): 16m38.810s
TIME: Compilation: 164m43.034s

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

More visible in this spreadsheet: https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AqfqtPrXYdq7dEpKSFVTMFh6MzFCbnlQaGV0c2JSdlE#gid=0

So, time increase in all stages of build - not really easy to attribute that to Jenkins upgrade (though anything can be), and whole 50min of unaccounted time with builds after upgrade.

Revision history for this message
Bernhard Rosenkraenzer (berolinux) wrote :

Hard to blame on Jenkins, but even harder to blame anything that changed between 193 and 194 (there is no difference whatsoever between 193 and 194 -- neither in the build config nor in pinned-manifest.xml) - so we may need to look at anything else that may be influencing this.

Could this be a connectivity issue (Compile stage etc. slowing down not because the actual compile is getting slower, but because communication between ec2 and jenkins times out)?

Does reverting to the old Jenkins help?

Changed in linaro-android-infrastructure:
status: Confirmed → In Progress
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Well, we didn't consider reverting to old Jenkins (even if we did and it would resolve issue - what would we do, we can't stick with ancient Jenkins forever, we didn't upgrade for quite long already). Instead, I performed pending migrations in it (those are mostly related to security issues) and upgraded plugins. Then I watched Jenkins startup - it started lean on CPU usage, until first build went onto publishing stage, then CPU usage went to 99%, and subjectively publishing went pretty slow. Even after publishing finished, usage didn't subsided.

That all pretty much reminded year-old Jenkins upgrade on ci.linaro.org, where after some runtime, Jenkins just came to a halt, and we had to upgrade master to Medium instance (a-b is still on Small). I checked current ci.linaro.org state with the same Jenkins version 1.480.3 - CPU usage is also at 99%, but no issue like this one were reported (and re: CPU usage, granted, noone really checks it until issues arise, so it may have been like that all the time).

So, I wasn't to optimistic, but today's nightly went pretty ok. Also, looking at https://android-build.linaro.org/jenkins/view/All/job/linaro-android-restricted_juice-aosp/buildTimeTrend , we're kind back to usual build times. So, it seems to help after all, but definitely need to watch it for some time yet.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

<bhoj> pfalcon, we are again crossing the timeout on android builds. there are being marked as unstable ...

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Analyzed more build results from today: https://docs.google.com/a/linaro.org/spreadsheet/ccc?key=0AqfqtPrXYdq7dEpKSFVTMFh6MzFCbnlQaGV0c2JSdlE#gid=1

It seems that with new Jenkins version, master gets bottlenecked with few publishings going in parallel.

I'm going to bump build timeout, but don't think that would help anyway.

Two possible solutions:

1. Upgrade android-build master to Medium instance (may help or not, doesn't resolve real problem).
2. Rework publishing process altogether (this would fix real problem, if reworked well).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Bumped timeout to 275min.

Revision history for this message
vishal (vishalbhoj) wrote : Re: All the builds are being aborted/unstable
summary: - All the builds are being aborted/unstabled
+ All the builds are being aborted/unstable
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

https://blueprints.launchpad.net/linaro-android-infrastructure/+spec/prototype-new-publishing has been files to fix underlying issues (high priority, started).

summary: - All the builds are being aborted/unstable
+ All the builds are being aborted/unstable due to publishing bottleneck
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Changes per BP above were deployed last night, previously known failures are seen recovered. Going to keep monitoring. Please report any issues.

Changed in linaro-android-infrastructure:
status: In Progress → Fix Committed
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

After additional fixes yesterday, publishing appear to work reliably - my checks of failed build show missing tags/compile errors only. No issues raised today by other parties either. Closing.

Changed in linaro-android-infrastructure:
status: Fix Committed → Fix Released
Revision history for this message
vishal (vishalbhoj) wrote :
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

So, yesterday android-build has proven to be not performant and even stable enough (we had it come to a halt due to load) to support release process. Decision was made to perform unscheduled upgrade to the next bigger instance, and temporarily switch to more powerful build slaves to compensate for time already lost. The work was completed, and new a-b on a medium EC2 instance appear to work well, Android release process unblocked.

Changed in linaro-android-infrastructure:
status: Fix Released → In Progress
status: In Progress → Fix Committed
Changed in linaro-android-infrastructure:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.