CI jobs hang on archiving artifacts

Bug #1288092 reported by Milo Casagrande on 2014-03-05
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linaro CI
In Progress
Critical
Paul Sokolovsky

Bug Description

As reported by Fathi, we have some CI jobs that hang and fails when archiving artifacts. This is an example of a failing job:

https://ci.linaro.org/jenkins/job/odp-api-doxygen-generation/164/console

Milo Casagrande (milo) on 2014-03-05
Changed in linaro-ci:
assignee: nobody → Paul Sokolovsky (pfalcon)
status: New → Triaged
milestone: none → 2014.03
Changed in linaro-ci:
importance: Undecided → Critical
Changed in linaro-ci:
status: Triaged → In Progress
Paul Sokolovsky (pfalcon) wrote :

Yesterday, my previous fix to https://issues.jenkins-ci.org/browse/JENKINS-7641 was deployed on ci.linaro.org (in the form of update remoting.jar), and tested to work as expected (i.e. aborting archiving after sensible timout if it hangs). So, now we have
protection against build lock-ups, but archiving still can fail (and some jobs, like the one above, fail and are aborted with ~100% probability).

Discussed results with Fathi and decided that next step would be trying alternative JDKs on build salves. Fathi will drive this effort.

Riku Voipio (riku-voipio) wrote :

Another project consistently failing:

https://ci.linaro.org/jenkins/job/libvirt/

These jobs worked just fine until feb 23rd, so looking at jdk version seems wrong. What needs to be found out what in jenkins changed then that broke publishing from arm based nodes.

Paul Sokolovsky (pfalcon) wrote :

To start, I carefully reviewed my previous patch to Jenkins "remoting" component, and found a bug with it which might cause premature timeout in case of big or slow transfers. I fixed it and redeployed remoting.jar, but of no avail - couple of builds I started failed.

> so looking at jdk version seems wrong

Trying different JDK versions and vendors comes from comments in https://issues.jenkins-ci.org/browse/JENKINS-7641 . I don't hold my breathe that it helps, that's why I communicated it to Fathi for 2nd opinion on feasibility and considering if, when, and how we can try that (I know almost nothing about ARM build slaves to drive that myself). Generally, explanation for what happens would be that ovecomplicated and everchanging code in Jenkins triggers subtle bugs in JDK (as in java stdlib) and JVM. Nothing is stable and deterministic there, that's why JENKINS-7641 is open for 3+ years. Now add "unconventional" and "lower-performance" archs like ARM, and well, one could attribute even 100% failures to some crazy mix of issues.

> These jobs worked just fine until feb 23rd

We'd need to call Fathi for exact details, because most of changes on that host comes thru him now, but checking file timestamps, Jenkins 1.532.1 -> 1.532.2 happened on Feb 19, then on Feb 23 some plugins were upgraded. Unfortunately, I don't think that munging with these will help much, but we can try.

Finally, I had a look at https://ci.linaro.org/jenkins/job/odp-api-doxygen-generation which is all green now. And recovery came from dropping usage of jenkins native archiving and switching to snapshots.linaro.org publishing. Then with https://ci.linaro.org/jenkins/job/libvirt/ , why do we need to copy some artifacts back to job's workspace on master (which is the step which fails)? From looking at the job, I can imagine only one reason - this job triggers another job, and that one tries to pull files from job/libvirt's workspace. But well, job/libvirt also publishes artifacts to snapshots, and can know the download URL for them - then it can pass that info to downstream job, and problematic step can be avoided.

Summing up, I can propose following choices to resolve the issue ordered by productivity:

1. Reworking libvirt job and its dependency with other jobs to avoid problematic transfers - this will 100% resolve the issue.
2. Try other JDKs on build slave. I actually pretty much would like to see this done, as I'd really like to get to the bottom of this issue, except that: a) it might not help; b) I'm personally booked with the other stuff for the time being. But if you gentlemen can go for this yourself, or request this from Infra team as a priority, sounds good.
3. We can try to munge plugin versions on ci.linaro.org. That will be pretty boring and unlikely lead to success (again, this looks like weird combination of issues leading to such effect, not a single culprit), but if someone thinks it's worth trying, we can.

Fathi Boudra (fboudra) on 2014-04-07
Changed in linaro-ci:
milestone: 2014.03 → 2014.04
Fathi Boudra (fboudra) on 2014-05-05
Changed in linaro-ci:
milestone: 2014.04 → 2014.05
Chase Qi (chase-qi) wrote :
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.