ci.l.o "toolchain_cloud" slaves hang and do not terminate properly

Bug #1324882 reported by Paul Sokolovsky
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro CI
Confirmed
High
Unassigned

Bug Description

This problem exists for ~2 montsh already - after a build, "toolchain_cloud" EC2 build slave goes into "offline" state on Jenkins side, but does not terminate properly in EC2. All this time, I killed them manually (at first it seemed like randoned hanged slaves, but over time, it became clear this affects ~100% of these slaves).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

So, the questions are:

1. What is this special slave type "toolchain_cloud"?
2. Why can't we use one of standard slaves (e.g. "Precise-64 3exec")?
3. What's wrong with "toolchain_cloud" config?

With zero info I posses so far, question 2 seems the most productive - assuming "toolchain_cloud" builds can be converted to existing well-behaving slave, the problem is solved.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ok, it "toolchain_cloud" is currently used by 13 jobs, which is a bit too many to just switch to another slave type.

Changed in linaro-ci:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

So, what I did is renamed older "toolchain_cloud" slave config to "toolchain_cloud_old", and cloned "Precise 3 exec" as new "toolchain_cloud" - to not try to dig/guess what may be wrong in original config. Unfortunately, that didn't really help.

Then, watching builds leading to hanging slave, I noticed that there're some long operations, which don't update log in a while, and then connection reported as broken. Intuitively, I tied that to timeouts due to channel inactivity. To remind, some time ago I applied additional timeout checks to Jenkins lp:1288092. Note that errors didn't show that termination happened due to added check, and it itself is likely overwritten by now due to Jenkins upgrade. Nonetheless, Jenkins should have own checks, so I followed this trail.

I added "bash -x" within build steps to get more output from build scripts (which are itself in VCS so I couldn't easily change them). This seemingly helped some builds running on toolchain_cloud - at least some builds I watched succeeded, while previous builds of same jobs failed. But the main culprit, android-ndk job, still kept failing.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
Download full text (5.8 KiB)

https://ci.linaro.org/jenkins/job/android-ndk/ had last successful (of currently available) on Mar 26: https://ci.linaro.org/jenkins/job/android-ndk/240/

Builds typically fail like this:
02:41:57.388 /mnt/ci_build/workspace/android-ndk/src/build/../gcc/gcc-4.9/gcc/targhooks.c:1513:79: warning: ignoring return value of 'int asprintf(char**, const char*, ...)', declared with attribute warn_unused_result [-Wunused-result]
02:42:00.723 yes
02:42:02.470 checking whether log10l is declared without a macro... build/genautomata /mnt/ci_build/workspace/android-ndk/src/build/../gcc/gcc-4.9/gcc/config/i386/i386.md \
02:42:03.986 insn-conditions.md > tmp-automata.c
02:42:13.138 FATAL: Unable to delete script file /tmp/hudson6799312210914075568.sh
02:42:13.139 hudson.util.IOException2: remote file operation failed: /tmp/hudson6799312210914075568.sh at hudson.remoting.Channel@138d814:toolchain_cloud (i-66177335)
02:42:13.142 at hudson.FilePath.act(FilePath.java:907)
02:42:13.143 at hudson.FilePath.act(FilePath.java:884)
02:42:13.146 at hudson.FilePath.delete(FilePath.java:1302)
02:42:13.146 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:101)
02:42:13.147 at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:60)
02:42:13.148 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
02:42:13.149 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785)
02:42:13.150 at hudson.model.Build$BuildExecution.build(Build.java:199)
02:42:13.151 at hudson.model.Build$BuildExecution.doRun(Build.java:160)
02:42:13.151 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:566)
02:42:13.152 at hudson.model.Run.execute(Run.java:1665)
02:42:13.153 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
02:42:13.153 at hudson.model.ResourceController.execute(ResourceController.java:88)
02:42:13.154 at hudson.model.Executor.run(Executor.java:246)
02:42:13.155 Caused by: hudson.remoting.ChannelClosedException: channel is already closed
02:42:13.156 at hudson.remoting.Channel.send(Channel.java:524)
02:42:13.156 at hudson.remoting.Request.call(Request.java:129)
02:42:13.157 at hudson.remoting.Channel.call(Channel.java:722)
02:42:13.158 at hudson.FilePath.act(FilePath.java:900)
02:42:13.158 ... 13 more
02:42:13.158 Caused by: java.io.IOException: Unexpected termination of the channel
02:42:13.159 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
02:42:13.160 Caused by: java.io.EOFException
02:42:13.161 at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
02:42:13.162 at java.io.ObjectInputStream.readObject0(Unknown Source)
02:42:13.162 at java.io.ObjectInputStream.readObject(Unknown Source)
02:42:13.163 at hudson.remoting.Command.readFrom(Command.java:92)
02:42:13.164 at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
02:42:13.164 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
02:42:13.166 FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
02:42:1...

Read more...

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

On slave side this looks like:

<===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 2.33
This is a Unix slave
Evacuated stdout
Slave successfully connected and online
ERROR: Connection terminated
java.io.IOException: Unexpected termination of the channel
 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
 at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
 at java.io.ObjectInputStream.readObject0(Unknown Source)
 at java.io.ObjectInputStream.readObject(Unknown Source)
 at hudson.remoting.Command.readFrom(Command.java:92)
 at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
 at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Catching moment of build failure, "java -jar slave.jar" isn't running on slave, i.e. JVM with slave control crashed. Trying to restart slave soon after crash fails with "Could not allocate enough heap space". That's due to the fact that gcc/clang build still continuing, and apparently putting resources under good stress (I see loadavg of 100). That's how I got an idea that slave.jar may fail due to out of memory condition (note that I didn't see explicit Java OOM exception).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

So, I bumped initial and max heap sizes for slave.jar (-Xms512m -Xmx768m), and watched it running on an ec2 slave closely using "jps" tool. I saw (some parameters of) heap growing, then GC kicking, and usage drop, etc. And then build crashed just as before (maybe survived a little bit longer, but still far away from #240's 6hrs to success). I didn't catch exact moment, but jsp (running in "watch" mode) still showed more lines, but when I stopped it, run normal ps, I saw that java wasn't running, and re-running jsp of course errored out.

So, what we have is JVM running slave.jar crashing randomly without good reporting (I so far wasn't able to find any trails), nor quick googling showed any similar case (well, except for usual random hits of people complaining about openjdk and reporting better results with oracle jdk).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

To add a bit of gossip the errors roughly remind that of lp:1288092, though there we were getting a hang, not crash (could it be misdetected crash?). We're also having weird crashes of Gerrit instances (mostly review.android.git.linaro.org, though lately we had it on review.git.linaro.org) - logs usually don't show anything definitive on the reasons.

Revision history for this message
Chase Qi (chase-qi) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.