ci.linaro.org is going down intermittently

Bug #943901 reported by Deepti B. Kalakeri
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro CI
Fix Released
Critical
James Tunnicliffe

Bug Description

ci.linaro.org is going out of service intermittently because of JVM Out Of Memory Error as shown in the jenkins.log below:

"
[Winstone 2012/03/01 07:15:37] - Untrapped Error in Servlet
java.lang.OutOfMemoryError: Java heap space

"
I have seen this since last week or so intermittently once in a while but it is becoming very obvious now.
I am seeing this recently with the jenkins version 1.419.
I fixed this problem by restarting the jenkins. But this is not a permanent solution.
The following reasons are cited for such kind of errors:

    Jenkins is growing in data size, requiring a bigger heap space. In this case you just want to give it a bigger heap.
    Jenkins is temporarily processing a large amount of data (like test reports), requiring a bigger head room in memory. In this case you just want to give it a bigger heap.
    Jenkins is leaking memory, in which case we need to fix that.

The reason 2 does not seem to be the cause of the OOM as only 3 jobs are scheduled to run at a time. In that case, the culprits can be 1 or 3.

I think I have given enough memory to the JVM " -Xms640m -Xmx1024m " with the actual RAM on the machine being 1GB.

I think the 1GB RAM is also a constraint here.

There are some options to get the dump of the JVM @ https://wiki.jenkins-ci.org/display/JENKINS/I%27m+getting+OutOfMemoryError.I will have to try that.

But, am afraid with the increasing jobs being added to ci.linaro.org we would see this problem more often.

Here are the solutions what I think might help us:

1) Try and see if we the latest jenkins upgrade will solve the problem, am not able to decide on which version, but I see 1.439 has memory fixes.

2) Migrate the ci.linaro.org onto another ec2 instance with more memory for time being till we get setup on DC.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote : Re: Out of Memory Error
Download full text (5.0 KiB)

Hello Deepti,

On Thu, 1 Mar 2012 13:29:41 +0530
Deepti Kalakeri <email address hidden> wrote:

> Hello Danilo/Paul,
>
> I am seeing a lot of Out of Memory error on ci.linaro.org.

Deepti now opened bug for this,
https://bugs.launchpad.net/linaro-ci/+bug/943901 (I'm cc:ing it).

>
> "
> [Winstone 2012/03/01 07:15:37] - Untrapped Error in Servlet
> java.lang.OutOfMemoryError: Java heap space
>
> "
> I have seen this since last week or so intermittently once in a while
> but it is becoming very obvious now.
> I am seeing this recently with the jenkins version 1.419.
> I fixed this problem by restarting the jenkins. But this is not a
> permanent solution.
> The following reasons are cited for such kind of errors:
>
> 1. Jenkins is growing in data size, requiring a bigger heap space.
> In this case you just want to give it a bigger heap.
> 2. Jenkins is temporarily processing a large amount of data (like
> test reports), requiring a bigger head room in memory. In this case
> you just want to give it a bigger heap.
> 3. Jenkins is leaking memory, in which case we need to fix that.
>
> The reason 2 does not seem to be the cause of the OOM as only 3 jobs
> are scheduled to run at a time. In that case, the culprits can be 1
> or 3.
>
> I think I have given enough memory to the JVM " -Xms640m -Xmx1024m "
> with the actual RAM on the machine being 1GB.
>
> I think the 1GB RAM is also a constraint here.

Well, small EC2 instance we use for Jenkins masters is 1.7Gb RAM, so
there's room for heap increase. And surprisingly, it turns out
android-build still uses Java defaults (doesn't pass explicit
-Xms/-Xmx), so we can't compare to that.

What's more worrying though is that OOM are accompanied by 99% CPU
usage by Java. Actually, after the restart, Jenkins is back to eating
99% CPU time in 5 mins or less (site still responds). So, I wouldn't
dismiss p.2 "Large processing" above, especially that you mentioned
that there were some issues due to frequent SCM polls already.
Actually, I saw an exception in logs while 99% usage related to SCM
polling:

Mar 1, 2012 10:20:28 AM hudson.triggers.SCMTrigger$Runner runPolling
SEVERE: Failed to record SCM polling
java.lang.NullPointerException
        at
hudson.plugins.bazaar.BazaarSCM.calcRevisionsFromBuild(BazaarSCM.java:196)

So, my proposals for how to deal with it:

1. Increase heap size by 256Mb as a stop-gap measure.
2. Consider upgrading to new version of Jenkins (read changelogs, test
on a sandbox).
3. Review/investigate how Jenkins does SCM polling. Consider again that
the most efficient ways to handle SCM tip builds would be
interrupt-driven, not polling (i.e. have trigger in SCM repository to
queue a build).

>
> There are some options to get the dump of the JVM @
> https://wiki.jenkins-ci.org/display/JENKINS/I%27m+getting+OutOfMemoryError .
> I will have to try that.

Well, investigating JavaVM memory usage is for sure technically sound
plan, especially if you had experience for it.

>
> But, am afraid with the increasing jobs being added to ci.linaro.org
> we would see this problem more often.
>
> Here are the solutions what I think might help us:
>
> 1) Try and see if we the l...

Read more...

Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :
Download full text (6.6 KiB)

On Thu, Mar 1, 2012 at 4:49 PM, Paul Sokolovsky
<email address hidden>wrote:

> Hello Deepti,
>
> On Thu, 1 Mar 2012 13:29:41 +0530
> Deepti Kalakeri <email address hidden> wrote:
>
> > Hello Danilo/Paul,
> >
> > I am seeing a lot of Out of Memory error on ci.linaro.org.
>
> Deepti now opened bug for this,
> https://bugs.launchpad.net/linaro-ci/+bug/943901 (I'm cc:ing it).
>
> >
> > "
> > [Winstone 2012/03/01 07:15:37] - Untrapped Error in Servlet
> > java.lang.OutOfMemoryError: Java heap space
> >
> > "
> > I have seen this since last week or so intermittently once in a while
> > but it is becoming very obvious now.
> > I am seeing this recently with the jenkins version 1.419.
> > I fixed this problem by restarting the jenkins. But this is not a
> > permanent solution.
> > The following reasons are cited for such kind of errors:
> >
> > 1. Jenkins is growing in data size, requiring a bigger heap space.
> > In this case you just want to give it a bigger heap.
> > 2. Jenkins is temporarily processing a large amount of data (like
> > test reports), requiring a bigger head room in memory. In this case
> > you just want to give it a bigger heap.
> > 3. Jenkins is leaking memory, in which case we need to fix that.
> >
> > The reason 2 does not seem to be the cause of the OOM as only 3 jobs
> > are scheduled to run at a time. In that case, the culprits can be 1
> > or 3.
> >
> > I think I have given enough memory to the JVM " -Xms640m -Xmx1024m "
> > with the actual RAM on the machine being 1GB.
> >
> > I think the 1GB RAM is also a constraint here.
>
> Well, small EC2 instance we use for Jenkins masters is 1.7Gb RAM, so
> there's room for heap increase. And surprisingly, it turns out
> android-build still uses Java defaults (doesn't pass explicit
> -Xms/-Xmx), so we can't compare to that.
>

yes that was my concern that a-b is working fine with default unlike ci*
which is having problem even with the same version on jenkins
and even after providing more memory.

>
> What's more worrying though is that OOM are accompanied by 99% CPU
> usage by Java. Actually, after the restart, Jenkins is back to eating
> 99% CPU time in 5 mins or less (site still responds). So, I wouldn't
> dismiss p.2 "Large processing" above, especially that you mentioned
> that there were some issues due to frequent SCM polls already.
> Actually, I saw an exception in logs while 99% usage related to SCM
> polling:
>
> Mar 1, 2012 10:20:28 AM hudson.triggers.SCMTrigger$Runner runPolling
> SEVERE: Failed to record SCM polling
> java.lang.NullPointerException
> at
> hudson.plugins.bazaar.BazaarSCM.calcRevisionsFromBuild(BazaarSCM.java:196)
>
>
> So, my proposals for how to deal with it:
>
> 1. Increase heap size by 256Mb as a stop-gap measure.
>

I have increased the heap size by 256 MB lets hope it is sufficient.
The restart of the jenkins did showed the CPU usage to be around 4 - 5%
initially,
but once I started the builds its anywhere between 70 - 99 % and the mem
usage is well within 80 %.
I am not sure what is so cpu intensive on the master when a build starts
because the slaves are the one where
all the builds happen.

2. Consider upgrading to n...

Read more...

Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :
Download full text (7.7 KiB)

On Thu, Mar 1, 2012 at 5:15 PM, Deepti Kalakeri
<email address hidden>wrote:

>
>
> On Thu, Mar 1, 2012 at 4:49 PM, Paul Sokolovsky <
> <email address hidden>> wrote:
>
>> Hello Deepti,
>>
>> On Thu, 1 Mar 2012 13:29:41 +0530
>> Deepti Kalakeri <email address hidden> wrote:
>>
>> > Hello Danilo/Paul,
>> >
>> > I am seeing a lot of Out of Memory error on ci.linaro.org.
>>
>> Deepti now opened bug for this,
>> https://bugs.launchpad.net/linaro-ci/+bug/943901 (I'm cc:ing it).
>>
>> >
>> > "
>> > [Winstone 2012/03/01 07:15:37] - Untrapped Error in Servlet
>> > java.lang.OutOfMemoryError: Java heap space
>> >
>> > "
>> > I have seen this since last week or so intermittently once in a while
>> > but it is becoming very obvious now.
>> > I am seeing this recently with the jenkins version 1.419.
>> > I fixed this problem by restarting the jenkins. But this is not a
>> > permanent solution.
>> > The following reasons are cited for such kind of errors:
>> >
>> > 1. Jenkins is growing in data size, requiring a bigger heap space.
>> > In this case you just want to give it a bigger heap.
>> > 2. Jenkins is temporarily processing a large amount of data (like
>> > test reports), requiring a bigger head room in memory. In this case
>> > you just want to give it a bigger heap.
>> > 3. Jenkins is leaking memory, in which case we need to fix that.
>> >
>> > The reason 2 does not seem to be the cause of the OOM as only 3 jobs
>> > are scheduled to run at a time. In that case, the culprits can be 1
>> > or 3.
>> >
>> > I think I have given enough memory to the JVM " -Xms640m -Xmx1024m "
>> > with the actual RAM on the machine being 1GB.
>> >
>> > I think the 1GB RAM is also a constraint here.
>>
>> Well, small EC2 instance we use for Jenkins masters is 1.7Gb RAM, so
>> there's room for heap increase. And surprisingly, it turns out
>> android-build still uses Java defaults (doesn't pass explicit
>> -Xms/-Xmx), so we can't compare to that.
>>
>
> yes that was my concern that a-b is working fine with default unlike ci*
> which is having problem even with the same version on jenkins
> and even after providing more memory.
>
>>
>> What's more worrying though is that OOM are accompanied by 99% CPU
>> usage by Java. Actually, after the restart, Jenkins is back to eating
>> 99% CPU time in 5 mins or less (site still responds). So, I wouldn't
>> dismiss p.2 "Large processing" above, especially that you mentioned
>> that there were some issues due to frequent SCM polls already.
>> Actually, I saw an exception in logs while 99% usage related to SCM
>> polling:
>>
>> Mar 1, 2012 10:20:28 AM hudson.triggers.SCMTrigger$Runner runPolling
>> SEVERE: Failed to record SCM polling
>> java.lang.NullPointerException
>> at
>> hudson.plugins.bazaar.BazaarSCM.calcRevisionsFromBuild(BazaarSCM.java:196)
>>
>>
>> So, my proposals for how to deal with it:
>>
>> 1. Increase heap size by 256Mb as a stop-gap measure.
>>
>
> I have increased the heap size by 256 MB lets hope it is sufficient.
> The restart of the jenkins did showed the CPU usage to be around 4 - 5%
> initially,
> but once I started the builds its anywhere between 70 - 99 % and the mem
> usage i...

Read more...

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

On Thu, 1 Mar 2012 17:15:00 +0530
Deepti Kalakeri <email address hidden> wrote:

[]

> > > @Paul,
> > >
> > > What kind of prior testing did we do before we upgraded our
> > > jenkins service?
> >
> > Mostly reading changelog and testing it on a sandbox, to make sure
> > that normal a-b functionality works as expected. I can setup a
> > sandbox with latest Jenkins version for you to play with.
> >
> oh! that will be nice if you could provide me a sandbox if it
> already does
> exist.

Well, it's easy to create one for Android Build, so here it is:
https://ec2-23-20-130-114.compute-1.amazonaws.com/jenkins/ . You have
Jenkins admin and SSH access to it, and feel free to do anything you
want with it. Or let me know if I can help with it. The only thing,
sandbox lifetime should be less than 5 days, so this one will need to
be deleted by end of Monday. But another can be created if needed.

--
Best Regards,
Paul

Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

Changed in linaro-ci:
status: New → Triaged
Changed in linaro-ci:
assignee: nobody → Paul Sokolovsky (pfalcon)
Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :

On Thu, Mar 1, 2012 at 8:00 PM, Paul Sokolovsky
<email address hidden>wrote:

> On Thu, 1 Mar 2012 17:15:00 +0530
> Deepti Kalakeri <email address hidden> wrote:
>
> []
>
> > > > @Paul,
> > > >
> > > > What kind of prior testing did we do before we upgraded our
> > > > jenkins service?
> > >
> > > Mostly reading changelog and testing it on a sandbox, to make sure
> > > that normal a-b functionality works as expected. I can setup a
> > > sandbox with latest Jenkins version for you to play with.
> > >
> > oh! that will be nice if you could provide me a sandbox if it
> > already does
> > exist.
>
> Well, it's easy to create one for Android Build, so here it is:
> https://ec2-23-20-130-114.compute-1.amazonaws.com/jenkins/ . You have
> Jenkins admin and SSH access to it, and feel free to do anything you
> want with it. Or let me know if I can help with it. The only thing,
> sandbox lifetime should be less than 5 days, so this one will need to
> be deleted by end of Monday. But another can be created if needed.
>
>
>
I would not be able to make use of the sandbox before Monday.
So it would be good if we have one from Monday.

>
> --
> Best Regards,
> Paul
>
> Linaro.org | Open source software for ARM SoCs
> Follow Linaro: http://www.facebook.com/pages/Linaro
> http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog
>

--
Thanks and Regards,
Deepti
Infrastructure Team Member, Linaro Platform Teams
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
http://twitter.com/#!/linaroorg - http://www.linaro.org/linaro-blog

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

After the heap increase, there were no reports of system going down. One extra day of monitoring, and going to close this tomorrow.

Changed in linaro-ci:
status: Triaged → Fix Committed
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

No issues visible on ci.linaro.org, closing.

Changed in linaro-ci:
status: Fix Committed → Fix Released
Revision history for this message
Deepti B. Kalakeri (deeptik) wrote : Re: [Bug 943901] Re: ci.linaro.org is going down intermittently

I saw that ci.linaro.org going un responsive again

Here is the jenkins.log message

Mar 8, 2012 6:45:59 AM hudson.triggers.SafeTimerTask run
SEVERE: Timer task
hudson.slaves.NodeProvisioner$NodeProvisionerInvoker@a4d265 failed
java.lang.OutOfMemoryError: Java heap space
Mar 8, 2012 6:46:11 AM hudson.triggers.SafeTimerTask run
SEVERE: Timer task
hudson.model.LoadStatistics$LoadStatisticsUpdater@82ceeafailed
java.lang.OutOfMemoryError: Java heap space

Guess we need to monitor this for some more time before we close this bug.

On Wed, Mar 7, 2012 at 10:08 PM, Paul Sokolovsky
<email address hidden>wrote:

> No issues visible on ci.linaro.org, closing.
>
>
> ** Changed in: linaro-ci
> Status: Fix Committed => Fix Released
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/943901
>
> Title:
> ci.linaro.org is going down intermittently
>
> Status in Linaro Continuous Integration:
> Fix Released
>
> Bug description:
> ci.linaro.org is going out of service intermittently because of JVM
> Out Of Memory Error as shown in the jenkins.log below:
>
> "
> [Winstone 2012/03/01 07:15:37] - Untrapped Error in Servlet
> java.lang.OutOfMemoryError: Java heap space
>
> "
> I have seen this since last week or so intermittently once in a while but
> it is becoming very obvious now.
> I am seeing this recently with the jenkins version 1.419.
> I fixed this problem by restarting the jenkins. But this is not a
> permanent solution.
> The following reasons are cited for such kind of errors:
>
> Jenkins is growing in data size, requiring a bigger heap space. In
> this case you just want to give it a bigger heap.
> Jenkins is temporarily processing a large amount of data (like test
> reports), requiring a bigger head room in memory. In this case you just
> want to give it a bigger heap.
> Jenkins is leaking memory, in which case we need to fix that.
>
> The reason 2 does not seem to be the cause of the OOM as only 3 jobs
> are scheduled to run at a time. In that case, the culprits can be 1
> or 3.
>
> I think I have given enough memory to the JVM " -Xms640m -Xmx1024m "
> with the actual RAM on the machine being 1GB.
>
> I think the 1GB RAM is also a constraint here.
>
> There are some options to get the dump of the JVM @ https://wiki
> .jenkins-ci.org/display/JENKINS/I%27m+getting+OutOfMemoryError.I will
> have to try that.
>
> But, am afraid with the increasing jobs being added to ci.linaro.org
> we would see this problem more often.
>
> Here are the solutions what I think might help us:
>
> 1) Try and see if we the latest jenkins upgrade will solve the
> problem, am not able to decide on which version, but I see 1.439 has
> memory fixes.
>
> 2) Migrate the ci.linaro.org onto another ec2 instance with more
> memory for time being till we get setup on DC.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linaro-ci/+bug/943901/+subscriptions
>

--
Thanks and Regards,
Deepti

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

I tried to get some debugging tools attached today, but failed to get anywhere. Jconsole will start over SSH with X forwarding, but it seems like doing that makes jenkins run very slowly indeed. I expect that because of the lag of running the application over SSH, any blocking calls that Jconsole is making makes Jenkins run slowly. So that approach is out.

I then tried Jconsole using the remote debug feature, which would be fine if I had full VPN access, but it needs more than 1 port open. One port is fixed (easy to forward) and the other is picked at random. I have seen some source code to create a launcher to make this manageable:
http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.html#gdfvq
linked to from:
http://blog.cantremember.com/debugging-with-jconsole-jmx-ssh-tunnels/

I also tried getting a stack trace (sudo jps to get PID, sudo jstack -F <PID>). This resulted in seeing a lot of threads that it just ran off the stack of, which could be a jstack problem.

There are some interesting posts on here:
http://jenkins.361315.n4.nabble.com/Performance-problems-on-Hudson-master-td3242456.html

If we can look into what plugins we actually require and disable the rest, that would be good. I don't have admin access to do this and wouldn't without consultation anyway. I thought the master using a 5GB heap was interesting! Makes me feel OK about our little machine being overloaded.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Reopening.

Changed in linaro-ci:
status: Fix Released → In Progress
assignee: Paul Sokolovsky (pfalcon) → James Tunnicliffe (dooferlad)
Revision history for this message
James Tunnicliffe (dooferlad) wrote : Re: [Bug 943901] [NEW] ci.linaro.org is going down intermittently
Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :

I have removed the un necessary plugins this morning. Does not help to a great extent though.

Thanks!!!
Deepti.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Fixed with lp:961070

Changed in linaro-ci:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.