Frequent java task hang on ARM server

Bug #845158 reported by James Page
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ubuntu-omap4-extras-graphics
New
Undecided
Unassigned
linux-ti-omap4 (Ubuntu)
Confirmed
Undecided
Unassigned
linux-ti-omap4 (Ubuntu Linaro Evaluation Build)
New
Undecided
Unassigned
openjdk-6 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I'm using an up-to-date oneiric ARM server image running jenkins - after a short period of time (< 1 hour) the server locks up - hard reboot and I can see these messages in the syslog:

Sep 8 21:50:51 winehouse kernel: [ 7920.496795] INFO: task java:20973 blocked for more than 120 seconds.
Sep 8 21:50:51 winehouse kernel: [ 7920.496795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 8 21:50:51 winehouse kernel: [ 7920.497009] java D c05ca9e0 0 20973 561 0x00000000
Sep 8 21:50:51 winehouse kernel: [ 7920.497222] [<c05ca9e0>] (schedule+0x540/0x618) from [<c05cc728>] (__down_read+0xc0/0xd8)
Sep 8 21:50:51 winehouse kernel: [ 7920.497222] [<c05cc728>] (__down_read+0xc0/0xd8) from [<c05cf5e8>] (do_page_fault.part.2+0x80/0x1e0)
Sep 8 21:50:51 winehouse kernel: [ 7920.497222] [<c05cf5e8>] (do_page_fault.part.2+0x80/0x1e0) from [<c05cf7e4>] (do_page_fault+0x9c/0xac)
Sep 8 21:50:51 winehouse kernel: [ 7920.497222] [<c05cf7e4>] (do_page_fault+0x9c/0xac) from [<c004c410>] (do_DataAbort+0x34/0x98)
Sep 8 21:50:51 winehouse kernel: [ 7920.497406] [<c004c410>] (do_DataAbort+0x34/0x98) from [<c05cd210>] (__dabt_svc+0x70/0xa0)
Sep 8 21:50:51 winehouse kernel: [ 7920.497436] Exception stack(0xec91dec0 to 0xec91df08)
Sep 8 21:50:51 winehouse kernel: [ 7920.497467] dec0: 56e0f000 56e12000 00000020 0000001f ef248040 56e110c0 56e0f44c ef248000
Sep 8 21:50:51 winehouse kernel: [ 7920.497467] dee0: 00000000 ec91c000 00000000 56b4446c 56e11000 ec91df08 c00553d4 c005d198
Sep 8 21:50:51 winehouse kernel: [ 7920.497497] df00: 800f0013 ffffffff
Sep 8 21:50:51 winehouse kernel: [ 7920.497497] [<c05cd210>] (__dabt_svc+0x70/0xa0) from [<c005d198>] (v7_coherent_kern_range+0x1c/0x7c)
Sep 8 21:50:51 winehouse kernel: [ 7920.497497] [<c005d198>] (v7_coherent_kern_range+0x1c/0x7c) from [<c00553d4>] (arm_syscall+0x140/0x294)
Sep 8 21:50:51 winehouse kernel: [ 7920.497497] [<c00553d4>] (arm_syscall+0x140/0x294) from [<c0051220>] (ret_fast_syscall+0x0/0x30)

Architecture: armel
Date: Fri Sep 9 08:51:09 2011
Dependencies:
 ca-certificates 20110502+nmu1
 ca-certificates-java 20110531
 coreutils 8.5-1ubuntu6
 debconf 1.5.40ubuntu1
 dpkg 1.16.0.3ubuntu3
 gcc-4.6-base 4.6.1-9ubuntu2
 icedtea-6-jre-jamvm 6b23~pre8-1ubuntu1
 java-common 0.42ubuntu1
 libacl1 2.2.51-3
 libattr1 1:2.4.46-3
 libavahi-client3 0.6.30-4ubuntu1
 libavahi-common-data 0.6.30-4ubuntu1
 libavahi-common3 0.6.30-4ubuntu1
 libbz2-1.0 1.0.5-6ubuntu1
 libc-bin 2.13-17ubuntu2
 libc6 2.13-17ubuntu2
 libcomerr2 1.41.14-1ubuntu3
 libcups2 1.5.0-5
 libdbus-1-3 1.4.14-1ubuntu1
 libfreetype6 2.4.4-2ubuntu1
 libgcc1 1:4.6.1-9ubuntu2
 libgcrypt11 1.5.0-1
 libgnutls26 2.10.5-1ubuntu3
 libgpg-error0 1.10-0.3ubuntu1
 libgssapi-krb5-2 1.9.1+dfsg-1ubuntu1
 libjpeg62 6b1-1ubuntu2
 libk5crypto3 1.9.1+dfsg-1ubuntu1
 libkeyutils1 1.4-6
 libkrb5-3 1.9.1+dfsg-1ubuntu1
 libkrb5support0 1.9.1+dfsg-1ubuntu1
 liblcms1 1.19.dfsg-1ubuntu1
 liblzma2 5.0.0-2
 libnspr4 4.8.7-0ubuntu3
 libnss3 3.12.9+ckbi-1.82-0ubuntu4
 libnss3-1d 3.12.9+ckbi-1.82-0ubuntu4
 libselinux1 2.0.98-1.1
 libsqlite3-0 3.7.7-2ubuntu1
 libssl1.0.0 1.0.0d-2ubuntu2
 libstdc++6 4.6.1-9ubuntu2
 libtasn1-3 2.9-4
 multiarch-support 2.13-17ubuntu2
 openjdk-6-jre-headless 6b23~pre8-1ubuntu1
 openjdk-6-jre-lib 6b23~pre8-1ubuntu1
 openssl 1.0.0d-2ubuntu2
 perl-base 5.12.4-4
 tzdata 2011g-0ubuntu0.11.04
 tzdata-java 2011g-0ubuntu0.11.04
 xz-utils 5.0.0-2
 zlib1g 1:1.2.3.4.dfsg-3ubuntu3
DistroRelease: Ubuntu 11.10
Package: openjdk-6-jre-headless 6b23~pre8-1ubuntu1
PackageArchitecture: armel
ProcEnviron:
 LANGUAGE=en_GB:
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 3.0.0-1204.9-omap4 3.0.3
SourcePackage: openjdk-6
Uname: Linux 3.0.0-1204-omap4 armv7l
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
James Page (james-page) wrote :
affects: ubuntu → openjdk-6 (Ubuntu)
Revision history for this message
James Page (james-page) wrote :

I see if I can grab a stack trace from the java process but unlikely as system is frozen

Revision history for this message
Xerxes Rånby (xranby) wrote :

Hi James, are you able to reproduce this bug from inside a oneiric chroot while running a Natty kernel?

Do Jenkins need to run any specific task to trigger the bug?

Which ARM hardware + hardware revision and kernel version are you using?

Revision history for this message
James Page (james-page) wrote :

Hi Xerxes

I've attached more information to the main description of the bug which should cover kernel version.

I'm using an OMAP4 Pandaboard - not sure how I check hardware revision TBH.

I've not got a chroot environment setup for ARM so can't test the natty kernel stuff (yet).

With reference to jenkins this was a fresh install with no configuration - so it was not actually doing anything other than general jenkins background stuff like checking for updates etc..

description: updated
tags: added: oneiric
description: updated
Revision history for this message
James Page (james-page) wrote :

I see a similar issue with jetty:

Sep 12 11:18:43 winehouse kernel: [ 602.998992] INFO: task jsvc:693 blocked for more than 120 seconds.
Sep 12 11:18:43 winehouse kernel: [ 602.999176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 12 11:18:43 winehouse kernel: [ 602.999176] jsvc D c05ca9e0 0 693 673 0x00000001
Sep 12 11:18:43 winehouse kernel: [ 602.999176] [<c05ca9e0>] (schedule+0x540/0x618) from [<c05cc728>] (__down_read+0xc0/0xd8)
Sep 12 11:18:43 winehouse kernel: [ 602.999176] [<c05cc728>] (__down_read+0xc0/0xd8) from [<c05cf5e8>] (do_page_fault.part.2+0x80/0x1e0)
Sep 12 11:18:43 winehouse kernel: [ 602.999694] [<c05cf5e8>] (do_page_fault.part.2+0x80/0x1e0) from [<c05cf7e4>] (do_page_fault+0x9c/0xac)
Sep 12 11:18:43 winehouse kernel: [ 602.999694] [<c05cf7e4>] (do_page_fault+0x9c/0xac) from [<c004c410>] (do_DataAbort+0x34/0x98)
Sep 12 11:18:43 winehouse kernel: [ 602.999755] [<c004c410>] (do_DataAbort+0x34/0x98) from [<c05cd844>] (ret_from_exception+0x0/0x10)
Sep 12 11:18:43 winehouse kernel: [ 602.999786] Exception stack(0xe88c9fb0 to 0xe88c9ff8)
Sep 12 11:18:43 winehouse kernel: [ 602.999786] 9fa0: 40007000 00000104 40007001 00000022
Sep 12 11:18:43 winehouse kernel: [ 602.999816] 9fc0: 40275000 00000104 40007000 00000100 00000040 00000000 00a71718 00a954e8
Sep 12 11:18:43 winehouse kernel: [ 602.999816] 9fe0: 00000000 597fec60 40242751 4024275c 000f0030 ffffffff
Sep 12 11:18:43 winehouse kernel: [ 602.999816] INFO: task jsvc:694 blocked for more than 120 seconds.
Sep 12 11:18:43 winehouse kernel: [ 603.000030] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 12 11:18:43 winehouse kernel: [ 603.000244] jsvc D c05ca9e0 0 694 673 0x00000001
Sep 12 11:18:43 winehouse kernel: [ 603.000244] [<c05ca9e0>] (schedule+0x540/0x618) from [<c05cc728>] (__down_read+0xc0/0xd8)
Sep 12 11:18:43 winehouse kernel: [ 603.000305] [<c05cc728>] (__down_read+0xc0/0xd8) from [<c008eb04>] (exit_mm+0x28/0x15c)
Sep 12 11:18:43 winehouse kernel: [ 603.000335] [<c008eb04>] (exit_mm+0x28/0x15c) from [<c008ee54>] (do_exit+0x21c/0x3ac)
Sep 12 11:18:43 winehouse kernel: [ 603.000335] [<c008ee54>] (do_exit+0x21c/0x3ac) from [<c008f1bc>] (do_group_exit+0x90/0xc0)
Sep 12 11:18:43 winehouse kernel: [ 603.000366] [<c008f1bc>] (do_group_exit+0x90/0xc0) from [<c009d8d0>] (get_signal_to_deliver+0x37c/0x3c0)
Sep 12 11:18:43 winehouse kernel: [ 603.000366] [<c009d8d0>] (get_signal_to_deliver+0x37c/0x3c0) from [<c00541cc>] (do_signal+0xb0/0x20c)
Sep 12 11:18:43 winehouse kernel: [ 603.000366] [<c00541cc>] (do_signal+0xb0/0x20c) from [<c005487c>] (do_notify_resume+0x18/0x50)
Sep 12 11:18:43 winehouse kernel: [ 603.000457] [<c005487c>] (do_notify_resume+0x18/0x50) from [<c0051278>] (work_pending+0x24/0x28)

Matthias Klose (doko)
tags: added: jamvm
Revision history for this message
Tobin Davis (gruemaster) wrote :

Can you also comment on what part of jenkins you see this with? I have the jenkins slave running on my systems with the same jvm package and have had no problems.

Give a list of jenkins packages or a link to the jenkins release that you ran and details on installation for reproduction. Thanks.

Revision history for this message
James Page (james-page) wrote : Re: [Bug 845158] Re: Frequent java task hang on ARM server

On Mon, 2011-09-12 at 21:23 +0000, Tobin Davis wrote:
> Can you also comment on what part of jenkins you see this with? I
> have
> the jenkins slave running on my systems with the same jvm package and
> have had no problems.
>
> Give a list of jenkins packages or a link to the jenkins release that
> you ran and details on installation for reproduction. Thanks.

I see this with the 'jenkins' package as currently available in Oneiric;
the upstream release is 1.409.1.

Its literally a clean install as follows:

sudo apt-get install jenkins

At some point in time later I get the Java task hang messages; process
listing freeze and I end up having to hard reset the board.

--
James Page
Ubuntu Server Developer

Revision history for this message
Xerxes Rånby (xranby) wrote :

I am able to trigger a similar looking bug (java process entered uninterruptible sleep) while running NetBeans.

Steps to reproduce:
#go to netbeans download page http://netbeans.org/downloads/start.html?platform=zip&lang=en&option=javase
wget http://download.netbeans.org/netbeans/7.0.1/final/zip/netbeans-7.0.1-201107282000-ml-javase.zip
unzip netbeans-7.0.1-201107282000-ml-javase.zip
cd netbeans/bin/
./netbeans
#Manually accept the GPL license.
#The stall hits after the loading modules and plugins "spash screen" when the main gui and welcome screen are visible
#A popup bubble gets displayed indicating that 17 updates (or more ) are available.
#try click on Help menu -> About and verify that you are running jamvm. (you cant the gui are stalled)
#check dmesg and observe that the kernel have detected the freeze

I am able to reproduce this bug on a ac100 that makes java enter the uninterruptible sleep (D) state without any output from netbeans. kernel

I was not able to trigger the uninterruptible stall on a pandaboard running oneiric. kernel
Here instead I see a "regular stall" like this:
xranby@panda:~/netbeans/bin$ ./netbeans
jmm_GetInputArgumentArray
jmm_GetBoolAttribute: Unknown attribute 24
jmm_DumpThreads
jmm_DumpThreads
jmm_DumpThreads
jmm_DumpThreads
jmm_DumpThreads
jmm_DumpThreads
... <endless loop> netbeans gui are stalled.

The testcase above do work fine on ia32 when using jamvm. here the output are:
xranby@xranby-ESPRIMO-P7935:~/netbeans/bin$ ./netbeans
jmm_GetInputArgumentArray
jmm_GetMemoryUsage
jmm_GetMemoryUsage
jmm_GetMemoryUsage
... <gets displayed a lot> netbeans remains responsive.

I will forward this testcase to jamvm upstream and check if the stalls are related to the three unimplemented jmm_* JMX management extensions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openjdk-6 (Ubuntu):
status: New → Confirmed
Revision history for this message
Xerxes Rånby (xranby) wrote :

This are possibly caused by a arm linux kernel pagefault stall
Do this patch improve the situation?
http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=7178/1

Revision history for this message
Xerxes Rånby (xranby) wrote :

The reason why this bug happens are that some process, currently unknown to us which one, have locked and keeps on holding the kernel &mm->mmap_sem lock without releasing it.

When java hits a pagefault then the Linux kernel needs to aquire this lock in order to serve the pagefault, it waits, and wait for this lock to become available again, unfortunally in this case the kernel waits forever...

note that this bug also prevents other tools from operating like
ps ax

http://www.kernel.org/doc/Documentation/vm/locking

Revision history for this message
Xerxes Rånby (xranby) wrote :

The best way to track this kernel lock bug are probably to recompile a kernel with lockdep enabled.

(11:40:44) xranby: mjw: can systemtap help diagnose kernel locks? i would like to know what part of the kernel are holding a &mm->mmap_sem semaphore
(11:42:20) xranby: all i currently know are all parts of the kernel gets stuck waiting for this semaphre to get released :/
(11:45:29) mjw: xranby, for that kind of kernel lock issue I would use lockdep
(11:47:38) mjw: https://www.kernel.org/doc/Documentation/lockdep-design.txt
(11:48:13) mjw: xranby, that does mean you want a kernel build with lockdep enabled
(11:48:17) mjw: which is slow
(11:48:30) xranby: sounds like a reasonable way to debug this
(11:48:45) xranby: http://lwn.net/Articles/321663/ found this lwn article on the subject as well
(11:48:55) xranby: thank you for the heads up of lockdep existance
(11:49:15) mjw: http://linuxplumbersconf.org/2011/ocw//system/presentations/153/original/lockdep-plumbers-2011.odp
(11:49:31) mjw: It is not as cool as using systemtap of course :)
(11:50:10) mjw: But for systemtap to help you with this issue you have to have some hint already where you suspect the problem might be
(11:50:21) mjw: if so, then you can set probes on the suspected functions
(11:50:35) mjw: and trace how/when/in-which-order they are called
(11:51:05) mjw: lockdep automates that for you (with the downside of having to recompile your kernel of course)

Revision history for this message
Xerxes Rånby (xranby) wrote :

The patch http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=7178/1 in #10 enables the user to kill the stuck process when this bug triggers but it do not fix the deadlock itself.

Testcase that can reproduce this lock bug using a pandaboard running kernel 3.1 using Ubuntu oneiric with the ubuntu-omap4-extras package installed:

# First I installed all build dependencies needed
sudo apt-get install openjdk-6-jdk ant git gcc libgl1-mesa-dev libglu1-mesa-dev xorg-dev libice-dev libsm-dev libx11-dev libxext-dev libxxf86vm-dev libxinerama-dev libxrandr-dev libxrender-dev install p7zip-full

# clone the jogamp sourcecode into the machine using git
git clone git://jogamp.org/srv/scm/gluegen.git gluegen
git clone git://jogamp.org/srv/scm/jogl.git jogl

# according to the JOGL build instructions i had to unset CLASSPATH to prevent
# the build to pick up any system installed GlueGen or JOGL files during compilation.
unsetenv CLASSPATH
unset CLASSPATH

cd gluegen/make
# Make sure that TARGET_PLATFORM_LIBS and TARGET_JAVA_LIBS match my
# system, for example to make it build for Ubuntu armhf Precise i had to set:
# export TARGET_PLATFORM_LIBS=/usr/lib/arm-linux-gnueabihf
# export TARGET_JAVA_LIBS=/usr/lib/jvm/default-java/jre/lib/arm
# in the file scripts/make.gluegen.all.linux-armv7.sh
#
sh scripts/make.gluegen.all.linux-armv7.sh

# gluegen BUILD SUCESSFUL

# Next step building of JOGL
cd ../../jogl/make
# Similary here check that TARGET_PLATFORM_LIBS
# and TARGET_JAVA_LIBS are set to match your system
# then start the build.
#
sh scripts/make.jogl.all.linux-armv7.sh

# jogl BUILD SUCESSFUL

# run all the jogl unittests
sh scripts/make.jogl.all.linux-armv7.sh junit.run

#The lock bugs triggers on my pandaboard after running the unittest for about 10 minutes.

Revision history for this message
Xerxes Rånby (xranby) wrote :

The Jogamp team have found that this bug triggers more frequently while testing against the ubuntu-omap4-extras libEGL compared to using the mesa software packages.
http://forum.jogamp.org/Linux-ARM-freezes-Java-EGL-ES-JOGL-td3801301.html

"However platform-1b does not freeze (software OpenGL)
hence some correlation between hardware and Java might
cause the problem."

Revision history for this message
Xerxes Rånby (xranby) wrote :

I can reproduce this bug using the openjdk6 + linux-ti-omap4 kernel using the Linaro 12.02 release.

Xerxes Rånby (xranby)
tags: added: arm-porting-queue
Revision history for this message
Xerxes Rånby (xranby) wrote :

please re-test using https://launchpad.net/ubuntu/+source/linux-ti-omap4/3.2.0-1409.12

I belive this bug have been fixed when using the precise 3.2 kernels that contain a reworked arch/arm/mm/fault.c
where this part in do_page_fault:

         * As per x86, we may deadlock here. However, since the kernel only
         * validly references user space from well defined areas of the code,
         * we can bug out early if this is from code which shouldn't.
         */
        if (!down_read_trylock(&mm->mmap_sem)) {
                if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
                        goto no_context;
retry:
                down_read(&mm->mmap_sem);

have ben changed to remove the retry: label and goto retry: logic.

 /*
  * As per x86, we may deadlock here. However, since the kernel only
  * validly references user space from well defined areas of the code,
  * we can bug out early if this is from code which shouldn't.
  */
 if (!down_read_trylock(&mm->mmap_sem)) {
  if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
   goto no_context;
  down_read(&mm->mmap_sem);

Revision history for this message
Xerxes Rånby (xranby) wrote :

hum.. ignore my comment in #16 it is irrelevant..

the goto retry logic got added with the patch in #10

Revision history for this message
Xerxes Rånby (xranby) wrote :

I have moved the #16 discussion of a potential deadlock in the #10 patch into the https://bugs.launchpad.net/ubuntu/+source/linux-ti-omap4/+bug/951043 bugreport that deals with the merge of #10 into precise.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-ti-omap4 (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.