Bug 559

Summary: Linux ARM freezes (Java, EGL/ES, JOGL)
Product: [JogAmp] Jogl Reporter: Sven Gothel <sgothel>
Component: embeddedAssignee: Sven Gothel <sgothel>
Status: RESOLVED WORKSFORME    
Severity: blocker CC: xerxes
Priority: P1    
Version: 2   
Hardware: embedded_arm   
OS: linux   
Type: --- SCM Refs:
Workaround: ---

Description Sven Gothel 2012-03-05 18:26:08 CET
Phenomenon:

The freeze I am reporting is characterized by 

  - hanging java process

  - the command 'ps ax' hangs before the line
    where it probably shall report the java process

  - syslog message: <see bottom of the description>

  - 'kill -9 <PID>' doesn't work

  - reboot freezes as well, the reset button needs to be pressed
 
This is different then an implementation error, eg. 'software deadlock',
since such freeze shall not affect the overall system
and the user process shall be interrupt-able.
+++

The native es2redsquare didn't freeze the machine so far,
800 loops from the shell etc.
  cd ./jogl/src/test/native/mesa-demos-patched
  bash make.sh es2redsquare.c
  bash shell_loop.sh

+++

TestRedSquareES2NEWT or TestGearsES2NEWT
with '-loops 1000 -loop-shutdown 1 -time 100' doesn't frees either.

Note: '-loop-shutdown 2' triggers a bug in EGL, eglGetDisplay(..) fails
sometime, probably some EGL race condition ?

+++

Lately test of 'shell' loops w/ TestRedSquareES2NEWT or TestGearsES2NEWT
and the args '-loops 1 -time 100' didn't freeze the machines,
tested a few times until ~250.

+++

Platform-1a + Platform-2:

The remote NEWT unit tests pass properly the 1st time.
You have to remove the AWT*NEWT* test collection manually 
from the junit.run.remote.ssh target in build-test.xml.

However a 2nd run freezes the machines (pandaboard/ac100) 
within an arbitrary test.

Running all remote unit tests (default) freezes both machines 
within the 'AWT/NEWT tests', which comes after the NEWT only tests.

+++

Platform-1b:

Running the NEWT unit tests, occasional 'hangs' occur in:
  'jogamp.opengl.x11.glx.GLX.dispatch_glXMakeContextCurrent1'

'ps ax' works and discloses the PID, 
which can be killed via 'kill -9 <PID>'.

The unit tests then continue properly.

+++

This has been reproduced w/ OpenJDK 
  - IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10.2) + 
    JamVM (build 1.6.0-devel, inline-threaded interpreter with stack-caching)

  - Oracle J2SE/JRE build 1.6.0_30-b12 +
    Java HotSpot(TM) Client VM (build 20.5-b03, mixed mode)


Platform-1a:
  - Pandaboard ES (Omap4)

  - Ubuntu 11.10

  - GLX and Mesa3D Software 'enabled'

  - EGL/ES: pvr-omap4 1.7.10.0.1.9-1

  - Linux panda01 3.1.0-1282-omap4 #11-Ubuntu SMP PREEMPT Mon Feb 13 15:38:55 
    UTC 2012 armv7l armv7l armv7l GNU/Linux


Platform-1b:
  - Pandaboard ES (Omap4)

  - Ubuntu 11.10

  - GLX and Mesa3D Software 'enabled'

  - EGL/ES: disabled (moved libEGL* libGLESv* away)

  - Linux panda01 3.1.0-1282-omap4 #11-Ubuntu SMP PREEMPT Mon Feb 13 15:38:55 
    UTC 2012 armv7l armv7l armv7l GNU/Linux



Platform-2:
  - Toshiba AC100 (Tegra2)

  - Ubuntu 11.10

  - GLX and Mesa3D Software 'enabled'

  - EGL/ES: nvidia-tegra 12~beta1-0ubuntu1

  - Linux jautab02 2.6.38-1001-ac100 #2-Ubuntu SMP PREEMPT Tue Dec 20 08:05:25 
    UTC 2011 armv7l armv7l armv7l GNU/Linux

+++

The freeze is completely arbitrary,
rarely it happens within the demo code's call of EGLContextImpl.makeCurrent(), 
but more often before test setup or finish w/o any EGL/ES calls involved.

+++

Both platforms have a similar if not equal package setup.

They differ in their:
  - Linux kernel 
  - EGL/ES driver.

Since the internal loop and neither the native test 
could reproduce this freeze, 
one could assumed that the EGL/ES drivers are not the culprit.

This assumption may also been deduced knowing that platform-1a
and platform-2 use different EGL/ES drivers.

However platform-1b does not freeze (software OpenGL)
hence some correlation between hardware and Java might
cause the problem.

The common ground on all freezing platforms is the 
Xorg server/client, besides the other generic dependencies.

The Xorg server/client is being treated different 
when using software OpenGL or proprietary EGL/ES.

+++

Cause: TBD

+++

Java Freeze Syslog Message:

Mar  5 17:27:34 panda01 kernel: [  372.084716] INFO: task java:1503 blocked for more than 120 seconds.
Mar  5 17:27:34 panda01 kernel: [  372.084716] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  5 17:27:34 panda01 kernel: [  372.084716] java            D c0576f58     0  1503   1167 0x00000000
Mar  5 17:27:34 panda01 kernel: [  372.084716] [<c0576f58>] (__schedule+0x4f0/0x5cc) from [<c0578d1c>] (__down_read+0xc0/0xd8)
Mar  5 17:27:34 panda01 kernel: [  372.084716] [<c0578d1c>] (__down_read+0xc0/0xd8) from [<c057b0e8>] (do_page_fault.part.2+0x90/0x1f8)
Mar  5 17:27:34 panda01 kernel: [  372.085205] [<c057b0e8>] (do_page_fault.part.2+0x90/0x1f8) from [<c057b2ec>] (do_page_fault+0x9c/0xac)
Mar  5 17:27:34 panda01 kernel: [  372.085266] [<c057b2ec>] (do_page_fault+0x9c/0xac) from [<c0008674>] (do_DataAbort+0x34/0x98)
Mar  5 17:27:34 panda01 kernel: [  372.085266] [<c0008674>] (do_DataAbort+0x34/0x98) from [<c05797d8>] (__dabt_svc+0x38/0x60)
Mar  5 17:27:34 panda01 kernel: [  372.085266] Exception stack(0xeb9a5ec0 to 0xeb9a5f08)
Mar  5 17:27:34 panda01 kernel: [  372.085357] 5ec0: 595ac000 595ae000 00000020 0000001f ee5a555c 595ad9f4 595ac680 ee5a5520
Mar  5 17:27:34 panda01 kernel: [  372.085357] 5ee0: 00000000 eb9a4000 00000000 000002ff 595ad000 eb9a5f08 c0011758 c0019958
Mar  5 17:27:34 panda01 kernel: [  372.085357] 5f00: 800f0113 ffffffff
Mar  5 17:27:34 panda01 kernel: [  372.085510] [<c05797d8>] (__dabt_svc+0x38/0x60) from [<c0019958>] (v7_coherent_kern_range+0x1c/0x7c)
Mar  5 17:27:34 panda01 kernel: [  372.085510] [<c0019958>] (v7_coherent_kern_range+0x1c/0x7c) from [<c0011758>] (arm_syscall+0x140/0x294)
Mar  5 17:27:34 panda01 kernel: [  372.085632] [<c0011758>] (arm_syscall+0x140/0x294) from [<c000d500>] (ret_fast_syscall+0x0/0x30)
Comment 1 Sven Gothel 2012-03-05 19:15:45 CET
Similar experiences:

<https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/845158>
<http://forums.debian.net/viewtopic.php?f=5&t=49368>

+++

I followed:
  <http://www.nico.schottelius.org/blog/reboot-linux-if-task-blocked-for-more-than-n-seconds/>
and set  
  </proc/sys/kernel/hung_task_timeout_secs> from 120 (2 min) to 360 (6 min).

/etc/sysctl.conf:
  vm.min_free_kbytes = 32000
  kernel.hung_task_timeout_secs = 360

Results on platform-1a:
  - passed 2 consecutive remote NEWT junit test runs (no AWT)
  - freezes w/ all test runs (before hang timeout),
    somewhere within the AWT tests.

Results on platform-2:
  - freezes w/ all test runs (before hang timeout),
    somewhere within the AWT tests.

So this is inconclusive - reset the timeout value back to 120.

+++
Comment 2 Sven Gothel 2012-03-05 19:27:05 CET
Freeze could also be reproduced when running all tests locally.
Comment 3 Sven Gothel 2012-03-05 20:02:44 CET
platform-1a: remote ssh (NEWT only) freeze @ 6th run
platform-2:  remote ssh (NEWT only) freeze @ 4th run
Comment 4 Sven Gothel 2012-03-18 09:44:51 CET
It has been determined that the root cause is not within JOGL itself
but probably within the Linux kernel version we use on pandaboard and AC100.
Comment 5 Sven Gothel 2012-03-18 11:26:44 CET
keeping the bug open .. to track success, Xerxes is currently looking for a 
Linux kernel remedy within the vm page_fault area.
Comment 6 Sven Gothel 2012-03-18 16:10:44 CET
This 'external' bug is a P1 blocker prohibiting us from running our unit tests on linux-arm.
Comment 7 Xerxes RĂ„nby 2012-03-19 12:48:08 CET
Inside the linux kernel while it are handling a java JVM pagefault,
arch/arm/mm/fault.c do_page_fault around line 300, the linux kernel are trying to grab the &mm->mmap_sem kernel semaphore lock.

down_read(&mm->mmap_sem);

This can be seen in the kernel dmesg dump:

(__schedule+0x4f0/0x5cc) from [<c0578d1c>] (__down_read+0xc0/0xd8)
Mar  5 17:27:34 panda01 kernel: [  372.084716] [<c0578d1c>]
(__down_read+0xc0/0xd8) from [<c057b0e8>] (do_page_fault.part.2+0x90/0x1f8)
Mar  5 17:27:34 panda01 kernel: [  372.085205] [<c057b0e8>]

Basically what happens are that some part of the kernel are already holding this mm->mmap_sem lock and have forgotten to release it, thus the java process are stuck in the linux kernel waiting for this lock to get released.

https://bugs.launchpad.net/ubuntu-leb/+source/linux-ti-omap4/+bug/845158
Comment 8 Sven Gothel 2012-04-17 10:16:32 CEST
It is confirmed that this bug no more appears on Ubuntu 12.* armhf build 
running on pandaboard es.