<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://jogamp.org/bugzilla/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.2"
          urlbase="https://jogamp.org/bugzilla/"
          
          maintainer="sgothel@jausoft.com"
>

    <bug>
          <bug_id>559</bug_id>
          
          <creation_ts>2012-03-05 18:26:08 +0100</creation_ts>
          <short_desc>Linux ARM freezes (Java, EGL/ES, JOGL)</short_desc>
          <delta_ts>2012-04-17 10:16:32 +0200</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>3</classification_id>
          <classification>JogAmp</classification>
          <product>Jogl</product>
          <component>embedded</component>
          <version>2</version>
          <rep_platform>embedded_arm</rep_platform>
          <op_sys>linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>WORKSFORME</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P1</priority>
          <bug_severity>blocker</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Sven Gothel">sgothel</reporter>
          <assigned_to name="Sven Gothel">sgothel</assigned_to>
          <cc>xerxes</cc>
          
          <cf_type>---</cf_type>
          <cf_scm_refs></cf_scm_refs>
          <cf_workaround>---</cf_workaround>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>1421</commentid>
    <comment_count>0</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-03-05 18:26:08 +0100</bug_when>
    <thetext>Phenomenon:

The freeze I am reporting is characterized by 

  - hanging java process

  - the command &apos;ps ax&apos; hangs before the line
    where it probably shall report the java process

  - syslog message: &lt;see bottom of the description&gt;

  - &apos;kill -9 &lt;PID&gt;&apos; doesn&apos;t work

  - reboot freezes as well, the reset button needs to be pressed
 
This is different then an implementation error, eg. &apos;software deadlock&apos;,
since such freeze shall not affect the overall system
and the user process shall be interrupt-able.
+++

The native es2redsquare didn&apos;t freeze the machine so far,
800 loops from the shell etc.
  cd ./jogl/src/test/native/mesa-demos-patched
  bash make.sh es2redsquare.c
  bash shell_loop.sh

+++

TestRedSquareES2NEWT or TestGearsES2NEWT
with &apos;-loops 1000 -loop-shutdown 1 -time 100&apos; doesn&apos;t frees either.

Note: &apos;-loop-shutdown 2&apos; triggers a bug in EGL, eglGetDisplay(..) fails
sometime, probably some EGL race condition ?

+++

Lately test of &apos;shell&apos; loops w/ TestRedSquareES2NEWT or TestGearsES2NEWT
and the args &apos;-loops 1 -time 100&apos; didn&apos;t freeze the machines,
tested a few times until ~250.

+++

Platform-1a + Platform-2:

The remote NEWT unit tests pass properly the 1st time.
You have to remove the AWT*NEWT* test collection manually 
from the junit.run.remote.ssh target in build-test.xml.

However a 2nd run freezes the machines (pandaboard/ac100) 
within an arbitrary test.

Running all remote unit tests (default) freezes both machines 
within the &apos;AWT/NEWT tests&apos;, which comes after the NEWT only tests.

+++

Platform-1b:

Running the NEWT unit tests, occasional &apos;hangs&apos; occur in:
  &apos;jogamp.opengl.x11.glx.GLX.dispatch_glXMakeContextCurrent1&apos;

&apos;ps ax&apos; works and discloses the PID, 
which can be killed via &apos;kill -9 &lt;PID&gt;&apos;.

The unit tests then continue properly.

+++

This has been reproduced w/ OpenJDK 
  - IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10.2) + 
    JamVM (build 1.6.0-devel, inline-threaded interpreter with stack-caching)

  - Oracle J2SE/JRE build 1.6.0_30-b12 +
    Java HotSpot(TM) Client VM (build 20.5-b03, mixed mode)


Platform-1a:
  - Pandaboard ES (Omap4)

  - Ubuntu 11.10

  - GLX and Mesa3D Software &apos;enabled&apos;

  - EGL/ES: pvr-omap4 1.7.10.0.1.9-1

  - Linux panda01 3.1.0-1282-omap4 #11-Ubuntu SMP PREEMPT Mon Feb 13 15:38:55 
    UTC 2012 armv7l armv7l armv7l GNU/Linux


Platform-1b:
  - Pandaboard ES (Omap4)

  - Ubuntu 11.10

  - GLX and Mesa3D Software &apos;enabled&apos;

  - EGL/ES: disabled (moved libEGL* libGLESv* away)

  - Linux panda01 3.1.0-1282-omap4 #11-Ubuntu SMP PREEMPT Mon Feb 13 15:38:55 
    UTC 2012 armv7l armv7l armv7l GNU/Linux



Platform-2:
  - Toshiba AC100 (Tegra2)

  - Ubuntu 11.10

  - GLX and Mesa3D Software &apos;enabled&apos;

  - EGL/ES: nvidia-tegra 12~beta1-0ubuntu1

  - Linux jautab02 2.6.38-1001-ac100 #2-Ubuntu SMP PREEMPT Tue Dec 20 08:05:25 
    UTC 2011 armv7l armv7l armv7l GNU/Linux

+++

The freeze is completely arbitrary,
rarely it happens within the demo code&apos;s call of EGLContextImpl.makeCurrent(), 
but more often before test setup or finish w/o any EGL/ES calls involved.

+++

Both platforms have a similar if not equal package setup.

They differ in their:
  - Linux kernel 
  - EGL/ES driver.

Since the internal loop and neither the native test 
could reproduce this freeze, 
one could assumed that the EGL/ES drivers are not the culprit.

This assumption may also been deduced knowing that platform-1a
and platform-2 use different EGL/ES drivers.

However platform-1b does not freeze (software OpenGL)
hence some correlation between hardware and Java might
cause the problem.

The common ground on all freezing platforms is the 
Xorg server/client, besides the other generic dependencies.

The Xorg server/client is being treated different 
when using software OpenGL or proprietary EGL/ES.

+++

Cause: TBD

+++

Java Freeze Syslog Message:

Mar  5 17:27:34 panda01 kernel: [  372.084716] INFO: task java:1503 blocked for more than 120 seconds.
Mar  5 17:27:34 panda01 kernel: [  372.084716] &quot;echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs&quot; disables this message.
Mar  5 17:27:34 panda01 kernel: [  372.084716] java            D c0576f58     0  1503   1167 0x00000000
Mar  5 17:27:34 panda01 kernel: [  372.084716] [&lt;c0576f58&gt;] (__schedule+0x4f0/0x5cc) from [&lt;c0578d1c&gt;] (__down_read+0xc0/0xd8)
Mar  5 17:27:34 panda01 kernel: [  372.084716] [&lt;c0578d1c&gt;] (__down_read+0xc0/0xd8) from [&lt;c057b0e8&gt;] (do_page_fault.part.2+0x90/0x1f8)
Mar  5 17:27:34 panda01 kernel: [  372.085205] [&lt;c057b0e8&gt;] (do_page_fault.part.2+0x90/0x1f8) from [&lt;c057b2ec&gt;] (do_page_fault+0x9c/0xac)
Mar  5 17:27:34 panda01 kernel: [  372.085266] [&lt;c057b2ec&gt;] (do_page_fault+0x9c/0xac) from [&lt;c0008674&gt;] (do_DataAbort+0x34/0x98)
Mar  5 17:27:34 panda01 kernel: [  372.085266] [&lt;c0008674&gt;] (do_DataAbort+0x34/0x98) from [&lt;c05797d8&gt;] (__dabt_svc+0x38/0x60)
Mar  5 17:27:34 panda01 kernel: [  372.085266] Exception stack(0xeb9a5ec0 to 0xeb9a5f08)
Mar  5 17:27:34 panda01 kernel: [  372.085357] 5ec0: 595ac000 595ae000 00000020 0000001f ee5a555c 595ad9f4 595ac680 ee5a5520
Mar  5 17:27:34 panda01 kernel: [  372.085357] 5ee0: 00000000 eb9a4000 00000000 000002ff 595ad000 eb9a5f08 c0011758 c0019958
Mar  5 17:27:34 panda01 kernel: [  372.085357] 5f00: 800f0113 ffffffff
Mar  5 17:27:34 panda01 kernel: [  372.085510] [&lt;c05797d8&gt;] (__dabt_svc+0x38/0x60) from [&lt;c0019958&gt;] (v7_coherent_kern_range+0x1c/0x7c)
Mar  5 17:27:34 panda01 kernel: [  372.085510] [&lt;c0019958&gt;] (v7_coherent_kern_range+0x1c/0x7c) from [&lt;c0011758&gt;] (arm_syscall+0x140/0x294)
Mar  5 17:27:34 panda01 kernel: [  372.085632] [&lt;c0011758&gt;] (arm_syscall+0x140/0x294) from [&lt;c000d500&gt;] (ret_fast_syscall+0x0/0x30)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1422</commentid>
    <comment_count>1</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-03-05 19:15:45 +0100</bug_when>
    <thetext>Similar experiences:

&lt;https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/845158&gt;
&lt;http://forums.debian.net/viewtopic.php?f=5&amp;t=49368&gt;

+++

I followed:
  &lt;http://www.nico.schottelius.org/blog/reboot-linux-if-task-blocked-for-more-than-n-seconds/&gt;
and set  
  &lt;/proc/sys/kernel/hung_task_timeout_secs&gt; from 120 (2 min) to 360 (6 min).

/etc/sysctl.conf:
  vm.min_free_kbytes = 32000
  kernel.hung_task_timeout_secs = 360

Results on platform-1a:
  - passed 2 consecutive remote NEWT junit test runs (no AWT)
  - freezes w/ all test runs (before hang timeout),
    somewhere within the AWT tests.

Results on platform-2:
  - freezes w/ all test runs (before hang timeout),
    somewhere within the AWT tests.

So this is inconclusive - reset the timeout value back to 120.

+++</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1423</commentid>
    <comment_count>2</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-03-05 19:27:05 +0100</bug_when>
    <thetext>Freeze could also be reproduced when running all tests locally.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1424</commentid>
    <comment_count>3</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-03-05 20:02:44 +0100</bug_when>
    <thetext>platform-1a: remote ssh (NEWT only) freeze @ 6th run
platform-2:  remote ssh (NEWT only) freeze @ 4th run</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1462</commentid>
    <comment_count>4</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-03-18 09:44:51 +0100</bug_when>
    <thetext>It has been determined that the root cause is not within JOGL itself
but probably within the Linux kernel version we use on pandaboard and AC100.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1463</commentid>
    <comment_count>5</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-03-18 11:26:44 +0100</bug_when>
    <thetext>keeping the bug open .. to track success, Xerxes is currently looking for a 
Linux kernel remedy within the vm page_fault area.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1468</commentid>
    <comment_count>6</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-03-18 16:10:44 +0100</bug_when>
    <thetext>This &apos;external&apos; bug is a P1 blocker prohibiting us from running our unit tests on linux-arm.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1478</commentid>
    <comment_count>7</comment_count>
    <who name="Xerxes Rånby">xerxes</who>
    <bug_when>2012-03-19 12:48:08 +0100</bug_when>
    <thetext>Inside the linux kernel while it are handling a java JVM pagefault,
arch/arm/mm/fault.c do_page_fault around line 300, the linux kernel are trying to grab the &amp;mm-&gt;mmap_sem kernel semaphore lock.

down_read(&amp;mm-&gt;mmap_sem);

This can be seen in the kernel dmesg dump:

(__schedule+0x4f0/0x5cc) from [&lt;c0578d1c&gt;] (__down_read+0xc0/0xd8)
Mar  5 17:27:34 panda01 kernel: [  372.084716] [&lt;c0578d1c&gt;]
(__down_read+0xc0/0xd8) from [&lt;c057b0e8&gt;] (do_page_fault.part.2+0x90/0x1f8)
Mar  5 17:27:34 panda01 kernel: [  372.085205] [&lt;c057b0e8&gt;]

Basically what happens are that some part of the kernel are already holding this mm-&gt;mmap_sem lock and have forgotten to release it, thus the java process are stuck in the linux kernel waiting for this lock to get released.

https://bugs.launchpad.net/ubuntu-leb/+source/linux-ti-omap4/+bug/845158</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>1532</commentid>
    <comment_count>8</comment_count>
    <who name="Sven Gothel">sgothel</who>
    <bug_when>2012-04-17 10:16:32 +0200</bug_when>
    <thetext>It is confirmed that this bug no more appears on Ubuntu 12.* armhf build 
running on pandaboard es.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>