Bug 1112 - Missing OpenCL devices with java webstart launched JOCL/JOGL
Summary: Missing OpenCL devices with java webstart launched JOCL/JOGL
Status: RESOLVED INVALID
Alias: None
Product: Jocl
Classification: JogAmp
Component: opencl (show other bugs)
Version: 2.3.0
Hardware: pc_x86_64 linux
: --- critical
Assignee: Wade Walker
URL:
Depends on:
Blocks:
 
Reported: 2014-12-21 21:09 CET by Tim Barth
Modified: 2019-03-29 17:54 CET (History)
2 users (show)

See Also:
Type: ---
SCM Refs:
Workaround: ---


Attachments
sh etc/test.sh log file (15.37 KB, text/x-log)
2014-12-21 21:09 CET, Tim Barth
Details
simple test code that works correctly when run from command line but fails when launched from webstart (1.59 KB, text/x-java)
2014-12-21 21:12 CET, Tim Barth
Details
output when test code is run from command line (1.51 KB, application/octet-stream)
2014-12-21 21:14 CET, Tim Barth
Details
output when test code is launched from java webstart (2.54 KB, application/octet-stream)
2014-12-21 21:15 CET, Tim Barth
Details
second test code that runs correctly when run from command line and also runs when launched from webstart but is missing OpenCL planform and devices (1.00 KB, text/x-java)
2014-12-21 21:18 CET, Tim Barth
Details
output when second test code is run from command line (816 bytes, application/octet-stream)
2014-12-21 21:20 CET, Tim Barth
Details
output when second test code is launched from webstart (note missing platform and devices) (1.39 KB, application/octet-stream)
2014-12-21 21:21 CET, Tim Barth
Details
JNLP for first test problem (2.70 KB, application/x-java-jnlp-file)
2014-12-21 21:21 CET, Tim Barth
Details
JNLP for second test problem (2.70 KB, application/x-java-jnlp-file)
2014-12-21 21:22 CET, Tim Barth
Details
Output from test 1 (4.65 KB, text/plain)
2014-12-30 21:52 CET, Wade Walker
Details
Output from test 2 (4.34 KB, text/plain)
2014-12-30 21:53 CET, Wade Walker
Details
Output from test 1 with Intel OpenCL (5.26 KB, text/plain)
2015-01-01 21:53 CET, Wade Walker
Details
Output from test 2 with Intel OpenCL (4.70 KB, text/plain)
2015-01-01 21:53 CET, Wade Walker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Barth 2014-12-21 21:09:12 CET
Created attachment 672 [details]
sh etc/test.sh log file
Comment 1 Tim Barth 2014-12-21 21:12:58 CET
Created attachment 673 [details]
simple test code that works correctly when run from command line but fails when launched from webstart
Comment 2 Tim Barth 2014-12-21 21:14:37 CET
Created attachment 674 [details]
output when test code is run from command line
Comment 3 Tim Barth 2014-12-21 21:15:47 CET
Created attachment 675 [details]
output when test code is launched from java webstart
Comment 4 Tim Barth 2014-12-21 21:18:36 CET
Created attachment 676 [details]
second test code that runs correctly when run from command line and also runs when launched from webstart but is missing OpenCL planform and devices
Comment 5 Tim Barth 2014-12-21 21:20:09 CET
Created attachment 677 [details]
output when second test code is run from command line
Comment 6 Tim Barth 2014-12-21 21:21:12 CET
Created attachment 678 [details]
output when second test code is launched from webstart (note missing platform and devices)
Comment 7 Tim Barth 2014-12-21 21:21:52 CET
Created attachment 679 [details]
JNLP for first test problem
Comment 8 Tim Barth 2014-12-21 21:22:27 CET
Created attachment 680 [details]
JNLP for second test problem
Comment 9 Tim Barth 2014-12-21 21:38:38 CET
The issue pertains to java webstart launched applications that mix JOCL and JOGL. These same applications work correctly when executed from a commandline (no java webstart). For java webstart launched applications

  -- Retreaving JOCL OpenCL platform(s) and devices(s) BEFORE calling JOGL GLProfile fails when launched via webstart (see Test.java, Test.jnlp, webstart_console_test, output_commandline_test).. works correctly from commandline execution

  -- Retreaving JOCL OpenCL platform(s) and devices(s) AFTER calling JOGL GLProfile runs when launched via webstart (see Test2.java, Test2.jnlp, webstart_console_test2, output_commandline_test2).. but the number of OpenCL platforms found is 1 and it should be 2. The number of devices found is 1 and should be 3 in total for both platforms. 

This is a critical problem. For example, the jogamp CL/GL interop demo does not work for me because only one device is found, a CPU, and this device does not interop capability--even through I do have a GPU with interop capability that JOCL does otherwise find (see above) when JOGL is not invoked.
Comment 10 Wade Walker 2014-12-22 00:01:15 CET
So from looking at your logs, it seems that from the command line, everything works as expected, but in a webstart app:

1. If you enumerate OpenCL devices, then call GLProfile.getDefault(), then getDefault() fails with a web start error saying "Profile GL_DEFAULT is not available on null, but: []".

2. If you call GLProfile.getDefault(), then enumerate OpenCL devices, nothing crashes, but some OpenCL devices are not shown.

Does this sum up the behavior you're seeing?
Comment 11 Wade Walker 2014-12-30 21:51:57 CET
I ran both the included tests on my machine:

- Windows 8, 64-bit
- Java 1.8.0_25, 64-bit
- Nvidia GeForce GTX 660, driver 335.23
- slightly old JogAmp code (checked out a few months back)

I include both outputs here, but I don't see the problems you're seeing. Looking at the diffs between our outputs though, I see a few potential things you could try to help diagnose this:

- I'm running JWS from the command line with "javaws Test.jnlp"; if you're running some other way you might see if this makes a difference
- I don't have Intel's OpenCL installed; you might see if removing this makes a difference
- I can't tell what Java or JogAmp version you're on; maybe a newer version would make a difference (I had to get a very recent Java just to get JWS to let me run the app :))
- Ditto with your graphics drivers; maybe newer ones would help if you're not on the newest already
Comment 12 Wade Walker 2014-12-30 21:52:51 CET
Created attachment 682 [details]
Output from test 1
Comment 13 Wade Walker 2014-12-30 21:53:08 CET
Created attachment 683 [details]
Output from test 2
Comment 14 Tim Barth 2014-12-31 01:21:48 CET
Wade, 
   Based on your findings of 12/30, I have significantly narrowed down the problem to an interaction of the Intel OpenCL platform with jogl-jocl methods. 
Recall that I have 2 OpenCL vendor platforms on my Linux system: (1) AMD OpenCL 1.2 and (2) Intel OpenCL 1.2. I have experimented with removing the ICD files for each platform with the following results:

-- only the AMD ICD file installed: javaws launch of both Test.jnlp and Test2.jnlp problems work correctly.

-- only the INTEL ICD file installed: javaws launch of Test.jnlp FAILS with "Profile GL_DEFAULT is not available on null, but[]". Test2.jnlp does run correctly but the success is inconclusive because only 1 INTEL platform device is present. 

-- both AMD and INTEL ICD files installed: same results as indicated on 12/21, i.e. a javaws launch of Test.jnlp fails with "Profile GL_DEFAULT is not available on null, but[]". Test2.jnl runs but only 1 device is detected after the GLProfile.getDefault() is executed (should be 3 devices detected).

So it seems that the problem relates to an jogl-jocl interaction with the Intel OpenCL platform and to observe it you may need to installed the Intel OpenCL SDK.
Comment 15 Wade Walker 2015-01-01 21:52:58 CET
OK, I installed Intel's OpenCL runtime (version 14.2) for 64-bit Windows and re-ran, with identical results -- I can't seem to duplicate the crash you're seeing in test 1, or the missing device in test 2. New output is attached. Perhaps this problem is OS-specific?
Comment 16 Wade Walker 2015-01-01 21:53:26 CET
Created attachment 684 [details]
Output from test 1 with Intel OpenCL
Comment 17 Wade Walker 2015-01-01 21:53:47 CET
Created attachment 685 [details]
Output from test 2 with Intel OpenCL
Comment 18 Tim Barth 2015-01-02 19:18:46 CET
I (re)confirmed that the 2 test codes launched via java webstart execute correctly on Windows 8.1.

I then set up a test on a linux platform running Redhat 6 (2.6.32-504.1.3.el6.x86_64) and both test problems launched via java webstart execute CORRECTLY. 

So at this point, it seems that for the test problems launched via java webstart

 * Windows 8.1 (Intel and AMD OpenCL platforms) - success

 * Linux Redhat Enterpise 6 (Intel and NVIDIA OpenCL platforms) - success

 * Linux Ubuntu 14.04 (Intel and AMD OpenCL platforms) - FAIL
Comment 19 Wade Walker 2015-01-08 02:27:12 CET
I've duplicated this error on Ubuntu 14.04 LTS 64-bit, with Nvidia and Intel OpenCL drivers installed. The stack trace for the JWS error is this:

javax.media.opengl.GLException: Profile GL_DEFAULT is not available on null, but: []
	at javax.media.opengl.GLProfile.get(GLProfile.java:990)
	at javax.media.opengl.GLProfile.getDefault(GLProfile.java:721)
	at javax.media.opengl.GLProfile.getDefault(GLProfile.java:732)
	at Test.main(Test.java:38)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.sun.javaws.Launcher.executeApplication(Unknown Source)
	at com.sun.javaws.Launcher.executeMainClass(Unknown Source)
	at com.sun.javaws.Launcher.doLaunchApp(Unknown Source)
	at com.sun.javaws.Launcher.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:744)

Next I just need to trace the execution when running in JWS to find out why GLProfile.getDefault() is acting so strangely.
Comment 20 Wade Walker 2015-01-17 20:34:00 CET
Still in the process of debugging this one. I've got a debug build of JOGL set up and I can trace execution into it after attaching to the JWS process, I just haven't found the root cause yet. There are some runtime exceptions that are swallowed inside GLProfile.getDefault(), but so far they seem to be false alarms.

In the error message "Profile GL_DEFAULT is not available on null, but: []", the "null" is the default device handle, and the "[]" is the empty list of GL profiles JOGL finds for the device. Not clear yet why this list isn't populated, but my guess is that somehow when you enumerate CL devices the default device is set to the Intel device, which is a CPU not a GPU, and thus has no GL profiles.

Will work more on this next week.
Comment 21 Wade Walker 2015-01-27 01:51:01 CET
Debugged a little more on this. JOGL is seeing a failure on a dlopen("libGL.so"), but only if it's previously done a dlopen("libOpenCL.so"). I'm not yet sure why this should be happening, since dlopen() is supposed to search the same paths regardless of what has previously been loaded. A call to dlerror() after the failing dlopen() doesn't return any information.

My next step will probably be to write a C program that duplicates this dlopen() behavior, to try to see if it happens regardless of JOGL, or if JOGL is somehow responsible.

Some other people reporting problems with GL/CL interop on Ubuntu have found that adding their libGL.so's directory to LD_LIBRARY_PATH fixed the problem for them -- I haven't tried that yet, though.
Comment 22 Wade Walker 2015-01-28 02:38:15 CET
OK, I think I know the cause of this and a possible solution, but it doesn't seem to be JOGL's fault.

Once I identified that the failure was at dlopen(), I turned on JOGL's debug messages for that module by passing "-J-Djogamp.debug.NativeLibrary" to javaws, which shows a lot of diagnostic info in the Java console.

Then as a test, I tried setting LD_LIBRARY_PATH to explicitly include the path to my libGL.so (/usr/lib/nvidia-304). It still failed dlopen(), but it also showed a message for dlerror() now:

NativeLibrary.open(global true): Trying to load /usr/lib/nvidia-304/libGL.so.1 dlopen "/usr/lib/nvidia-304/libGL.so.1" failed, error: dlopen: cannot load any more object with static TLS

Aha! When you look up that error, it turns out that there can be a static thread-local storage area required to load a shared object file, and glibc's loader only reserves a set amount of space for this. If you try to load too many .so files that need static TLS, dlopen() will eventually fail. And it looks like going through javaws and then loading the Intel OpenCL drivers pushes it over the limit -- the drivers alone load 3 or 4 different .so files, each of which needs some static TLS.

Normally the solution is to recompile the .so files so they don't use static TLS, but that's not an option here since we don't own them. My test machine's Ubuntu 14.04 LTS glibc library is already at the latest version, so no easy way to upgrade it. But it looks like people have seen this error in another context, so there's a patch in Ubuntu 15:

https://launchpad.net/ubuntu/+source/glibc/2.19-10ubuntu2
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1375555

So my suggestion would be to try Ubuntu 15 (confirming it's got the right glibc version) and see if that fixes it. Unfortunately my test machine was borrowed, so I couldn't upgrade it to try this myself. And if it works, please let me know so we'll have a record in this bug report for the future.
Comment 23 Tim Barth 2015-01-28 20:21:59 CET
Great news! I upgraded to Ubuntu 14.10 from 14.04. The glibc in Ubuntu 14.10 has been fixed 

     http://www.ubuntuupdates.org/package/core/utopic/universe/proposed/glibc

>>glibc (2.19-10ubuntu2) utopic; urgency=medium
>>
>>  * Add patches/ubuntu/unsubmitted-increase-dtv-surplus.diff from Fedora to
>>    allow up to 32 dlopened modules to use static TLS (LP: #1375555).
>> -- Colin Watson <email address hidden> Tue, 30 Sep 2014 14:33:02 +0100
>>
>> 1375555 	global static TLS slot limit breaks the x86 emulator

Both JOGL/JOCL test problems now function correctly when launched from JWS!!  

Thanks for getting to bottom of this Ubuntu problem. 

From the previous comments, it does not seem that there is a work around for Ubuntu 14.04?  Should this problem be expected for previous versions of Ubuntu?
Comment 24 Wade Walker 2015-01-29 02:13:12 CET
I'm not sure if they'll back-port this patch to older version of Ubuntu or not :( From Googling around, it seemed to be a problem that's only happened to a few applications, so it might not be a huge priority. But I don't have any inside knowledge of Ubuntu, so who knows? Apparently it's also possible to compile your own glibc from scratch, but that's probably too scary for a customer-facing fix. Maybe you could get this ported back to 14.04 LTS if you reported it as a bug to them?

I'm just glad upgrading to 14.10 fixed it, since that's a little more palatable for some folks than upgrading to 15.