Bug 513

Summary: WGL.wglMakeCurrent() crashes in native code
Product: [JogAmp] Jogl Reporter: Wade Walker <wwalker3>
Component: windowsAssignee: Sven Gothel <sgothel>
Status: RESOLVED INVALID    
Severity: normal CC: bcutler
Priority: ---    
Version: 2   
Hardware: pc_x86_32   
OS: windows   
Type: --- SCM Refs:
Workaround: ---
Attachments: Log file
Error file
Log file with new WGL/WGLExt instrumentation -- "gold" output
Log file - extra output
Error file - extra output
Log file - threaded optimization = off
Error file - threaded optimization = off
Log file - CS_OWNDC
Error file - CS_OWNDC
Log file from CPP version
Log file - GDI/WGL-only
Error file - GDI/WGL-only
Log file - super minimal GDI/WGL
Error file - super minimal GDI/WGL
Log file - minimal 10/13
Error file - minimal 10/13
Log file - Statically linked WGL
Error file - statically linked WGL
Log file - jogl 1.1.1a
Error file - jogl 1.1.1a
Log file - 10/27 version

Description Wade Walker 2011-08-30 15:34:14 CEST
Bug originally reported at http://forum.jogamp.org/Crash-in-native-code-dispatch-wglMakeCurrent1-tp3279419p3279419.html

Happens on Windows XP SP3 32-bit, Java 1.6.0_26, using two NVIDIA GeForce GTX 260 graphics cards, driver version 6.14.12.8026 (released 8/3/11).

Bug also occurs in JOGL1.1.1a, so it doesn't seem to be JOGL2-specific.
Comment 1 bcutler 2011-08-30 23:04:20 CEST
I ran the Geeks3D GPU Caps Viewer as Wade suggested, and all of the OpenGL demos ran perfectly, except for the GL 4.x Tessellation demo, which popped up a dialog saying it wasn't supported.  So it does appear to be a JOGL problem.
Comment 2 Wade Walker 2011-08-31 17:28:12 CEST
Emailed test case to bug reporter to gather initial logs.
Comment 3 bcutler 2011-08-31 19:09:52 CEST
This is the contents of test.log after running Wade's test program.  I hope it is helpful:
java.lang.UnsupportedClassVersionError: javax/media/opengl/GLCapabilitiesImmutable : Unsupported major.minor version 51.0
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClassCond(Unknown Source)
	at java.lang.ClassLoader.defineClass(Unknown Source)
	at java.security.SecureClassLoader.defineClass(Unknown Source)
	at java.net.URLClassLoader.defineClass(Unknown Source)
	at java.net.URLClassLoader.access$000(Unknown Source)
	at java.net.URLClassLoader$1.run(Unknown Source)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
	at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
	at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class: name.wadewalker.drivertest.OneTriangleAWT.  Program will exit.
Exception in thread "main"
Comment 4 Wade Walker 2011-08-31 21:52:12 CEST
Oops, my fault -- I compiled the test with Java 7, which is class version 51.0. I'll redo it with Java 6 to match your environment, and re-send you the test.

When you post the log, it'll be large (> 100K) so please post it and the HotSpot log as attachments rather than as comments :) Also, please set the attachment types to plaintext instead of auto-detect -- this helps readability on some platforms.
Comment 5 bcutler 2011-08-31 23:17:17 CEST
Created attachment 266 [details]
Log file
Comment 6 bcutler 2011-08-31 23:17:52 CEST
Created attachment 267 [details]
Error file
Comment 7 Wade Walker 2011-09-02 03:03:38 CEST
I've identified which WGL and WGLExt methods JOGL uses to interact with the OpenGL driver, and I'm in the process of instrumenting them to dump out extra logging information and check for violations of API contracts.

Last time we had a bug like this (e.g. https://jogamp.org/bugzilla/show_bug.cgi?id=480) I had to do an assembly-level debugging session with the machine in my possession to find the problem. That's why I'm attempting this strategy of adding more logging and checking -- to try to improve our ability to find these sorts of driver interaction bugs without requiring the reporter to give us login (or physical access) to their machine.
Comment 8 Sven Gothel 2011-09-02 10:42:33 CEST
Re comment 7: Of course, we use this technique already for a long time, hence all the DEBUG code.
This is good for some situations, ofc.
However - reaching the goal of catching all unknown circumstances is IMHO impossible.

We already catch and notify a lot states and check results.
Adding missed API checks can be a good thing, sure.

IMHO these things shall be in some balance and I doubt we can catch unknown situations
w/o access to the machine. Adding another 40% of DEBUG / INFO statements cannot be the solution.

I never really needed to disassemble code (the other extreme) ..
Most of the time it's just about reading the API, digest the feedback (debug on machine)
and learn whether it's our code or the GL driver's bug.
For the latest cases - I would never be able to do this with static DEBUG code alone
even though it gives me a hint already.

Just my 2 cents ..

+++

Since we already have a few WinXP bug reports (when will this platform die again ?)
maybe I should set one up here .. hmm
But if it's not reproducible .. well, we should get access to the specific machine.
Comment 9 Wade Walker 2011-09-02 16:03:11 CEST
Of course, if you're available to do a remote debug on this one, that would be the quickest solution :) I was assuming you'd be too busy with Android and Mac OS X, and I wasn't sure if Beth could make her machine available for remote login.

I'm currently running JOGL 2 on two different Windows XP machines, so it's definitely this particular machine/driver/OS combination and not Windows XP in general that's the problem.

For this particular bug, we know that JOGL 1 and JOGL 2 both fail, but C OpenGL programs run fine. This implies to me that JOGL is using WGL in a way that may work for many cards/drivers, but is somehow not fully "correct".

I was thinking of temporarily wrapping JOGL's WGL/WGLExt calls with a class WGLCheck that would print logging info and perform strict consistency and error checking. For example (in pseudo-Java):

class WGLCheck {
    Map<HGLRC, HDC> mapGLContextsToHDCs;
    Map<HGLRC, ThreadID> mapGLContextsToThreads;

    public HGLRC createContext( HDC hdc ) {
        log( createContext: thread ID, hdc );

        HGLRC hglrc = WGL.createContext( hdc );

        mapGLContextsToHDCs.put( hglrc, hdc );
        return( hglrc );
    }

    public void makeCurrent( HDC hdc, HGLRC hglrc ) {
        log( makeCurrent: thread ID, hdc, hglrc );
        if( mapGLContextsToThreads.get( hglrc ) != null )
            log( error: context can only be current on one thread at a time );

        WGL.makeCurrent( hdc, hglrc );

        mapGLContextsToThreads.put( hglrc, Thread.getID() );
        if( !mapGLContextsToHDCs.contains( hglrc ) )
            log( error: GL context not created );
        else if( mapGLContextsToHDCs.get( hglrc ) == null )
            log( error: GL context deleted );
        else if( mapGLContextsToHDCs.get( hglrc ) != hdc )
            log( error: GL context not associated with hdc );
    }

    ... // wrappers for the other 20 or so WGL/WGLExt functions JOGL uses
}

Then once I know the exact sequence of WGL calls that fails, I could examine that sequence closely and see what we're doing wrong. Plus, the extra consistency checking might find some problem in the way we're using the WGL API. It would be nice to have code checks to insure that we're properly satisfying the complex WGL API contract.

Does this sound crazy? You're the maintainer, so of course you have the final say -- all this is just my humble suggestion :)
Comment 10 bcutler 2011-09-03 01:32:45 CEST
It would be difficult, though probably not impossible, to get you access to the machine.  If you have exhausted all other avenues, I will investigate how we could make that work.
Comment 11 Sven Gothel 2011-09-03 22:06:34 CEST
Re: comment 9

ofc, something like DebugWGL and TraceWGL (like we produce those for DebugGL* TraceGL*)
is a great idea. You can easily use our pipeline generator for this task.

See build-jogl.xml target 'java.generate.composable.pipeline.es2' for example ..
Don't know if we need to patch gluegen/jogl for this task - it should just work.

Indeed, this will give us a nice clue about what passes through incl. the return values
(-> TraceWGL*). The use of DebugWGL* might be tricky, since they are allowed to fail
here and there .. and there are known (documented) WGL driver cases where it 
returns a failure value but actually works (or vice versa).
Hence DebugWGL* might need to continue .. but dump a marker/warning in the stderr stream.

Great idea - go for it!

Even if this doesn't pinpoint the problem right away, it will be a great help,
as DebugGL* and TraceGL* are already - at least for me.
Comment 12 Wade Walker 2011-09-07 16:22:31 CEST
I've looked at com.jogamp.gluegen.opengl.BuildComposablePipeline and the code it emits, and it seems to be custom to GL only, so it can't produce DebugWGL without being modified. For example, the prefixes to GL calls and the code to check for current GL contexts is hard-coded into this class.

Also, for WGL it's not so easy to wrap one object inside WGLDebug at the point where it's instantiated. WGL methods are all static, and are called all over the code without using a wrappable instance :(

Let me first try manually creating a WGLDebug and WGLExtDebug that only implements the 20 or so methods that JOGL actually uses. I can use this to help find Beth's problem. Then afterwards I can use this experience to figure out how to make a general solution like BuildComposablePipeline.
Comment 13 Wade Walker 2011-09-09 03:12:53 CEST
I finished up the trace code for WGL this evening. Tomorrow I'll do WGLExt, and then I should be able to give Beth a new version of the test program that will dump out much more information about our driver calls.
Comment 14 Sven Gothel 2011-09-10 18:51:38 CEST
(In reply to comment #12)
> I've looked at com.jogamp.gluegen.opengl.BuildComposablePipeline and the code
> it emits, and it seems to be custom to GL only, so it can't produce DebugWGL
> without being modified. For example, the prefixes to GL calls and the code to
> check for current GL contexts is hard-coded into this class.

ok, so we would need to make them more generic

> 
> Also, for WGL it's not so easy to wrap one object inside WGLDebug at the point
> where it's instantiated. WGL methods are all static, and are called all over
> the code without using a wrappable instance :(

ofc .. we would need to make them use interfaces and using dynamic lookup code,
but that should not be too hard. ie like GLXExt, WGLExt ..

> 
> Let me first try manually creating a WGLDebug and WGLExtDebug that only
> implements the 20 or so methods that JOGL actually uses. I can use this to help
> find Beth's problem. Then afterwards I can use this experience to figure out
> how to make a general solution like BuildComposablePipeline.

yup .. ofc
Comment 15 Sven Gothel 2011-09-10 18:53:09 CEST
(In reply to comment #13)
> I finished up the trace code for WGL this evening. Tomorrow I'll do WGLExt, and
> then I should be able to give Beth a new version of the test program that will
> dump out much more information about our driver calls.

sounds great, kudos

if this helps in this case, we definitely should make WGL and GLX interfaces/dynamic
allowing pipelining with the debug implementation automatically created.
Comment 16 Wade Walker 2011-09-15 16:37:23 CEST
I finished the logging code for both WGL and WGLExt last night, and emailed a
new version of the test program to Beth. Once she attaches the new log file
here, we should be able to see the arguments, return code, and stack trace of
every WGL and WGLExt call -- hopefully comparing that to the correct output
will tell us what's going on.
Comment 17 Wade Walker 2011-09-16 16:53:57 CEST
Created attachment 268 [details]
Log file with new WGL/WGLExt instrumentation -- "gold" output

This is the log output when the test works correctly on a Windows XP system. When Beth uploads her log we should be able to compare directly to this and see where they diverge.
Comment 18 bcutler 2011-09-19 19:12:52 CEST
Created attachment 269 [details]
Log file - extra output
Comment 19 bcutler 2011-09-19 19:13:21 CEST
Created attachment 270 [details]
Error file - extra output
Comment 20 Wade Walker 2011-09-21 18:54:53 CEST
Sorry I've taken so long to look at this -- I'm traveling on business right now and don't have much computer access. I should be able to get to this on Friday once I'm back home.
Comment 21 Sven Gothel 2011-09-22 01:04:20 CEST
(In reply to comment #19)
> Created attachment 270 [details]
> Error file - extra output

Good logs, at least they show me that all 'should' be fine,
but still you get a native SIGSEGV.

Reminds me of the NV bug I have fixed here:
  http://jogamp.org/git/?p=jogl.git;a=commit;h=5166d6a6b617ccb15c40fcb8d4eac2800527aa7b

Maybe this workaround doesn't work on your system,
so you could try to set your NV driver to:
   'Threaded Optimization' := 'off'

Just an idea .. otherwise I don't have a clue why MakeCurrent with a valid 
HDC and context doesn't work.
Comment 22 bcutler 2011-09-23 18:42:12 CEST
Created attachment 271 [details]
Log file - threaded optimization = off

Using Sven's suggestion of turning off the threaded optimization in the Nvidia settings.  My extremely simple jogl test program still failed, but it failed in a slightly different way, so I thought I'd post the log file for this case as well.  The log looks just slightly different.
Comment 23 bcutler 2011-09-23 18:42:48 CEST
Created attachment 272 [details]
Error file - threaded optimization = off
Comment 24 Wade Walker 2011-09-25 01:03:59 CEST
Looking through this log, the crash happens when trying to make an OpenGL context current for the dummy window that JOGL creates to get access to the WGL functions.

The WGL trace first calls these functions, which return null since there's no current OpenGL context:

getWGLProcAddressTable
wglGetProcAddress (45 times)

Then it calls these two, using the HDC of the dummy window created by GDI.CreateDummyWindow0():

wglCreateContext
wglMakeCurrent (fails)

There are at least two things I can think of that might be causing this failure:

- The dummy window isn't created with CS_OWNDC in its window class (supposedly some older drivers/cards/OSes need this)
- Maybe the pixel format choosing/setting doesn't work right for dual-card configurations

I'll create some more test versions to send to Beth that try these ideas out.
Comment 25 Wade Walker 2011-09-28 02:53:46 CEST
I've emailed Beth a new version of my test program with CS_OWNDC set to see if that makes a difference. I've also written a C program that calls the same WGL functions that JOGL calls in this trace. I'll send that to her next if we need to look at the pixel format code. That way we can determine why it works in C but not in Java.
Comment 26 bcutler 2011-09-29 18:45:54 CEST
Created attachment 273 [details]
Log file - CS_OWNDC
Comment 27 bcutler 2011-09-29 18:46:20 CEST
Created attachment 274 [details]
Error file - CS_OWNDC
Comment 28 Wade Walker 2011-09-29 22:58:32 CEST
Looking at the results of the CS_OWNDC test, it doesn't make any difference to this crash.

So moving to theory #2, I've sent Beth a C++ program I wrote last night that enumerates the pixel formats, opens a window, and draws a single triangle, using similar C++ code to what's inside JOGL. If this program runs, it confirms (again) that OpenGL works on Beth's machine from C++, just not from Java. It also prints the list of pixel formats so I can compare that to the one we see in JOGL.

The idea is to establish working code (the C++ test) and failing code (the Java test), then figure out what the crucial difference between the two is. Fortunately this failure is early in the program's execution, so we should be able to pin it down pretty quickly :)
Comment 29 bcutler 2011-09-29 23:21:36 CEST
The cpp test failed with this error:
"This application has failed to start because libgcc_s_dw2-1.dll was not found.  Re-installing the application may fix this problem."
Comment 30 Wade Walker 2011-09-30 02:37:57 CEST
Oops, forgot to statically link :) I'll fix it and resend.
Comment 31 bcutler 2011-09-30 20:03:30 CEST
Created attachment 275 [details]
Log file from CPP version

The new CPP version (with proper dlls) seemed to run without errors.  A window popped up showing a color triangle.
Comment 32 Wade Walker 2011-10-01 02:41:03 CEST
Yep, that looks perfect. The C++ code behaves totally as expected, so there's definitely nothing wrong with your OpenGL drivers.

My next step is to write a Java test that invokes the exact same GDI and WGL functions as the C++ test, but using JOGL's GDI and WGL wrappers. That way I can rule out a bunch of minor differences in parameters JOGL uses to set up the dummy window. I got some of it done tonight, so it should be ready for you by Monday.

Thanks again for helping me track this bug down. I'm very interested to see what the cause turns out to be :)
Comment 33 Wade Walker 2011-10-05 20:58:44 CEST
Quick status report: getting my next test case working has proved to be a challenge. I wrote a Java program to try to duplicate Beth's bug using only JOGL's WGL and GDI functions (therefore narrowing down the scope of the code that has to be debugged), but I haven't quite got it to run properly on my own test platform yet. It may take me a while longer to debug this before I can submit it to Beth for testing -- I'll keep you posted.
Comment 34 Wade Walker 2011-10-07 22:11:09 CEST
Status report: I finally got a simple Java program working that creates and makes current an OpenGL context using nothing but GDI and WGL calls. I'll polish it up a bit over the weekend, then send it out to Beth for testing.

This program should demonstrate the absolute minimum number of lines of Java needed to make an OpenGL context current, so it will tell us whether the problem is in our basic GDI/WGL wrappers, or somewhere in the JOGL framework above that.
Comment 35 Wade Walker 2011-10-10 00:23:42 CEST
Just sent the latest GDI/WGL-only test case to Beth. This test calls the exact same GDI and WGL calls in the exact same order as the C++ test that we've confirmed works correctly on Beth's machine. The only other code in it is the bare minimum initialization (maybe 4 lines) so GDI and WGL can be called.

If this test still fails in wglMakeCurrent like the full-size test, then the bug has to be in the very small amount of setup code that's in this test.

If this test doesn't fail, then the bug is caused by the rest of the JOGL code that's normally called when setting up a GL context (which this test is not calling). Either way, this should narrow things down.
Comment 36 Sven Gothel 2011-10-10 08:53:10 CEST
KUDOS to your hard work Wade - thx a lot Wade & Beth.
Even though I cannot help w/ this issue at the moment
for sure it is very much appreciated.
Comment 37 bcutler 2011-10-10 18:27:09 CEST
Created attachment 277 [details]
Log file - GDI/WGL-only
Comment 38 bcutler 2011-10-10 18:27:39 CEST
Created attachment 278 [details]
Error file - GDI/WGL-only
Comment 39 Wade Walker 2011-10-11 17:38:15 CEST
This looks like good news. Since the test still fails, the problem must be in one of the very few remaining differences between the Java and C++ versions.

My prime suspect now is wglGetProcAddress. JOGL calls it a bunch of times in the process of setting up its WGL object, relying on the driver to return null since there's no current OpenGL context yet. But there could easily be a driver bug where calling this function without a current context messes up the internal driver state.

This would explain why JOGL works fine on most drivers, but not quite all -- it's relying on the driver to be robust in the face of unnecessary calls that aren't supposed to have any effect. If the driver writers haven't carefully tested these sorts of cases, JOGL could easily hit bad driver behavior that other programs don't see.

I'll create a new, even more minimal test case that doesn't call wglGetProcAddress unnecessarily and see if that fixes our problem.
Comment 40 Sven Gothel 2011-10-11 19:56:17 CEST
Yes, this definitely narrows it down.

In regards to comment 39, 
the premature wglGetProcAddress(..) are returning NULL as far as I see in the log files
and the crashing call is to the statically linked wglMakeContectCurrent(..).
Hence I don't see how this could be the culprit ?
Also, only a few 'manual' wglGetProcAddress(..) calls are being issued,
not the whole ProcAddressTable - since, as you pointed out correctly,
you shall do this only after a context is current.

However .. it definitely is great progress.

Maybe we can double check the handle values on the Java side and the native JNI side ?
Comment 41 Sven Gothel 2011-10-11 20:02:24 CEST
comment 39:
>> But there could easily be a driver
>> bug where calling this function without a current context messes up the
>> internal driver state.

Yes, maybe .. even though I would guess it's nothing but a TLS function table fetch.
But .. well, sometimes pigs do fly :)
Comment 42 bcutler 2011-10-12 18:14:03 CEST
Created attachment 279 [details]
Log file - super minimal GDI/WGL
Comment 43 bcutler 2011-10-12 18:14:34 CEST
Created attachment 280 [details]
Error file - super minimal GDI/WGL
Comment 44 Sven Gothel 2011-10-12 20:20:54 CEST
Ok, so pigs do not fly :)

Another idea, besides double checking the native DC ..
  Created dummy window  hwnd=0x440396.
  Got DC of dummy window  dc=0xffffffff9b010aae.

would be to evaluate if the native function lookup (win's dlsym/..) ..
  Lookup-Native: <wglMakeCurrent> 0x5ed19bd5 in lib NativeLibrary[OpenGL32.dll, 0x5ed00000]
  Got address of wglMakeCurrent  0x5ed19bd5.

may cause havoc ..

A way to verify would be to link the WGL part statically against OpenGL32.dll,
not using function pointers.

Below I walked through the GlueGen Win32 lookup code .. no result.

However, maybe there is a problem w/ the function lookup,
ie. using the wrong OpenGL32.dll library or something like that ?

How to test this ?

You could try installing and using GLIntercept's OpenGL32.dll
  http://code.google.com/p/glintercept/wiki/Readme
.. and use it to debug/trace even the wgl commands ..

Maybe that gives some additional clues ?

I have also looked for Mesa3D's DLL .. but I don't have it anymore and I couldn't find 
a precompiled one - however, that would be worth a try as well.

Cheers, Sven

+++

Recap WindowsWGLDynamicLibraryBundleInfo, 
which uses GLDynamicLibraryBundleInfo default setting:
  shallLinkGlobal() { return false; }
  shallLookupGlobal() { return false; }

linkGlobal==true leads to -> dynLink.openLibraryGlobal(path...);
  and makes no diff on the Win32 code, since both local/global 'open' are equal.

lookupGlobal==true just leads to NOP, and hence lookupLocal is being used.

Result: it's using the 'local' codepath regardless of the settings ..

+++
Comment 45 Wade Walker 2011-10-13 16:45:42 CEST
I sent Beth one more test case that removes one lookup of wglGetProcAddress that I hadn't noticed last time. This probably makes no difference, I just wanted to rule it out.

I'll also try sending Beth a test that launches using "java -Xss4096k" as a sanity check. Since we're getting an EXCEPTION_STACK_OVERFLOW, I guess it's possible that her video driver needs more stack than the default 512KB that the JVM uses. But the binary drivertestcpp.exe that I gave Beth only had a 200K stack size (apparently that's the default for gcc & ld under MinGW), so I'm not sure this really makes sense.

I verified that the WGL function pointers in Beth's log file have the exact same addresses that they do on my Windows XP system, so the function pointer lookup seems correct.

If this latest test still fails, next I'll try statically linking to WGL to avoid all function pointer lookup and use.
Comment 46 bcutler 2011-10-13 18:14:02 CEST
Created attachment 281 [details]
Log file - minimal 10/13
Comment 47 bcutler 2011-10-13 18:14:31 CEST
Created attachment 282 [details]
Error file - minimal 10/13
Comment 48 Wade Walker 2011-10-14 17:21:20 CEST
Did a bit more analysis last night:

- Verified that we're loading opengl32.dll from the correct location (C:\WINDOWS\system32\OpenGL32.dll)

- Checked the amount of stack used when the exception happens, and it's very small (the stack grows towards low addresses from 0x00910000, and it's only at 0x0090fab8 when the stack overflow occurs). Supposedly the stack should be able to go all the way down to 0x008c0000, about 327 KB, so the program is nowhere near that.

- Disassembled the code around the exception, which happens at 0x69e39e87

69e39e68: 080B          OR      [BP+DI],CL
69e39e6A: C1            DB      C1
69e39e6B: 5F            POP     DI
69e39e6C: 5E            POP     SI
69e39e6D: 5B            POP     BX
69e39e6E: C9            DB      C9
69e39e6F: C3            RET
69e39e70: 51            PUSH    CX
69e39e71: 3D0010        CMP     AX,1000
69e39e74: 0000          ADD     [BX+SI],AL
69e39e76: 8D4C24        LEA     CX,[SI+24]
69e39e79: 087214        OR      [BP+SI+14],DH
69e39e7C: 81E90010      SUB     CX,1000
69e39e80: 0000          ADD     [BX+SI],AL
69e39e82: 2D0010        SUB     AX,1000
69e39e85: 0000          ADD     [BX+SI],AL
69e39e87: 8501          TEST    AX,[BX+DI]
69e39e89: 3D0010        CMP     AX,1000
69e39e8C: 0000          ADD     [BX+SI],AL
69e39e8E: 73EC          JNB     007C
69e39e90: 2BC8          SUB     CX,AX
69e39e92: 8BC4          MOV     AX,SP
69e39e94: 8501          TEST    AX,[BX+DI]
69e39e96: 8BE1          MOV     SP,CX
69e39e98: 8B08          MOV     CX,[BX+SI]
69e39e9A: 8B4004        MOV     AX,[BX+SI+04]
69e39e9D: 50            PUSH    AX
69e39e9E: C3            RET
69e39e9F: 8B4424        MOV     AX,[SI+24]
69e39eA2: 0483          ADD     AL,83
69e39eA4: C0            DB      C0
69e39eA5: E0C3          LOOPNZ  006A
69e39eA7: 65            DB      65
69e39eA8: 3A0D          CMP     CL,[DI]

It's unclear how "TEST AX,[BX+DI]" could be causing a stack overflow, since it doesn't modify SP. So it must be that stack overflow is not a precise exception, and SP is hitting an address protected with PAGE_GUARD sometime after 0x69e39e87.

The instruction at 0x69e39e96, "MOV SP,CX" must be the culprit. This makes sense since ECX = 0x008c0ac0, which is close enough to the bottom of the stack to hit the guard page.

It doesn't look like the most common cause of stack overflow (namely infinite recursion). It might be some calling convention problem that results in a too-large value being added to the stack pointer. So it sounds like trying static linkage to WGL next is the way to go.
Comment 49 Wade Walker 2011-10-19 16:21:49 CEST
Status report: Still working on getting static linkage set up for WGL with GlueGen. It looks like I just need to use JavaEmitter instead of GLEmitter, but I need to resolve some GlueGen errors in the code emission process.

I did figure out a possible theory of this bug. When the NVIDIA driver DLL loads, it could be hooking/patching some of the entry points of opengl32.dll, which would cause our saved function pointers to become stale. I'll try a version of the test program that re-checks these pointers after the NVIDIA driver load (which happens during SetPixelFormat()) just to be sure.
Comment 50 Wade Walker 2011-10-21 18:29:44 CEST
Just emailed Beth a new test I finished up last night. It uses JNI to link directly to opengl32.lib/.dll for the WGL functions, instead of querying function pointers and calling through them like JOGL does. I created a WGLStatic that sits beside WGL to do this, so I can use them both at once and compare the results if needed.

If this test passes, it will mean something is wrong with JOGL's treatment of WGL function pointers (either the NVIDIA driver is hacking or changing them, or we're somehow not querying/using/storing them incorrectly on some systems).
Comment 51 bcutler 2011-10-25 00:26:40 CEST
Created attachment 284 [details]
Log file - Statically linked WGL
Comment 52 bcutler 2011-10-25 00:27:13 CEST
Created attachment 285 [details]
Error file - statically linked WGL
Comment 53 Wade Walker 2011-10-25 15:28:33 CEST
OK, the latest log shows that the test still fails, even when I link WGL with the DLL import library (instead of using function pointers and LoadLibrary).

There are many more things I can try with the JOGL 2 test, but first I'd like to see if JOGL 1.1.1a works on Beth's machine. I wrote a quick test for this last night, and I'll email it to her shortly.

If JOGL 1.1.1a works, this will let us know that the problem is something that's been changed in JOGL 2 (so it should be relatively easy to find).
Comment 54 Wade Walker 2011-10-25 15:50:10 CEST
Looking back through this bug report again, I remember now that Beth reported that JOGL 1.1.1.a fails too, but we never got any logs of it failing. So hopefully this new JOGL 1.1.1a test will give us a different failure message than JOGL 2 that will help narrow things down.
Comment 55 bcutler 2011-10-25 20:03:25 CEST
Created attachment 286 [details]
Log file - jogl 1.1.1a
Comment 56 bcutler 2011-10-25 20:03:52 CEST
Created attachment 287 [details]
Error file - jogl 1.1.1a
Comment 57 Wade Walker 2011-10-26 21:18:25 CEST
Looks like JOGL 1.1.1a fails in exactly the same way as JOGL 2 - at the same address inside nvoglnt.dll, of the same stack overflow.

I'll go back to stripping down the statically linked JOGL 2 test to resemble the C++ test more. Since we know that the C++ test works, there's got to be a way to get the JOGL test to duplicate its results.

I can also add some code to the C++ test to verify that it's seeing the same opengl32.dll and nvoglnt.dll that the Java version sees.
Comment 58 Wade Walker 2011-10-27 16:16:18 CEST
I made a few more changes to the test case last night and emailed it to Beth just now.

This version manually loads the JOGL libraries with System.loadLibrary() instead of invoking any JOGL code at all. This avoids trying and failing to load nativewindow_x11, and also avoids loading opengl32 redundantly now that it's linked via import library to jogl_desktop.dll.

This version also sets the C stack ridiculously large with -Xss4096k just to see if this NVIDIA driver has strangely large stack requirements. At the least, increasing stack size should change where the error happens.
Comment 59 bcutler 2011-10-27 18:42:17 CEST
Created attachment 288 [details]
Log file - 10/27 version

This test did not produce an error file, so I think it succeeded.  No window was made visible though.  Hopefully that's a good sign?
Comment 60 Wade Walker 2011-10-27 20:50:48 CEST
(Hmm, my emailed comment doesn't seem to have made it -- retyping in the web form)

Yep, it finally worked! This test doesn't create a window, it just writes a log, so this is the expected behavior. I changed two things though, so we'll need to check which one fixed the problem.

Could you edit the run.bat file of that latest test to remove the -Xss4096k and then rerun it? If the failure comes back, this means the -Xss4096k fixed the problem. In that case, try adding the same -Xss4096k to the JVM options of your own JOGL program and see if that makes it work.
Comment 61 bcutler 2011-10-28 22:26:16 CEST
Hey, it's fixed!
The -Xss4096k argument did the trick, on all of my tests!

Thank you so much, Wade, for sticking it out and tracking down the problem!
Comment 62 Wade Walker 2011-10-28 23:04:22 CEST
Glad to hear it works! Sorry I took such a long way around to finding this -- when hearing hoofbeats, I need to think "horses", not "zebras" :)

One other thing you might want to do: try reducing the -Xss4096k by powers of two until you find the smallest size that works. This will keep your threads' memory footprint from being needlessly large.

This problem appears to be an NVIDIA driver mistake (though not strictly a driver bug, since it can still work). The driver code is doing something like this:

void somefunc( bigstruct b );

when it should have done this:

void somefunc( bigstruct *b );

So they're pushing a huge amount of data (hundreds of KB) on the stack in just one function call, instead of passing a pointer like they ought to (which would just use 4B).

The JVM sets a smaller stack size for its threads than Windows does for executables, so that's why it works from C++ but not from Java (unless you manually set -Xss).