Summary: | Intel Graphics: Dead GLCanvas after hide/show in CardLayout w/ Animator start()/stop() [works w/ pause()/resume()] | ||
---|---|---|---|
Product: | [JogAmp] Jogl | Reporter: | Gene <gene.ressler> |
Component: | awt | Assignee: | Sven Gothel <sgothel> |
Status: | RESOLVED WORKSFORME | ||
Severity: | minor | CC: | gouessej, wwalker3 |
Priority: | P5 | ||
Version: | 2 | ||
Hardware: | pc_x86_64 | ||
OS: | windows | ||
Type: | --- | SCM Refs: |
97940607411e33b28886ae4ac8a5e345fc7d015a
2c005c6bf4abd2beafdc9b8cb4b713229bc2b359
7ce29d85bb85c003c9dc3b94efa84b55dfbb7f86
098398c2a9145447da5314eed9792b3738c2d515
6451e3c0c79fca92a39e32c2600c69f16dfc7f4d
9e61d4529143ff3f6de15ce55f8e8747f67a86c9
c49d29784986b1945343b9a90b5e0c9f3d95d937
acb48154608c8f4e3f49306ff6e2ab3d5df8bc72
|
Workaround: | --- | ||
Attachments: |
JOGL 1 test driver outputs from 64-bit Intel graphics Win 7.
JOGL 2 test driver outputs from 64-bit Intel graphics Win 7. Zipped log file from JOGL 2 test |
Description
Gene
2011-12-11 09:53:40 CET
Created attachment 299 [details]
JOGL 1 test driver outputs from 64-bit Intel graphics Win 7.
JOGL 1 Test driver furnished by Wade Walker.
This had 32-bit DLLs, which I swapped with current 64-bit DLLs because the target is running a 64-bit JVM.
Created attachment 300 [details]
JOGL 2 test driver outputs from 64-bit Intel graphics Win 7.
JOGL 2 Test driver furnished by Wade Walker.
This had 32-bit DLLs, which I swapped with current 64-bit DLLs because the target is running a 64-bit JVM.
Uploaded to the project site because it's too big for Bugzilla.
Created attachment 301 [details]
Zipped log file from JOGL 2 test
Good work substituting the 64-bit JOGL DLLs in my test -- I wasn't smart enough to think of that yesterday :) From the log, my JOGL 2 driver test looks like it ran fine. Did you see the triangle pop up in the window? If so, the driver is probably "generally OK" on this machine. Usually if there's a driver problem, this test will crash outright. Perhaps when the GLCanvas is being hidden, some function call is being made that messes up or invalidates the GL context? It might be worth attaching the zipped JOGL source code to your JOGL JAR and seeing if (for example) GLEventListener.dispose() is being called. Noted on same machine: (In reply to comment #4) > From the log, my JOGL 2 driver test looks like it ran fine. Did you see the > triangle pop up in the window? If so, the driver is probably "generally OK" on > this machine. Usually if there's a driver problem, this test will crash > outright. > > Perhaps when the GLCanvas is being hidden, some function call is being made > that messes up or invalidates the GL context? It might be worth attaching the > zipped JOGL source code to your JOGL JAR and seeing if (for example) > GLEventListener.dispose() is being called. Yes the triangle popped up as expected and then disappeared. I did notice in the JOGL 2 log at the end that a MakeCurrnent call with null arguments returns success = false. This may just be a redundant cleanup operation as the window was being destroyed, but thought I'd mention it. I noticed another quirk last night. My application gets a depth buffer with limited resolution (24-bit?) in JOGL 2 compared to JOGL 1 (floating point?). I've had no chance to attach a debugger to see what's really going on. But I mention this because it means the JOGL 2 initialization must be picking a different pixel/buffer format than JOGL 1. Maybe this is invoking a driver path with a bug that doesn't show up with JOGL 1's choices. I agree the problem acts like a broken GL context. Can you go into a bit more detail on what you're saying about attaching source? I have Netbeans set up with JOGL source in my dev machine. Are you saying I can get trace information on the target without installing a dev environment? Thanks. Yes, it's possible run the program on your buggy machine and connect the Netbeans debugger to it from another machine. This way you don't need an IDE or dev environment on the buggy machine :) 1. Make sure you can set breakpoints and step through the code locally in Netbeans, including stepping into JOGL methods. You'll need to attach the JOGL source code zip files to your project to enable this. 2. Start the program on the buggy machine like this: java -jar -Xdebug -Xnoagent -Djava.compiler=NONE -Xrunjdwp:transport=dt_socket,server=y,address=8000,suspend=y <your JAR> <your command-line args> This starts and immediately suspends it the program. 3. Connect to that JVM from Netbeans using "Debug > Start Session > Attach". You'll have to specify the hostname and port 8000. 4. Once you've attached, you can set breakpoints and step through code just like in a local debug session. If you want full debug visibility inside JOGL code (i.e. the ability to see variable values, not just step through lines), you'll need to build your own version of the JOGL JARs with extra debugging info turned on. I've written a step-by-step guide for this at http://jogamp.org/wiki/index.php/Building_JOGL_on_the_command_line. It only takes a few minutes to set up the build from scratch, so this is simpler than it sounds. Make sure to set up the custom Ant properties (see http://jogamp.org/wiki/index.php/Building_JOGL_on_the_command_line#Set_up_custom_Ant_properties_.28optional.29), as these are what turns on the debugging info. Let me know if you have any problems, and I can walk you through any rough spots. Two more thoughts about this bug: 1. You may be right about the canvas properties hitting a different driver path. You can try manipulating the GLCapabilities you pass into your GLCanvas to test this, e.g. GLCapabilities glcapabilities = new GLCapabilities( glprofile ); glcapabilities.setDepthBits( 32 ); GLCanvas glcanvas = new GLCanvas( glcapabilities ); 2. It could be that Nvidia's drivers improperly allow you to use a GL context after it's been made non-current. So the Intel drivers could be more correct! You might want to check all the events sent to your GLCanvas to see if the context is not being made current again when the card is re-displayed. (In reply to comment #7) > Two more thoughts about this bug: > > 1. You may be right about the canvas properties hitting a different driver > path. You can try manipulating the GLCapabilities you pass into your GLCanvas > to test this, e.g. > > GLCapabilities glcapabilities = new GLCapabilities( glprofile ); > glcapabilities.setDepthBits( 32 ); > GLCanvas glcanvas = new GLCanvas( glcapabilities ); Thanks. I know about capabilities and will try this. I wonder why 2 would have change the algorithm of 1.1.1a for this. > > 2. It could be that Nvidia's drivers improperly allow you to use a GL context > after it's been made non-current. So the Intel drivers could be more correct! > You might want to check all the events sent to your GLCanvas to see if the > context is not being made current again when the card is re-displayed. I'm sorry I'm dumb on this. Is there a simple way to observe all events to the GLCanvas? I can't find anything in the Netbeans debugging environment. Do I need to attach my own event dispatcher? I did look at CardLayout. All it does is call setVisible false and true to hide the formerly visible component and make the new one show. So if there is a problem with event dispatching, it's caused within setVisible(). It would be amazing if Swing has been around all these years and there's still a problem at this level. But you never know. > I wonder why 2 would have changed the algorithm of 1.1.1a for this. The driver calls for new GL contexts are different from those of old contexts when it comes to selecting pixel formats. So it's not a matter of JOGL intentionally changing, it's just that to create GL 2.0+ contexts you have to make driver calls that didn't exist (or weren't exploited yet) back in the JOGL 1.1.1a days. > Is there a simple way to observe all events to the GLCanvas? I was thinking of just setting breakpoints in all the GLCanvas methods in the Netbeans IDE :) That's what I normally do. It also lets me see the call stack at each point where a GLCanvas method is called, which might reveal something unknown about the canvas lifecycle. > I did look at CardLayout. All it does is call setVisible false and true to hide > the formerly visible component and make the new one show. So if there is a > problem with event dispatching, it's caused within setVisible(). It would be > amazing if Swing has been around all these years and there's still a problem at > this level. But you never know. I don't mean to suggest that it's a Swing bug. I'm just hypothesizing that perhaps setVisible() ends up invoking some GLCanvas method that makes the GL context non-current (or changes the context in some other way). Then after that, perhaps Nvidia drivers still let you draw into the context, but Intel drivers don't. There may be (probably is) some way to change the program so it works with both sets of drivers, we just don't know what it is yet. I've seen other similar cases where I had to use certain pixel formats or GL rendering mode settings to work around driver problems. The key is just to try a bunch of different things until you find what works, then work backwards from that to the cause of the bug :) (In reply to comment #9) > > I wonder why 2 would have changed the algorithm of 1.1.1a for this. > > The driver calls for new GL contexts are different from those of old contexts > when it comes to selecting pixel formats. So it's not a matter of JOGL > intentionally changing, it's just that to create GL 2.0+ contexts you have to > make driver calls that didn't exist (or weren't exploited yet) back in the JOGL > 1.1.1a days. I see. > > > Is there a simple way to observe all events to the GLCanvas? > > I was thinking of just setting breakpoints in all the GLCanvas methods in the > Netbeans IDE :) That's what I normally do. It also lets me see the call stack > at each point where a GLCanvas method is called, which might reveal something > unknown about the canvas lifecycle. > Ok. Thanks. > There may be (probably is) some way to change the program so it works with both > sets of drivers, we just don't know what it is yet. I've seen other similar > cases where I had to use certain pixel formats or GL rendering mode settings to > work around driver problems. The key is just to try a bunch of different things > until you find what works, then work backwards from that to the cause of the > bug :) I did try setting the capabilities. Now a good depth buffer is provided, but the bad behavior remains. Next is tracing events and watching context validity as you say. It's good to have the benefit of your time and experience on this. (In reply to comment #9) > I don't mean to suggest that it's a Swing bug. I'm just hypothesizing that > perhaps setVisible() ends up invoking some GLCanvas method that makes the GL > context non-current (or changes the context in some other way). Then after > that, perhaps Nvidia drivers still let you draw into the context, but Intel > drivers don't. Wade, Thanks again for the insight. I'm copying a Forum post I made on some progress: ------------ I may have a fix, but am more puzzled than ever. I tried removing the animator in the pared down example, and good news: the GLCanvas card started flipping correctly, responding to paint(), no longer dead. I tried this because Wade had a hunch that the problem may be a missing MakeCurrent somewhere that the NVIDIA card tolerates and the Intel card does not. The animator has a timer thread, so taking it out of the picture seemed a way to reduce chances for such a bug to manifest. Another clue in Wade's favor: If I start() the animator just once when it's created, use pause() / resume() in place of stop() / start() in the button actions, and finally call stop() just once when the canvas is disposed, all works correctly. The only difference between pause/resume and start/stop is that the latter create/destroy a timer, including its thread. Pause/resume use the same timer. Incidentally, the reason for pause/resume in addition to start/stop is not explained in the FPSAnimator API docs. So maybe I've just discovered the way the timer is supposed to be used? I'd really appreciate comments from the experts on this. Random brainstorming: +++ hmm .. ok the initial setVisible() should be put in the AWT/EDT thread javax.swing.SwingUtilities.invokeAndWait(new Runnable() { public void run() { ... } } the other 'manipulators' as well. But I guess actionPerformed() is run from AWT's EDT... ?! Will the CardLayout 'switch' issue removeNotify/addNotify ? +++ GLAutoDrawable's paint method respect a hooked AnimatorControl (Animator). If the Animator is animating, paint does nothing since the animator thread does the update. Maybe there is a glitch in that logic .. However, GLContext is either blocking or throwing an Exception if you attempt to lock (makeCurrent) from 2 threads at the same time. GLCanvas uses the blocking method AFAIK. Release (unlock) shall only happen after a successful lock. The sequence is [1]GLCanvas.display -> [2]GLDrawableHelper.invokeGL(..) -> [3]GLEventListener.display(), where [2] encapsulates the calls to [3] with makeCurrent/release. The latter is ofc a makeCurrent(NULL). Will be interesting to find the culprit (concurrency issue .. makeCurrent(NULL) pulling the ctx). Maybe we should trace makeCurrent/release w/ thread names properly (debug output) and w/ the other traces we should be able to see the culprit. Hope I can join this effort soon. Great work! Hmmm, hard to say what's going on here from a remote vantage point. Replacing animator.start()/stop() with resume()/pause() may indeed be the correct way to use the animator (since I notice that GL canvas uses pause()/resume() itself in GLCanvas.dispose()). You do have a working solution now, which is good enough sometimes :) I couldn't blame you if you wanted to stop debugging this now. However, if you wanted to go farther, you could put log statements in the GLCanvas methods, then compare the traces when you use start()/stop() vs. resume()/pause() and see how (or if) they differ. I would guess that maybe animator.stop() somehow results in GLCanvas.dispose() being called, or something of that nature. With multithreaded, event-driven code it can be hard to say by inspection -- sometimes logging is the only way. (In reply to comment #12) > Random brainstorming: > > +++ > > hmm .. ok the initial setVisible() should be put in the AWT/EDT thread > javax.swing.SwingUtilities.invokeAndWait(new Runnable() { > public void run() { > ... > } > } > the other 'manipulators' as well. But I guess actionPerformed() is run > from AWT's EDT... ?! > Yes the button listeners run in the AWT dispatch thread. The setVisible() point is interesting. I grabbed code from a web tutorial to modify and this was part of it. I will see if this makes any difference. But the advertised way to initialize Swing apps is with top level setVisible() in the main thread. In fact the Java Desktop framework, which I am using in my large app, does this. I would be willing to help with the logging/tracing effort, but don't know how to recognize the problem when I see it. It seems GLContexts are in thread local storage, so the GL object always looks consistent with itself. Is there a way to query for the last GLContext that was actually sent to makeCurrent globally across all threads? (In reply to comment #12) Additional notes below: > Random brainstorming: > > +++ > > hmm .. ok the initial setVisible() should be put in the AWT/EDT thread > javax.swing.SwingUtilities.invokeAndWait(new Runnable() { > public void run() { > ... > } > } > the other 'manipulators' as well. But I guess actionPerformed() is run > from AWT's EDT... ?! I tried this. Makes no difference. Again, the initial setVisible is run in the main() thread in every example I can find and in the Swing Desktop framework, so I don't think we'll be able to tell everyone to change their init code. > > Will the CardLayout 'switch' issue removeNotify/addNotify ? No. CardLayout calls only setVisible: false on the old card then true on the new one being "flipped to the top". > GLAutoDrawable's paint method respect a hooked AnimatorControl (Animator). > If the Animator is animating, paint does nothing since the animator thread does > the update. > Maybe there is a glitch in that logic .. No glitch. Things first started working for me when I _removed_ the FPSAnimator completely, so paint() was being called (so the image was static). Then I replaced the Animator and tried pause/resume instead of start/stop and that worked with the animator working. > However, GLContext is either blocking or throwing an Exception if you attempt > to lock (makeCurrent) from 2 threads at the same time. GLCanvas uses the > blocking method AFAIK. > Release (unlock) shall only happen after a successful lock. > The sequence is > [1]GLCanvas.display -> [2]GLDrawableHelper.invokeGL(..) -> > [3]GLEventListener.display(), > where [2] encapsulates the calls to [3] with makeCurrent/release. > The latter is ofc a makeCurrent(NULL). Okay I see the makeCurrent and release calls in invokeGL. In my traces, the AWT thread is calling display(). I'm logging a stack dump and GLContext handles for the GLCanvas and also the GL2 object in display() and the they match. The entire trace is identical for start/stop and pause/resume. Yet the latter works and the former doesn't. So not sure what to trace next. > > Will be interesting to find the culprit (concurrency issue .. makeCurrent(NULL) > pulling the ctx). > > Maybe we should trace makeCurrent/release w/ thread names properly (debug > output) > and w/ the other traces we should be able to see the culprit. I'm not sure how to go about this. I tried tracing through some of the GLCanvas code myself, and it's very confusing :) Lots of threads passing runnables to each other in a way that's hard to debug, even for this simple test. Maybe encapsulate this test as a JUnit test, then add it to the regression and see if it fails on any of the platforms in Sven's cluster? But I'm not sure how to tell if it fails or not (in an automated way), since it seems to think it's drawing (it just doesn't produce any visible output). Maybe logging that included the thread IDs would tell us if there was any difference (from start()/stop() to resume()/pause()) in which functions are being called from which threads? There could be something there that the Intel driver doesn't tolerate well. Added Unit-Test for Bug 532 to test Animator behavior w/ CardLayout and diff. Platforms, commit 97940607411e33b28886ae4ac8a5e345fc7d015a Tests permutations of FPSAnimator,Animator,StartStop,PauseResume, automated or manual (see main method). Next step is to determine the behavior whether we have a bug or not. I could already see an animation freeze when using Animator... Thanks. Don't forget the Gears demo also fails to display anything on this Intel HD Graphics hardware. If this should be a separate bug report, I'll be happy to write it up. TestAWTCardLayoutAnimatorStartStopBug532: Refine, add 'continue' mode, .. Previous commit 098398c2a9145447da5314eed9792b3738c2d515 cleaned up and fixed context/drawable lock/unlock for makeCurrent()/release()/destroy() and consistency is looks much better now in this regard. However, on Intel HD 3000 / Windows7, our AnimatorControl start/stop still let the 2nd switch to GLCanvas within the CardPanel not showing rendering results. One interesting artefact though: 1st switch 2 GLCanvas (rendering visible): *** hdc 0x2f010ec5, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, *** hdc 0x160110c4, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, *** hdc 0x2f010ec5, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, *** hdc 0x160110c4, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, -> alternating HDC's 2nd switch 2 GLCanvas (rendering _not_ visible): *** hdc 0x160110c4, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, *** hdc 0x160110c4, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, *** hdc 0x160110c4, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, *** hdc 0x160110c4, hdw(hdc) 0x1003a0, hdw 0x1003a0 - AWT-EventQueue-0, -> fixed HDC Maybe this is a hint for what is going wrong in JAWTWindow locking, which aquires the frame's HDC. Verifying the recursive lock shows proper lock/unlock actions though. Testing the 'wgl' functions of the 5 GDI/WGL function set (commit c49d29784986b1945343b9a90b5e0c9f3d95d937) didn't solve this issue neither. The last series of patches, which cleaned some 'dirty' code and are good in general didn't reveal any 'situation' which could lead to this bug. Maybe the bug relies in some timing issues etc impacting the Intel driver only, The Animator stop() command is issued from the AWT-EDT and since we are using the AWT implementation, it cannot wait until the animator thread completes (-> finishLifecycleAction()). This causes pending frames to be rendered _after_ issuing stop(), maybe this confuses the Intel driver ? Actually it should not, since those pending frames were still able to properly acquire the locks (drawable/context) .. and the 'Continue' test case works as well. And 'pause()' works the same way (pending frames, non blocking in AWT mode). The only difference between stop() and pause() is that the former kills the rendering loop thread and starts a new created one at start(). However, this rendering loop thread only triggers the display method in case of AWT, since all GL commands are issued from the AWT-EDT thread (GLCanvas). If this issue is the culprit here and leads to invisible rendering results on Intel, it's a bug in their drivers. Let's just say: If you intend to cont. using the GLAutoDrawable w/ your AnimatorControl just use pause()/resume() - they use less resources anyways. I lower the bug level now, since I couldn't find any major flaw in the JOGL code. For now I will pause this effort .. but feel free to cont. otherwise we can change the status to 'works for me' ! |